Spaces:
Runtime error
Runtime error
| <!--- | |
| Copyright 2021 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); | |
| you may not use this file except in compliance with the License. | |
| You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software | |
| distributed under the License is distributed on an "AS IS" BASIS, | |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| See the License for the specific language governing permissions and | |
| limitations under the License. | |
| --> | |
| # Language modelling examples | |
| This folder contains some scripts showing examples of *language model pre-training* with the 🤗 Transformers library. | |
| For straightforward use-cases you may be able to use these scripts without modification, although we have also | |
| included comments in the code to indicate areas that you may need to adapt to your own projects. The two scripts | |
| have almost identical arguments, but they differ in the type of LM they train - a causal language model (like GPT) or a | |
| masked language model (like BERT). Masked language models generally train more quickly and perform better when | |
| fine-tuned on new tasks with a task-specific output head, like text classification. However, their ability to generate | |
| text is weaker than causal language models. | |
| ## Pre-training versus fine-tuning | |
| These scripts can be used to both *pre-train* a language model completely from scratch, as well as to *fine-tune* | |
| a language model on text from your domain of interest. To start with an existing pre-trained language model you | |
| can use the `--model_name_or_path` argument, or to train from scratch you can use the `--model_type` argument | |
| to indicate the class of model architecture to initialize. | |
| ### Multi-GPU and TPU usage | |
| By default, these scripts use a `MirroredStrategy` and will use multiple GPUs effectively if they are available. TPUs | |
| can also be used by passing the name of the TPU resource with the `--tpu` argument. | |
| ## run_mlm.py | |
| This script trains a masked language model. | |
| ### Example command | |
| ``` | |
| python run_mlm.py \ | |
| --model_name_or_path distilbert-base-cased \ | |
| --output_dir output \ | |
| --dataset_name wikitext \ | |
| --dataset_config_name wikitext-103-raw-v1 | |
| ``` | |
| When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation. | |
| ``` | |
| python run_mlm.py \ | |
| --model_name_or_path distilbert-base-cased \ | |
| --output_dir output \ | |
| --train_file train_file_path | |
| ``` | |
| ## run_clm.py | |
| This script trains a causal language model. | |
| ### Example command | |
| ``` | |
| python run_clm.py \ | |
| --model_name_or_path distilgpt2 \ | |
| --output_dir output \ | |
| --dataset_name wikitext \ | |
| --dataset_config_name wikitext-103-raw-v1 | |
| ``` | |
| When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation. | |
| ``` | |
| python run_clm.py \ | |
| --model_name_or_path distilgpt2 \ | |
| --output_dir output \ | |
| --train_file train_file_path | |
| ``` | |