Spaces:
Runtime error
Runtime error
| # Zero-shot classifier distillation | |
| Author: @joeddav | |
| This script provides a way to improve the speed and memory performance of a zero-shot classifier by training a more | |
| efficient student model from the zero-shot teacher's predictions over an unlabeled dataset. | |
| The zero-shot classification pipeline uses a model pre-trained on natural language inference (NLI) to determine the | |
| compatibility of a set of candidate class names with a given sequence. This serves as a convenient out-of-the-box | |
| classifier without the need for labeled training data. However, for a given sequence, the method requires each | |
| possible label to be fed through the large NLI model separately. Thus for `N` sequences and `K` classes, a total of | |
| `N*K` forward passes through the model are required. This requirement slows inference considerably, particularly as | |
| `K` grows. | |
| Given (1) an unlabeled corpus and (2) a set of candidate class names, the provided script trains a student model | |
| with a standard classification head with `K` output dimensions. The resulting student model can then be used for | |
| classifying novel text instances with a significant boost in speed and memory performance while retaining similar | |
| classification performance to the original zero-shot model | |
| ### Usage | |
| A teacher NLI model can be distilled to a more efficient student model by running [`distill_classifier.py`](https://github.com/huggingface/transformers/blob/main/examples/research_projects/zero-shot-distillation/distill_classifier.py): | |
| ``` | |
| python distill_classifier.py \ | |
| --data_file <unlabeled_data.txt> \ | |
| --class_names_file <class_names.txt> \ | |
| --output_dir <output_dir> | |
| ``` | |
| `<unlabeled_data.txt>` should be a text file with a single unlabeled example per line. `<class_names.txt>` is a text file with one class name per line. | |
| Other optional arguments include: | |
| - `--teacher_name_or_path` (default: `roberta-large-mnli`): The name or path of the NLI teacher model. | |
| - `--student_name_or_path` (default: `distillbert-base-uncased`): The name or path of the student model which will | |
| be fine-tuned to copy the teacher predictions. | |
| - `--hypothesis_template` (default `"This example is {}."`): The template used to turn each label into an NLI-style | |
| hypothesis when generating teacher predictions. This template must include a `{}` or similar syntax for the | |
| candidate label to be inserted into the template. For example, the default template is `"This example is {}."` With | |
| the candidate label `sports`, this would be fed into the model like `[CLS] sequence to classify [SEP] This example | |
| is sports . [SEP]`. | |
| - `--multi_class`: Whether or not multiple candidate labels can be true. By default, the scores are normalized such | |
| that the sum of the label likelihoods for each sequence is 1. If `--multi_class` is passed, the labels are | |
| considered independent and probabilities are normalized for each candidate by doing a softmax of the entailment | |
| score vs. the contradiction score. This is sometimes called "multi-class multi-label" classification. | |
| - `--temperature` (default: `1.0`): The temperature applied to the softmax of the teacher model predictions. A | |
| higher temperature results in a student with smoother (lower confidence) predictions than the teacher while a value | |
| `<1` resultings in a higher-confidence, peaked distribution. The default `1.0` is equivalent to no smoothing. | |
| - `--teacher_batch_size` (default: `32`): The batch size used for generating a single set of teacher predictions. | |
| Does not affect training. Use `--per_device_train_batch_size` to change the training batch size. | |
| Any of the arguments in the 🤗 Trainer's | |
| [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html?#trainingarguments) can also be | |
| modified, such as `--learning_rate`, `--fp16`, `--no_cuda`, `--warmup_steps`, etc. Run `python distill_classifier.py | |
| -h` for a full list of available arguments or consult the [Trainer | |
| documentation](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). | |
| > **Note**: Distributed and TPU training are not currently supported. Single-node multi-GPU is supported, however, | |
| and will run automatically if multiple GPUs are available. | |
| ### Example: Topic classification | |
| > A full colab demo notebook of this example can be found [here](https://colab.research.google.com/drive/1mjBjd0cR8G57ZpsnFCS3ngGyo5nCa9ya?usp=sharing). | |
| Let's say we're interested in classifying news articles into one of four topic categories: "the world", "sports", | |
| "business", or "science/tech". We have an unlabeled dataset, [AG's News](https://huggingface.co/datasets/ag_news), | |
| which corresponds to this problem (in reality AG's News is annotated, but we will pretend it is not for the sake of | |
| example). | |
| We can use an NLI model like `roberta-large-mnli` for zero-shot classification like so: | |
| ```python | |
| >>> class_names = ["the world", "sports", "business", "science/tech"] | |
| >>> hypothesis_template = "This text is about {}." | |
| >>> sequence = "A new moon has been discovered in Jupiter's orbit" | |
| >>> zero_shot_classifier = pipeline("zero-shot-classification", model="roberta-large-mnli") | |
| >>> zero_shot_classifier(sequence, class_names, hypothesis_template=hypothesis_template) | |
| {'sequence': "A new moon has been discovered in Jupiter's orbit", | |
| 'labels': ['science/tech', 'the world', 'business', 'sports'], | |
| 'scores': [0.7035840153694153, 0.18744826316833496, 0.06027870625257492, 0.04868902638554573]} | |
| ``` | |
| Unfortunately, inference is slow since each of our 4 class names must be fed through the large model for every | |
| sequence to be classified. But with our unlabeled data we can distill the model to a small distilbert classifier to | |
| make future inference much faster. | |
| To run the script, we will need to put each training example (text only) from AG's News on its own line in | |
| `agnews/train_unlabeled.txt`, and each of the four class names in the newline-separated `agnews/class_names.txt`. | |
| Then we can run distillation with the following command: | |
| ```bash | |
| python distill_classifier.py \ | |
| --data_file ./agnews/unlabeled.txt \ | |
| --class_names_files ./agnews/class_names.txt \ | |
| --teacher_name_or_path roberta-large-mnli \ | |
| --hypothesis_template "This text is about {}." \ | |
| --output_dir ./agnews/distilled | |
| ``` | |
| The script will generate a set of soft zero-shot predictions from `roberta-large-mnli` for each example in | |
| `agnews/unlabeled.txt`. It will then train a student distilbert classifier on the teacher predictions and | |
| save the resulting model in `./agnews/distilled`. | |
| The resulting model can then be loaded and used like any other pre-trained classifier: | |
| ```python | |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
| model = AutoModelForSequenceClassification.from_pretrained("./agnews/distilled") | |
| tokenizer = AutoTokenizer.from_pretrained("./agnews/distilled") | |
| ``` | |
| and even used trivially with a `TextClassificationPipeline`: | |
| ```python | |
| >>> distilled_classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True) | |
| >>> distilled_classifier(sequence) | |
| [[{'label': 'the world', 'score': 0.14899294078350067}, | |
| {'label': 'sports', 'score': 0.03205857425928116}, | |
| {'label': 'business', 'score': 0.05943061783909798}, | |
| {'label': 'science/tech', 'score': 0.7595179080963135}]] | |
| ``` | |
| > Tip: pass `device=0` when constructing a pipeline to run on a GPU | |
| As we can see, the results of the student closely resemble that of the trainer despite never having seen this | |
| example during training. Now let's do a quick & dirty speed comparison simulating 16K examples with a batch size of | |
| 16: | |
| ```python | |
| for _ in range(1000): | |
| zero_shot_classifier([sequence] * 16, class_names) | |
| # runs in 1m 23s on a single V100 GPU | |
| ``` | |
| ```python | |
| %%time | |
| for _ in range(1000): | |
| distilled_classifier([sequence] * 16) | |
| # runs in 10.3s on a single V100 GPU | |
| ``` | |
| As we can see, the distilled student model runs an order of magnitude faster than its teacher NLI model. This is | |
| also a seeting where we only have `K=4` possible labels. The higher the number of classes for a given task, the more | |
| drastic the speedup will be, since the zero-shot teacher's complexity scales linearly with the number of classes. | |
| Since we secretly have access to ground truth labels for AG's news, we can evaluate the accuracy of each model. The | |
| original zero-shot model `roberta-large-mnli` gets an accuracy of 69.3% on the held-out test set. After training a | |
| student on the unlabeled training set, the distilled model gets a similar score of 70.4%. | |
| Lastly, you can share the distilled model with the community and/or use it with our inference API by [uploading it | |
| to the 🤗 Hub](https://huggingface.co/transformers/model_sharing.html). We've uploaded the distilled model from this | |
| example at | |
| [joeddav/distilbert-base-uncased-agnews-student](https://huggingface.co/joeddav/distilbert-base-uncased-agnews-student). | |