内容简介:One of the “secrets” behind the success of Transformer models is the technique of Transfer Learning. In Transfer Learning, a model (in our case, a Transformer model) isIf you are interested in learning how pre-training works and how you can train a brand n
One of the “secrets” behind the success of Transformer models is the technique of Transfer Learning. In Transfer Learning, a model (in our case, a Transformer model) is pre-trained on a gigantic dataset using an unsupervised pre-training objective. This same model is then fine-tuned (typically supervised training) on the actual task at hand. The beauty of this approach is that the fine-tuning dataset can be as small as 500–1000 training samples! A number small enough to be potentially scoffed out of the room if one were to call it Deep Learning. This also means that the expensive and time-consuming part of the pipeline, pre-training , only needs to be done once and the pre-trained model can be reused for any number of tasks thereafter. Since pre-trained models are typically made publically available , we can grab the relevant model, fine-tune it on a custom dataset, and have a state-of-the-art model ready to go in a few hours!
If you are interested in learning how pre-training works and how you can train a brand new language model on a single GPU, check out my article linked below!
ELECTRA is one of the latest classes of pre-trained Transformer models released by Google and it switches things up a bit compared to most other releases. For the most part, Transformer models have followed the well-trodden path of Deep Learning, with larger models, more training, and bigger datasets equalling better performance. ELECTRA, however, bucks this trend by outperforming earlier models like BERT while using less computational power, smaller datasets, and less training time. (In case you are wondering, ELECTRA is the same “size” as BERT).
In this article, we’ll look at how to use a pre-trained ELECTRA model for text classification and we’ll compare it to other standard models along the way. Specifically, we’ll be comparing the final performance (Matthews correlation coefficient (MCC) ) and the training times for each model listed below.
- electra-small
- electra-base
- bert-base-cased
- distilbert-base-cased
- distilroberta-base
- roberta-base
- xlnet-base-cased
As always, we’ll be doing this with the Simple Transformers library (based on the Hugging Face Transformers library) and we’ll be using Weights & Biases for visualizations.
You can find all the code used here in the examples directory of the library.
Installation
- Install Anaconda or Miniconda Package Manager from here .
-
Create a new virtual environment and install packages.
conda create -n simpletransformers python pandas tqdm
conda activate simpletransformers
conda install pytorch cudatoolkit=10.1 -c pytorch
- Install Apex if you are using fp16 training. Please follow the instructions here .
-
Install simpletransformers.
pip install simpletransformers
Data Preparation
We’ll be using the Yelp Review Polarity dataset which is a binary classification dataset. The script below will download it and store it in the data
directory. Alternatively, you can manually download the data from FastAI
.
Hyperparameters
Once the data is in the data
directory, we can start training our models.
Simple Transformers models can be configured extensively (see docs ), but we’ll just be going with some basic, “good enough” hyperparameter settings. This is because we are more interested in comparing the models to each other on an equal footing, rather than trying to optimize for the absolute best hyperparameters for each model.
With that in mind, we’ll increase the train_batch_size
to 128
and we’ll increase the num_train_epochs
to 3
so that all models will have enough training to converge.
One caveat here is that the train_batch_size
is reduced to 64
for XLNet as it cannot be trained on an RTX Titan GPU with train_batch_size=128
. However, any effect of this discrepancy is minimized by setting gradient_accumulation_steps
to 2
, which changes the effective batch size to 128
. (Gradients are calculated and the model weights are updated only once for every two steps)
All other settings which affect training are unchanged from their defaults.
Training the Models
Setting up the training process is quite simple. We just need the data loaded into Dataframes and the hyperparameters defined and we are off to the races!
For convenience, I’m using the same script to train all models as we only need to change the model names between each run. The model names are supplied by a shell script which also automatically runs the training script for each model.
The training script is given below:
Note that the Yelp Reviews Polarity dataset uses the labels
[1, 2]
for positive and negative, respectively. I’m changing this to
[0, 1]
for negative and positive, respectively. Simple Transformers requires the labels to start from
0
(duh!) and a label of
0
for negative sentiment is a lot more intuitive (in my opinion).
The bash script which can automate the entire process:
Note that you can remove the saved models at each stage by adding
rm -r outputs
to the bash script. This might be a good idea if you don’t have much disk space to spare.
The training script will also log the evaluation scores to Weights & Biases, letting us compare models easily.
For more information on training classification models, check out the Simple Transformers docs .
Results
You can find all my results here . Try playing around with the different graphs and information available!
Let’s go through the important results.
Final Scores
These are the final MCC scores obtained by each model. As you can see, the scores are quite close to each other for all the models.
To get a better view of the differences, the chart below zooms into the X-axis and shows only the range 0.88–0.94.
Note that a zoomed-in view, while helpful for spotting differences, can distort the perception of the results. Therefore, the chart below is for illustrative purposes only. Beware the graph that hides its zeros!
The roberta-base
model leads the pack with xlnet-base
close behind. The distilroberta-base
and the electra-base
models follow next, with barely anything between them. Honestly, the difference between the two is probably more due to random chance than anything else in this case. Bringing up the rear, we have bert-base-cased
, distilbert-base-cased
, and electra-small
respectively.
Looking at the actual values shows close they are.
In this experiment, RoBERTa seems to outperform the other models. However, I’m willing to bet that with some tricks like hyperparameter tuning and ensembling, the ELECTRA model is capable of making up the difference. This is confirmed by the current GLUE benchmark leaderboard where ELECTRA is sitting above RoBERTa.
It is important to keep in mind that the ELECTRA model required substantially less pre-training
resources (about a quarter) compared to RoBERTa. This is true for distilroberta-base
as well. Even though the distilroberta-base
model is comparatively smaller, you need the original roberta-base
model before you can distil
it into distilroberta-base
.
The XLNet model is nearly keeping pace with the RoBERTa model but it requires far more computational resources than all other models shown here (see training time graph).
The venerable (although less than two years old) BERT model is starting to show its age and is outperformed by all but the electra-small
model.
The electra-small
model, although not quite matching the standards of the other models, still performs admirably. As might be expected, it trains the fastest, has the smallest memory requirements and is the fastest at inference.
Speaking of training times…
The speed of training is determined mostly by the size (number of parameters) of the model, except in the case of XLNet. The training algorithm used with XLNet makes it significantly slower than the comparative BERT, RoBERTa, and ELECTRA models, despite having roughly the same number of parameters. The GPU memory requirement for XLNet is also higher compared to the other models tested here, necessitating the use of a smaller training batch size as noted earlier (64 compared to 128 for the other models).
The inference times (not tested here) should also follow this general trend.
Finally, another important consideration is how quickly each of the models converges. All these model were trained for 3 full epochs without using early stopping.
Evidently, there is no discernible difference between the models with regard to how many training steps are required for convergence. All the models seem to be converging around 9000 training steps. Of course, the time taken to converge would vary due to the difference in training speed.
Conclusion
It’s a tough call to choose between different Transformer models. However, we can still gain some valuable insights from the experiment we’ve seen.
- ELECTRA models can be outperformed by older models depending on the situation. But, it’s strength lies in its ability to reach competitive performance levels with significantly less computational resources used for pre-training .
-
The ELECTRA paper
indicates that the
electra-small
model significantly outperforms a similar-sized BERT model. - Distilled versions of Transformer models sacrifice a few accuracy points for the sake of quicker training and inference. This may be a desirable exchange in some situations.
- XLNet sacrifices speed of training and inference in exchange for potentially better performance on complex tasks.
Based on these insights, I can offer the following recommendations (although they should be taken with a grain of salt as results may vary between different datasets).
distilroberta-base
It would be interesting to see if the large models also follow this trend. I hope to test this out in a future article (where T5 might also be thrown into the mix)!
If you would like to see some more in-depth analysis regarding the training and inference speeds of different models, check out my earlier article (sadly, no ELECTRA) linked below.
以上所述就是小编给大家介绍的《Battle of the Transformers: ELECTRA, BERT, RoBERTa, or XLNet》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Learning Vue.js 2
Olga Filipova / Packt Publishing / 2017-1-5 / USD 41.99
About This Book Learn how to propagate DOM changes across the website without writing extensive jQuery callbacks code.Learn how to achieve reactivity and easily compose views with Vue.js and unders......一起来看看 《Learning Vue.js 2》 这本书的介绍吧!