The Guide to Multi-Tasking with the T5 Transformer
The T5 Transformer can perform any NLP task. It can perform multiple tasks, at the same time, with the same model. Here’s how!
Jun 19 ·8min read
The T5 (Text-To-Text Transfer Transformer) model was the product of a large-scale study ( paper ) conducted to explore the limits of transfer learning. It builds upon popular architectures like GPT, BERT, and RoBERTa(to name only a few) models that utilized Transfer Learning with incredible success. While BERT-like models can be fine-tuned to perform a variety of tasks, the constraints of the architecture mean that each model can perform only one task.
Typically, this is done by adding a task-specific layer on top of the Transformer model. For example, a BERT Transformer can be adapted for binary classification by adding a fully-connected layer with two output neurons (corresponding to each class). The T5 model departs from this tradition by reframing all NLP tasks as text-to-text tasks. This results in a shared framework for any NLP task as the input to the model and the output from the model is always a string. In the example of binary classification, the T5 model will simply output a string representation for the class (i.e. "0"
or "1"
).
Since the input and output formats are identical for any NLP task, the same T5 model can be taught to perform multiple tasks! To specify which task should be performed, we can simply prepend a prefix (string) to the input of the model. The animation (shown below) from the Google AI Blog article demonstrates this concept.
In this article, we’ll be using this technique to train a single T5 model capable of performing the 3 NLP tasks, binary classification, multi-label classification, and regression.
All code can also be found on Github .
Task Specification
Binary Classification
The goal of binary classification in NLP is to classify a given text sequence into one of two classes. In our task, we will be using the Yelp Reviews dataset to classify the sentiment of the text as either positive ( "1"
) or negative ( "0"
).
Multi-label Classification
In multi-label classification, a given text sequence should be labeled with the correct subset of a set of pre-defined labels (note that the subset can include both the null set and the full set of labels itself). For this, we will be using the Toxic Comments dataset where each text can be labeled with any subset of the labels toxic, severe_toxic, obscene, threat, insult, identity_hate
.
Regression
In regression tasks, the target variable is a continuous value. In our task, we will use the STS-B (Semantic Textual Similarity Benchmark) dataset where the goal is to predict the similarity of two sentences. The similarity is denoted by a continuous value between 0
and 5
.
Data Preparation
Since we are going to be working with 3 datasets, we’ll put them in 3 separate subdirectories inside the data
directory.
data/binary_classification data/multilabel_classification data/regression
Downloading
- Download the Yelp Reviews Dataset .
- Extract
train.csv
andtest.csv
todata/binary_classification
. - Download the Toxic Comments dataset .
- Extract the
csv
files todata/multilabel_classification
. - Download the STS-B dataset .
- Extract the
csv
files todata/regression
.
Combining the datasets
As mentioned earlier, the inputs and outputs of a T5 model is always text. A particular task is specified by using a prefix text that lets the model know what it should do with the input.
The input data format for a T5 model in Simple Transformers reflects this fact. The input is a Pandas dataframe with the 3 columns — prefix
, input_text
, and target_text
. This makes it quite easy to train the model on multiple tasks as you just need to change the prefix
.
The notebook above loads each of the datasets, preprocesses them for T5, and finally combines them into a unified dataframe.
This gives us a dataframe with 3 unique prefixes, namely binary classification
, multilabel classification
, and similarity
. Note that the prefixes themselves are fairly arbitrary, the important thing is to ensure that each task has its own unique prefix. The input to the model will take the following format:
<prefix>: <input_text>
The ": "
is automatically added when training.
A few other things to note:
- The output of the multilabel classification task is a comma-separated list of the predicted labels (
toxic, severe_toxic, obscene, threat, insult, identity_hate
). If no label is predicted, the output should beclean
. - The
input_text
for the similarity task includes both sentences as shown in the following example;
sentence1: A man plays the guitar. sentence2: The man sang and played his guitar.
- The output of the similarity task is a number (as a string) between 0.0 and 5.0, going by increments of 0.2. (E.g.
0.0
,0.4
,3.0
,5.0
). This follows the same format used by the authors of the T5 paper.
As you can see from the way the different inputs and outputs are represented, the T5 model’s text-to-text approach gives us a great deal of flexibility both in terms of representing various tasks and in terms of the actual tasks we can perform.
For example;
The only limitation is imagination! (Well, imagination and compute resources but that’s another story) :sweat_smile:
Getting back to the data, running the notebook should have given you a train.tsv
and an eval.tsv
file which we’ll be using to train our model in the next section!
Setup
We will be using the Simple Transformers library (based on the Hugging Face Transformers ) to train the T5 model.
The instructions given below will install all the requirements.
- Install Anaconda or Miniconda Package Manager from here .
- Create a new virtual environment and install packages.
conda create -n simpletransformers python
conda activate simpletransformers
conda install pytorch cudatoolkit=10.1 -c pytorch
- Install simpletransformers.
pip install simpletransformers
See installation docs
Training the T5 Model
As always, training the model with Simple Transformers is quite straightforward.
Most of the arguments used here are fairly standard.
-
max_seq_length
: Chosen such that most samples are not truncated. Increasing the sequence length significantly affects the memory consumption of the model, so it’s usually best to keep it as short as possible (ideally without truncating the input sequences). -
train_batch_size
: Bigger the better (as long as it fits on your GPU) -
eval_batch_size
: Same deal astrain_batch_size
-
num_train_epochs
: Training for more than 1 epoch would probably improve the model’s performance, but it would obviously increase the training time as well (about 7 hours per epoch on an RTX Titan). -
evaluate_during_training
: We’ll periodically test the model against the test data to see how it’s learning. -
evaluate_during_training_steps
: The aforementioned period at which the model is tested. -
evaluate_during_training_verbose
: Show us the results when a test is done. -
use_multiprocessing
: Using multiprocessing significantly reduces the time taken for tokenization (done before training starts), however, this currently causes issues with the T5 implementation. So, no multiprocessing for now. :cry: -
fp16
: FP16 or mixed-precision training reduces the memory consumption of training the models (meaning larger batch sized are possible). Unfortunately,fp16
training is not stable with T5 at the moment, so it’s turned off as well. -
save_steps
: Setting this to-1
means that checkpoints aren’t saved. -
save_eval_checkpoints
: By default, a model checkpoint will be saved when an evaluation is performed during training. Since this experiment is being done for demonstration only, let’s not waste space on saving these checkpoints either. -
save_model_every_epoch
: We only have 1 epoch, so no. Don’t need this one either. -
reprocess_input_data
: Controls whether the features are loaded from cache (saved to disk) or whether tokenization is done again on the input sequences. It only really matters when doing multiple runs. -
overwrite_output_dir
: This will overwrite any previously saved models if they are in the same output directory. -
wandb_project
: Used for visualization of training progress.
Speaking of visualization, you can check my training progress here . Shoutout to W&B for their awesome library!
Testing the T5 model
Considering the fact that we are dealing with multiple tasks, it’s a good idea to use suitable metrics to evaluate each task. With that in mind, we’ll be using the following metrics;
- Binary Classification: F1 score and Accuracy score
- Multilabel Classification: F1 score (Hugging Face SQuAD metrics implementation) and Exact matches (Hugging Face SQuAD metrics implementation)
- Similarity: Pearson correlation coefficient and Spearman correlation
Note that a ": “
is inserted between the prefix
and the input_text
when preparing the data. This is done automatically when training but needs to be handled manually for prediction.
If you’d like to read more about the decoding arguments ( num_beams
, do_sample
, max_length
, top_k
, top_p
), please refer to this article .
Time to see how our model did!
----------------------------------- Results: Scores for binary classification: F1 score: 0.96044512420231 Accuracy Score: 0.9605263157894737Scores for multilabel classification: F1 score: 0.923048001002632 Exact matches: 0.923048001002632Scores for similarity: Pearson Correlation: 0.8673017763553101 Spearman Correlation: 0.8644328787107548
The model performs quite well on each task, despite being trained on 3 separate tasks! We’ll take a quick look at how we can try to improve the performance of the model even more in the next section.
Closing Thoughts
Possible improvements
A potential issue that arises when mixing tasks is the discrepancy between the sizes of the datasets used for each task. We can see this issue in our dataset by taking a look at the training sample counts.
binary classification 560000 multilabel classification 143613 similarity 5702
The dataset is substantially unbalanced with the plight of the similarity
task seeming particularly dire! This can be clearly seen in the evaluation scores where the similarity
task lags behind the others (although it’s important to note that we are not looking at the same metrics between the tasks).
A possible remedy to this problem would be to oversample the similarity
tasks so that the model.
In addition to this, increasing the number of training epochs (and tuning other hyperparameters) is also likely to improve the model.
Finally, tuning the decoding parameters could also lead to better results.
Wrapping up
The text-to-text format of the T5 model paves the way to apply Transformers and NLP to a wide variety of tasks with next to no customization necessary. The T5 model performs strongly even when the same model is used to perform multiple tasks!
Hopefully, this will lead to many innovative applications in the near future.
References
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — https://arxiv.org/abs/1910.10683
- Google AI Blog — https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
以上所述就是小编给大家介绍的《The Guide to Multi-Tasking with the T5 Transformer》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
P2P网贷投资手册
徐红伟 / 同济大学出版社 / 2015-4 / CNY 28.00
《P2P网贷投资手册》由“P2P网络借贷知多少”、“新手如何开始P2P网贷投资”和“如何确定适合自己的网贷投资策略”三部分组成。将网贷之家平台上众多投资人和从业者的智慧集结成册,分享给网贷投资上的同路人。一起来看看 《P2P网贷投资手册》 这本书的介绍吧!