The Guide to Multi-Tasking with the T5 Transformer

栏目: IT技术 · 发布时间: 4年前

The Guide to Multi-Tasking with the T5 Transformer

The T5 Transformer can perform any NLP task. It can perform multiple tasks, at the same time, with the same model. Here’s how!

Thilina Rajapakse

Jun 19 ·8min read

The Guide to Multi-Tasking with the T5 Transformer — Photo by Matt Bero on Unsplash

The T5 (Text-To-Text Transfer Transformer) model was the product of a large-scale study ( paper ) conducted to explore the limits of transfer learning. It builds upon popular architectures like GPT, BERT, and RoBERTa(to name only a few) models that utilized Transfer Learning with incredible success. While BERT-like models can be fine-tuned to perform a variety of tasks, the constraints of the architecture mean that each model can perform only one task.

Typically, this is done by adding a task-specific layer on top of the Transformer model. For example, a BERT Transformer can be adapted for binary classification by adding a fully-connected layer with two output neurons (corresponding to each class). The T5 model departs from this tradition by reframing all NLP tasks as text-to-text tasks. This results in a shared framework for any NLP task as the input to the model and the output from the model is always a string. In the example of binary classification, the T5 model will simply output a string representation for the class (i.e. "0" or "1" ).

Since the input and output formats are identical for any NLP task, the same T5 model can be taught to perform multiple tasks! To specify which task should be performed, we can simply prepend a prefix (string) to the input of the model. The animation (shown below) from the Google AI Blog article demonstrates this concept.

In this article, we’ll be using this technique to train a single T5 model capable of performing the 3 NLP tasks, binary classification, multi-label classification, and regression.

All code can also be found on Github .

Task Specification

Binary Classification

The goal of binary classification in NLP is to classify a given text sequence into one of two classes. In our task, we will be using the Yelp Reviews dataset to classify the sentiment of the text as either positive ( "1" ) or negative ( "0" ).

Multi-label Classification

In multi-label classification, a given text sequence should be labeled with the correct subset of a set of pre-defined labels (note that the subset can include both the null set and the full set of labels itself). For this, we will be using the Toxic Comments dataset where each text can be labeled with any subset of the labels toxic, severe_toxic, obscene, threat, insult, identity_hate .

Regression

In regression tasks, the target variable is a continuous value. In our task, we will use the STS-B (Semantic Textual Similarity Benchmark) dataset where the goal is to predict the similarity of two sentences. The similarity is denoted by a continuous value between 0 and 5 .

Data Preparation

Since we are going to be working with 3 datasets, we’ll put them in 3 separate subdirectories inside the data directory.

data/binary_classification
data/multilabel_classification
data/regression

Downloading

Download the Yelp Reviews Dataset .
Extract train.csv and test.csv to data/binary_classification .
Download the Toxic Comments dataset .
Extract the csv files to data/multilabel_classification .
Download the STS-B dataset .
Extract the csv files to data/regression .

Combining the datasets

As mentioned earlier, the inputs and outputs of a T5 model is always text. A particular task is specified by using a prefix text that lets the model know what it should do with the input.

The input data format for a T5 model in Simple Transformers reflects this fact. The input is a Pandas dataframe with the 3 columns — prefix , input_text , and target_text . This makes it quite easy to train the model on multiple tasks as you just need to change the prefix .

The notebook above loads each of the datasets, preprocesses them for T5, and finally combines them into a unified dataframe.

This gives us a dataframe with 3 unique prefixes, namely binary classification , multilabel classification , and similarity . Note that the prefixes themselves are fairly arbitrary, the important thing is to ensure that each task has its own unique prefix. The input to the model will take the following format:

<prefix>: <input_text>

The ": " is automatically added when training.

A few other things to note:

The output of the multilabel classification task is a comma-separated list of the predicted labels ( toxic, severe_toxic, obscene, threat, insult, identity_hate ). If no label is predicted, the output should be clean .
The input_text for the similarity task includes both sentences as shown in the following example;
sentence1: A man plays the guitar. sentence2: The man sang and played his guitar.
The output of the similarity task is a number (as a string) between 0.0 and 5.0, going by increments of 0.2. (E.g. 0.0 , 0.4 , 3.0 , 5.0 ). This follows the same format used by the authors of the T5 paper.

As you can see from the way the different inputs and outputs are represented, the T5 model’s text-to-text approach gives us a great deal of flexibility both in terms of representing various tasks and in terms of the actual tasks we can perform.

For example;

Asking the Right Questions: Training a T5 Transformer Model on a New task

The T5 Transformer frames any NLP task as a text-to-text task enabling it to easily learn new tasks. Let’s teach the…

towardsdatascience.com

The only limitation is imagination! (Well, imagination and compute resources but that’s another story) :sweat_smile:

Getting back to the data, running the notebook should have given you a train.tsv and an eval.tsv file which we’ll be using to train our model in the next section!

Setup

We will be using the Simple Transformers library (based on the Hugging Face Transformers ) to train the T5 model.

The instructions given below will install all the requirements.

Install Anaconda or Miniconda Package Manager from here .
Create a new virtual environment and install packages.
conda create -n simpletransformers python
conda activate simpletransformers
conda install pytorch cudatoolkit=10.1 -c pytorch
Install simpletransformers.
pip install simpletransformers

See installation docs

Training the T5 Model

As always, training the model with Simple Transformers is quite straightforward.

Most of the arguments used here are fairly standard.

max_seq_length : Chosen such that most samples are not truncated. Increasing the sequence length significantly affects the memory consumption of the model, so it’s usually best to keep it as short as possible (ideally without truncating the input sequences).
train_batch_size : Bigger the better (as long as it fits on your GPU)
eval_batch_size : Same deal as train_batch_size
num_train_epochs : Training for more than 1 epoch would probably improve the model’s performance, but it would obviously increase the training time as well (about 7 hours per epoch on an RTX Titan).
evaluate_during_training : We’ll periodically test the model against the test data to see how it’s learning.
evaluate_during_training_steps : The aforementioned period at which the model is tested.
evaluate_during_training_verbose : Show us the results when a test is done.
use_multiprocessing : Using multiprocessing significantly reduces the time taken for tokenization (done before training starts), however, this currently causes issues with the T5 implementation. So, no multiprocessing for now. :cry:
fp16 : FP16 or mixed-precision training reduces the memory consumption of training the models (meaning larger batch sized are possible). Unfortunately, fp16 training is not stable with T5 at the moment, so it’s turned off as well.
save_steps : Setting this to -1 means that checkpoints aren’t saved.
save_eval_checkpoints : By default, a model checkpoint will be saved when an evaluation is performed during training. Since this experiment is being done for demonstration only, let’s not waste space on saving these checkpoints either.
save_model_every_epoch : We only have 1 epoch, so no. Don’t need this one either.
reprocess_input_data : Controls whether the features are loaded from cache (saved to disk) or whether tokenization is done again on the input sequences. It only really matters when doing multiple runs.
overwrite_output_dir : This will overwrite any previously saved models if they are in the same output directory.
wandb_project : Used for visualization of training progress.

Speaking of visualization, you can check my training progress here . Shoutout to W&B for their awesome library!

Testing the T5 model

Considering the fact that we are dealing with multiple tasks, it’s a good idea to use suitable metrics to evaluate each task. With that in mind, we’ll be using the following metrics;

Binary Classification: F1 score and Accuracy score
Multilabel Classification: F1 score (Hugging Face SQuAD metrics implementation) and Exact matches (Hugging Face SQuAD metrics implementation)
Similarity: Pearson correlation coefficient and Spearman correlation

Note that a ": “ is inserted between the prefix and the input_text when preparing the data. This is done automatically when training but needs to be handled manually for prediction.

If you’d like to read more about the decoding arguments ( num_beams , do_sample , max_length , top_k , top_p ), please refer to this article .

Time to see how our model did!

-----------------------------------
Results: 
Scores for binary classification:
F1 score: 0.96044512420231
Accuracy Score: 0.9605263157894737Scores for multilabel classification:
F1 score: 0.923048001002632
Exact matches: 0.923048001002632Scores for similarity:
Pearson Correlation: 0.8673017763553101
Spearman Correlation: 0.8644328787107548

The model performs quite well on each task, despite being trained on 3 separate tasks! We’ll take a quick look at how we can try to improve the performance of the model even more in the next section.

Closing Thoughts

Possible improvements

A potential issue that arises when mixing tasks is the discrepancy between the sizes of the datasets used for each task. We can see this issue in our dataset by taking a look at the training sample counts.

binary classification        560000
multilabel classification    143613
similarity                     5702

The dataset is substantially unbalanced with the plight of the similarity task seeming particularly dire! This can be clearly seen in the evaluation scores where the similarity task lags behind the others (although it’s important to note that we are not looking at the same metrics between the tasks).

A possible remedy to this problem would be to oversample the similarity tasks so that the model.

In addition to this, increasing the number of training epochs (and tuning other hyperparameters) is also likely to improve the model.

Finally, tuning the decoding parameters could also lead to better results.

Wrapping up

The text-to-text format of the T5 model paves the way to apply Transformers and NLP to a wide variety of tasks with next to no customization necessary. The T5 model performs strongly even when the same model is used to perform multiple tasks!

Hopefully, this will lead to many innovative applications in the near future.

References

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — https://arxiv.org/abs/1910.10683
Google AI Blog — https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

以上所述就是小编给大家介绍的《The Guide to Multi-Tasking with the T5 Transformer》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

The Guide to Multi-Tasking with the T5 Transformer

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

谁排第一

Amy N. Langville、Carl D. Meyer / 郭斯羽 / 机械工业出版社 / 2014-6 / 49

《谁排第一？关于评价和排序的科学》是首个关于评分和排名科学的著作。它是搜索排序姊妹篇的第二本。本书主要内容有：排名概述、梅西法、科利法、基纳法、埃洛体系、马尔可夫法、攻防评分法、基于重新排序的排名方法、分差、用户偏好评分、处理平局、加入权重、“假如……会怎样”的问题与敏感性、排名聚合、比较排名的方法、数据等。《谁排第一？关于评价和排序的科学》可作为数学、计算机、网络技术、管理学和数据科学等......一起来看看《谁排第一》这本书的介绍吧!

码农工具

The Guide to Multi-Tasking with the T5 Transformer