内容简介:This is the third part of a series of posts showing the improvements in NLP modeling approaches. We have seen the use of traditional techniques likeThe complete code for this tutorial is available atThe idea of using Transfer Learning is quite new in NLP T
Evolution of NLP — Part 3 — Transfer Learning Using ULMFit
Introduction to Transfer Learning for NLP using fast.ai
This is the third part of a series of posts showing the improvements in NLP modeling approaches. We have seen the use of traditional techniques like Bag of Words, TF-IDF , then moved on to RNNs and LSTMs . This time we’ll look into one of the pivotal shifts in approaching NLP Tasks — Transfer Learning!
The complete code for this tutorial is available at this Kaggle Kernel
The idea of using Transfer Learning is quite new in NLP Tasks, while it has been quite prominently used in Computer Vision tasks! This new way of looking at NLP was first proposed by Howard Jeremy, and has transformed the way we looked at data previously!
The core idea is two-fold — using generative pre-trained Language Model + task-specific fine-tuning was first explored in ULMFiT (Howard & Ruder, 2018), directly motivated by the success of using ImageNet pre-training for computer vision tasks. The base model is AWD-LSTM.
A Language Model is exactly like it sounds — the output of this model is to predict the next word of a sentence. The goal is to have a model that can understand the semantics, grammar, and unique structure of a language.
ULMFitfollows three steps to achieve good transfer learning results on downstream language classification tasks:
- General Language Model pre-training: on Wikipedia text.
- Target task Language Model fine-tuning: ULMFiT proposed two training techniques for stabilizing the fine-tuning process.
- Target task classifier fine-tuning: The pretrained LM is augmented with two standard feed-forward layers and a softmax normalization at the end to predict a target label distribution.
Using fast.ai for NLP –
fast.ai’s motto — Making Neural Networks Uncool again — tells you a lot about their approach :wink: Implementation of these models is remarkably simple and intuitive, and with good documentation, you can easily find a solution if you get stuck anywhere. Along with this, and a few other reasons I elaborate below, I decided to try out the fast.ai library which is built on top of PyTorch instead of Keras. Despite being used to working in Keras, I didn’t find it difficult to navigate fast.ai and the learning curve is quite fast to implement advanced things as well!
In addition to its simplicity, there are some advantages of using fast.ai’s implementation –
- Discriminative fine-tuning is motivated by the fact that different layers of LM capture different types of information (see discussion above). ULMFiT proposed to tune each layer with different learning rates, {η1,…,ηℓ,…,ηL}, where η is the base learning rate for the first layer, ηℓ is for the ℓ-th layer and there are L layers in total.
- Slanted triangular learning rates (STLR) refer to a special learning rate scheduling that first linearly increases the learning rate and then linearly decays it. The increase stage is short so that the model can converge to a parameter space suitable for the task fast, while the decay period is long allowing for better fine-tuning.
Let’s try to see how well this approach works for our dataset. I would also like to point out that all these ideas and code are available at fast.ai’s free official course for Deep Learning .
Loading the data!
Data in fast.ai is taken using TextLMDataBunch. This is very similar to ImageGenerator in Keras, where the path, labels, etc. are provided and the method prepares Train, Test and Validation data depending on the task at hand!
Data Bunch for Language Model
data_lm = TextLMDataBunch.from_csv(path,'train.csv', text_cols = 3, label_cols = 4)
Data Bunch for Classification Task
data_clas = TextClasDataBunch.from_csv(path, 'train.csv', vocab=data_lm.train_ds.vocab, bs=32, text_cols = 3, label_cols = 4)
As discussed in the steps before, we start out first with a language model learner, while basically predicts the next word, given a sequence. Intuitively, this model tries to understand what language and context is. And then we use this model and fine-tune it for our specific task — Sentiment Classification.
Step 1. Training a Language Model
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5) learn.fit_one_cycle(1, 1e-2)
By default, we start with a pre-trained model, based on AWD-LSTM architecture. This model is built on top of simple LSTM units but has multiple dropout layers and hyperparameters. Based on the drop_mult argument, we can simultaneously set multiple dropouts within the model. I’ve kept it at 0.5. You can set it higher if you find that this model is overfitting.
Discriminative Fine-Tuning
learn.unfreeze() learn.fit_one_cycle(3, slice(1e-4,1e-2))
learn.unfreeze() makes all the layers of AWD-LSTM trainable. We can set a training rate using slice() function, which trains the last layer at 1e-02, while groups (of layers) in between would have geometrically reducing learning rates. In our case, I’ve specified the learning rate using the slice() method. It basically takes 1e-4 as the learning rate for the inner layer and 1e-2 for the outer layer. Layers in between have geometrically scaled learning rates.
Slated Triangular Learning Rates
This can be achieved simply by using fit_one_cycle() method in fast.ai
Gradual Unfreezing
Though I’ve not experimented with this here, the idea is pretty simple. In the start, we keep the initial layers of the model as un-trainable, and then we slowly unfreeze earlier layers, as we keep on training. I’ll cover this in detail in next post
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网