Evolution of NLP — Part 3 — Transfer Learning Using ULMFit

栏目: IT技术 · 发布时间: 4年前

内容简介:This is the third part of a series of posts showing the improvements in NLP modeling approaches. We have seen the use of traditional techniques likeThe complete code for this tutorial is available atThe idea of using Transfer Learning is quite new in NLP T

Evolution of NLP — Part 3 — Transfer Learning Using ULMFit

Introduction to Transfer Learning for NLP using fast.ai

This is the third part of a series of posts showing the improvements in NLP modeling approaches. We have seen the use of traditional techniques like Bag of Words, TF-IDF , then moved on to RNNs and LSTMs . This time we’ll look into one of the pivotal shifts in approaching NLP Tasks — Transfer Learning!

The complete code for this tutorial is available at this Kaggle Kernel

ULMFit

The idea of using Transfer Learning is quite new in NLP Tasks, while it has been quite prominently used in Computer Vision tasks! This new way of looking at NLP was first proposed by Howard Jeremy, and has transformed the way we looked at data previously!

The core idea is two-fold — using generative pre-trained Language Model + task-specific fine-tuning was first explored in ULMFiT (Howard & Ruder, 2018), directly motivated by the success of using ImageNet pre-training for computer vision tasks. The base model is AWD-LSTM.

A Language Model is exactly like it sounds — the output of this model is to predict the next word of a sentence. The goal is to have a model that can understand the semantics, grammar, and unique structure of a language.

ULMFitfollows three steps to achieve good transfer learning results on downstream language classification tasks:

  1. General Language Model pre-training: on Wikipedia text.
  2. Target task Language Model fine-tuning: ULMFiT proposed two training techniques for stabilizing the fine-tuning process.
  3. Target task classifier fine-tuning: The pretrained LM is augmented with two standard feed-forward layers and a softmax normalization at the end to predict a target label distribution.

Using fast.ai for NLP –

fast.ai’s motto — Making Neural Networks Uncool again — tells you a lot about their approach :wink: Implementation of these models is remarkably simple and intuitive, and with good documentation, you can easily find a solution if you get stuck anywhere. Along with this, and a few other reasons I elaborate below, I decided to try out the fast.ai library which is built on top of PyTorch instead of Keras. Despite being used to working in Keras, I didn’t find it difficult to navigate fast.ai and the learning curve is quite fast to implement advanced things as well!

In addition to its simplicity, there are some advantages of using fast.ai’s implementation –

  • Discriminative fine-tuning is motivated by the fact that different layers of LM capture different types of information (see discussion above). ULMFiT proposed to tune each layer with different learning rates, {η1,…,ηℓ,…,ηL}, where η is the base learning rate for the first layer, ηℓ is for the ℓ-th layer and there are L layers in total.
Weight update for Stochastic Gradient Descent (SGD). ∇θ(ℓ) J (θ) is the gradient of Loss Function with respect to θ(ℓ). η(ℓ) is the learning rate of the ℓ-th layer.
  • Slanted triangular learning rates (STLR) refer to a special learning rate scheduling that first linearly increases the learning rate and then linearly decays it. The increase stage is short so that the model can converge to a parameter space suitable for the task fast, while the decay period is long allowing for better fine-tuning.
Learning rate increases till 200th iteration and then slowly decays. Howard, Ruder (2018) — Universal Language Model Fine-tuning for Text Classification

Let’s try to see how well this approach works for our dataset. I would also like to point out that all these ideas and code are available at fast.ai’s free official course for Deep Learning .

Loading the data!

Data in fast.ai is taken using TextLMDataBunch. This is very similar to ImageGenerator in Keras, where the path, labels, etc. are provided and the method prepares Train, Test and Validation data depending on the task at hand!

Data Bunch for Language Model

data_lm = TextLMDataBunch.from_csv(path,'train.csv', text_cols = 3, label_cols = 4)

Data Bunch for Classification Task

data_clas = TextClasDataBunch.from_csv(path, 'train.csv', vocab=data_lm.train_ds.vocab, bs=32, text_cols = 3, label_cols = 4)

As discussed in the steps before, we start out first with a language model learner, while basically predicts the next word, given a sequence. Intuitively, this model tries to understand what language and context is. And then we use this model and fine-tune it for our specific task — Sentiment Classification.

Step 1. Training a Language Model

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)

By default, we start with a pre-trained model, based on AWD-LSTM architecture. This model is built on top of simple LSTM units but has multiple dropout layers and hyperparameters. Based on the drop_mult argument, we can simultaneously set multiple dropouts within the model. I’ve kept it at 0.5. You can set it higher if you find that this model is overfitting.

Discriminative Fine-Tuning

learn.unfreeze()
learn.fit_one_cycle(3, slice(1e-4,1e-2))

learn.unfreeze() makes all the layers of AWD-LSTM trainable. We can set a training rate using slice() function, which trains the last layer at 1e-02, while groups (of layers) in between would have geometrically reducing learning rates. In our case, I’ve specified the learning rate using the slice() method. It basically takes 1e-4 as the learning rate for the inner layer and 1e-2 for the outer layer. Layers in between have geometrically scaled learning rates.

Slated Triangular Learning Rates

This can be achieved simply by using fit_one_cycle() method in fast.ai

Gradual Unfreezing

Though I’ve not experimented with this here, the idea is pretty simple. In the start, we keep the initial layers of the model as un-trainable, and then we slowly unfreeze earlier layers, as we keep on training. I’ll cover this in detail in next post


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

深入浅出HTML与CSS、XHTML

深入浅出HTML与CSS、XHTML

[美] 弗里曼 Freeman.E. / 东南大学出版社 / 2006-5 / 98.00元

《深入浅出HTML与CSS XHTML》(影印版)能让你避免认为Web-safe颜色还是紧要问题的尴尬,以及不明智地把标记放入你的页面。最大的好处是,你将毫无睡意地学习HTML、XHTML 和CSS。如果你曾经读过深入浅出(Head First)系列图书中的任一本,就会知道书中展现的是什么:一个按人脑思维方式设计的丰富的可视化学习模式。《深入浅出HTML与CSS XHTML》(影印版)的编写采用了......一起来看看 《深入浅出HTML与CSS、XHTML》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具