Evolution of NLP — Part 3 — Transfer Learning Using ULMFit

栏目: IT技术 · 发布时间: 4年前

内容简介:This is the third part of a series of posts showing the improvements in NLP modeling approaches. We have seen the use of traditional techniques likeThe complete code for this tutorial is available atThe idea of using Transfer Learning is quite new in NLP T

Evolution of NLP — Part 3 — Transfer Learning Using ULMFit

Introduction to Transfer Learning for NLP using fast.ai

This is the third part of a series of posts showing the improvements in NLP modeling approaches. We have seen the use of traditional techniques like Bag of Words, TF-IDF , then moved on to RNNs and LSTMs . This time we’ll look into one of the pivotal shifts in approaching NLP Tasks — Transfer Learning!

The complete code for this tutorial is available at this Kaggle Kernel

ULMFit

The idea of using Transfer Learning is quite new in NLP Tasks, while it has been quite prominently used in Computer Vision tasks! This new way of looking at NLP was first proposed by Howard Jeremy, and has transformed the way we looked at data previously!

The core idea is two-fold — using generative pre-trained Language Model + task-specific fine-tuning was first explored in ULMFiT (Howard & Ruder, 2018), directly motivated by the success of using ImageNet pre-training for computer vision tasks. The base model is AWD-LSTM.

A Language Model is exactly like it sounds — the output of this model is to predict the next word of a sentence. The goal is to have a model that can understand the semantics, grammar, and unique structure of a language.

ULMFitfollows three steps to achieve good transfer learning results on downstream language classification tasks:

  1. General Language Model pre-training: on Wikipedia text.
  2. Target task Language Model fine-tuning: ULMFiT proposed two training techniques for stabilizing the fine-tuning process.
  3. Target task classifier fine-tuning: The pretrained LM is augmented with two standard feed-forward layers and a softmax normalization at the end to predict a target label distribution.

Using fast.ai for NLP –

fast.ai’s motto — Making Neural Networks Uncool again — tells you a lot about their approach :wink: Implementation of these models is remarkably simple and intuitive, and with good documentation, you can easily find a solution if you get stuck anywhere. Along with this, and a few other reasons I elaborate below, I decided to try out the fast.ai library which is built on top of PyTorch instead of Keras. Despite being used to working in Keras, I didn’t find it difficult to navigate fast.ai and the learning curve is quite fast to implement advanced things as well!

In addition to its simplicity, there are some advantages of using fast.ai’s implementation –

  • Discriminative fine-tuning is motivated by the fact that different layers of LM capture different types of information (see discussion above). ULMFiT proposed to tune each layer with different learning rates, {η1,…,ηℓ,…,ηL}, where η is the base learning rate for the first layer, ηℓ is for the ℓ-th layer and there are L layers in total.
Weight update for Stochastic Gradient Descent (SGD). ∇θ(ℓ) J (θ) is the gradient of Loss Function with respect to θ(ℓ). η(ℓ) is the learning rate of the ℓ-th layer.
  • Slanted triangular learning rates (STLR) refer to a special learning rate scheduling that first linearly increases the learning rate and then linearly decays it. The increase stage is short so that the model can converge to a parameter space suitable for the task fast, while the decay period is long allowing for better fine-tuning.
Learning rate increases till 200th iteration and then slowly decays. Howard, Ruder (2018) — Universal Language Model Fine-tuning for Text Classification

Let’s try to see how well this approach works for our dataset. I would also like to point out that all these ideas and code are available at fast.ai’s free official course for Deep Learning .

Loading the data!

Data in fast.ai is taken using TextLMDataBunch. This is very similar to ImageGenerator in Keras, where the path, labels, etc. are provided and the method prepares Train, Test and Validation data depending on the task at hand!

Data Bunch for Language Model

data_lm = TextLMDataBunch.from_csv(path,'train.csv', text_cols = 3, label_cols = 4)

Data Bunch for Classification Task

data_clas = TextClasDataBunch.from_csv(path, 'train.csv', vocab=data_lm.train_ds.vocab, bs=32, text_cols = 3, label_cols = 4)

As discussed in the steps before, we start out first with a language model learner, while basically predicts the next word, given a sequence. Intuitively, this model tries to understand what language and context is. And then we use this model and fine-tune it for our specific task — Sentiment Classification.

Step 1. Training a Language Model

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)

By default, we start with a pre-trained model, based on AWD-LSTM architecture. This model is built on top of simple LSTM units but has multiple dropout layers and hyperparameters. Based on the drop_mult argument, we can simultaneously set multiple dropouts within the model. I’ve kept it at 0.5. You can set it higher if you find that this model is overfitting.

Discriminative Fine-Tuning

learn.unfreeze()
learn.fit_one_cycle(3, slice(1e-4,1e-2))

learn.unfreeze() makes all the layers of AWD-LSTM trainable. We can set a training rate using slice() function, which trains the last layer at 1e-02, while groups (of layers) in between would have geometrically reducing learning rates. In our case, I’ve specified the learning rate using the slice() method. It basically takes 1e-4 as the learning rate for the inner layer and 1e-2 for the outer layer. Layers in between have geometrically scaled learning rates.

Slated Triangular Learning Rates

This can be achieved simply by using fit_one_cycle() method in fast.ai

Gradual Unfreezing

Though I’ve not experimented with this here, the idea is pretty simple. In the start, we keep the initial layers of the model as un-trainable, and then we slowly unfreeze earlier layers, as we keep on training. I’ll cover this in detail in next post


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

乔纳森传

乔纳森传

利恩德·卡尼 / 汪琪 岳卉 王文雅 / 中信出版社 / 2014-1-1 / 49

抛开苹果公司,单就设计行业来讲,乔纳森也是一个特殊的人物。他推动了设计行业的大变革:不再为产品增加看起来炫得多的配件,而是要去掉多余的东西。 ——陈向东 终于有一本书能够如此地接地气:它不再关注那位神一样的乔布斯,而是关注那位站在神的背后,同样具有神一样光环的乔纳森。 ——孙陶然 乔纳森•艾夫把他自己对科学、人文、艺术,乃至整个世界的感知尽数渗透进苹果的设计和审美之中,他是......一起来看看 《乔纳森传》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

在线进制转换器
在线进制转换器

各进制数互转换器

SHA 加密
SHA 加密

SHA 加密工具