Deep Learning For NLP with PyTorch and Torchtext

栏目: IT技术 · 发布时间: 4年前

Deep Learning For NLP with PyTorch and Torchtext

Torchtext’s Pre-trained Word Embedding, Dataset API, Iterator API, and training model with Torchtext and PyTorch

May 24 ·5min read

PyTorch has been an awesome deep learning framework that I have been working with. However, when it comes to NLP somehow I could not found as good utility library like torchvision . Turns out PyTorch has this torchtext , which, in my opinion, lack of examples on how to use it and the documentation [6] can be improved. Moreover, there are some great tutorials like [1] and [2] but, we still need more examples.

This article’s purpose is to give readers sample codes on how to use torchtext , in particular, to use pre-trained word embedding, use dataset API, use iterator API for mini-batch, and finally how to use these in conjunction to train a model.

Pre-Trained Word Embedding with Torchtext

There have been some alternatives in pre-trained word embeddings such as Spacy [3], Stanza (Stanford NLP)[4], Gensim [5] but in this article, I wanted to focus on doing word embedding with torchtext .

Available Word Embedding

You can see the list of pre-trained word embeddings at torchtext . At this time of writing, there are 3 pre-trained word embedding classes supported: GloVe, FastText, and CharNGram, with no additional detail on how to load. The exhaustive list is stated here , but it took me sometimes to read that so I will layout the list here.

charngram.100d
fasttext.en.300d
fasttext.simple.300d
glove.42B.300d
glove.840B.300d
glove.twitter.27B.25d
glove.twitter.27B.50d
glove.twitter.27B.100d
glove.twitter.27B.200d
glove.6B.50d
glove.6B.100d
glove.6B.200d
glove.6B.300d

There are two ways we can load pre-trained word embeddings: initiate word embedding object or using Field instance.

Using Field Instance

You need some toy dataset to use this so let’s set one up.

df = pd.DataFrame([
    ['my name is Jack', 'Y'],
    ['Hi I am Jack', 'Y'],
    ['Hello There!', 'Y'],
    ['Hi I am cooking', 'N'],
    ['Hello are you there?', 'N'],
    ['There is a bird there', 'N'],
], columns=['text', 'label'])

then we can construct Field objects that hold metadata of feature column and label column.

from torchtext.data import Fieldtext_field = Field(
    tokenize='basic_english', 
    lower=True
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(lambda x: text_field.preprocess(x))# load fastext simple embedding with 300d
text_field.build_vocab(
    preprocessed_text, 
    vectors='fasttext.simple.300d'
)# get the vocab instance
vocab = text_field.vocab

to get the real instance of pre-trained word embedding, you can use

vocab.vectors

Initiate Word Embedding Object

For each of these codes, it will download a big size of word embeddings so you have to be patient and do not execute all of the below codes all at once.

FastText

FastText object has one parameter: language, and it can be ‘simple’ or ‘en’. Currently they only support 300 embedding dimensions as mentioned at the above embedding list.

from torchtext.vocab import FastText
embedding = FastText('simple')

CharNGram

from torchtext.vocab import CharNGram
embedding_charngram = CharNGram()

GloVe

GloVe object has 2 parameters: name and dim. You can look up the available embedding list on what each parameter support.

from torchtext.vocab import GloVe
embedding_glove = GloVe(name='6B', dim=100)

Using Word Embedding

Using the torchtext API to use word embedding is super easy! Say you have stored your embedding at variable embedding , then you can use it like a python’s dict .

# known token, in my case print 12
print(vocab['are'])
# unknown token, will print 0
print(vocab['crazy'])

As you can see, it has handled unknown token without throwing error! If you play with encoding the words into an integer, you can notice that by default unknown token will be encoded as 0 while pad token will be encoded as 1 .

Using Dataset API

Assuming variable df has been defined as above, we now proceed to prepare the data by constructing Field for both the feature and label.

from torchtext.data import Fieldtext_field = Field(
    sequential=True,
    tokenize='basic_english', 
    fix_length=5,
    lower=True
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(
    lambda x: text_field.preprocess(x)
)# load fastext simple embedding with 300d
text_field.build_vocab(
    preprocessed_text, 
    vectors='fasttext.simple.300d'
)# get the vocab instance
vocab = text_field.vocab

A bit of warning here, Dataset.split may return 3 datasets ( train, val, test ) instead of 2 values as defined

Using Iterator Class for Mini-batching

I do not found any ready Dataset API to load pandas DataFrame to torchtext dataset, but it is pretty easy to form one.

from torchtext.data import Dataset, Example
ltoi = {l: i for i, l in enumerate(df['label'].unique())}
df['label'] = df['label'].apply(lambda y: ltoi[y])class DataFrameDataset(Dataset):
    def __init__(self, df: pd.DataFrame, fields: list):
        super(DataFrameDataset, self).__init__(
            [
                Example.fromlist(list(r), fields) 
                for i, r in df.iterrows()
            ], 
            fields
        )

we can now construct the DataFrameDataset and initiate it with the pandas dataframe.

train_dataset, test_dataset = DataFrameDataset(
    df=df, 
    fields=(
        ('text', text_field),
        ('label', label_field)
    )
).split()

we then use BucketIterator class to easily construct minibatching iterator .

from torchtext.data import BucketIteratortrain_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, test_dataset), 
    batch_sizes=(2, 2),
    sort=False
)

Remember to use sort=Falseotherwise it will lead to an error when you try to iterate test_iter because we haven’t defined the sort function, yet somehow, by default test_iter defined to be sorted.

A little note: while I do agree that we should use DataLoader API to handle the minibatch, but at this moment I have not explored how to use DataLoader with torchtext.

Example in Training PyTorch Model

Let’s define an arbitrary PyTorch model using 1 embedding layer and 1 linear layer. In the current example, I do not use pre-trained word embedding but instead I use new untrained word embedding.

import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adamclass ModelParam(object):
    def __init__(self, param_dict: dict = dict()):
        self.input_size = param_dict.get('input_size', 0)
        self.vocab_size = param_dict.get('vocab_size')
        self.embedding_dim = param_dict.get('embedding_dim', 300)
        self.target_dim = param_dict.get('target_dim', 2)
        
class MyModel(nn.Module):
    def __init__(self, model_param: ModelParam):
        super().__init__()
        self.embedding = nn.Embedding(
            model_param.vocab_size, 
            model_param.embedding_dim
        )
        self.lin = nn.Linear(
            model_param.input_size * model_param.embedding_dim, 
            model_param.target_dim
        )
        
    def forward(self, x):
        features = self.embedding(x).view(x.size()[0], -1)
        features = F.relu(features)
        features = self.lin(features)
        return features

Then I can easily iterate the training (and testing) routine as follows.

Reusing The Pre-trained Word Embedding

It is easy to modify the current defined model to a model that used pre-trained embedding.

class MyModelWithPretrainedEmbedding(nn.Module):
    def __init__(self, model_param: ModelParam, embedding):
        super().__init__()
        self.embedding = embedding
        self.lin = nn.Linear(
            model_param.input_size * model_param.embedding_dim, 
            model_param.target_dim
        )
        
    def forward(self, x):
        features = self.embedding[x].reshape(x.size()[0], -1)
        features = F.relu(features)
        features = self.lin(features)
        return features

I made 3 lines of modifications. You should notice that I have changed constructor input to accept an embedding. Additionally, I have also change the view method to reshape and use get operator [] instead of call operator () to access the embedding.

model = MyModelWithPretrainedEmbedding(model_param, vocab.vectors)

Conclusion

I have finished laying out my own exploration of using torchtext to handle text data in PyTorch. I began writing this article because I had trouble using it with the current tutorials available on the internet. I hope this article may reduce overhead for others too.

You need help to write this code? Here’s a link to google Colab.

Link to Google Colab

References

[1] Nie, A. A Tutorial on Torchtext. 2017. http://anie.me/On-Torchtext/

[2] Text Classification with TorchText Tutorial. https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

[3] Stanza Documentation. https://stanfordnlp.github.io/stanza/

[4] Gensim Documentation. https://radimrehurek.com/gensim/

[5] Spacy Documentation. https://spacy.io/

[6] Tor chtext Documentation. https://pytorch.org/text/

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Deep Learning For NLP with PyTorch and Torchtext

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

游戏化实战

[美]Yu-kai Chou / 杨国庆 / 华中科技大学出版社 / 2017-1 / 59.00

TED演讲人作品，罗辑思维、华为首席用户体验架构师、思科网络体验CTO推荐。随书附有TED演讲中文视频及作者开设的游戏化初学者课程。作者为Google、乐高、华为、思科、斯坦福大学、丹麦创新中心等多家企业、机构提供高层培训与合作。 ********************** “我长期以来都在密切关注Yu-kai的研究成果。任何想要让工作、生活变美好的人都应该阅读这本书。” ......一起来看看《游戏化实战》这本书的介绍吧!

码农工具

Deep Learning For NLP with PyTorch and Torchtext