Deep Learning For NLP with PyTorch and Torchtext

栏目: IT技术 · 发布时间: 4年前

Deep Learning For NLP with PyTorch and Torchtext

Torchtext’s Pre-trained Word Embedding, Dataset API, Iterator API, and training model with Torchtext and PyTorch

Deep Learning For NLP with PyTorch and Torchtext

Picture by Clarissa Watson on Unsplash

PyTorch has been an awesome deep learning framework that I have been working with. However, when it comes to NLP somehow I could not found as good utility library like torchvision . Turns out PyTorch has this torchtext , which, in my opinion, lack of examples on how to use it and the documentation [6] can be improved. Moreover, there are some great tutorials like [1] and [2] but, we still need more examples.

This article’s purpose is to give readers sample codes on how to use torchtext , in particular, to use pre-trained word embedding, use dataset API, use iterator API for mini-batch, and finally how to use these in conjunction to train a model.

Pre-Trained Word Embedding with Torchtext

There have been some alternatives in pre-trained word embeddings such as Spacy [3], Stanza (Stanford NLP)[4], Gensim [5] but in this article, I wanted to focus on doing word embedding with torchtext .

Available Word Embedding

You can see the list of pre-trained word embeddings at torchtext . At this time of writing, there are 3 pre-trained word embedding classes supported: GloVe, FastText, and CharNGram, with no additional detail on how to load. The exhaustive list is stated here , but it took me sometimes to read that so I will layout the list here.

charngram.100d
fasttext.en.300d
fasttext.simple.300d
glove.42B.300d
glove.840B.300d
glove.twitter.27B.25d
glove.twitter.27B.50d
glove.twitter.27B.100d
glove.twitter.27B.200d
glove.6B.50d
glove.6B.100d
glove.6B.200d
glove.6B.300d

There are two ways we can load pre-trained word embeddings: initiate word embedding object or using Field instance.

Using Field Instance

You need some toy dataset to use this so let’s set one up.

df = pd.DataFrame([
    ['my name is Jack', 'Y'],
    ['Hi I am Jack', 'Y'],
    ['Hello There!', 'Y'],
    ['Hi I am cooking', 'N'],
    ['Hello are you there?', 'N'],
    ['There is a bird there', 'N'],
], columns=['text', 'label'])

then we can construct Field objects that hold metadata of feature column and label column.

from torchtext.data import Fieldtext_field = Field(
    tokenize='basic_english', 
    lower=True
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(lambda x: text_field.preprocess(x))# load fastext simple embedding with 300d
text_field.build_vocab(
    preprocessed_text, 
    vectors='fasttext.simple.300d'
)# get the vocab instance
vocab = text_field.vocab

to get the real instance of pre-trained word embedding, you can use

vocab.vectors

Initiate Word Embedding Object

For each of these codes, it will download a big size of word embeddings so you have to be patient and do not execute all of the below codes all at once.

FastText

FastText object has one parameter: language, and it can be ‘simple’ or ‘en’. Currently they only support 300 embedding dimensions as mentioned at the above embedding list.

from torchtext.vocab import FastText
embedding = FastText('simple')

CharNGram

from torchtext.vocab import CharNGram
embedding_charngram = CharNGram()

GloVe

GloVe object has 2 parameters: name and dim. You can look up the available embedding list on what each parameter support.

from torchtext.vocab import GloVe
embedding_glove = GloVe(name='6B', dim=100)

Using Word Embedding

Using the torchtext API to use word embedding is super easy! Say you have stored your embedding at variable embedding , then you can use it like a python’s dict .

# known token, in my case print 12
print(vocab['are'])
# unknown token, will print 0
print(vocab['crazy'])

As you can see, it has handled unknown token without throwing error! If you play with encoding the words into an integer, you can notice that by default unknown token will be encoded as 0 while pad token will be encoded as 1 .

Using Dataset API

Assuming variable df has been defined as above, we now proceed to prepare the data by constructing Field for both the feature and label.

from torchtext.data import Fieldtext_field = Field(
    sequential=True,
    tokenize='basic_english', 
    fix_length=5,
    lower=True
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(
    lambda x: text_field.preprocess(x)
)# load fastext simple embedding with 300d
text_field.build_vocab(
    preprocessed_text, 
    vectors='fasttext.simple.300d'
)# get the vocab instance
vocab = text_field.vocab

A bit of warning here, Dataset.split may return 3 datasets ( train, val, test ) instead of 2 values as defined

Using Iterator Class for Mini-batching

I do not found any ready Dataset API to load pandas DataFrame to torchtext dataset, but it is pretty easy to form one.

from torchtext.data import Dataset, Example
ltoi = {l: i for i, l in enumerate(df['label'].unique())}
df['label'] = df['label'].apply(lambda y: ltoi[y])class DataFrameDataset(Dataset):
    def __init__(self, df: pd.DataFrame, fields: list):
        super(DataFrameDataset, self).__init__(
            [
                Example.fromlist(list(r), fields) 
                for i, r in df.iterrows()
            ], 
            fields
        )

we can now construct the DataFrameDataset and initiate it with the pandas dataframe.

train_dataset, test_dataset = DataFrameDataset(
    df=df, 
    fields=(
        ('text', text_field),
        ('label', label_field)
    )
).split()

we then use BucketIterator class to easily construct minibatching iterator .

from torchtext.data import BucketIteratortrain_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, test_dataset), 
    batch_sizes=(2, 2),
    sort=False
)

Remember to use sort=Falseotherwise it will lead to an error when you try to iterate test_iter because we haven’t defined the sort function, yet somehow, by default test_iter defined to be sorted.

A little note: while I do agree that we should use DataLoader API to handle the minibatch, but at this moment I have not explored how to use DataLoader with torchtext.

Example in Training PyTorch Model

Let’s define an arbitrary PyTorch model using 1 embedding layer and 1 linear layer. In the current example, I do not use pre-trained word embedding but instead I use new untrained word embedding.

import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adamclass ModelParam(object):
    def __init__(self, param_dict: dict = dict()):
        self.input_size = param_dict.get('input_size', 0)
        self.vocab_size = param_dict.get('vocab_size')
        self.embedding_dim = param_dict.get('embedding_dim', 300)
        self.target_dim = param_dict.get('target_dim', 2)
        
class MyModel(nn.Module):
    def __init__(self, model_param: ModelParam):
        super().__init__()
        self.embedding = nn.Embedding(
            model_param.vocab_size, 
            model_param.embedding_dim
        )
        self.lin = nn.Linear(
            model_param.input_size * model_param.embedding_dim, 
            model_param.target_dim
        )
        
    def forward(self, x):
        features = self.embedding(x).view(x.size()[0], -1)
        features = F.relu(features)
        features = self.lin(features)
        return features

Then I can easily iterate the training (and testing) routine as follows.

Reusing The Pre-trained Word Embedding

It is easy to modify the current defined model to a model that used pre-trained embedding.

class MyModelWithPretrainedEmbedding(nn.Module):
    def __init__(self, model_param: ModelParam, embedding):
        super().__init__()
        self.embedding = embedding
        self.lin = nn.Linear(
            model_param.input_size * model_param.embedding_dim, 
            model_param.target_dim
        )
        
    def forward(self, x):
        features = self.embedding[x].reshape(x.size()[0], -1)
        features = F.relu(features)
        features = self.lin(features)
        return features

I made 3 lines of modifications. You should notice that I have changed constructor input to accept an embedding. Additionally, I have also change the view method to reshape and use get operator [] instead of call operator () to access the embedding.

model = MyModelWithPretrainedEmbedding(model_param, vocab.vectors)

Conclusion

I have finished laying out my own exploration of using torchtext to handle text data in PyTorch. I began writing this article because I had trouble using it with the current tutorials available on the internet. I hope this article may reduce overhead for others too.

You need help to write this code? Here’s a link to google Colab.

Link to Google Colab

References

[1] Nie, A. A Tutorial on Torchtext. 2017. http://anie.me/On-Torchtext/

[2] Text Classification with TorchText Tutorial. https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

[3] Stanza Documentation. https://stanfordnlp.github.io/stanza/

[4] Gensim Documentation. https://radimrehurek.com/gensim/

[5] Spacy Documentation. https://spacy.io/

[6] Tor chtext Documentation. https://pytorch.org/text/


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

深入浅出MFC (第二版)

深入浅出MFC (第二版)

侯俊杰 / 华中科技大学出版社 / 2001-1 / 80.00元

《深入浅出MFC》分为四大篇。第一篇提出学习MFC程序设计之前的必要基础,包括Widnows程序的基本观念以及C++的高阶议题。“学前基础”是相当主观的认定,但作者是甚于自己的学习经验以及教学经验,其挑选应该颇具说服力。第二篇介绍Visual C++整合环境开发工具。此篇只是提纲挈领,并不企图取代Visual C++使用手册;然而对于软件使用的老手,此篇或已足以帮助掌握Visual C++整合环境......一起来看看 《深入浅出MFC (第二版)》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

在线进制转换器
在线进制转换器

各进制数互转换器

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码