Deep Learning For NLP with PyTorch and Torchtext
Torchtext’s Pre-trained Word Embedding, Dataset API, Iterator API, and training model with Torchtext and PyTorch
May 24 ·5min read
PyTorch has been an awesome deep learning framework that I have been working with. However, when it comes to NLP somehow I could not found as good utility library like torchvision . Turns out PyTorch has this torchtext , which, in my opinion, lack of examples on how to use it and the documentation [6] can be improved. Moreover, there are some great tutorials like [1] and [2] but, we still need more examples.
This article’s purpose is to give readers sample codes on how to use torchtext , in particular, to use pre-trained word embedding, use dataset API, use iterator API for mini-batch, and finally how to use these in conjunction to train a model.
Pre-Trained Word Embedding with Torchtext
There have been some alternatives in pre-trained word embeddings such as Spacy [3], Stanza (Stanford NLP)[4], Gensim [5] but in this article, I wanted to focus on doing word embedding with torchtext .
Available Word Embedding
You can see the list of pre-trained word embeddings at torchtext . At this time of writing, there are 3 pre-trained word embedding classes supported: GloVe, FastText, and CharNGram, with no additional detail on how to load. The exhaustive list is stated here , but it took me sometimes to read that so I will layout the list here.
charngram.100d
fasttext.en.300d
fasttext.simple.300d
glove.42B.300d
glove.840B.300d
glove.twitter.27B.25d
glove.twitter.27B.50d
glove.twitter.27B.100d
glove.twitter.27B.200d
glove.6B.50d
glove.6B.100d
glove.6B.200d
glove.6B.300d
There are two ways we can load pre-trained word embeddings: initiate word embedding object or using Field
instance.
Using Field Instance
You need some toy dataset to use this so let’s set one up.
df = pd.DataFrame([ ['my name is Jack', 'Y'], ['Hi I am Jack', 'Y'], ['Hello There!', 'Y'], ['Hi I am cooking', 'N'], ['Hello are you there?', 'N'], ['There is a bird there', 'N'], ], columns=['text', 'label'])
then we can construct Field
objects that hold metadata of feature column and label column.
from torchtext.data import Fieldtext_field = Field( tokenize='basic_english', lower=True )label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually preprocessed_text = df['text'].apply(lambda x: text_field.preprocess(x))# load fastext simple embedding with 300d text_field.build_vocab( preprocessed_text, vectors='fasttext.simple.300d' )# get the vocab instance vocab = text_field.vocab
to get the real instance of pre-trained word embedding, you can use
vocab.vectors
Initiate Word Embedding Object
For each of these codes, it will download a big size of word embeddings so you have to be patient and do not execute all of the below codes all at once.
FastText
FastText object has one parameter: language, and it can be ‘simple’ or ‘en’. Currently they only support 300 embedding dimensions as mentioned at the above embedding list.
from torchtext.vocab import FastText embedding = FastText('simple')
CharNGram
from torchtext.vocab import CharNGram embedding_charngram = CharNGram()
GloVe
GloVe object has 2 parameters: name and dim. You can look up the available embedding list on what each parameter support.
from torchtext.vocab import GloVe embedding_glove = GloVe(name='6B', dim=100)
Using Word Embedding
Using the torchtext
API to use word embedding is super easy! Say you have stored your embedding at variable embedding
, then you can use it like a python’s dict
.
# known token, in my case print 12 print(vocab['are']) # unknown token, will print 0 print(vocab['crazy'])
As you can see, it has handled unknown token without throwing error! If you play with encoding the words into an integer, you can notice that by default unknown token will be encoded as 0
while pad token will be encoded as 1
.
Using Dataset API
Assuming variable df
has been defined as above, we now proceed to prepare the data by constructing Field
for both the feature and label.
from torchtext.data import Fieldtext_field = Field( sequential=True, tokenize='basic_english', fix_length=5, lower=True )label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually preprocessed_text = df['text'].apply( lambda x: text_field.preprocess(x) )# load fastext simple embedding with 300d text_field.build_vocab( preprocessed_text, vectors='fasttext.simple.300d' )# get the vocab instance vocab = text_field.vocab
A bit of warning here, Dataset.split
may return 3 datasets ( train, val, test
) instead of 2 values as defined
Using Iterator Class for Mini-batching
I do not found any ready Dataset
API to load pandas DataFrame
to torchtext
dataset, but it is pretty easy to form one.
from torchtext.data import Dataset, Example ltoi = {l: i for i, l in enumerate(df['label'].unique())} df['label'] = df['label'].apply(lambda y: ltoi[y])class DataFrameDataset(Dataset): def __init__(self, df: pd.DataFrame, fields: list): super(DataFrameDataset, self).__init__( [ Example.fromlist(list(r), fields) for i, r in df.iterrows() ], fields )
we can now construct the DataFrameDataset
and initiate it with the pandas dataframe.
train_dataset, test_dataset = DataFrameDataset( df=df, fields=( ('text', text_field), ('label', label_field) ) ).split()
we then use BucketIterator
class to easily construct minibatching iterator
.
from torchtext.data import BucketIteratortrain_iter, test_iter = BucketIterator.splits( datasets=(train_dataset, test_dataset), batch_sizes=(2, 2), sort=False )
Remember to use sort=Falseotherwise it will lead to an error when you try to iterate test_iter
because we haven’t defined the sort function, yet somehow, by default test_iter
defined to be sorted.
A little note: while I do agree that we should use DataLoader
API to handle the minibatch, but at this moment I have not explored
how to use DataLoader
with torchtext.
Example in Training PyTorch Model
Let’s define an arbitrary PyTorch model using 1 embedding layer and 1 linear layer. In the current example, I do not use pre-trained word embedding but instead I use new untrained word embedding.
import torch.nn as nn import torch.nn.functional as F from torch.optim import Adamclass ModelParam(object): def __init__(self, param_dict: dict = dict()): self.input_size = param_dict.get('input_size', 0) self.vocab_size = param_dict.get('vocab_size') self.embedding_dim = param_dict.get('embedding_dim', 300) self.target_dim = param_dict.get('target_dim', 2) class MyModel(nn.Module): def __init__(self, model_param: ModelParam): super().__init__() self.embedding = nn.Embedding( model_param.vocab_size, model_param.embedding_dim ) self.lin = nn.Linear( model_param.input_size * model_param.embedding_dim, model_param.target_dim ) def forward(self, x): features = self.embedding(x).view(x.size()[0], -1) features = F.relu(features) features = self.lin(features) return features
Then I can easily iterate the training (and testing) routine as follows.
Reusing The Pre-trained Word Embedding
It is easy to modify the current defined model to a model that used pre-trained embedding.
class MyModelWithPretrainedEmbedding(nn.Module): def __init__(self, model_param: ModelParam, embedding): super().__init__() self.embedding = embedding self.lin = nn.Linear( model_param.input_size * model_param.embedding_dim, model_param.target_dim ) def forward(self, x): features = self.embedding[x].reshape(x.size()[0], -1) features = F.relu(features) features = self.lin(features) return features
I made 3 lines of modifications. You should notice that I have changed constructor input to accept an embedding. Additionally, I have also change the view
method to reshape
and use get operator []
instead of call operator ()
to access the embedding.
model = MyModelWithPretrainedEmbedding(model_param, vocab.vectors)
Conclusion
I have finished laying out my own exploration of using torchtext to handle text data in PyTorch. I began writing this article because I had trouble using it with the current tutorials available on the internet. I hope this article may reduce overhead for others too.
You need help to write this code? Here’s a link to google Colab.
References
[1] Nie, A. A Tutorial on Torchtext. 2017. http://anie.me/On-Torchtext/
[2] Text Classification with TorchText Tutorial. https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
[3] Stanza Documentation. https://stanfordnlp.github.io/stanza/
[4] Gensim Documentation. https://radimrehurek.com/gensim/
[5] Spacy Documentation. https://spacy.io/
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
深度学习框架PyTorch:入门与实践
陈云 / 电子工业出版社 / 2018-1 / 65
《深度学习框架PyTorch:入门与实践》从多维数组Tensor开始,循序渐进地带领读者了解PyTorch各方面的基础知识。结合基础知识和前沿研究,带领读者从零开始完成几个经典有趣的深度学习小项目,包括GAN生成动漫头像、AI滤镜、AI写诗等。《深度学习框架PyTorch:入门与实践》没有简单机械地介绍各个函数接口的使用,而是尝试分门别类、循序渐进地向读者介绍PyTorch的知识,希望读者对PyT......一起来看看 《深度学习框架PyTorch:入门与实践》 这本书的介绍吧!