内容简介:10分钟快速入门PyTorch (8)
前面一篇文章中,我们简单的介绍了自然语言处理中最简单的词向量 word embedding,这一篇文章我们将介绍如何使用word embedding做自然语言处理的词语预测。
N-Gram language Modeling
首先我们介绍一下 N-Gram 模型。在一篇文章中,每一句话有很多单词组成,而对于一句话,这些单词的组成顺序也是很重要的,我们想要知道在一篇文章中我们是否可以给出几个词然后预测这些词后面的一个单词,比如’I lived in France for 10 years, I can speak _ .’那么我们想要做的就是预测最后这个词是French。
知道了我们想要做的事情之后,我们就可以引出 N-Gram 模型了。先给出其公式
CONTEXT_SIZE = 2 EMBEDDING_DIM = 10 # We will use Shakespeare Sonnet 2 test_sentence = """When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a totter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold.""".split()
CONTEXT_SIZE表示我们想由前面的几个单词来预测这个单词,这里设置为2,就是说我们希望通过这个单词的前两个单词来预测这一个单词。 EMBEDDING_DIM表示word embedding的维数,上一篇已经介绍过了。
trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2]) for i in range(len(test_sentence)-2)]
vocb = set(test_sentence) # 通过set将重复的单词去掉 word_to_idx = {word: i for i, word in enumerate(vocb)} idx_to_word = {word_to_idx[word]: word for word in word_to_idx}
接下来需要给每个单词编码,也就是用数字来表示每个单词,这样才能够传入word embeding得到词向量。
class NgramModel(nn.Module): def __init__(self, vocb_size, context_size, n_dim): super(NgramModel, self).__init__() self.n_word = vocb_size self.embedding = nn.Embedding(self.n_word, n_dim) self.linear1 = nn.Linear(context_size*n_dim, 128) self.linear2 = nn.Linear(128, self.n_word) def forward(self, x): emb = self.embedding(x) emb = emb.view(1, -1) out = self.linear1(emb) out = F.relu(out) out = self.linear2(out) log_prob = F.log_softmax(out) return log_prob ngrammodel = NgramModel(len(word_to_idx), CONTEXT_SIZE, 100) criterion = nn.NLLLoss() optimizer = optim.SGD(ngrammodel.parameters(), lr=1e-3)
然后在向前传播中,首先传入单词得到词向量,比如在该模型中传入两个词,得到的词向量是(2, 100),然后将词向量展开成(1, 200),然后传入一个线性模型,经过relu激活函数再传入一个线性模型,输出的维数是单词总数,可以看成一个分类问题,要最大化预测单词的概率,最后经过一个log softmax激活函数。
for epoch in range(100): print('epoch: {}'.format(epoch+1)) print('*'*10) running_loss = 0 for data in trigram: word, label = data word = Variable(torch.LongTensor([word_to_idx[i] for i in word])) label = Variable(torch.LongTensor([word_to_idx[label]])) # forward out = ngrammodel(word) loss = criterion(out, label) running_loss += loss.data[0] # backward optimizer.zero_grad() loss.backward() optimizer.step() print('Loss: {:.6f}'.format(running_loss / len(word_to_idx)))
word, label = trigram[3] word = Variable(torch.LongTensor([word_to_idx[i] for i in word])) out = ngrammodel(word) _, predict_label = torch.max(out, 1) predict_word = idx_to_word[predict_label.data[0][0]] print('real word is {}, predict word is {}'.format(label, predict_word))
以上我们介绍了如何通过最简单的单边 N-Gram 模型预测单词,还有一种复杂一点的N-Gram模型通过双边的单词来预测中间的单词,这种模型有个专门的名字,叫 Continuous Bag-of-Words model (CBOW),具体的内容差别不大,就不再细讲了,代码的实现放在了github上面。
