The Transformer: A Quick Run Through

栏目: IT技术 · 发布时间: 5年前

The Transformer: A Quick Run Through

Explore the best of natural language modeling enabled by the Transformer. Understand its architecture and internal working.

This is Part 3 of the5 part series on language modeling.

Seq2seq task of machine translation solved using Transformer-like architecture (BERT) ( translate.google.com )

Introduction

In the previous post , we looked at how ELMo and ULMFiT boosted the prominence of language model pre-training in the community. This blog assumes that you have read through the previous two parts of this series and thus builds upon that knowledge.

The Transformer: A Quick Run Through
English input being translated to German output using the Transformer model (Mandar Deshpande)

The Transformer has been seen as the model which finally removed the limitations of sequence model training through recurrent neural networks. The idea picked up in language modeling and machine translation around the use of encoder-decoder stacking turned out to be valuable learning in the process of building this architecture. The Transformer is a simple network architecture solely based on attention mechanism and giving away any kind of recurrence and convolutions entirely. It has been shown to generalize well to other language understanding and modeling tasks, with large and limited training data. It also achieved the state of the art results on the English-to-German translation task and anchored itself as the go-to architecture for future advancements in model pre-training in NLP.

Encoder-Decoder Architecture

The Transformer: A Quick Run Through

The 6 encoder-decoder architecture used in the Transformer (Mandar Deshpande)

In this model, multiple encoders are stacked on top of each other, and similarly, decoders are stacked together. Usually, each encoder/decoder comprises recurrent connections and convolution, and the hidden representation from each encoder stage is passed ahead to be used by the next layer. Most seq2seq tasks can easily be solved using such a stack of encoders-decoders which processes each word in the input sequence in order.

Attention Mechanism

Since attention mechanism has become an integral part of sequence modeling and transduction models in various tasks allow modeling dependencies without regard to their distance in the input or output sequences. To put it in simple terms; the attention mechanism helps us tackle long-range dependency issues in neural networks without the use of recurrent neural networks (RNN). This solves the exact purpose addressed by hidden state shared across all time steps in RNN, through the use of encoder-decoder based architecture. The attention model focuses on the relevant part of the input text sequence or image as per the task being solved.

In a regular RNN, context is passed in terms of the final hidden state produced by the encoders and uses it to produce the next token of the translation or text.

The Transformer: A Quick Run Through

Regular seq2seq models without Attention Mechanism only uses the last hidden state as the context vector (Mandar Deshpande)

Steps involved in generating the Context Vector:

  1. Initialize the context vector of random values and size as per the task (eg 128, 256, 512)
  2. Process one token from the input sequence through the encoder
  3. Use the hidden state representation in the encoder to update the context vector
  4. Keep repeating Step 2 and 3 until the entire input sequence is processed

Once the context vector has been fully updated, it is passed to the decoder as an additional input to the word/token being translated. The context vector is a useful abstraction, except that it acts as a bottleneck for the representation of the entire meaning of the input sequence.

Instead of passing a single context vector to the decoder, the attention mechanism passes all the intermediate hidden states within a stack of encoders to the decoder . This enables the decoder to focus on different parts of the input sequence as per the relevance of the current word/token being processed.

Unlike the previous seq2seq models, attention models perform 2 extra steps:

  1. More data is passed from the encoder to the decoder
  2. The decoder in an attention model uses this additional data to focus on a particular word from the input sequence and uses the hidden state with the highest softmax score as the context vector

The Transformer: A Quick Run Through

Attention Mechanism used to create the context vector passed to the decoder (Mandar Deshpande)

Peek Inside the Transformer

The Transformer consists of 6 stacked encoders and 6 stacked decoders to form the main architecture of the model. This number can be variable as per the use-case, but 6 has been used in the original paper .

Let us consider a single encoder and decoder stack to simplify our understanding of the working.

The Transformer: A Quick Run Through

Components inside the Encoder and Decoder in the Transformer (Mandar Deshpande)

Architecture

Each encoder consists of a Self-Attention layer followed by the Feed Forward network. Usually in attention mechanisms, hidden states from the previous states are utilized for the calculation of attention. Instead, self-attention uses trained embeddings from the same layer to compute the attention vector. To elucidate, self-attention could be thought of as a mechanism for coreference resolution within a sentence:

“The man was eating his meal while he was thinking about his family”

In the above sentence, the model needs to build an understanding of what he refers to, and that it is a coreference to the man. This is enabled by the self-attention mechanism in the Transformer. A detailed discussion on self-attention (using multiple heads) is beyond the scope of this blog and can be found in the original paper.

The decoder also has the same two layers as the encoder, except that additional encoder-decoder attention is introduced in between to help the model extract relevant features from attention vectors from the encoder.

The Transformer: A Quick Run Through

Simplified 2 encoders stacked together with 2 decoders to explore internal architecture (Mandar Deshpande)

Point-wise Feed-Forward Networks

It is important to notice that each word in the input sequence shares the computation in the self-attention layer, but each word flows through a separate feed-forward network. The output from the feed-forward network is passed on to the next encoder in the stack which utilizes this learned context from previous encoders.

Positional Encoding

To embed a sense of time in the input sequence, each word is concatenated with a positional encoding. This augmented input word embedding is passed as input to Encoder 1. Since the model doesn’t use any recurrence or convolution, positional encodings encode some information about the relative position in the input sentence.

Residual Connections with Normalization

The output from the self-attention layer is added with the original word embedding using residual connections and layer normalization. A similar scheme is followed by the feed-forward layer.

Fully Connected Linear with Softmax

Once a point vector is given out by the final decoder in the stack, it needs to be converted into the translated word. Now that we already have all the required information embedded as floats in this output vector, we just need to convert it to a probability over possible next word in the translation.

The fully connected linear network converts the float vector into scores which are transformed into probability values using the softmax function. The index with the highest softmax value is chosen and retrieved from the output vocabulary learned from the training set.

Transformer Training Explained

The training is supervised i.e. uses labeled training dataset which can be used as a benchmark for comparison and correction of output word probabilities.

Essentially, each word in the translated output vocabulary is converted into a one-hot vector that is 1, only at the index where the word is present and 0 everywhere else. Now, once we receive the softmax output vector comprised of normalized probability values, we can compare it with the one-hot vectors to improve model parameters/weights.

These two vectors can be compared by using some similarity metrics like cosine similarity, cross-entropy, and/or Kullback-Leibler divergence . At the beginning of the training process, the output probability distribution is much further off than the ground truth one-hot vector. As training proceeds and the weights get optimized, the output word probabilities closely track the ground truth vectors.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

数据科学入门

数据科学入门

[美] Joel Grus / 高蓉、韩波 / 人民邮电出版社 / 2016-3 / 69.00元

数据科学是一个蓬勃发展、前途无限的行业,有人将数据科学家称为“21世纪头号性感职业”。本书从零开始讲解数据科学工作,教授数据科学工作所必需的黑客技能,并带领读者熟悉数据科学的核心知识——数学和统计学。 作者选择了功能强大、简单易学的Python语言环境,亲手搭建工具和实现算法,并精心挑选了注释良好、简洁易读的实现范例。书中涵盖的所有代码和数据都可以在GitHub上下载。 通过阅读本书,......一起来看看 《数据科学入门》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具