Blender Bot — Part 2: The Transformer

栏目: IT技术 · 发布时间: 4年前

内容简介:Facebook’s open sourced chatbot “Blender” is breaking all records previously set by Google’s “Meena”. In this post, we will go over theYou can read Part 1 of this series, where we have gone over the Data Sets on which the chatbot is trained, onTDS.Assuming

Facebook’s open sourced chatbot “Blender” is breaking all records previously set by Google’s “Meena”. In this post, we will go over the Poly-Encoders Transformer architecture, that forms the crux of Blender.

You can read Part 1 of this series, where we have gone over the Data Sets on which the chatbot is trained, onTDS.

Assuming the reader has a prior understanding of Attention, Transformers, BERT and Generative Language Models, I shall march forth.

Introduction:

Before seeing how the Poly-Encoder is used in the context of Blender, we will first understand them independently. The datasets and (fake) training tasks employed in pre-training and fine-tuning the Blender (which are explained in detail in Part 1) should not be confused with the details am about to explain below. The experimental settings given here, are to understand a specific task called “ Multi-Sentence Scoring ” and the Encoder architectures trained for that task, in a generic setting. And then among the Encoder architectures trained for this task, we will see how the Poly-Encoders are superior.

Task:

Multi-Sentence scoring does pairwise comparison between the input and output sequences. Given an input sequence, we score a set of candidate labels.

From here on, we’ll represent the input-output pair by [INPUT, LABEL] .

The goal is to find the best label from among a finite list of candidate labels. The Encoder used is the BERT-Base with 12 Encoder blocks, 12 Attention heads and 768 hidden neurons in the Feed Forward Network.

Pre-Training:

Two versions of pre-training are done for this task:

  1. pre-trained like BERT, on the Toronto Book Corpus and Wikipedia. Here, the [INPUT, LABEL] can be thought of as [Sentence A, Sentence B].
  2. pre-trained on the public domain social media conversations available from Reddit. Here, the [INPUT, LABEL] can be understood as [Context, Next Sentence]

Fake Training Tasks:

The training tasks are the same ones used in the pre-training of BERT.

  1. MLM: Masked Language Model: Here a certain percentage of the input tokens are masked at random (with [MASK] token). The task then is to learn to predict the masked tokens.
  2. NSP: Next Sentence Prediction: Here given 2 sentences A and B, the task is to say if B follows A? (with Negative Sampling). Negative Sampling is implemented by taking a random sentence from the dataset as B, 50% of the time.

A little digression here. A trick that I use, to remember the nature of these pre-training tasks in BERT is to draw a direct comparison with the fake training tasks used in generating the Word2Vec embeddings, namely: 1) CBOW 2) Skip-Gram. If you could recall, in CBOW (Continuous Bag of Words), given a context the task is to predict the target word — similar to the MLM task. And in the Skip-Gram model, given the target word predict the context => but instead of predicting the context/neighbouring word, we change the dataset and the task becomes: given the target word and another word -> predict if the other word is a neighbour of the target word or not (binary classification problem). Since the initial dataset was formed only with target words and words in their context, the modified dataset now contains only positive examples. So we introduce noise by negative sampling. Very very similar to the NSP task of BERT. (If you think there is any inconsistency in drawing such a comparison between the training tasks of BERT and Word Embeddings, do let me know in the comments. Thanks!)

Fine-Tuning:

The model is fine-tuned separately on the ConvAI2 dataset, thereby encouraged to learn the “Personality” trait and on the Ubuntu chat logs which would help them learn “Domain Knowledge/Expertise”.

Architectures:

We will see 3 Encoder architectures to solve the “Multi-Sentence Scoring” task, namely,

  1. Bi-Encoder
  2. Cross-Encoder
  3. Poly-Encoder

The performance of an architecture during inferencing is measured both by the quality of the prediction and also by the prediction speed.

Before proceeding, it is important to remember that this is a Retrieval and NOT Generative task : we only need to retrieve a correct label from a fixed set of candidate labels.

Bi-Encoder:

Blender Bot — Part 2: The Transformer

Bi-Encoder architecture from Ref [1]

In Bi-Encoders, Self-Attention is performed over the Input and Label separately. This is nothing but the more generic concept of a Vector Space model. This architecture has the advantage of being faster during inferencing, because we can pre-compute & cache encodings of large, fixed set of candidate labels. This is made possible as the labels are getting encoded separately and have no dependancy with that of the input context.

  • Both the INPUT and LABEL are surrounded by a special token [S]. This is similar to the [CLS] token in BERT, which captures the features of the entire sentence.
  • The embeddings input to the Encoder is a combination of Token Embeddings + Segment Embeddings + Position Embeddings. The Segment Embedding is generally used to say if a token belongs to Sentence A or Sentence B (in the context of BERT). Since the INPUT and LABEL are encoded separately here, the Segment Embedding is ‘0’ in both the cases.
  • Map the input and candidate label separately to a common feature space. In the formula shown, T1 and T2 are two separate Transformers (Encoders).
  • The Encoder, after performing Self-Attention on the Input token embeddings, gives the encoder representations for every token like:
Blender Bot — Part 2: The Transformer
  • A reduce function ( red ) is then used to reduce this to a single embedding representation. The reduce function can be any of the following:

-> it can either take the representation of the first token. This is the representation corresponding to the special token [S]

-> or we can take the average over all the output embeddings

-> or we can take the average over the first ‘m’ (m

  • Once the INPUT and LABEL are represented thus in a common vector space, measure the similarity between them using standard dot product or any other non-linear function.
Blender Bot — Part 2: The Transformer
  • We then minimize the Cross Entropy loss function, where the logits look like:

Cross-Encoder:

Blender Bot — Part 2: The Transformer

Cross-Encoder architecture from Ref [1]
  • Here, the INPUT and the LABEL are concatenated and Full Self Attention is performed between the entire sequence of input and label.That is, every token of the input would attend to every token of the label and vice versa. This gives rise to rich interactions between the input and label.
  • Even here, both the INPUT and LABEL are surrounded by a special token [S].
  • Again, the embeddings input to the Encoder is a combination of Token Embeddings + Segment Embeddings + Position Embeddings. Since the INPUT and LABEL are combined, the Segment Embedding is ‘0’ for a INPUT token and ‘1’ for a LABEL token.
  • Cross-Encoders give higher accuracy than the Bi-Encoder, because of the full bi-directional attention between the input and the label. At the same time, they are extremely slow during inferencing — because, as each of the candidate labels are supposed to be concatenated with the input context, and cannot be encoded separately like in the case of Bi-Encoders. Therefore candidate embeddings cannot be pre-computed and cached. When the number of candidate labels is huge (as it is in most real scenarios), cross-encoders do not scale.
  • After Self-Attention, the Transformer gives the encoder representations for all the input tokens. We reduce this to a single representation, by taking the embedding corresponding to the first token (i.e. the special token [S]). This embedding vector is then converted to a scalar score by doing a linear projection. These two steps are shown below:

Blender Bot — Part 2: The Transformer

  • The training objective here too is to minimize the Cross-Entropy loss function given by the logits:
  • where ‘ cand1’ is the correct candidate and the others are negatives taken from the training set. One problem here is that, in the bi-encoder we could use the other labels in the batch as negative training samples- here we cannot do that. we use external negatives provided in the training set. Because it is computation heavy, the in memory batch size of the cross encoder is also very small.

Poly-Encoder:

Blender Bot — Part 2: The Transformer

Poly-Encoder architecture from Ref [1]
  • Poly-Encoder take the best qualities of Bi- and Cross-Encoders. Therefore, it is faster during inferencing than the Cross-Encoders and have better accuracy than Bi-Encoders.
  • The Candidate Label is encoded separately.
  • Given the input context like:
Blender Bot — Part 2: The Transformer

we perform 3 types of Attention, as explained below:

  • Self-Attention over the Input Context’s tokens and we get:
  • Second, we learn ‘m’ codes (or queries in the parlance of Self-Attention), where m < N (N being the length of the INPUT). The number of codes to be learnt, ‘m’, is a hyperparameter. Each code Ci attends over all the outputs of the previous Self-Attention. The ‘m’ codes are randomly initialized.
  • We first get the Attention weights (w’s) by performing a dot-product attention (or a multiplicative attention in general) between the ‘m’ codes — which serve as the “Queries”, and the previous Self-Attention outputs (Out’s)—which serve as the “Keys”. Then use these attention weights to get a weighted sum of the previous Self-Attention outputs(Out’s) — which serve as the “Values”.

  • Think about why we are doing this kind of an Attention mechanism here. In a Bi-Encoder, the candidate label does not attend over the tokens of the input context. A Cross-Encoder on the other extreme, makes the candidate label attend over every token of the input context. Somehow in the Poly-Encoder we are trying to find a middle ground, by making the candidate label embedding attend over not the entire input context, but over a subset of features learnt from the input context.
  • The third kind of attention (alluded to in the previous paragraph) is between the ‘m’ global features of the Input Context and the embedding of the Candidate Label.

  • Now we compute the Similarity score between the Input Context embedding and the Candidate Label embedding as:
  • Once again, the training objective here too is to minimize the Cross-Entropy loss function given by the logits as before.

We saw three different Encoder architectures for the task of “Multi-Sentence Scoring” and saw how the Poly-Encoders were better. In the next part, we will see how the Poly-Encoders are used in the Blender and also about the different Model Architectures and training objectives. We will also touch upon the Evaluation methods used to compare the performance of Blender with that of the other Chatbots.

Note:All the notations, formulae and the Encoder block diagrams above are the same as used in the original paper mentioned in Ref.[1].

References:

  1. Poly-Encoder Transformer : https://arxiv.org/abs/1905.01969
  2. BERT : https://arxiv.org/abs/1810.04805
  3. Transformers : https://arxiv.org/abs/1706.03762

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

移动的帝国

移动的帝国

曾航、刘羽、陶旭骏 / 浙江大学出版社 / 2014-1-1 / 48.00

日本是全世界移动互联网最发达的国家之一,堪称移动的帝国。在手机游戏、手机支付、移动医疗、移动电子商务、手机电视等方面,日本都充当了全球移动互联网的试验田。 曾经傲视全球的日本运营商将怎样面对转型的挑战?iPhone来势汹汹,如何打破封闭的日本移动互联网体系?日本在智能手机时代的手机游戏、O2O、移动医疗、移动广告等方面,涌现出了哪些值得借鉴的商业模式? 本书作者团队先后数次前往日本调研......一起来看看 《移动的帝国》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具