内容简介:Facebook’s open sourced chatbot “Blender” is breaking all records previously set by Google’s “Meena”. In this post, we will go over theYou can read Part 1 of this series, where we have gone over the Data Sets on which the chatbot is trained, onTDS.Assuming
Facebook’s open sourced chatbot “Blender” is breaking all records previously set by Google’s “Meena”. In this post, we will go over the Poly-Encoders Transformer architecture, that forms the crux of Blender.
You can read Part 1 of this series, where we have gone over the Data Sets on which the chatbot is trained, onTDS.
Assuming the reader has a prior understanding of Attention, Transformers, BERT and Generative Language Models, I shall march forth.
Introduction:
Before seeing how the Poly-Encoder is used in the context of Blender, we will first understand them independently. The datasets and (fake) training tasks employed in pre-training and fine-tuning the Blender (which are explained in detail in Part 1) should not be confused with the details am about to explain below. The experimental settings given here, are to understand a specific task called “ Multi-Sentence Scoring ” and the Encoder architectures trained for that task, in a generic setting. And then among the Encoder architectures trained for this task, we will see how the Poly-Encoders are superior.
Task:
Multi-Sentence scoring does pairwise comparison between the input and output sequences. Given an input sequence, we score a set of candidate labels.
From here on, we’ll represent the input-output pair by [INPUT, LABEL] .
The goal is to find the best label from among a finite list of candidate labels. The Encoder used is the BERT-Base with 12 Encoder blocks, 12 Attention heads and 768 hidden neurons in the Feed Forward Network.
Pre-Training:
Two versions of pre-training are done for this task:
- pre-trained like BERT, on the Toronto Book Corpus and Wikipedia. Here, the [INPUT, LABEL] can be thought of as [Sentence A, Sentence B].
- pre-trained on the public domain social media conversations available from Reddit. Here, the [INPUT, LABEL] can be understood as [Context, Next Sentence]
Fake Training Tasks:
The training tasks are the same ones used in the pre-training of BERT.
- MLM: Masked Language Model: Here a certain percentage of the input tokens are masked at random (with [MASK] token). The task then is to learn to predict the masked tokens.
- NSP: Next Sentence Prediction: Here given 2 sentences A and B, the task is to say if B follows A? (with Negative Sampling). Negative Sampling is implemented by taking a random sentence from the dataset as B, 50% of the time.
A little digression here. A trick that I use, to remember the nature of these pre-training tasks in BERT is to draw a direct comparison with the fake training tasks used in generating the Word2Vec embeddings, namely: 1) CBOW 2) Skip-Gram. If you could recall, in CBOW (Continuous Bag of Words), given a context the task is to predict the target word — similar to the MLM task. And in the Skip-Gram model, given the target word predict the context => but instead of predicting the context/neighbouring word, we change the dataset and the task becomes: given the target word and another word -> predict if the other word is a neighbour of the target word or not (binary classification problem). Since the initial dataset was formed only with target words and words in their context, the modified dataset now contains only positive examples. So we introduce noise by negative sampling. Very very similar to the NSP task of BERT. (If you think there is any inconsistency in drawing such a comparison between the training tasks of BERT and Word Embeddings, do let me know in the comments. Thanks!)
Fine-Tuning:
The model is fine-tuned separately on the ConvAI2 dataset, thereby encouraged to learn the “Personality” trait and on the Ubuntu chat logs which would help them learn “Domain Knowledge/Expertise”.
Architectures:
We will see 3 Encoder architectures to solve the “Multi-Sentence Scoring” task, namely,
- Bi-Encoder
- Cross-Encoder
- Poly-Encoder
The performance of an architecture during inferencing is measured both by the quality of the prediction and also by the prediction speed.
Before proceeding, it is important to remember that this is a Retrieval and NOT Generative task : we only need to retrieve a correct label from a fixed set of candidate labels.
Bi-Encoder:
In Bi-Encoders, Self-Attention is performed over the Input and Label separately. This is nothing but the more generic concept of a Vector Space model. This architecture has the advantage of being faster during inferencing, because we can pre-compute & cache encodings of large, fixed set of candidate labels. This is made possible as the labels are getting encoded separately and have no dependancy with that of the input context.
- Both the INPUT and LABEL are surrounded by a special token [S]. This is similar to the [CLS] token in BERT, which captures the features of the entire sentence.
- The embeddings input to the Encoder is a combination of Token Embeddings + Segment Embeddings + Position Embeddings. The Segment Embedding is generally used to say if a token belongs to Sentence A or Sentence B (in the context of BERT). Since the INPUT and LABEL are encoded separately here, the Segment Embedding is ‘0’ in both the cases.
- Map the input and candidate label separately to a common feature space. In the formula shown, T1 and T2 are two separate Transformers (Encoders).
- The Encoder, after performing Self-Attention on the Input token embeddings, gives the encoder representations for every token like:
- A reduce function ( red ) is then used to reduce this to a single embedding representation. The reduce function can be any of the following:
-> it can either take the representation of the first token. This is the representation corresponding to the special token [S]
-> or we can take the average over all the output embeddings
-> or we can take the average over the first ‘m’ (m we perform 3 types of Attention, as explained below: We saw three different Encoder architectures for the task of “Multi-Sentence Scoring” and saw how the Poly-Encoders were better. In the next part, we will see how the Poly-Encoders are used in the Blender and also about the different Model Architectures and training objectives. We will also touch upon the Evaluation methods used to compare the performance of Blender with that of the other Chatbots. Note:All the notations, formulae and the Encoder block diagrams above are the same as used in the original paper mentioned in Ref.[1].
Cross-Encoder:
Poly-Encoder:
References:
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。