Blender Bot — Part 2: The Transformer

栏目: IT技术 · 发布时间: 3年前

内容简介：Facebook’s open sourced chatbot “Blender” is breaking all records previously set by Google’s “Meena”. In this post, we will go over theYou can read Part 1 of this series, where we have gone over the Data Sets on which the chatbot is trained, onTDS.Assuming

Facebook’s open sourced chatbot “Blender” is breaking all records previously set by Google’s “Meena”. In this post, we will go over the Poly-Encoders Transformer architecture, that forms the crux of Blender.

You can read Part 1 of this series, where we have gone over the Data Sets on which the chatbot is trained, onTDS.

Assuming the reader has a prior understanding of Attention, Transformers, BERT and Generative Language Models, I shall march forth.

Introduction:

Before seeing how the Poly-Encoder is used in the context of Blender, we will first understand them independently. The datasets and (fake) training tasks employed in pre-training and fine-tuning the Blender (which are explained in detail in Part 1) should not be confused with the details am about to explain below. The experimental settings given here, are to understand a specific task called “ Multi-Sentence Scoring ” and the Encoder architectures trained for that task, in a generic setting. And then among the Encoder architectures trained for this task, we will see how the Poly-Encoders are superior.

Task:

Multi-Sentence scoring does pairwise comparison between the input and output sequences. Given an input sequence, we score a set of candidate labels.

From here on, we’ll represent the input-output pair by [INPUT, LABEL] .

The goal is to find the best label from among a finite list of candidate labels. The Encoder used is the BERT-Base with 12 Encoder blocks, 12 Attention heads and 768 hidden neurons in the Feed Forward Network.

Pre-Training:

Two versions of pre-training are done for this task:

pre-trained like BERT, on the Toronto Book Corpus and Wikipedia. Here, the [INPUT, LABEL] can be thought of as [Sentence A, Sentence B].
pre-trained on the public domain social media conversations available from Reddit. Here, the [INPUT, LABEL] can be understood as [Context, Next Sentence]

Fake Training Tasks:

The training tasks are the same ones used in the pre-training of BERT.

MLM: Masked Language Model: Here a certain percentage of the input tokens are masked at random (with [MASK] token). The task then is to learn to predict the masked tokens.
NSP: Next Sentence Prediction: Here given 2 sentences A and B, the task is to say if B follows A? (with Negative Sampling). Negative Sampling is implemented by taking a random sentence from the dataset as B, 50% of the time.

A little digression here. A trick that I use, to remember the nature of these pre-training tasks in BERT is to draw a direct comparison with the fake training tasks used in generating the Word2Vec embeddings, namely: 1) CBOW 2) Skip-Gram. If you could recall, in CBOW (Continuous Bag of Words), given a context the task is to predict the target word — similar to the MLM task. And in the Skip-Gram model, given the target word predict the context => but instead of predicting the context/neighbouring word, we change the dataset and the task becomes: given the target word and another word -> predict if the other word is a neighbour of the target word or not (binary classification problem). Since the initial dataset was formed only with target words and words in their context, the modified dataset now contains only positive examples. So we introduce noise by negative sampling. Very very similar to the NSP task of BERT. (If you think there is any inconsistency in drawing such a comparison between the training tasks of BERT and Word Embeddings, do let me know in the comments. Thanks!)

Fine-Tuning:

The model is fine-tuned separately on the ConvAI2 dataset, thereby encouraged to learn the “Personality” trait and on the Ubuntu chat logs which would help them learn “Domain Knowledge/Expertise”.

Architectures:

We will see 3 Encoder architectures to solve the “Multi-Sentence Scoring” task, namely,

Bi-Encoder
Cross-Encoder
Poly-Encoder

The performance of an architecture during inferencing is measured both by the quality of the prediction and also by the prediction speed.

Before proceeding, it is important to remember that this is a Retrieval and NOT Generative task : we only need to retrieve a correct label from a fixed set of candidate labels.

Bi-Encoder:

Blender Bot — Part 2: The Transformer — **Bi-Encoder architecture from Ref [1]**

In Bi-Encoders, Self-Attention is performed over the Input and Label separately. This is nothing but the more generic concept of a Vector Space model. This architecture has the advantage of being faster during inferencing, because we can pre-compute & cache encodings of large, fixed set of candidate labels. This is made possible as the labels are getting encoded separately and have no dependancy with that of the input context.

Both the INPUT and LABEL are surrounded by a special token [S]. This is similar to the [CLS] token in BERT, which captures the features of the entire sentence.
The embeddings input to the Encoder is a combination of Token Embeddings + Segment Embeddings + Position Embeddings. The Segment Embedding is generally used to say if a token belongs to Sentence A or Sentence B (in the context of BERT). Since the INPUT and LABEL are encoded separately here, the Segment Embedding is ‘0’ in both the cases.
Map the input and candidate label separately to a common feature space. In the formula shown, T1 and T2 are two separate Transformers (Encoders).

The Encoder, after performing Self-Attention on the Input token embeddings, gives the encoder representations for every token like:

A reduce function ( red ) is then used to reduce this to a single embedding representation. The reduce function can be any of the following:

-> it can either take the representation of the first token. This is the representation corresponding to the special token [S]

-> or we can take the average over all the output embeddings

-> or we can take the average over the first ‘m’ (m

Once the INPUT and LABEL are represented thus in a common vector space, measure the similarity between them using standard dot product or any other non-linear function.

We then minimize the Cross Entropy loss function, where the logits look like:

Cross-Encoder:

Here, the INPUT and the LABEL are concatenated and Full Self Attention is performed between the entire sequence of input and label.That is, every token of the input would attend to every token of the label and vice versa. This gives rise to rich interactions between the input and label.
Even here, both the INPUT and LABEL are surrounded by a special token [S].
Again, the embeddings input to the Encoder is a combination of Token Embeddings + Segment Embeddings + Position Embeddings. Since the INPUT and LABEL are combined, the Segment Embedding is ‘0’ for a INPUT token and ‘1’ for a LABEL token.
Cross-Encoders give higher accuracy than the Bi-Encoder, because of the full bi-directional attention between the input and the label. At the same time, they are extremely slow during inferencing — because, as each of the candidate labels are supposed to be concatenated with the input context, and cannot be encoded separately like in the case of Bi-Encoders. Therefore candidate embeddings cannot be pre-computed and cached. When the number of candidate labels is huge (as it is in most real scenarios), cross-encoders do not scale.
After Self-Attention, the Transformer gives the encoder representations for all the input tokens. We reduce this to a single representation, by taking the embedding corresponding to the first token (i.e. the special token [S]). This embedding vector is then converted to a scalar score by doing a linear projection. These two steps are shown below:

The training objective here too is to minimize the Cross-Entropy loss function given by the logits:

where ‘ cand1’ is the correct candidate and the others are negatives taken from the training set. One problem here is that, in the bi-encoder we could use the other labels in the batch as negative training samples- here we cannot do that. we use external negatives provided in the training set. Because it is computation heavy, the in memory batch size of the cross encoder is also very small.

Poly-Encoder:

Poly-Encoder take the best qualities of Bi- and Cross-Encoders. Therefore, it is faster during inferencing than the Cross-Encoders and have better accuracy than Bi-Encoders.
The Candidate Label is encoded separately.
Given the input context like:

we perform 3 types of Attention, as explained below:

Self-Attention over the Input Context’s tokens and we get:

Second, we learn ‘m’ codes (or queries in the parlance of Self-Attention), where m < N (N being the length of the INPUT). The number of codes to be learnt, ‘m’, is a hyperparameter. Each code Ci attends over all the outputs of the previous Self-Attention. The ‘m’ codes are randomly initialized.
We first get the Attention weights (w’s) by performing a dot-product attention (or a multiplicative attention in general) between the ‘m’ codes — which serve as the “Queries”, and the previous Self-Attention outputs (Out’s)—which serve as the “Keys”. Then use these attention weights to get a weighted sum of the previous Self-Attention outputs(Out’s) — which serve as the “Values”.

Think about why we are doing this kind of an Attention mechanism here. In a Bi-Encoder, the candidate label does not attend over the tokens of the input context. A Cross-Encoder on the other extreme, makes the candidate label attend over every token of the input context. Somehow in the Poly-Encoder we are trying to find a middle ground, by making the candidate label embedding attend over not the entire input context, but over a subset of features learnt from the input context.

The third kind of attention (alluded to in the previous paragraph) is between the ‘m’ global features of the Input Context and the embedding of the Candidate Label.

Now we compute the Similarity score between the Input Context embedding and the Candidate Label embedding as:

Once again, the training objective here too is to minimize the Cross-Entropy loss function given by the logits as before.

We saw three different Encoder architectures for the task of “Multi-Sentence Scoring” and saw how the Poly-Encoders were better. In the next part, we will see how the Poly-Encoders are used in the Blender and also about the different Model Architectures and training objectives. We will also touch upon the Evaluation methods used to compare the performance of Blender with that of the other Chatbots.

Note:All the notations, formulae and the Encoder block diagrams above are the same as used in the original paper mentioned in Ref.[1].

References:

Poly-Encoder Transformer : https://arxiv.org/abs/1905.01969
BERT : https://arxiv.org/abs/1810.04805
Transformers : https://arxiv.org/abs/1706.03762

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Blender Bot — Part 2: The Transformer

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

高性能HTML5

Jay Bryant、Mike Jones / 奇舞团 / 电子工业出版社 / 2014-5

《高性能html5》为读者讲解了如何用html5 从一开始就设计一个高性能的网站，以及如何对已有网站进行改良，使得它们具备优越的性能。《高性能html5》中总结了许多实践经验、关键技巧，并提供了丰富的示例，作者有意无意地将软件工程以及前端开发技术之道隐藏于朴实的描述中。通过学习《高性能html5》，读者能够掌握如何创建自己的高性能网站。《高性能html5》适合于想创建自己网站的初学者，同样......一起来看看《高性能HTML5》这本书的介绍吧!

码农工具