Blender Bot — Part 3: The Many Architectures

栏目: IT技术 · 发布时间: 3年前

内容简介:We have been looking into Facebook’s open-sourced conversational offering, Blender Bot.InPart-1 we went over in detail about the DataSets used in the pre-training and fine-tuning of it and the failure cases as well as limitations of Blender.And inPart-2 we

We have been looking into Facebook’s open-sourced conversational offering, Blender Bot.

InPart-1 we went over in detail about the DataSets used in the pre-training and fine-tuning of it and the failure cases as well as limitations of Blender.

And inPart-2 we studied the more generic problem setting of “Multi-Sentence Scoring”, the Transformer architectures used for such a task and learnt about the Poly-Encoders in particular — which will be used to provide the encoder representations in Blender.

In this 3rd and final part, we return from our respite with Poly-Encoders, back to Blender. We shall go over the different Model Architectures, their respective training objectives, the Evaluation methods and performance of Blender in comparison to Meena.

Model Architectures:

The paper (Ref.[2]) discusses several variations of models, that vary in multiple factors (we’ll later discuss more on the perturbations involved). But at a high level, there are 3 different model architectures discussed.

  1. Retriever
  2. Generator
  3. Retrieve & Refine

1. Retriever:

Given a dialogue history (context) as input, retrieval systems select the next dialogue utterance by scoring a large set of candidate responses and output the highest scoring one. This is the exact same setting that we saw in the Multi-Sentence Scoring task using a Poly-Encoder, in Part-2 of this series. Two variations of the models are developed: with 256M and 622M parameters. The training objective here is to effectively rank the candidate responses. This is done by minimizing the Cross-Entropy loss, whose logits are given as:

where each ‘s’ is a score between the context embedding and one of the candidate responses’ embedding. The score can be a standard dot product similarity score between the context and candidate label encoder representations (when projected onto a common vector space), or more generally any non-linear function. Score can be given as:

The encoder representations of the context and candidate response are obtained using a Poly-Encoder — that undergoes 3 types of Attention mechanisms: 1) Self-Attention among the token embeddings of the Input context, 2) Learn ‘m’ codes by performing Self-Attention between the codes and the outputs of the previous Self-Attention, 3) Self-Attention between candidate embedding and ‘m’ global learned features. (Read Part-2 for an in-depth explanation).

2. Generator:

Here we use a standard Seq2Seq Transformer (Decoder) architecture to generate responses rather than retrieve them from a fixed set of candidates. Three variations of the model are developed with: 90M, 2.7B and 9.4B parameters.

Maximum Likelihood Estimation:

The training objective here is Maximum Likelihood Estimation — that is to minimize the negative log likelihood.

The Likelihood models the overall sequence probability distribution. The objective as given in the paper (Ref.[2]):

Blender Bot — Part 3: The Many Architectures

Screenshot from the paper in Ref.[2]

where the likelihood is the probability of generating the token y_t at the time step ‘t’ given the input context ‘ x ’ and the tokens that were generated up to the time step ‘t’ (y_y ’ refers to the ground-truth next utterance provided by humans, given the context.

Un-Likelihood:

In addition to maximizing the likelihood of getting the ground-truth token ‘y’ at time step ‘t’, here, we also try to minimize the probability of getting certain negative candidate tokens at that time step. This is called the Un-Likelihood objective and it helps decrease the occurrences of tokens (or n-grams) that are repeated. The Un-likelihood objective as given in the paper (Ref.[2]):

Blender Bot — Part 3: The Many Architectures

Screenshot from the paper in Ref.[2]

But how can we get the set of negative candidates at every time step ‘t’? Either we can maintain a static list of negative candidates (frequent n-grams generated by the model) and use the same at every time step. Or the more acceptable solution is where we keep the n-gram distribution of the tokens generated by the model at every time step. From this, whenever an n-gram count is greater than the corresponding n-gram count observed from the gold responses (that is, frequent n-grams used by humans), then we add that n-gram to the set of negative candidates maintained at that time step.

Decoding: Beam Search:

At the time of inferencing, the model has to select the best next response from among the available hypotheses, given an input context. This is called “Decoding”. The output (next response) is generated as a probability distribution over all the tokens of the vocabulary. Instead of taking the highest probability token at each time step (which is nothing but the “Greedy Search”), we can try to get the partial sentence that maximizes the joint probability up to that instant.

  • In “Beam Search”, at each time step ‘t’, keep in memory, a list of top k partially formed hypotheses — whose joint probability is the maximum up to that time step.
  • Then at time t, append each token in the vocabulary to each of the top k hypotheses.
  • Compute the new joint probability.
  • Can also normalize the score by the length of the sequence formed so far up to time t.
  • Rank the hypothesis based on the joint probability score and then select the top k from among the new set of hypothesis. The remaining hypotheses are discarded.
  • In order to enforce a stopping condition, the (End Of Sentence) token is also included in the vocabulary. So the procedure can be stopped when enough number of tokens are reached.

However, beam search heuristic has traditionally resulted in generating shorter responses — than actual human responses — therefore tend to be dull and less engaging. So we introduce a minimum length constraint — such that the token will not be generated until we have partial hypotheses that satisfy a minimum length. This forces the model to generate long responses. Even though longer responses are considered more engaging (during human evaluation), they are also prone to more errors. So the minimal response length is a trade-off.

Another improvement that can be done in Beam Search is n-gram beam blocking . If a hypothesis in a beam contains more than 1 occurrence of an n-gram => that hypothesis is discarded. This is done to avoid repetition of sub-sequences or n-gram sequences.

3. Retrieve & Refine:

The Retriever model gets the next response from a limited set of candidate responses and uses only the Input Context as its knowledge. The Generator model has no limitations on the candidate responses, but still no additional knowledge is used other than the context, in order to generate the next response. In the third alternative, external knowledge is incorporated into the model giving rise to => Retrieval before Generation .

Dialog Retrieval:

Retrieve & Refine: Dialog Retrieval System

For a given Input Context, the Retrieval system (Poly-Encoder) gets the most probable next response from a fixed set of candidate responses. This is marked as “Retrieved Next Response” in the animation above. In the “Retriever” model, we stop at this point. But here, the response is appended to the Input Context via a separator and this combined sequence is fed as input to the Generator (Decoder block). The decoder generates a response for the given input sequence. The purpose of doing this is to improve the quality of responses that the generator could produce. Remember that the Candidate Labels are human generated responses. And even though the “Retrieved Next Response” need not be the same as the “golden response” for a given input context, we can assume that a good retriever will pick a candidate that closely aligns with the golden response. And human responses are generally considered to be more engaging than decoder generated responses. And the purpose here is for the Decoder to somehow learn when to simply use the “Retrieved Next Response” directly, without making any effort to generate anything on its own; and when to ignore the retrieved response and generate one based only on the context. And if the Decoder is able to learn this association it will be able to generate more human-like responses.

Training Objective: alpha — blending:The ideal learning for the Decoder would be to simply use the retrieved response when it’s good and ignore it when it’s not. But in reality, more often than not, the Decoder would choose to ignore the retrieved next response and generate on its own. This is attributed to the fact alluded in the previous paragraph: the lack of understanding of the relationship between the retrieved response and the golden (human) response given the input context. To mitigate this, we do “alpha — blending”, where the retrieved next response is replaced with the golden response “alpha” % of the time. This is nothing but the more generic idea of “Teacher Forcing”.

Knowledge Retrieval:

Retrieve & Refine: Knowledge Retrieval System

In this variant, an external knowledge base is used. An Information Retrieval (IR) System is built to store and retrieve the Wikipedia dump. A little aside on how the IR system works:

  • The documents (Wikipedia articles in this case) are parsed and an inverted index is built that is of the form — {term: list of all documents in which the term appears}.
  • The inverted index can be searched by the “Query” and the output is the list of documents that contain any of the query terms.
  • Then we rank the documents thus returned by finding the similarity between the query and the document, both represented in a common vector space — whose coordinates are the TF-IDF scores of the terms in the query/document.

Back to our Knowledge Retrieval System, the Input Context is used as the “Query” into the IR System in order to retrieve the most suitable Knowledge Candidates (i.e. Wiki articles that are relevant to the context). These candidates and the context are fed to the Retrieval system (Poly-Encoder) to get the best candidate — that is the best piece of external knowledge, on which our dialog next response is going to be based. This knowledge candidate is then given as input to the Generator model, that generates the next response conditioned on the knowledge sentence.

Evaluation Methods:

Automatic Evaluation:

Retrieval Models:The Retrieval models are fine-tuned on the crowd-sourced clean Data Sets that we talked about in Part-1, namely, ConvAI2, ED, Wizard of Wikipedia and BST. The evaluation metric reported is Hits@1/K (which is similar in principle to the Top@N classification metric), on the corresponding dataset’s validation data.

Blender Bot — Part 3: The Many Architectures

Screenshot from the paper in Ref.[2]

Generator Models:Here we measure the perplexity of the underlying language model. Perplexity measures the uncertainty of a language model. The lower the perplexity, the more confident the model is in generating the next token (character, subword, or word). Conceptually, perplexity represents the number of choices the model is trying to choose from when producing the next token. From the graph below, we see that the larger model achieves a better performance in fewer steps.

Blender Bot — Part 3: The Many Architectures

Screenshot from the paper in Ref.[2]

Human Evaluation:

This kind of evaluation gives the leverage to arrive at comparisons between different versions of Blender as well as comparison between other chatbots in the field — as these chat logs are available in public domain for analysis purposes. Many different versions of Blender have been developed based on multiple factors like:

  • Minimum beam length while using beam search in the generator models
  • Whether or not n-gram beam blocking is done
  • Smaller vs larger models (in terms of number of parameters learnt)
  • Whether a persona context was given or no persona context was given (during fine-tuning)
  • Using likelihood vs. combination of likelihood and unlikelihood
  • etc.

Human evaluations were done for all kinds of variants and the detailed results are given in the paper (Ref.[2]), where you can go over them.

ACUTE Eval:

In this method, human evaluators are given pairs of complete conversations between a human and a chatbot. These conversations are generated by the 2 models/systems that are up for comparison. And the job of the evaluator is to compare the two dialogues and answer the following questions, as given in Ref.[2]:

  1. Engagingness question: “Who would you prefer to talk to for a long conversation?”
  2. Humanness question: “Which speaker sounds more human?”

Self-Chat ACUTE Eval:

This is the same type of evaluation as above, except that the conversations evaluated are between 2 models, instead of a human and a model.

A sample conversation pair as presented for a human evaluator is given below:

Blender Bot — Part 3: The Many Architectures

Screenshot from the paper in Ref.[2]

Comparisons:

Finally, I’ll leave you with the human evaluation scores for Blender (and its variants) and Meena; as well as Blender and Humans — as reported in Ref.[2].

Blender v. Meena:


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Weaving the Web

Weaving the Web

Tim Berners-Lee / Harper Paperbacks / 2000-11-01 / USD 15.00

Named one of the greatest minds of the 20th century by Time , Tim Berners-Lee is responsible for one of that century's most important advancements: the world wide web. Now, this low-profile genius-wh......一起来看看 《Weaving the Web》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具