The Transformer: A Quick Run Through
Explore the best of natural language modeling enabled by the Transformer. Understand its architecture and internal working.
May 2 ·7min read
This is Part 3 of the5 part series on language modeling.
Introduction
In the previous post , we looked at how ELMo and ULMFiT boosted the prominence of language model pre-training in the community. This blog assumes that you have read through the previous two parts of this series and thus builds upon that knowledge.
The Transformer has been seen as the model which finally removed the limitations of sequence model training through recurrent neural networks. The idea picked up in language modeling and machine translation around the use of encoder-decoder stacking turned out to be valuable learning in the process of building this architecture. The Transformer is a simple network architecture solely based on attention mechanism and giving away any kind of recurrence and convolutions entirely. It has been shown to generalize well to other language understanding and modeling tasks, with large and limited training data. It also achieved the state of the art results on the English-to-German translation task and anchored itself as the go-to architecture for future advancements in model pre-training in NLP.
Encoder-Decoder Architecture
In this model, multiple encoders are stacked on top of each other, and similarly, decoders are stacked together. Usually, each encoder/decoder comprises recurrent connections and convolution, and the hidden representation from each encoder stage is passed ahead to be used by the next layer. Most seq2seq tasks can easily be solved using such a stack of encoders-decoders which processes each word in the input sequence in order.
Attention Mechanism
Since attention mechanism has become an integral part of sequence modeling and transduction models in various tasks allow modeling dependencies without regard to their distance in the input or output sequences. To put it in simple terms; the attention mechanism helps us tackle long-range dependency issues in neural networks without the use of recurrent neural networks (RNN). This solves the exact purpose addressed by hidden state shared across all time steps in RNN, through the use of encoder-decoder based architecture. The attention model focuses on the relevant part of the input text sequence or image as per the task being solved.
In a regular RNN, context is passed in terms of the final hidden state produced by the encoders and uses it to produce the next token of the translation or text.
Steps involved in generating the Context Vector:
- Initialize the context vector of random values and size as per the task (eg 128, 256, 512)
- Process one token from the input sequence through the encoder
- Use the hidden state representation in the encoder to update the context vector
- Keep repeating Step 2 and 3 until the entire input sequence is processed
Once the context vector has been fully updated, it is passed to the decoder as an additional input to the word/token being translated. The context vector is a useful abstraction, except that it acts as a bottleneck for the representation of the entire meaning of the input sequence.
Instead of passing a single context vector to the decoder, the attention mechanism passes all the intermediate hidden states within a stack of encoders to the decoder . This enables the decoder to focus on different parts of the input sequence as per the relevance of the current word/token being processed.
Unlike the previous seq2seq models, attention models perform 2 extra steps:
- More data is passed from the encoder to the decoder
- The decoder in an attention model uses this additional data to focus on a particular word from the input sequence and uses the hidden state with the highest softmax score as the context vector
Peek Inside the Transformer
The Transformer consists of 6 stacked encoders and 6 stacked decoders to form the main architecture of the model. This number can be variable as per the use-case, but 6 has been used in the original paper .
Let us consider a single encoder and decoder stack to simplify our understanding of the working.
Architecture
Each encoder consists of a Self-Attention layer followed by the Feed Forward network. Usually in attention mechanisms, hidden states from the previous states are utilized for the calculation of attention. Instead, self-attention uses trained embeddings from the same layer to compute the attention vector. To elucidate, self-attention could be thought of as a mechanism for coreference resolution within a sentence:
“The man was eating his meal while he was thinking about his family”
In the above sentence, the model needs to build an understanding of what he
refers to, and that it is a coreference to the man.
This is enabled by the self-attention mechanism in the Transformer. A detailed discussion on self-attention (using multiple heads) is beyond the scope of this blog and can be found in the original paper.
The decoder also has the same two layers as the encoder, except that additional encoder-decoder attention is introduced in between to help the model extract relevant features from attention vectors from the encoder.
Point-wise Feed-Forward Networks
It is important to notice that each word in the input sequence shares the computation in the self-attention layer, but each word flows through a separate feed-forward network. The output from the feed-forward network is passed on to the next encoder in the stack which utilizes this learned context from previous encoders.Positional Encoding
To embed a sense of time in the input sequence, each word is concatenated with a positional encoding. This augmented input word embedding is passed as input to Encoder 1. Since the model doesn’t use any recurrence or convolution, positional encodings encode some information about the relative position in the input sentence.Residual Connections with Normalization
The output from the self-attention layer is added with the original word embedding using residual connections and layer normalization. A similar scheme is followed by the feed-forward layer.Fully Connected Linear with Softmax
Once a point vector is given out by the final decoder in the stack, it needs to be converted into the translated word. Now that we already have all the required information embedded as floats in this output vector, we just need to convert it to a probability over possible next word in the translation.The fully connected linear network converts the float vector into scores which are transformed into probability values using the softmax function. The index with the highest softmax value is chosen and retrieved from the output vocabulary learned from the training set.
Transformer Training Explained
The training is supervised i.e. uses labeled training dataset which can be used as a benchmark for comparison and correction of output word probabilities.
Essentially, each word in the translated output vocabulary is converted into a one-hot vector that is 1, only at the index where the word is present and 0 everywhere else. Now, once we receive the softmax output vector comprised of normalized probability values, we can compare it with the one-hot vectors to improve model parameters/weights.
These two vectors can be compared by using some similarity metrics like cosine similarity, cross-entropy, and/or Kullback-Leibler divergence . At the beginning of the training process, the output probability distribution is much further off than the ground truth one-hot vector. As training proceeds and the weights get optimized, the output word probabilities closely track the ground truth vectors.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Dynamic HTML权威指南
Danny Goodman / 2009-6 / 99.80元
《Dynamic HTML权威指南(第3版)》涵盖了最新的Web规范和各种浏览器功能特性,如果您要使用HTML、XHTML、CSS、文档对象模型(DOM)和JavaScript进行开发,那么它正是您要寻找的一站式终极资源宝库。《Dynamic HTML权威指南(第3版)》为富互联网应用程序的设计者提供了全面而翔实的参考。在《Dynamic HTML权威指南(第3版)》的帮助下,无论是Interne......一起来看看 《Dynamic HTML权威指南》 这本书的介绍吧!