The Transformer: A Quick Run Through

栏目: IT技术 · 发布时间: 4年前

The Transformer: A Quick Run Through

Explore the best of natural language modeling enabled by the Transformer. Understand its architecture and internal working.

This is Part 3 of the5 part series on language modeling.

Seq2seq task of machine translation solved using Transformer-like architecture (BERT) ( translate.google.com )

Introduction

In the previous post , we looked at how ELMo and ULMFiT boosted the prominence of language model pre-training in the community. This blog assumes that you have read through the previous two parts of this series and thus builds upon that knowledge.

The Transformer: A Quick Run Through
English input being translated to German output using the Transformer model (Mandar Deshpande)

The Transformer has been seen as the model which finally removed the limitations of sequence model training through recurrent neural networks. The idea picked up in language modeling and machine translation around the use of encoder-decoder stacking turned out to be valuable learning in the process of building this architecture. The Transformer is a simple network architecture solely based on attention mechanism and giving away any kind of recurrence and convolutions entirely. It has been shown to generalize well to other language understanding and modeling tasks, with large and limited training data. It also achieved the state of the art results on the English-to-German translation task and anchored itself as the go-to architecture for future advancements in model pre-training in NLP.

Encoder-Decoder Architecture

The Transformer: A Quick Run Through

The 6 encoder-decoder architecture used in the Transformer (Mandar Deshpande)

In this model, multiple encoders are stacked on top of each other, and similarly, decoders are stacked together. Usually, each encoder/decoder comprises recurrent connections and convolution, and the hidden representation from each encoder stage is passed ahead to be used by the next layer. Most seq2seq tasks can easily be solved using such a stack of encoders-decoders which processes each word in the input sequence in order.

Attention Mechanism

Since attention mechanism has become an integral part of sequence modeling and transduction models in various tasks allow modeling dependencies without regard to their distance in the input or output sequences. To put it in simple terms; the attention mechanism helps us tackle long-range dependency issues in neural networks without the use of recurrent neural networks (RNN). This solves the exact purpose addressed by hidden state shared across all time steps in RNN, through the use of encoder-decoder based architecture. The attention model focuses on the relevant part of the input text sequence or image as per the task being solved.

In a regular RNN, context is passed in terms of the final hidden state produced by the encoders and uses it to produce the next token of the translation or text.

The Transformer: A Quick Run Through

Regular seq2seq models without Attention Mechanism only uses the last hidden state as the context vector (Mandar Deshpande)

Steps involved in generating the Context Vector:

  1. Initialize the context vector of random values and size as per the task (eg 128, 256, 512)
  2. Process one token from the input sequence through the encoder
  3. Use the hidden state representation in the encoder to update the context vector
  4. Keep repeating Step 2 and 3 until the entire input sequence is processed

Once the context vector has been fully updated, it is passed to the decoder as an additional input to the word/token being translated. The context vector is a useful abstraction, except that it acts as a bottleneck for the representation of the entire meaning of the input sequence.

Instead of passing a single context vector to the decoder, the attention mechanism passes all the intermediate hidden states within a stack of encoders to the decoder . This enables the decoder to focus on different parts of the input sequence as per the relevance of the current word/token being processed.

Unlike the previous seq2seq models, attention models perform 2 extra steps:

  1. More data is passed from the encoder to the decoder
  2. The decoder in an attention model uses this additional data to focus on a particular word from the input sequence and uses the hidden state with the highest softmax score as the context vector

The Transformer: A Quick Run Through

Attention Mechanism used to create the context vector passed to the decoder (Mandar Deshpande)

Peek Inside the Transformer

The Transformer consists of 6 stacked encoders and 6 stacked decoders to form the main architecture of the model. This number can be variable as per the use-case, but 6 has been used in the original paper .

Let us consider a single encoder and decoder stack to simplify our understanding of the working.

The Transformer: A Quick Run Through

Components inside the Encoder and Decoder in the Transformer (Mandar Deshpande)

Architecture

Each encoder consists of a Self-Attention layer followed by the Feed Forward network. Usually in attention mechanisms, hidden states from the previous states are utilized for the calculation of attention. Instead, self-attention uses trained embeddings from the same layer to compute the attention vector. To elucidate, self-attention could be thought of as a mechanism for coreference resolution within a sentence:

“The man was eating his meal while he was thinking about his family”

In the above sentence, the model needs to build an understanding of what he refers to, and that it is a coreference to the man. This is enabled by the self-attention mechanism in the Transformer. A detailed discussion on self-attention (using multiple heads) is beyond the scope of this blog and can be found in the original paper.

The decoder also has the same two layers as the encoder, except that additional encoder-decoder attention is introduced in between to help the model extract relevant features from attention vectors from the encoder.

The Transformer: A Quick Run Through

Simplified 2 encoders stacked together with 2 decoders to explore internal architecture (Mandar Deshpande)

Point-wise Feed-Forward Networks

It is important to notice that each word in the input sequence shares the computation in the self-attention layer, but each word flows through a separate feed-forward network. The output from the feed-forward network is passed on to the next encoder in the stack which utilizes this learned context from previous encoders.

Positional Encoding

To embed a sense of time in the input sequence, each word is concatenated with a positional encoding. This augmented input word embedding is passed as input to Encoder 1. Since the model doesn’t use any recurrence or convolution, positional encodings encode some information about the relative position in the input sentence.

Residual Connections with Normalization

The output from the self-attention layer is added with the original word embedding using residual connections and layer normalization. A similar scheme is followed by the feed-forward layer.

Fully Connected Linear with Softmax

Once a point vector is given out by the final decoder in the stack, it needs to be converted into the translated word. Now that we already have all the required information embedded as floats in this output vector, we just need to convert it to a probability over possible next word in the translation.

The fully connected linear network converts the float vector into scores which are transformed into probability values using the softmax function. The index with the highest softmax value is chosen and retrieved from the output vocabulary learned from the training set.

Transformer Training Explained

The training is supervised i.e. uses labeled training dataset which can be used as a benchmark for comparison and correction of output word probabilities.

Essentially, each word in the translated output vocabulary is converted into a one-hot vector that is 1, only at the index where the word is present and 0 everywhere else. Now, once we receive the softmax output vector comprised of normalized probability values, we can compare it with the one-hot vectors to improve model parameters/weights.

These two vectors can be compared by using some similarity metrics like cosine similarity, cross-entropy, and/or Kullback-Leibler divergence . At the beginning of the training process, the output probability distribution is much further off than the ground truth one-hot vector. As training proceeds and the weights get optimized, the output word probabilities closely track the ground truth vectors.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Dynamic HTML权威指南

Dynamic HTML权威指南

Danny Goodman / 2009-6 / 99.80元

《Dynamic HTML权威指南(第3版)》涵盖了最新的Web规范和各种浏览器功能特性,如果您要使用HTML、XHTML、CSS、文档对象模型(DOM)和JavaScript进行开发,那么它正是您要寻找的一站式终极资源宝库。《Dynamic HTML权威指南(第3版)》为富互联网应用程序的设计者提供了全面而翔实的参考。在《Dynamic HTML权威指南(第3版)》的帮助下,无论是Interne......一起来看看 《Dynamic HTML权威指南》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

随机密码生成器
随机密码生成器

多种字符组合密码

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码