5 Secrets About LSTM and GRU Everyone Else Knows

栏目: IT技术 · 发布时间: 6年前

5 Secrets About LSTM and GRU Everyone Else Knows

Mechanics explained with powerful visuals and a funny story

Feb 27 ·9min read

We secretly explain why Long Short Term Memory (LSTM) has been so effective and popular for processing sequence data for Apple, Google, Facebook, Amazon.

5 Secrets About LSTM and GRU Everyone Else Knows — Photo from Peggy Choucair on Pixabay

Secret 1 — LSTM was invented because RNNs had serious memory leaks.

Previously, we introduced recurrent neural networks (RNNs) and demonstrated how they can be used forsentiment analysis.

The issue with RNNs is long range memory. For example, they are able to predict the next word “sky” in the sentence “the clouds are in the …” But they come short in predicting the missing word in the following sentence:

“She grew up in France. Now she has been in China for few months only. She speaks fluent …”

As that gap grows, RNNs become unable to learn to connect the information. In this example, recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France , from further back. In natural language text, it is entirely possible for the gap between the relevant information and the point where it is needed to be very large. This is also very common in the German language.

Why do RNNs have huge problems with long sequences? By design, RNNs take two inputs at each time step: an input vector (e.g. one word from the input sentence), and a hidden state (e.g. a memory representation from previous words).

The next RNN step takes the second input vector and first hidden state to create the output of that time step. Therefore, in order to capture semantic meanings in long sequences, we need to run RNNs over many time steps, turning the unrolled RNN into a very deep network.

Recurrent Neural Networks (RNNs) for Dummies

An entertaining and illustrated guide to understand the intuition.

towardsdatascience.com

Long sequences are not the only troublemakers for RNNs. Just like any very deep neural network, RNNs suffers from the vanishing and exploding gradients problem , thus taking forever to train. Many techniques have been suggested to alleviate this problem, but they could not eliminate it:

initializing parameters carefully,
using non-saturating activation functions like ReLU,
applying batch normalization, gradient clipping, dropout,
using truncated backpropagation through time.

These workarounds have their limits, still. Additionally, besides the long training time, another problem faced by long-running RNNs is the fact that the memory of the first inputs gradually fades away .

After a while, the RNN’s state contains virtually no trace of the first inputs. For example, if we want to perform sentiment analysis on a long review that starts with “I loved this product,” but the rest of the review lists the many things that could have made the product even better, then, the RNN will gradually forget the first positive sentiment and will completely misinterpret the review as negative.

In order to solve these RNNs problems, various types of cells with long term memory have been introduced in research. In practice, basic RNNs are not used anymore and most of work is done using the so-called Long Short Term Memory (LSTM) networks . They were invented by S. Hochreiter and J. Schmidhuber.

Secret 2 — A key idea in LSTM is the (star)Gate.

Each single LSTM cell governs what to remember, what to forget and how to update the memory using gates. By doing so, the LSTM network solves the problem of exploding or vanishing gradients, as well as all other problems mentioned previously!

The architecture of a LSTM cell is depicted in the impressive diagram below.

h is the hidden state, representing short term memory . C is the cell state, representing long term memory and x is the input.

The gates perform only few matrices transformations, sigmoid and tanh activation in order to magically solve all the RNN problems.

We will dive into how this happens in the next sections, by looking at how the cell forgets, remembers and updates its memory.

A funny story

Let’s explore the diagram within a funny plot. Assume that you are the boss, and your employee asks for salary increase. Will you agree? Well, this will depend, let’s say, on your state of mind.

Below we consider your mind as a LSTM cell , with no mean to offense your lightning brain.

Your long term state C will impact your decision. On average, 70% of time you are in good mood and you have 30% of total budget left. Therefore your cell state is C =[0.7, 0.3].

Recently, things are really going well for you, boosting your good mood with probability 100% and you have operative budget left with high probability 100%. This turns your hidden state to h =[1, 1].

Today, three things happened: your kids succeeded at school exams, although you got an ugly review from your boss, however you figured out that you still have plenty of time to complete the work. So, today’s input is x =[1, -1, 1].

Based on this evaluation, will you give a salary increase to your employee?

Secret 3 — LSTM forgets by using Forget Gates.

In the situation described above, your first step will be probably to figure out how things which happened today (input x ) and things which happened recently (hidden state h ) will affect your long-term view of the situation (cell state C ). Forget Gates control how much of the past memory is kept.

After receiving your employee’s request for salary increase, your forget gate will run the following calculation of f_t , whose value will ultimately affect your long-term memory.

The weights shown in the picture below are chosen arbitrary for illustration purposes. Their values are normally calculated during training of the network. The result [0,0] indicates to erase (forget completely) your long term memory and not let it affect your decision today.

Secret 4 — LSTM remembers using Input Gates.

Next, you need to decide which information about what happened recently (hidden state h ) and what happened today (input x ) you want to record in your long-term view of the situation (cell state C ). LSTM decides what to remember by using Input Gates.

First, you will calculate your input gate values i_t , which falls between 0 and 1 thanks to sigmoid activation.

Next, you will scale your input between -1 and 1 using tanh activation.

Finally, you will estimate your new cell state by adding both results.

The result [1, 1] indicates that based on the recent and current information, you are 100% in good mood and very likely to have operative budget. This are looking promising for your employee.

Secret 5 — LSTM keeps long-term memory using Cell State.

Now, you know how things which recently happened would affect your state. Next, it is time to update your long-term view of the situation based on the new rationales.

When new values come in, LSTM decides on how to update its memory , again by using gates. The gated new values are added to the current memory. This additive operation is what solves the exploding or vanishing gradients problem of simple RNNs.

Instead of multiplying, LSTM adds things to compute the new state. The result C_t is stored as the new long-term view of the situation (cell state).

The values [1,1] suggest that you are overall 100% of the time in a good mood and 100% likelihood to have money all the time! You are the perfect boss!

Based on this information, you can update your short-term view of the situation h_t (next hidden state). The values [0.9, 0.9] indicate that there is 90% likelihood that you will increase your employee’s salary in the next time step! Congratulations to him!

Gated Recurrent Unit

A variant of the LSTM cell is called the Gated Recurrent Unit, or GRU. GRU was proposed by Kyunghyun Cho et al. in a 2014 paper .

GRU is a simplified version of the LSTM cell, can be a bit faster than LSTM, and it seems to perform similarly, which explains its growing popularity.

As shown above, both state vectors are merged into a single vector. A single gate controller controls both the forget gate and the input gate. If the gate controller outputs a 1, the input gate is open and the forget gate is closed. If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first.

There is no output gate ; the full state vector is output at every time step. However, there is a new gate controller that controls which part of the previous state will be shown to the main layer.

Stacking LSTM cells

By aligning multiple LSTM cells, we can process input of sequence data, for example, a 4-words sentence in the picture below.

LSTM units are typically arranged in layers, so that each the output of each unit is the input to the other units. In the example, we have 2 layers, each having 4 cells. In this way, the network becomes richer and captures more dependencies.

Bidirectional LSTM

RNNs, LSTMs and GRUs are designed to analyze sequence of values. Sometimes it makes sense to analyze the sequence in a reverse order .

For example in the sentence “he needs to work harder, the boss said about the employee.”, although the “he” appears at the very beginning, it refers to the employee, mentioned at the very end.

Therefore the order has to be reversed or by combining forward and backward. This bidirectional architecture is depicted in the figure below.

The following diagram further illustrates bidirectional LSTMs. The network in the bottom receives the sequence in the original order, while the network in the top takes receives the same input but in reverse order. Both networks are not necessarily identical. Important is, their outputs are combined for the final prediction.

Asking for more secrets?

As we have just disclosed, a LSTM cell can learn to recognize an important input (that’s the role of the input gate), store it in the long term state, learn to preserve it for as long as it is needed (that’s the role of the forget gate), and learn to extract it whenever it is needed.

LSTMs have transformed machine learning and are now available to billions of users through the world’s most valuable public companies like Google, Amazon and Facebook.

LSTMs greatly improved speech recognition on over 4 billion Android phones (since mid 2015).

LSTMs greatly improved machine translation through Google Translate since Nov 2016.

Facebook performed over 4 billion LSTM based translations per day.

Siri was LSTM-based on almost 2 billion iPhones since 2016.

The answers of Amazon’s Alexa were based on LSTMs.

5 Secrets About LSTM and GRU Everyone Else Knows