5 Secrets About LSTM and GRU Everyone Else Knows
Mechanics explained with powerful visuals and a funny story
Feb 27 ·9min read
We secretly explain why Long Short Term Memory (LSTM) has been so effective and popular for processing sequence data for Apple, Google, Facebook, Amazon.
Secret 1 — LSTM was invented because RNNs had serious memory leaks.
Previously, we introduced recurrent neural networks (RNNs) and demonstrated how they can be used forsentiment analysis.
The issue with RNNs is long range memory. For example, they are able to predict the next word “sky” in the sentence “the clouds are in the …” But they come short in predicting the missing word in the following sentence:
“She grew up in France. Now she has been in China for few months only. She speaks fluent …”
As that gap grows, RNNs become unable to learn to connect the information. In this example, recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France , from further back. In natural language text, it is entirely possible for the gap between the relevant information and the point where it is needed to be very large. This is also very common in the German language.
Why do RNNs have huge problems with long sequences? By design, RNNs take two inputs at each time step: an input vector (e.g. one word from the input sentence), and a hidden state (e.g. a memory representation from previous words).
The next RNN step takes the second input vector and first hidden state to create the output of that time step. Therefore, in order to capture semantic meanings in long sequences, we need to run RNNs over many time steps, turning the unrolled RNN into a very deep network.
Long sequences are not the only troublemakers for RNNs. Just like any very deep neural network, RNNs suffers from the vanishing and exploding gradients problem , thus taking forever to train. Many techniques have been suggested to alleviate this problem, but they could not eliminate it:
- initializing parameters carefully,
- using non-saturating activation functions like ReLU,
- applying batch normalization, gradient clipping, dropout,
- using truncated backpropagation through time.
These workarounds have their limits, still. Additionally, besides the long training time, another problem faced by long-running RNNs is the fact that the memory of the first inputs gradually fades away .
After a while, the RNN’s state contains virtually no trace of the first inputs. For example, if we want to perform sentiment analysis on a long review that starts with “I loved this product,” but the rest of the review lists the many things that could have made the product even better, then, the RNN will gradually forget the first positive sentiment and will completely misinterpret the review as negative.
In order to solve these RNNs problems, various types of cells with long term memory have been introduced in research. In practice, basic RNNs are not used anymore and most of work is done using the so-called Long Short Term Memory (LSTM) networks . They were invented by S. Hochreiter and J. Schmidhuber.
Secret 2 — A key idea in LSTM is the (star)Gate.
Each single LSTM cell governs what to remember, what to forget and how to update the memory using gates. By doing so, the LSTM network solves the problem of exploding or vanishing gradients, as well as all other problems mentioned previously!
The architecture of a LSTM cell is depicted in the impressive diagram below.
h is the hidden state, representing short term memory . C is the cell state, representing long term memory and x is the input.
The gates perform only few matrices transformations, sigmoid and tanh activation in order to magically solve all the RNN problems.
We will dive into how this happens in the next sections, by looking at how the cell forgets, remembers and updates its memory.
A funny story
Let’s explore the diagram within a funny plot. Assume that you are the boss, and your employee asks for salary increase. Will you agree? Well, this will depend, let’s say, on your state of mind.
Below we consider your mind as a LSTM cell , with no mean to offense your lightning brain.
Your long term state C will impact your decision. On average, 70% of time you are in good mood and you have 30% of total budget left. Therefore your cell state is C =[0.7, 0.3].
Recently, things are really going well for you, boosting your good mood with probability 100% and you have operative budget left with high probability 100%. This turns your hidden state to h =[1, 1].
Today, three things happened: your kids succeeded at school exams, although you got an ugly review from your boss, however you figured out that you still have plenty of time to complete the work. So, today’s input is x =[1, -1, 1].
Based on this evaluation, will you give a salary increase to your employee?
Secret 3 — LSTM forgets by using Forget Gates.
In the situation described above, your first step will be probably to figure out how things which happened today (input x ) and things which happened recently (hidden state h ) will affect your long-term view of the situation (cell state C ). Forget Gates control how much of the past memory is kept.
After receiving your employee’s request for salary increase, your forget gate will run the following calculation of f_t , whose value will ultimately affect your long-term memory.
The weights shown in the picture below are chosen arbitrary for illustration purposes. Their values are normally calculated during training of the network. The result [0,0] indicates to erase (forget completely) your long term memory and not let it affect your decision today.
Secret 4 — LSTM remembers using Input Gates.
Next, you need to decide which information about what happened recently (hidden state h ) and what happened today (input x ) you want to record in your long-term view of the situation (cell state C ). LSTM decides what to remember by using Input Gates.
First, you will calculate your input gate values i_t , which falls between 0 and 1 thanks to sigmoid activation.
Next, you will scale your input between -1 and 1 using tanh activation.
Finally, you will estimate your new cell state by adding both results.
The result [1, 1] indicates that based on the recent and current information, you are 100% in good mood and very likely to have operative budget. This are looking promising for your employee.
Secret 5 — LSTM keeps long-term memory using Cell State.
Now, you know how things which recently happened would affect your state. Next, it is time to update your long-term view of the situation based on the new rationales.
When new values come in, LSTM decides on how to update its memory , again by using gates. The gated new values are added to the current memory. This additive operation is what solves the exploding or vanishing gradients problem of simple RNNs.
Instead of multiplying, LSTM adds things to compute the new state. The result C_t is stored as the new long-term view of the situation (cell state).
The values [1,1] suggest that you are overall 100% of the time in a good mood and 100% likelihood to have money all the time! You are the perfect boss!
Based on this information, you can update your short-term view of the situation h_t (next hidden state). The values [0.9, 0.9] indicate that there is 90% likelihood that you will increase your employee’s salary in the next time step! Congratulations to him!
Gated Recurrent Unit
A variant of the LSTM cell is called the Gated Recurrent Unit, or GRU. GRU was proposed by Kyunghyun Cho et al. in a 2014 paper .
GRU is a simplified version of the LSTM cell, can be a bit faster than LSTM, and it seems to perform similarly, which explains its growing popularity.
As shown above, both state vectors are merged into a single vector. A single gate controller controls both the forget gate and the input gate. If the gate controller outputs a 1, the input gate is open and the forget gate is closed. If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first.
There is no output gate ; the full state vector is output at every time step. However, there is a new gate controller that controls which part of the previous state will be shown to the main layer.
Stacking LSTM cells
By aligning multiple LSTM cells, we can process input of sequence data, for example, a 4-words sentence in the picture below.
LSTM units are typically arranged in layers, so that each the output of each unit is the input to the other units. In the example, we have 2 layers, each having 4 cells. In this way, the network becomes richer and captures more dependencies.
Bidirectional LSTM
RNNs, LSTMs and GRUs are designed to analyze sequence of values. Sometimes it makes sense to analyze the sequence in a reverse order .
For example in the sentence “he needs to work harder, the boss said about the employee.”, although the “he” appears at the very beginning, it refers to the employee, mentioned at the very end.
Therefore the order has to be reversed or by combining forward and backward. This bidirectional architecture is depicted in the figure below.
The following diagram further illustrates bidirectional LSTMs. The network in the bottom receives the sequence in the original order, while the network in the top takes receives the same input but in reverse order. Both networks are not necessarily identical. Important is, their outputs are combined for the final prediction.
Asking for more secrets?
As we have just disclosed, a LSTM cell can learn to recognize an important input (that’s the role of the input gate), store it in the long term state, learn to preserve it for as long as it is needed (that’s the role of the forget gate), and learn to extract it whenever it is needed.
LSTMs have transformed machine learning and are now available to billions of users through the world’s most valuable public companies like Google, Amazon and Facebook.
LSTMs greatly improved speech recognition on over 4 billion Android phones (since mid 2015).
LSTMs greatly improved machine translation through Google Translate since Nov 2016.
Facebook performed over 4 billion LSTM based translations per day.
Siri was LSTM-based on almost 2 billion iPhones since 2016.
The answers of Amazon’s Alexa were based on LSTMs.
Further Reading
If you wish to know even more about LSTMs and GRUs, checkthis article with amazing animations by Michael Nguyen. For those who prefer to build their own LSTM from scratch, this article might work.
Practical implementations of LSTM networks in Python are available in my article below.
Attention-based sequence-to-sequence models and Transformers go beyond LSTMs and have amazed folks recently with their impressive results in machine translation at Google and text generation at OpenAI. You might want to check this blog or my article below to learn more.
A comprehensive implementation of text classification using BERT, FastText, TextCNN, Transformer, Se2seq, etc. can be found on this GitHub repository or you can check my tutorial about BERT.
Thanks to Anne Bonner from Towards Data Science for editorial notes.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
国际大学生程序设计竞赛例题解
郭嵩山 / 电子工业出版社 / 2006-5 / 32.0
《国际大学生程序设计竞赛例题解1:数论、计算几何、搜索算法专集》可以作为高等院校有关专业的研究生和本科学生参加国际大学生程序设计竞赛的辅导教材,也可作为高等院校有关专业相关课程的教材和教学参考书,也比较适合作为中学青少年信息学奥林匹克竞赛省级及省级以上优秀选手备战信息学奥林匹克竞赛的培训教材及训练题集。一起来看看 《国际大学生程序设计竞赛例题解》 这本书的介绍吧!