Long Short Term Memory (LSTM) Networks: Demystified (Part 1)



Long Short Term Memory (LSTM) Networks are a popular choice for sequential modelling tasks, due to its ability to store both long and short term. LSTMs are inspired from (Recurrent Neural Networks) RNNs. RNNs function in a similar manner, but does not possess effective ways of storing long-term memory. However, opposed to RNNs, LSTMs specifically implement “memory cells” which has the capability to store memory for longer periods of time.

Note: All the explanations are carried out inline with the exercise 6_lstm.ipynb from Udacity Deeplearning course. Furthermore “Generating Sequences With Recurrent Neural Networks” by Alex Graves will also be used.


RNNs are known for their performance in sequential tasks due to there recurrent connections (e.g. W_{h^ih^i}) which provide them with limited memorizing capabilities. RNN functions similar to a standard neural network (e.g. affine transformations, activation function, etc.) except for the recurrent connections (Figure below). Though the math looks hairy, this conceptually adds only a tiny bit of complexity. Also, I’m ignoring the “skip connections” for clarity.

Let’s briefly look at how RNNs work. First the training phase. Given a sequence of inputs \mathbf{X}=\{x_1,x_2,\ldots,x_t,\ldots,x_T\}, given a single input x_t, calculate activations of N layers of the RNN (h_t^0,h_t^1,\ldots,h_t^N) using the transformation,
h_t^i=\sigma(W_{h^{i}h^{i-1}}h^{i-1}_t+W_{h^ih^i}h_{t-1}^i+b^i_h) \text{ where } h^0_t = x_t.
Next, calculate the output y_t=\text{Softmax}(W_{h^Ny}h_t^N+b^N_h). Note that we’re calculating softmax with logits instead of activations (this is numerically stable).Subsequently calculate the loss log cross-entropy L_t between the predicted label and the actual label. Finally propagate the loss through the network using Backpropagation (I will not dive into mathematics behind backpropagation as there are plenty of resources for that).

Next, how does RNN work as a prediction network? First we start with a randomly sampled input x_{T+1}. Then using the already trained RNN, we propagate the random input forward and calculate y_{T+1}. Then use y_{T+1} as the next input (x_{T+2}) and repeat the process.


I think I did justice to RNNs with this explanation. Let’s move on to LSTMs.

Motivation behind LSTMs

RNNs are okey, but…

RNNs are great and does a decent job at sequential tasks. But you know what! Beside having a short-term memory, RNNs can also suffer from the vanishing/exploding gradient. Vanishing gradient is when the gradient becoming so small over time resulting in poor learning, where the exploding gradient causes very large gradients making the computations very unstable eventually leading to numerical errors. The reason being, pushing many highly-correlated weight updates consecutively over time. Now, how do we overcome these adverse phenomena?

To overcome exploding gradient, we apply a hack called gradient clipping. That is, whenever the gradient exceeds a threshold value, we clip it. Thus resulting numerically stable gradients.

Above illustrated RNN overcomes the vanishing gradient by using “skip connections”. Argument behind this is, having less steps to reach the lower layers in the back propagation reduces the diminishing nature of gradient. But this is a poorly scalable solution.

LSTMs to the rescue…

RNNs lacks long-term memory and suffer from gradient descent issues. As a solution, a team from IDSIA, Switzerland invented a better model; Long Short-Term Memory (LSTM) models. LSTM models not only have both long and short term memory, but also uses an elegant way to descend the gradient while preventing vanishing gradient.

LSTM replaces the hidden layers of RNN with a different model. LSTM architecture comprises 4 main components; input gate (i), forget gate (f), output gate (o) and memory cell (c). As it is obvious from the names, it uses a gated architecture. Each gate as a specific purpose. For each input, gates decide the followings.

  • Input gate should I write to the memory from the current input?
  • Forget gate should I forget the existing knowledge in the memory cell?
  • Output gate should I read from the current output and memory?
  • Memory cell stores the long-term memory

Sounds pretty cool right? But we can make it even cooler. We replace binary gates with sigmoidal functions. Allowing continuous operations on reading,forgetting and writing. For example now the network can write 50% of the input, forget 25% of the memory and read 75% of the output. This is a much better approach in contrast to binary gates. So this amazing devising allows LSTM to perform remarkably in sequential tasks. Let’s now dive into specifics.

What can LSTMs do

There are couple of amazing things done with LSTM networks. For example, this movie is entirely written by a LSTM network, by analyzing other movie scripts. This video shows a LSTM network generating music.

Model: Demistyfied

Whew! this one is a mouthful. So brace yourself. This sounds complicated due to the amount of the things present, but is not difficult to understand if you break things down.

Note: I refer both 6_lstm.ipynb and “Generating Sequences With Recurrent Neural Networks” to explain the model and keep it as simple as I can. For example, I dropped some connections that goes from memory cell to gates to preserve the clarity. Moreover, explanation considers a single layer LSTM but the generalization to multiple layers is straight-forward.


The main thing you need to understand is that, beside the intimidating appearance of LSTM everything except calculating the hidden layer output remains similar to RNNs. In other words, the high-level operations are similar. Feed an input x_t, calculate hidden layer activations (h_t^0,h_t^1,\ldots,h_t^N), predict the output (y_t), calculate the loss (L_t) and finally backpropagate the loss through the network. Of course the backpropagation would change as LSTMs employ a rich architecture with more connections. But you can implement LSTMs without understanding the specifics of backpropragation (for example, backpropagation cange implemented with just 1 line of code in tensorflow).

For calculating hidden layer activations, you just need to decompose operations inside the LSTM to one of these operations; reading input/writing output/forgetting/memorizing. And this can be implemented using only 5 lines of code.

i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)
f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)
c_t = f_tc_{t-1} + i_t\text{tanh}(W_{xc}x_t + W_{hc}h_{t-1} + b_c)
o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)
h_t = o_t\text{tanh}(c_t)

If c_t looks funny to you, intuitively what it is doing is, forgetting some amount of previous knowledge c_{t-1} and memorizing some amount of current input. You can read original LSTM paper to understand the actual mechanics of the model.

Putting All Together

Now the LSTM cell defined, all that is left to do is connect a classifier on top of the LSTM. The diagram below explains how that should be done. We add a softmax classifier with weights W and biases b. And connect W,b to the output h_t of the LSTM. Input remains x_t.



I think I covered the basics well enough. I think I’ll spare the implementation for another post. This is complicated enough as it is!

Light on Math Machine Learning: Intuitive Guide to Understanding Word2vec

Here comes the third blog post in the series of light on math machine learning A-Z. This article is going to be about Word2vec algorithms. Word2vec algorithms output word vectors. Word vectors, underpin many of the natural language processing (NLP) systems, that have taken the world by a storm (Amazon...

Light on Math Machine Learning: Intuitive Guide to Convolution Neural Networks

This is the second article on my series introducing machine learning concepts with while stepping very lightly on mathematics. If you missed previous article you can find in here. Fun fact, I’m going to make this an interesting adventure by introducing some machine learning concept for every letter in the...

Light on Math Machine Learning: Intuitive Guide to Understanding KL Divergence

I’m starting a new series of blog articles following a beginner friendly approach to understanding some of the challenging concepts in machine learning. To start with, we will start with KL divergence. Code: Here Concept Grounding First of all let us build some ground rules. We will define few things...