Long Short Term Memory (LSTM) Networks: Implementing with Tensorflow (Part 2)

Before proceeding further this assumes an intermediate knowledge about how things work in LSTM networks. If you don’t, please look at Long Short Term Memory (LSTM) Networks: Demystified (Part 1)

I’m using the following versions
Python: 3.4
Tensorflow: 0.10.0

Let’s get right into it. I’m using code snippets from 6_lstm.ipynb to explain the implementation from course Deep Learning on Udacity.

Now, before implementing let’s understand what we want to achieve. What we are trying to, implement a generative network that can generate meaningful text. We will achieve this through, training the model with (input:\text{character}_i, output:\text{character}_{i+1}) for all the characters in the text. Now let’s see the specific implementation details of that.

First we have the following methods which are very straight-forward and no need to dive into details of them.

def maybe_download(filename, expected_bytes): # download data
def read_data(filename): # read data as a string
def char2id(char): # convert a character to an ID
def id2char(dictid): # convert an ID to a character
def batches2string(batches): # convert a given set of batches to a string
def characters(probabilities): # convert softmax predictions to characters

Next we have the BatchGenerator, which generates num_unrolling batches of batch_size at a time when you call the method next(self). Let’s first understand the high-level functionality of this class. BatchGenerator will generate batches such that, \text{batch}_i will be the input where \text{batch}_{i+1} is the output. For example, given the sentence 'the quick brown fox ' and num_unrolling=2 and batch_size=10, I can generate two batches the following way.
\text{batch}_0 = [t,e ,q,i,k ,b,o,n ,f,x ]
\text{batch}_1 = [h,”,u,c,”,r,w,”,o,”]
Note: In the actual implementations, characters are represented with a numerical ID

Variable Initialization

We initialize all the variables here. I’ve added a comment after every line of code to show what each of these variables correspond to in the LSTM diagram from previous post.

num_nodes = 64

graph = tf.Graph()
with graph.as_default():

  # Parameters:
  # Input gate: input, previous output, and bias
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xi
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hi
  ib = tf.Variable(tf.zeros([1, num_nodes])) #b_i

  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xf
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hf
  fb = tf.Variable(tf.zeros([1, num_nodes])) #b_f

  # Memory cell: input, state and bias.
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xc
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hc
  cb = tf.Variable(tf.zeros([1, num_nodes])) #b_c

  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xo
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_ho
  ob = tf.Variable(tf.zeros([1, num_nodes])) #b_o

  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #h_t
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #c_t

  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1)) #Softmax W
  b = tf.Variable(tf.zeros([vocabulary_size])) #Softmax b

Next we define operations of the LSTM cell. Nothing too fancy here. These operations are define at the end of the previous post.

  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

Now first this is a sequential learning process, we cannot define just 2 placeholders for input and output. Instead we have to define num_unrolling+1 placeholders (train_data), where the first num_unrolling placeholders are the inputs and last num_unrolling placeholders are the outputs. (Remember batch_{i+1} is the output for batch_i. Imagine a sliding window of size num_unrolling).

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

Next, we calculate the output for the data in each input placeholder and saving it to a list called outputs.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)

Now calculating logits for softmax is a little bit tricky. This a temporal (time-based) network. So after each processing each num_unrolling batches through the LSTM cell, we update h_{t-1}=h_t and c_{t-1}=c_t before calculating logits and the loss. This is done by using tf.control_dependencies. What this does is that, logits will not be calculated until saved_output and saved_states are updated. Finally, as you can see, num_unrolling acts as the amount of history we are remembering.

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
        logits, tf.concat(0, train_labels)))

Next, we are implementing the optimizer. Remember! we should use “gradient clipping” (tf.clip_by_global_norm) to avoid “Exploding gradient” phenomenon. Also, we decay the learning_rate over time.

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

Now we are coming to the end of variable definitions. Here, we define train_prediction variable and several more input,output,state variables used to generate new text after the training process. Also we define reset_sample_state function to clear the memory at the start of every new generated sentence.

  # Predictions.
  train_prediction = tf.nn.softmax(logits)

  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

Training and Generating

Nothing too fancy here. I wouldn’t break the code to sections to explain. But I will write up a high level pseudo code to make it easy to understand

#Initialize all variables
#For each step
    # Get the train inputs and outputs
    # Run the optimizer
    # For every summary_frequency steps:
         # calculate mean_loss for last set of batches
         # calculate the perplexity for the last set of batches
         # For every 10*summary_frequency:
             # Generate 5 sentences with 80 characters
             # For each sentence
                 # Reset the state of LSTM
                 # Sample a random letter
                 # For each character to generate
                     # Get the prediction for the last letter of the sentence
                     # Add the prediction to the sentence
         # Reset state after the sentence generation
         # Calculate the perplexity of an independent predefined validation dataset         

The above functions are achieved by the following code.

num_steps = 7001
summary_frequency = 100
skip_window = 2

with tf.Session(graph=graph) as session:
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                print('=' * 80)
            # Measure validation set perplexity.
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))


So that’s it for a basic LSTM network, that generates text by learning from a given text file. Hope you enjoyed it!

  • Ankit Lohani

    This is a very nice tutorial. However I am stuck a point and would like to clarify — In the code snippet that you have shared “6_lstm.ipynb”, they have used word embedding in “Problem 2”. The code for Unrolling LSTM goes like —
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
    i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, tf.argmax(i, dimension=1))
    output, state = lstm_cell(i_embed, output, state)

    Here, ‘i’ is a matrix of dimen. [batch_size, vocab_size]. Suppose my text is “Cancer is a group of diseases characterized by an uncontrolled growth of abnormal cells”. For batch_size = 5, num_unrollings = 2, vocab_size = 14
    i would look like this —
    i =: [cancer, is, a] = [1,1,1,0,0,0,0,0,0,0,0,0,0,0]
    If we don’t go for word_embeddings, we can directly pass this vector in the LSTM cell as input. However, if we wish to pass the embedding instead of this, how does it work? I mean how can I get it? Could you please elaborate how this statement is working — i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, tf.argmax(i, dimension=1))

    • admin

      Hi Ankit,

      I appreciate you taking time to read the post. If I understood you correctly, you got two questions.

      1. Can we use something like [cancer,is,a] as the input instead of embeddings.
      The purpose of embeddings is to reduce the dimensionality of the input. If you use words as they are (i.e. without embeddings) your input will be of ~10000 dimensions and very sparse. Which is a waste of computational power. This is why embeddings are useful

      2. What does tf.nn.embedding_lookup(embeddings, ids) do
      Well, you know that, embeddings is a huge matrix of size, [vocabulary_size,embedding_size] (e.g 50000×128) so what embedding lookup does is it will get the embedding vectors for the corresponding ids (in this case tf.argmax(i)) from that huge matrix. (more info: https://www.tensorflow.org/api_docs/python/nn/embeddings)

      Hope these help

  • 1010101010110

    hello, thank you very much for the tutorial.
    My question is this :
    What is the practical relationship between number of unrolligns, batch size and input data? I mean, why should i use num_unrolling=2 and batch_size=10 instead of num_unrolling=10 and batch_size=2 for example? Why not go for 50×50 if I can ? Will my input dictate my approach and how?

    Thank you

    • Thushan Ganegedara

      Your argument is correct. Higher the num_unrolling, more to the past we are seeing at one time step. Therefore, longer the memory stored in the state will be.
      On the other hand, higher the batch_size, better the generalization properties your model will possess. Because when batch size is larger more text you see at a single time step.
      However higher values in these comes at a cost,
      Higher num_unrolling and higher batch size means higher memory requirement at a given time step.

  • Ilya Malyutin

    Thank you! I spent a lot of time for understanding 6_lstm before faced your tutorial!

    • Thushan Ganegedara

      Appreciate your feedback very much 🙂