Long Short Term Memory (LSTM) Networks: Implementing with Tensorflow (Part 2)

Before proceeding further this assumes an intermediate knowledge about how things work in LSTM networks. If you don’t, please look at Long Short Term Memory (LSTM) Networks: Demystified (Part 1)

I’m using the following versions
Python: 3.4
Tensorflow: 0.10.0

Let’s get right into it. I’m using code snippets from 6_lstm.ipynb to explain the implementation from course Deep Learning on Udacity.

Now, before implementing let’s understand what we want to achieve. What we are trying to, implement a generative network that can generate meaningful text. We will achieve this through, training the model with (input:\text{character}_i, output:\text{character}_{i+1}) for all the characters in the text. Now let’s see the specific implementation details of that.

First we have the following methods which are very straight-forward and no need to dive into details of them.

def maybe_download(filename, expected_bytes): # download data
def read_data(filename): # read data as a string
def char2id(char): # convert a character to an ID
def id2char(dictid): # convert an ID to a character
def batches2string(batches): # convert a given set of batches to a string
def characters(probabilities): # convert softmax predictions to characters

Next we have the BatchGenerator, which generates num_unrolling batches of batch_size at a time when you call the method next(self). Let’s first understand the high-level functionality of this class. BatchGenerator will generate batches such that, \text{batch}_i will be the input where \text{batch}_{i+1} is the output. For example, given the sentence 'the quick brown fox ' and num_unrolling=2 and batch_size=10, I can generate two batches the following way.
\text{batch}_0 = [t,e ,q,i,k ,b,o,n ,f,x ]
\text{batch}_1 = [h,”,u,c,”,r,w,”,o,”]
Note: In the actual implementations, characters are represented with a numerical ID

Variable Initialization

We initialize all the variables here. I’ve added a comment after every line of code to show what each of these variables correspond to in the LSTM diagram from previous post.

num_nodes = 64

graph = tf.Graph()
with graph.as_default():

  # Parameters:
  # Input gate: input, previous output, and bias
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xi
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hi
  ib = tf.Variable(tf.zeros([1, num_nodes])) #b_i

  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xf
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hf
  fb = tf.Variable(tf.zeros([1, num_nodes])) #b_f

  # Memory cell: input, state and bias.
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xc
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hc
  cb = tf.Variable(tf.zeros([1, num_nodes])) #b_c

  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xo
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_ho
  ob = tf.Variable(tf.zeros([1, num_nodes])) #b_o

  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #h_t
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #c_t

  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1)) #Softmax W
  b = tf.Variable(tf.zeros([vocabulary_size])) #Softmax b

Next we define operations of the LSTM cell. Nothing too fancy here. These operations are define at the end of the previous post.

  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

Now first this is a sequential learning process, we cannot define just 2 placeholders for input and output. Instead we have to define num_unrolling+1 placeholders (train_data), where the first num_unrolling placeholders are the inputs and last num_unrolling placeholders are the outputs. (Remember batch_{i+1} is the output for batch_i. Imagine a sliding window of size num_unrolling).

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

Next, we calculate the output for the data in each input placeholder and saving it to a list called outputs.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)

Now calculating logits for softmax is a little bit tricky. This a temporal (time-based) network. So after each processing each num_unrolling batches through the LSTM cell, we update h_{t-1}=h_t and c_{t-1}=c_t before calculating logits and the loss. This is done by using tf.control_dependencies. What this does is that, logits will not be calculated until saved_output and saved_states are updated. Finally, as you can see, num_unrolling acts as the amount of history we are remembering.

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
        logits, tf.concat(0, train_labels)))

Next, we are implementing the optimizer. Remember! we should use “gradient clipping” (tf.clip_by_global_norm) to avoid “Exploding gradient” phenomenon. Also, we decay the learning_rate over time.

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

Now we are coming to the end of variable definitions. Here, we define train_prediction variable and several more input,output,state variables used to generate new text after the training process. Also we define reset_sample_state function to clear the memory at the start of every new generated sentence.

  # Predictions.
  train_prediction = tf.nn.softmax(logits)

  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

Training and Generating

Nothing too fancy here. I wouldn’t break the code to sections to explain. But I will write up a high level pseudo code to make it easy to understand

#Initialize all variables
#For each step
    # Get the train inputs and outputs
    # Run the optimizer
    # For every summary_frequency steps:
         # calculate mean_loss for last set of batches
         # calculate the perplexity for the last set of batches
         # For every 10*summary_frequency:
             # Generate 5 sentences with 80 characters
             # For each sentence
                 # Reset the state of LSTM
                 # Sample a random letter
                 # For each character to generate
                     # Get the prediction for the last letter of the sentence
                     # Add the prediction to the sentence
         # Reset state after the sentence generation
         # Calculate the perplexity of an independent predefined validation dataset         

The above functions are achieved by the following code.

num_steps = 7001
summary_frequency = 100
skip_window = 2

with tf.Session(graph=graph) as session:
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                print('=' * 80)
            # Measure validation set perplexity.
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))


So that’s it for a basic LSTM network, that generates text by learning from a given text file. Hope you enjoyed it!

A Practical Guide to Understanding Stochastic Gradient Descent Methods: Workhorse of Machine Learning

IPython Notebook: Here Introduction: Why Optimization? It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models....

Make CNNs for NLP Great Again! Classifying Sentences with CNNs in Tensorflow

Tensorflow Version: 1.2 Original paper: Convolution Neural Networks for Sentence Classification Full code: Here RNN can be miracle workers, But… So, you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable...

GloVe: Global Vectors for Word Representation + Implementation

Hi, This post will be about a new Word2Vec technique that has come after skip-gram and CBOW, introduced in this paper. Why the authors claim that GloVe is better than context-window based methods is that, it tries to combine both global and local statistics in order to create more general...