## Long Short Term Memory (LSTM) Networks: Implementing with Tensorflow (Part 2)

Before proceeding further this assumes an intermediate knowledge about how things work in LSTM networks. If you don’t, please look at Long Short Term Memory (LSTM) Networks: Demystified (Part 1)

I’m using the following versions

Python: 3.4

Tensorflow: 0.10.0

Let’s get right into it. I’m using code snippets from 6_lstm.ipynb to explain the implementation from course Deep Learning on Udacity.

Now, before implementing let’s understand what we want to achieve. What we are trying to, implement a generative network that can generate meaningful text. We will achieve this through, training the model with **(input:, output:)** for all the characters in the text. Now let’s see the specific implementation details of that.

First we have the following methods which are very straight-forward and no need to dive into details of them.

def maybe_download(filename, expected_bytes): # download data def read_data(filename): # read data as a string def char2id(char): # convert a character to an ID def id2char(dictid): # convert an ID to a character def batches2string(batches): # convert a given set of batches to a string def characters(probabilities): # convert softmax predictions to characters

Next we have the `BatchGenerator`

, which generates `num_unrolling`

batches of `batch_size`

at a time when you call the method `next(self)`

. Let’s first understand the high-level functionality of this class. BatchGenerator will generate batches such that, will be the input where is the output. For example, given the sentence `'the quick brown fox '`

and `num_unrolling=2`

and `batch_size=10`

, I can generate two batches the following way.

= [t,e ,q,i,k ,b,o,n ,f,x ]

= [h,”,u,c,”,r,w,”,o,”]

Note: In the actual implementations, characters are represented with a numerical ID

### Variable Initialization

We initialize all the variables here. I’ve added a comment after every line of code to show what each of these variables correspond to in the LSTM diagram from previous post.

num_nodes = 64 graph = tf.Graph() with graph.as_default(): # Parameters: # Input gate: input, previous output, and bias ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xi im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hi ib = tf.Variable(tf.zeros([1, num_nodes])) #b_i # Forget gate: input, previous output, and bias. fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xf fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hf fb = tf.Variable(tf.zeros([1, num_nodes])) #b_f # Memory cell: input, state and bias. cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xc cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hc cb = tf.Variable(tf.zeros([1, num_nodes])) #b_c # Output gate: input, previous output, and bias. ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xo om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_ho ob = tf.Variable(tf.zeros([1, num_nodes])) #b_o # Variables saving state across unrollings. saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #h_t saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #c_t # Classifier weights and biases. w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1)) #Softmax W b = tf.Variable(tf.zeros([vocabulary_size])) #Softmax b

Next we define operations of the LSTM cell. Nothing too fancy here. These operations are define at the end of the previous post.

# Definition of the cell computation. def lstm_cell(i, o, state): input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib) forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb) update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb state = forget_gate * state + input_gate * tf.tanh(update) output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob) return output_gate * tf.tanh(state), state

Now first this is a sequential learning process, we cannot define just 2 placeholders for input and output. Instead we have to define `num_unrolling+1`

placeholders (`train_data`

), where the first `num_unrolling`

placeholders are the inputs and last num_unrolling placeholders are the outputs. (Remember is the output for . Imagine a sliding window of size `num_unrolling`

).

# Input data. train_data = list() for _ in range(num_unrollings + 1): train_data.append( tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size])) train_inputs = train_data[:num_unrollings] train_labels = train_data[1:] # labels are inputs shifted by one time step.

Next, we calculate the output for the data in each input placeholder and saving it to a list called `outputs`

.

# Unrolled LSTM loop. outputs = list() output = saved_output state = saved_state for i in train_inputs: output, state = lstm_cell(i, output, state) outputs.append(output)

Now calculating logits for softmax is a little bit tricky. This a temporal (time-based) network. So after each processing each `num_unrolling`

batches through the LSTM cell, we update and before calculating `logits`

and the `loss`

. This is done by using `tf.control_dependencies`

. What this does is that, `logits`

will not be calculated until `saved_output`

and `saved_states`

are updated. Finally, as you can see, `num_unrolling`

acts as the amount of history we are remembering.

# State saving across unrollings. with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]): # Classifier. logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b) loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( logits, tf.concat(0, train_labels)))

Next, we are implementing the optimizer. Remember! we should use “gradient clipping” (`tf.clip_by_global_norm`

) to avoid “Exploding gradient” phenomenon. Also, we decay the `learning_rate`

over time.

# Optimizer. global_step = tf.Variable(0) learning_rate = tf.train.exponential_decay( 10.0, global_step, 5000, 0.1, staircase=True) optimizer = tf.train.GradientDescentOptimizer(learning_rate) gradients, v = zip(*optimizer.compute_gradients(loss)) gradients, _ = tf.clip_by_global_norm(gradients, 1.25) optimizer = optimizer.apply_gradients( zip(gradients, v), global_step=global_step)

Now we are coming to the end of variable definitions. Here, we define `train_prediction`

variable and several more `input`

,`output`

,`state`

variables used to generate new text after the training process. Also we define `reset_sample_state`

function to clear the memory at the start of every new generated sentence.

# Predictions. train_prediction = tf.nn.softmax(logits) # Sampling and validation eval: batch 1, no unrolling. sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) reset_sample_state = tf.group( saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes]))) sample_output, sample_state = lstm_cell( sample_input, saved_sample_output, saved_sample_state) with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]): sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

### Training and Generating

Nothing too fancy here. I wouldn’t break the code to sections to explain. But I will write up a high level pseudo code to make it easy to understand

#Initialize all variables #For each step # Get the train inputs and outputs # Run the optimizer # For every summary_frequency steps: # calculate mean_loss for last set of batches # calculate the perplexity for the last set of batches # For every 10*summary_frequency: # Generate 5 sentences with 80 characters # For each sentence # Reset the state of LSTM # Sample a random letter # For each character to generate # Get the prediction for the last letter of the sentence # Add the prediction to the sentence # Reset state after the sentence generation # Calculate the perplexity of an independent predefined validation dataset

The above functions are achieved by the following code.

num_steps = 7001 summary_frequency = 100 skip_window = 2 with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print('Initialized') mean_loss = 0 for step in range(num_steps): batches = train_batches.next() feed_dict = dict() for i in range(num_unrollings + 1): feed_dict[train_data[i]] = batches[i] _, l, predictions, lr = session.run( [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict) mean_loss += l if step % summary_frequency == 0: if step > 0: mean_loss = mean_loss / summary_frequency # The mean loss is an estimate of the loss over the last few batches. print( 'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr)) mean_loss = 0 labels = np.concatenate(list(batches)[1:]) print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels)))) if step % (summary_frequency * 10) == 0: # Generate some samples. print('=' * 80) for _ in range(5): feed = sample(random_distribution()) sentence = characters(feed)[0] reset_sample_state.run() for _ in range(79): prediction = sample_prediction.eval({sample_input: feed}) feed = sample(prediction) sentence += characters(feed)[0] print(sentence) print('=' * 80) # Measure validation set perplexity. reset_sample_state.run() valid_logprob = 0 for _ in range(valid_size): b = valid_batches.next() predictions = sample_prediction.eval({sample_input: b[0]}) valid_logprob = valid_logprob + logprob(predictions, b[1]) print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

## Conclusion

So that’s it for a basic LSTM network, that generates text by learning from a given text file. Hope you enjoyed it!