Before proceeding further this assumes an intermediate knowledge about how things work in LSTM networks. If you don’t, please look at Long Short Term Memory (LSTM) Networks: Demystified (Part 1)

I’m using the following versions

Python: 3.4

Tensorflow: 0.10.0

Let’s get right into it. I’m using code snippets from 6_lstm.ipynb to explain the implementation from course Deep Learning on Udacity.

Now, before implementing let’s understand what we want to achieve. What we are trying to, implement a generative network that can generate meaningful text. We will achieve this through, training the model with **(input:, output:)** for all the characters in the text. Now let’s see the specific implementation details of that.

First we have the following methods which are very straight-forward and no need to dive into details of them.

def maybe_download(filename, expected_bytes): # download data def read_data(filename): # read data as a string def char2id(char): # convert a character to an ID def id2char(dictid): # convert an ID to a character def batches2string(batches): # convert a given set of batches to a string def characters(probabilities): # convert softmax predictions to characters

Next we have the `BatchGenerator`

, which generates `num_unrolling`

batches of `batch_size`

at a time when you call the method `next(self)`

. Let’s first understand the high-level functionality of this class. BatchGenerator will generate batches such that, will be the input where is the output. For example, given the sentence `'the quick brown fox '`

and `num_unrolling=2`

and `batch_size=10`

, I can generate two batches the following way.

= [t,e ,q,i,k ,b,o,n ,f,x ]

= [h,”,u,c,”,r,w,”,o,”]

Note: In the actual implementations, characters are represented with a numerical ID

We initialize all the variables here. I’ve added a comment after every line of code to show what each of these variables correspond to in the LSTM diagram from previous post.

num_nodes = 64 graph = tf.Graph() with graph.as_default(): # Parameters: # Input gate: input, previous output, and bias ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xi im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hi ib = tf.Variable(tf.zeros([1, num_nodes])) #b_i # Forget gate: input, previous output, and bias. fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xf fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hf fb = tf.Variable(tf.zeros([1, num_nodes])) #b_f # Memory cell: input, state and bias. cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xc cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_hc cb = tf.Variable(tf.zeros([1, num_nodes])) #b_c # Output gate: input, previous output, and bias. ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) #W_xo om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) #W_ho ob = tf.Variable(tf.zeros([1, num_nodes])) #b_o # Variables saving state across unrollings. saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #h_t saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) #c_t # Classifier weights and biases. w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1)) #Softmax W b = tf.Variable(tf.zeros([vocabulary_size])) #Softmax b

Next we define operations of the LSTM cell. Nothing too fancy here. These operations are define at the end of the previous post.

# Definition of the cell computation. def lstm_cell(i, o, state): input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib) forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb) update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb state = forget_gate * state + input_gate * tf.tanh(update) output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob) return output_gate * tf.tanh(state), state

Now first this is a sequential learning process, we cannot define just 2 placeholders for input and output. Instead we have to define `num_unrolling+1`

placeholders (`train_data`

), where the first `num_unrolling`

placeholders are the inputs and last num_unrolling placeholders are the outputs. (Remember is the output for . Imagine a sliding window of size `num_unrolling`

).

# Input data. train_data = list() for _ in range(num_unrollings + 1): train_data.append( tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size])) train_inputs = train_data[:num_unrollings] train_labels = train_data[1:] # labels are inputs shifted by one time step.

Next, we calculate the output for the data in each input placeholder and saving it to a list called `outputs`

.

# Unrolled LSTM loop. outputs = list() output = saved_output state = saved_state for i in train_inputs: output, state = lstm_cell(i, output, state) outputs.append(output)

Now calculating logits for softmax is a little bit tricky. This a temporal (time-based) network. So after each processing each `num_unrolling`

batches through the LSTM cell, we update and before calculating `logits`

and the `loss`

. This is done by using `tf.control_dependencies`

. What this does is that, `logits`

will not be calculated until `saved_output`

and `saved_states`

are updated. Finally, as you can see, `num_unrolling`

acts as the amount of history we are remembering.

# State saving across unrollings. with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]): # Classifier. logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b) loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( logits, tf.concat(0, train_labels)))

Next, we are implementing the optimizer. Remember! we should use “gradient clipping” (`tf.clip_by_global_norm`

) to avoid “Exploding gradient” phenomenon. Also, we decay the `learning_rate`

over time.

# Optimizer. global_step = tf.Variable(0) learning_rate = tf.train.exponential_decay( 10.0, global_step, 5000, 0.1, staircase=True) optimizer = tf.train.GradientDescentOptimizer(learning_rate) gradients, v = zip(*optimizer.compute_gradients(loss)) gradients, _ = tf.clip_by_global_norm(gradients, 1.25) optimizer = optimizer.apply_gradients( zip(gradients, v), global_step=global_step)

Now we are coming to the end of variable definitions. Here, we define `train_prediction`

variable and several more `input`

,`output`

,`state`

variables used to generate new text after the training process. Also we define `reset_sample_state`

function to clear the memory at the start of every new generated sentence.

# Predictions. train_prediction = tf.nn.softmax(logits) # Sampling and validation eval: batch 1, no unrolling. sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) reset_sample_state = tf.group( saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes]))) sample_output, sample_state = lstm_cell( sample_input, saved_sample_output, saved_sample_state) with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]): sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

Nothing too fancy here. I wouldn’t break the code to sections to explain. But I will write up a high level pseudo code to make it easy to understand

#Initialize all variables #For each step # Get the train inputs and outputs # Run the optimizer # For every summary_frequency steps: # calculate mean_loss for last set of batches # calculate the perplexity for the last set of batches # For every 10*summary_frequency: # Generate 5 sentences with 80 characters # For each sentence # Reset the state of LSTM # Sample a random letter # For each character to generate # Get the prediction for the last letter of the sentence # Add the prediction to the sentence # Reset state after the sentence generation # Calculate the perplexity of an independent predefined validation dataset

The above functions are achieved by the following code.

num_steps = 7001 summary_frequency = 100 skip_window = 2 with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print('Initialized') mean_loss = 0 for step in range(num_steps): batches = train_batches.next() feed_dict = dict() for i in range(num_unrollings + 1): feed_dict[train_data[i]] = batches[i] _, l, predictions, lr = session.run( [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict) mean_loss += l if step % summary_frequency == 0: if step > 0: mean_loss = mean_loss / summary_frequency # The mean loss is an estimate of the loss over the last few batches. print( 'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr)) mean_loss = 0 labels = np.concatenate(list(batches)[1:]) print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels)))) if step % (summary_frequency * 10) == 0: # Generate some samples. print('=' * 80) for _ in range(5): feed = sample(random_distribution()) sentence = characters(feed)[0] reset_sample_state.run() for _ in range(79): prediction = sample_prediction.eval({sample_input: feed}) feed = sample(prediction) sentence += characters(feed)[0] print(sentence) print('=' * 80) # Measure validation set perplexity. reset_sample_state.run() valid_logprob = 0 for _ in range(valid_size): b = valid_batches.next() predictions = sample_prediction.eval({sample_input: b[0]}) valid_logprob = valid_logprob + logprob(predictions, b[1]) print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

So that’s it for a basic LSTM network, that generates text by learning from a given text file. Hope you enjoyed it!

IPython Notebook: Here Introduction: Why Optimization? It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models....

Tensorflow Version: 1.2 Original paper: Convolution Neural Networks for Sentence Classification Full code: Here RNN can be miracle workers, But… So, you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable...

Hi, This post will be about a new Word2Vec technique that has come after skip-gram and CBOW, introduced in this paper. Why the authors claim that GloVe is better than context-window based methods is that, it tries to combine both global and local statistics in order to create more general...