Tensorflow Version: 1.2

Original paper: Convolution Neural Networks for Sentence Classification

Full code: Here

So,

you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable to live up to its god-standards while turning your tasteless watery text into wine? I know how you feel, as I’ve been in the same position. But there’s good news and bad news. First the good news, for the moment you can forget about RNNs get comfy in your seat, because we will be using Convolution Neural Networks (CNNs) to classify sentences. The bad news! CNNs have limited power in the domain of NLP and at some point you will have to turn your head towards RNN-based algorithms if you want to do NLP.

Though limited usage, classifying sentences is not without its advantages. For example, think of a company wanting to mine online feedback they received. Often such feedback forms have a limited number of words so we can generally assume the content of that field to be a sentence (though it can be a few!). Now if the CEO of the company asks a question “What’s the general impression of the public about our comapany?”, you should have some condensed rich statistics supporting your findings. Trust me, if you walk in with a pile of printed feedback forms, in there. Now an effective way to do this is, write an algorithm which can output + or – depending on the type of feedback. And thankfully, we can use a CNN for this.

Few such other tasks include automatic movie rating, identifying the type of question (e.g. useful for chatbots), etc.

You might want to keep reading even if you can implement a CNN with your eyes closed. This is not the everyday CNN you would see. So, first let us get to know this new cool kid in town well!

CNNs perform either 1D or 2D convolution, which requires a 2D or 3D input. In this task, we will be using 1D convolutions, therefore, we need to represent the sentence with a 2D matrix. Here is the formulation.

Let us assume a sentence is composed of words, where is a fixed number. In order to deal with the variable sized sentences, we set to be the number of words in the longest sentence and pad a special character to all the other sentences such that the length is . Each word is represented by a vector of length. Such word vectors can be obtained by,

- One-hot encoding the distinct words found in sentences
- Learning word embeddings using algorithms such as Skip-gram or CBOW or GloVe

Therefore, for sentences, each with words of dimensions, we can define a matrix of size .

Let us illustrate this with an example. Consider the following two sentences. The matrix can be built as follows.

- Mark plays basketball
- Abby likes to play Tennis on Sundays

The CNN architecture used for classifying sentences is different. It is primarily different from a CNN used in computer vision domain in several aspects.

- The input is 1-dimensional (ignoring the dimensionality of the word) where CNNs are designed to deal with data containing spatial information (e.g. images)
- The output of the convolution operations have only a single channel depth, compared to using hundreds of channel depth in image classificaiton tasks.
- A single convolution layer has multiple sub-layers performing convolution with different filter sizes (similar to inception modules)
- Pooling operation has a kernel size as same as the size of the output. Therefore, for the output of a single sub-layer, the pooling layer produces a single scalar

Though the paper redefine the precise execution of the convolution and pooling operations, the high-level functionality is no different to a standard CNN. We outline the CNN architecture below.

Now let us delve into the details of the architecture.

This is simply 1D convolution on the sentence matrix. Remember that our input sentence matrix is . Let us drop and assume a single sentence (i.e. size input). Now let us consider a convolution filter of width and an input channel depth of (i.e. }. Now we convolve the input over the dimension and produce an output of size (with zero padding).

Similarly, we use convolution filters of different filter widths to produce convolutional outputs of size . Then we concatenate these outputs to produce a new

How can the convolution operation can help in NLP tasks? To understand this, let us consider the following sentence from the context of a movie rating classification problem.

The movie plot was not amazing.

The movie plot was amazing. I could not believe it

Let us write down the words that fit within the convolution window during the movement of the window and what the bag of words method might produce.

[The, movie, plot] ... [plot, was, not] [was, not, amazing]

In the bag of words model, if you consider the vector,

The, movie, plot, was, not, amazing, I, could, beleive, it

we get

1,1,1,1,1,1,0,0,0,0

For the second sentence we have,

[The, movie, plot] ... [plot, was, amazing] ... [could, not, believe]

The bag of words model produces the following for the second sentence.

1,1,1,1,1,1,1,1,1,1

Let us now compare what just happened.

I brought up this example to show that, depending on the placement of the word “not” the review can change from positive to negative. The convolutional representation of sentences preserves the contextual information. Convolutional representation helps the model to see that `was, not, amazing`

is different from `plot, was, amazing`

. However, in the bag of words model, the contextual information is lost (thus it loses information about the placement of the word “not”), thus can lead to false input representations.

Still ignoring the batch size, the pooling over time layer will receive the output of size . And the pooling operation will pick only the maximum element from each output produced by each convolutional sub-layer. Therefore, this operation will transform the input to a output.

Let us dive with confident to the implementation as we’ve covered all the important bits of the CNN. The following code is available in Github. I’m not going to talk in detail about how to generate data as these are just some python function crunching to put data in the correct format.

First we define the input placeholder (size: ) and output (size: )

batch_size = 16 # inputs and labels sent_inputs = tf.placeholder(shape=[batch_size,sent_length,vocabulary_size],dtype=tf.float32,name='sentence_inputs') sent_labels = tf.placeholder(shape=[batch_size,num_classes],dtype=tf.float32,name='sentence_labels')

Here we define the weights and biases for 1D convolution operation.

# 3 filters with different context window sizes (3,5,7) # Each of this filter spans the full one-hot-encoded length of each word and the context window width w1 = tf.Variable(tf.truncated_normal([3,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_1') b1 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_1') w2 = tf.Variable(tf.truncated_normal([5,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_2') b2 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_2') w3 = tf.Variable(tf.truncated_normal([7,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_3') b3 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_3')

Here we calculate the convolution output and apply a non-linear activation (i.e. Tanh).

# Calculate the output for all the filters with a stride 1 h1_1 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w1,stride=1,padding='SAME') + b1) h1_2 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w2,stride=1,padding='SAME') + b2) h1_3 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w3,stride=1,padding='SAME') + b3)

We calculate the pooling over time output at this point. We calculate max of each convolution output and concatenate them to produce the output.

h2_1 = tf.reduce_max(h1_1,axis=1) h2_2 = tf.reduce_max(h1_2,axis=1) h2_3 = tf.reduce_max(h1_3,axis=1) h2 = tf.concat([h2_1,h2_2,h2_3],axis=1)

The weights and bias of the fully connected output layer (i.e. softmax layer).

w_fc1 = tf.Variable(tf.truncated_normal([h2_shape[1],num_classes],stddev=0.005,dtype=tf.float32),name='weights_fulcon_1') b_fc1 = tf.Variable(tf.random_uniform([num_classes],0,0.01,dtype=tf.float32),name='bias_fulcon_1')

Calculate logits (final output before applying the softmax) and predictions.

# since h2 is 2d [batch_size,output_width] reshaping the output is not required as it usually do in CNNs logits = tf.matmul(h2,w_fc1) + b_fc1 predictions = tf.argmax(tf.nn.softmax(logits),axis=1)

Define the loss function (i.e. cross entropy loss) and an optimizer to minimize the loss.

# Loss (Cross-Entropy) loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=sent_labels,logits=logits)) # Momentum Optimizer optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,momentum=0.9).minimize(loss)

Yup this is it. And you should be able to achieve around accuracy for the given dataset, without any fancy techniques such as (batch normalization, Adam optimizer).

Full code: Here

PS: Sorry about the click-baity title. But I couldn’t help it.

Cheers.

Jupyter Notebook for this Tutorial: Here Recently, I had to take a dive into the seq2seq library of Tensorflow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to...

IPython Notebook: Here Introduction: Why Optimization? It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models....