Make CNNs for NLP Great Again! Classifying Sentences with CNNs in Tensorflow

Tensorflow Version: 1.2
Original paper: Convolution Neural Networks for Sentence Classification
Full code: Here

RNN can be miracle workers, But…


you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable to live up to its god-standards while turning your tasteless watery text into wine? I know how you feel, as I’ve been in the same position. But there’s good news and bad news. First the good news, for the moment you can forget about RNNs get comfy in your seat, because we will be using Convolution Neural Networks (CNNs) to classify sentences. The bad news! CNNs have limited power in the domain of NLP and at some point you will have to turn your head towards RNN-based algorithms if you want to do NLP.

Why classify sentences?

Though limited usage, classifying sentences is not without its advantages. For example, think of a company wanting to mine online feedback they received. Often such feedback forms have a limited number of words so we can generally assume the content of that field to be a sentence (though it can be a few!). Now if the CEO of the company asks a question “What’s the general impression of the public about our comapany?”, you should have some condensed rich statistics supporting your findings. Trust me, if you walk in with a pile of printed feedback forms, in there. Now an effective way to do this is, write an algorithm which can output + or – depending on the type of feedback. And thankfully, we can use a CNN for this.

Few such other tasks include automatic movie rating, identifying the type of question (e.g. useful for chatbots), etc.

Let’s buckle up and classify some sentences

You might want to keep reading even if you can implement a CNN with your eyes closed. This is not the everyday CNN you would see. So, first let us get to know this new cool kid in town well!

Transforming a sentence into a Matrix

CNNs perform either 1D or 2D convolution, which requires a 2D or 3D input. In this task, we will be using 1D convolutions, therefore, we need to represent the sentence with a 2D matrix. Here is the formulation.

Let us assume a sentence is composed of n words, where n is a fixed number. In order to deal with the variable sized sentences, we set n to be the number of words in the longest sentence and pad a special character to all the other sentences such that the length is n. Each word is represented by a vector of k length. Such word vectors can be obtained by,

  1. One-hot encoding the distinct words found in sentences
  2. Learning word embeddings using algorithms such as Skip-gram or CBOW or GloVe

Therefore, for b sentences, each with n words of k dimensions, we can define a matrix of size b\times n\times k.

Let us illustrate this with an example. Consider the following two sentences. The matrix can be built as follows.

  • Mark plays basketball
  • Abby likes to play Tennis on Sundays

CNN Architecture

The CNN architecture used for classifying sentences is different. It is primarily different from a CNN used in computer vision domain in several aspects.

  • The input is 1-dimensional (ignoring the dimensionality of the word) where CNNs are designed to deal with data containing spatial information (e.g. images)
  • The output of the convolution operations have only a single channel depth, compared to using hundreds of channel depth in image classificaiton tasks.
  • A single convolution layer has multiple sub-layers performing convolution with different filter sizes (similar to inception modules)
  • Pooling operation has a kernel size as same as the size of the output. Therefore, for the output of a single sub-layer, the pooling layer produces a single scalar

Though the paper redefine the precise execution of the convolution and pooling operations, the high-level functionality is no different to a standard CNN. We outline the CNN architecture below.

Now let us delve into the details of the architecture.


This is simply 1D convolution on the sentence matrix. Remember that our input sentence matrix is b\times n \times k. Let us drop b and assume a single sentence (i.e. n\times k size input). Now let us consider a convolution filter of width m and an input channel depth of k (i.e. {\rm I\!R}^{m\times k)}. Now we convolve the input over the 1^{st} dimension and produce an output h of size {\rm I\!R}^{1\times n} (with zero padding).

Similarly, we use l convolution filters of different filter widths m to produce l convolutional outputs of size 1 \times n. Then we concatenate these outputs to produce a new H = \{h_0, h_1, \ldots, h_l\} \in {\rm I\!R}^{n\times l}

Importance of the Convolution in NLP

How can the convolution operation can help in NLP tasks? To understand this, let us consider the following sentence from the context of a movie rating classification problem.

The movie plot was amazing. I enjoyed every second of it.

Let us write down the words that fit within the convolution window during the movement of the window.

[The, movie, plot, was, amazing]
[movie, plot, was, amazing, I]
[I, enjoyed, every, second, of]
[enjoyed, every, second, of, it]

As you can see the words falling within the convolution window have enough information to classify a given sentence, we do not need to look at all the words at a time.

Pooling over time

Still ignoring the batch size, the pooling over time layer will receive the output H of size n\times l. And the pooling operation will pick only the maximum element from each output produced by each convolutional sub-layer. Therefore, this operation will transform the n\times l input to a 1\times l output.


Let us dive with confident to the implementation as we’ve covered all the important bits of the CNN. The following code is available in Github. I’m not going to talk in detail about how to generate data as these are just some python function crunching to put data in the correct format.

First we define the input placeholder (size: b\times n\times k) and output (size: b\times C)

batch_size = 16

# inputs and labels
sent_inputs = tf.placeholder(shape=[batch_size,sent_length,vocabulary_size],dtype=tf.float32,name='sentence_inputs')
sent_labels = tf.placeholder(shape=[batch_size,num_classes],dtype=tf.float32,name='sentence_labels')

Here we define the weights and biases for 1D convolution operation.

# 3 filters with different context window sizes (3,5,7)
# Each of this filter spans the full one-hot-encoded length of each word and the context window width
w1 = tf.Variable(tf.truncated_normal([3,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_1')
b1 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_1')

w2 = tf.Variable(tf.truncated_normal([5,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_2')
b2 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_2')

w3 = tf.Variable(tf.truncated_normal([7,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_3')
b3 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_3')

Here we calculate the convolution output and apply a non-linear activation (i.e. Tanh).

# Calculate the output for all the filters with a stride 1
h1_1 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w1,stride=1,padding='SAME') + b1)
h1_2 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w2,stride=1,padding='SAME') + b2)
h1_3 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w3,stride=1,padding='SAME') + b3)

We calculate the pooling over time output at this point. We calculate max of each convolution output and concatenate them to produce the output.

h2_1 = tf.reduce_max(h1_1,axis=1)
h2_2 = tf.reduce_max(h1_2,axis=1)
h2_3 = tf.reduce_max(h1_3,axis=1)

h2 = tf.concat([h2_1,h2_2,h2_3],axis=1)

The weights and bias of the fully connected output layer (i.e. softmax layer).

w_fc1 = tf.Variable(tf.truncated_normal([h2_shape[1],num_classes],stddev=0.005,dtype=tf.float32),name='weights_fulcon_1')
b_fc1 = tf.Variable(tf.random_uniform([num_classes],0,0.01,dtype=tf.float32),name='bias_fulcon_1')

Calculate logits (final output before applying the softmax) and predictions.

# since h2 is 2d [batch_size,output_width] reshaping the output is not required as it usually do in CNNs
logits = tf.matmul(h2,w_fc1) + b_fc1

predictions = tf.argmax(tf.nn.softmax(logits),axis=1)

Define the loss function (i.e. cross entropy loss) and an optimizer to minimize the loss.

# Loss (Cross-Entropy)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=sent_labels,logits=logits))

# Momentum Optimizer
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,momentum=0.9).minimize(loss)

Yup this is it. And you should be able to achieve around \sim90\% accuracy for the given dataset, without any fancy techniques such as (batch normalization, Adam optimizer).

Full code:

PS: Sorry about the click-baity title. But I couldn’t help it.