Make CNNs for NLP Great Again! Classifying Sentences with CNNs in Tensorflow

Tensorflow Version: 1.2
Original paper: Convolution Neural Networks for Sentence Classification
Full code: Here

RNN can be miracle workers, But…


you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable to live up to its god-standards while turning your tasteless watery text into wine? I know how you feel, as I’ve been in the same position. But there’s good news and bad news. First the good news, for the moment you can forget about RNNs get comfy in your seat, because we will be using Convolution Neural Networks (CNNs) to classify sentences. The bad news! CNNs have limited power in the domain of NLP and at some point you will have to turn your head towards RNN-based algorithms if you want to do NLP.

Why classify sentences?

Though limited usage, classifying sentences is not without its advantages. For example, think of a company wanting to mine online feedback they received. Often such feedback forms have a limited number of words so we can generally assume the content of that field to be a sentence (though it can be a few!). Now if the CEO of the company asks a question “What’s the general impression of the public about our comapany?”, you should have some condensed rich statistics supporting your findings. Trust me, if you walk in with a pile of printed feedback forms, in there. Now an effective way to do this is, write an algorithm which can output + or – depending on the type of feedback. And thankfully, we can use a CNN for this.

Few such other tasks include automatic movie rating, identifying the type of question (e.g. useful for chatbots), etc.

Let’s buckle up and classify some sentences

You might want to keep reading even if you can implement a CNN with your eyes closed. This is not the everyday CNN you would see. So, first let us get to know this new cool kid in town well!

Transforming a sentence into a Matrix

CNNs perform either 1D or 2D convolution, which requires a 2D or 3D input. In this task, we will be using 1D convolutions, therefore, we need to represent the sentence with a 2D matrix. Here is the formulation.

Let us assume a sentence is composed of n words, where n is a fixed number. In order to deal with the variable sized sentences, we set n to be the number of words in the longest sentence and pad a special character to all the other sentences such that the length is n. Each word is represented by a vector of k length. Such word vectors can be obtained by,

  1. One-hot encoding the distinct words found in sentences
  2. Learning word embeddings using algorithms such as Skip-gram or CBOW or GloVe

Therefore, for b sentences, each with n words of k dimensions, we can define a matrix of size b\times n\times k.

Let us illustrate this with an example. Consider the following two sentences. The matrix can be built as follows.

  • Mark plays basketball
  • Abby likes to play Tennis on Sundays

CNN Architecture

The CNN architecture used for classifying sentences is different. It is primarily different from a CNN used in computer vision domain in several aspects.

  • The input is 1-dimensional (ignoring the dimensionality of the word) where CNNs are designed to deal with data containing spatial information (e.g. images)
  • The output of the convolution operations have only a single channel depth, compared to using hundreds of channel depth in image classificaiton tasks.
  • A single convolution layer has multiple sub-layers performing convolution with different filter sizes (similar to inception modules)
  • Pooling operation has a kernel size as same as the size of the output. Therefore, for the output of a single sub-layer, the pooling layer produces a single scalar

Though the paper redefine the precise execution of the convolution and pooling operations, the high-level functionality is no different to a standard CNN. We outline the CNN architecture below.

Now let us delve into the details of the architecture.


This is simply 1D convolution on the sentence matrix. Remember that our input sentence matrix is b\times n \times k. Let us drop b and assume a single sentence (i.e. n\times k size input). Now let us consider a convolution filter of width m and an input channel depth of k (i.e. {\rm I\!R}^{m\times k)}. Now we convolve the input over the 1^{st} dimension and produce an output h of size {\rm I\!R}^{1\times n} (with zero padding).

Similarly, we use l convolution filters of different filter widths m to produce l convolutional outputs of size 1 \times n. Then we concatenate these outputs to produce a new H = \{h_0, h_1, \ldots, h_l\} \in {\rm I\!R}^{n\times l}

Importance of the Convolution in NLP

How can the convolution operation can help in NLP tasks? To understand this, let us consider the following sentence from the context of a movie rating classification problem.

The movie plot was not amazing. 
The movie plot was amazing. I could not believe it 

Let us write down the words that fit within the convolution window during the movement of the window and what the bag of words method might produce.

Convolution – First Sentence

[The, movie, plot]
[plot, was, not]
[was, not, amazing]

Bag of Words – First Sentence

In the bag of words model, if you consider the vector,
The, movie, plot, was, not, amazing, I, could, beleive, it
we get


Convolution – Second Sentence

For the second sentence we have,

[The, movie, plot]
[plot, was, amazing]
[could, not, believe]

Bag of Words – Second Sentence

The bag of words model produces the following for the second sentence.


Let us now compare what just happened.


I brought up this example to show that, depending on the placement of the word “not” the review can change from positive to negative. The convolutional representation of sentences preserves the contextual information. Convolutional representation helps the model to see that was, not, amazing is different from plot, was, amazing. However, in the bag of words model, the contextual information is lost (thus it loses information about the placement of the word “not”), thus can lead to false input representations.

Pooling over time

Still ignoring the batch size, the pooling over time layer will receive the output H of size n\times l. And the pooling operation will pick only the maximum element from each output produced by each convolutional sub-layer. Therefore, this operation will transform the n\times l input to a 1\times l output.


Let us dive with confident to the implementation as we’ve covered all the important bits of the CNN. The following code is available in Github. I’m not going to talk in detail about how to generate data as these are just some python function crunching to put data in the correct format.

First we define the input placeholder (size: b\times n\times k) and output (size: b\times C)

batch_size = 16

# inputs and labels
sent_inputs = tf.placeholder(shape=[batch_size,sent_length,vocabulary_size],dtype=tf.float32,name='sentence_inputs')
sent_labels = tf.placeholder(shape=[batch_size,num_classes],dtype=tf.float32,name='sentence_labels')

Here we define the weights and biases for 1D convolution operation.

# 3 filters with different context window sizes (3,5,7)
# Each of this filter spans the full one-hot-encoded length of each word and the context window width
w1 = tf.Variable(tf.truncated_normal([3,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_1')
b1 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_1')

w2 = tf.Variable(tf.truncated_normal([5,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_2')
b2 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_2')

w3 = tf.Variable(tf.truncated_normal([7,vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_3')
b3 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_3')

Here we calculate the convolution output and apply a non-linear activation (i.e. Tanh).

# Calculate the output for all the filters with a stride 1
h1_1 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w1,stride=1,padding='SAME') + b1)
h1_2 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w2,stride=1,padding='SAME') + b2)
h1_3 = tf.nn.tanh(tf.nn.conv1d(sent_inputs,w3,stride=1,padding='SAME') + b3)

We calculate the pooling over time output at this point. We calculate max of each convolution output and concatenate them to produce the output.

h2_1 = tf.reduce_max(h1_1,axis=1)
h2_2 = tf.reduce_max(h1_2,axis=1)
h2_3 = tf.reduce_max(h1_3,axis=1)

h2 = tf.concat([h2_1,h2_2,h2_3],axis=1)

The weights and bias of the fully connected output layer (i.e. softmax layer).

w_fc1 = tf.Variable(tf.truncated_normal([h2_shape[1],num_classes],stddev=0.005,dtype=tf.float32),name='weights_fulcon_1')
b_fc1 = tf.Variable(tf.random_uniform([num_classes],0,0.01,dtype=tf.float32),name='bias_fulcon_1')

Calculate logits (final output before applying the softmax) and predictions.

# since h2 is 2d [batch_size,output_width] reshaping the output is not required as it usually do in CNNs
logits = tf.matmul(h2,w_fc1) + b_fc1

predictions = tf.argmax(tf.nn.softmax(logits),axis=1)

Define the loss function (i.e. cross entropy loss) and an optimizer to minimize the loss.

# Loss (Cross-Entropy)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=sent_labels,logits=logits))

# Momentum Optimizer
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,momentum=0.9).minimize(loss)

Yup this is it. And you should be able to achieve around \sim90\% accuracy for the given dataset, without any fancy techniques such as (batch normalization, Adam optimizer).

Full code: Here

PS: Sorry about the click-baity title. But I couldn’t help it.


Light on Math Machine Learning: Intuitive Guide to Understanding Word2vec

Here comes the third blog post in the series of light on math machine learning A-Z. This article is going to be about Word2vec algorithms. Word2vec algorithms output word vectors. Word vectors, underpin many of the natural language processing (NLP) systems, that have taken the world by a storm (Amazon...

Light on Math Machine Learning: Intuitive Guide to Convolution Neural Networks

This is the second article on my series introducing machine learning concepts with while stepping very lightly on mathematics. If you missed previous article you can find in here. Fun fact, I’m going to make this an interesting adventure by introducing some machine learning concept for every letter in the...

Neural Machine Translator with 50 Lines of Code + Guide

Jupyter Notebook for this Tutorial: Here Recently, I had to take a dive into the seq2seq library of TensorFlow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to...