Neural Machine Translator with 50 Lines of Code + Guide

Jupyter Notebook for this Tutorial: Here

Recently, I had to take a dive into the seq2seq library of TensorFlow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to know about the library”. In other words, I didn’t want a 8-layer-deep-bi-directional-LSTM-network-with-Beam-Search-and-Attention-based-decoding that does amazing translation. That’s not what I came for. I just wanted to know how do I implement the most basic NMT in the world. And boy weren’t I surprised when I had a look around for a “simple” reference code.

We need Tutorials with more focus on the foundations

Don’t get me wrong, the Tensorflow official NMT Tutorial is a great tutorial and provided a wonderful conceptual foundation. But this and rest of the 90% of the code bases implementing NMT are quite complex. They are so focused on great performance leading to so much embellishments and not so much focus on the basics. This means there is a huge jump to be made for a novice to where there’s no smooth escalator leading to the super-complex code bases doing “amazing” translations. I’m here to bridge that gap and be the humble escalator to make that transition smoother! So let’s get on with it.

What is a Neural Machine Translator

NMT is the latest era of machine translation (i.e. translating a sentence/phrase from a source language to a target language). NMT currently holds the record for state-of-the-art performance that is bit by bit approaching human like performance. So what does a NMT look like? It has 3 major components,

  • Embedding layers (for both source and target vocabulary): Convert words to word vectors
  • An encoder: LSTM cell(s) (can be deep) that encodes the source sentence
  • A decoder: LSTM cell(s) (can be deep) that decodes the encoded source sentence

And they all are connected to each other as below.

This depicts the inference process from a “trained” NMT. Both the encoder and decoder are essentially LSTM cells. And can be upgraded to a deep LSTM or a bi-directional deep LSTM to gain better performance. But these are small details you can easily code in once you get the basics! So let’s focus on basics. Note that, this is not a blog post about the theory behind a NMT, but a practical insight to implementing one. So if you’re not familiar with the theory, I recommend you read some literature and return back to this post.

These models are more broadly known as sequence to sequence models, because we are inputting a sequence of words and outputting an arbitrary-length sequence of words (translation). Sequence to Sequence paper is one of the first to introduce this architecture. Many interesting real world problems such as, machine translators, chatbots, text summarizers, etc. use this architecture.

What is tensorflow.seq2seq?

Basically seq2seq is a TensorFlow overlay that takes all the difficult work you might have to do if you are implementing a sequence to sequence models with raw TensorFlow. Because if you are to implement this whole pipeline shown in the above figure, it can be quite a hassle. For example you will have to manually deal with things like,

  • Not all sentences have the same length. So it is tricky to process sentences in batches
  • You have to make sure the decoder is always initialized with the last encoder state by using control flow ops

Trust me! these become more and more difficult with more and more “upgrades” you need to make your model better. So that’s why seq2seq library is handy!

Right into the implementation (No performance related distractions)

I gave you an very brief overview of what a NMT is. So let’s get into actually implementing one. We’ll be implementing the full NMT with around 50 lines of code, thanks to the beloved Tensorflow’s seq2seq library.

Defining inputs, outputs and masks

First we define placeholders for feeding in the source sentence words (enc_train_inputs) and target sentence words (dec_train_inputs). We also define a mask for the decoder (dec_label_masks) to mask out the elements beyond the actual length of the target sentence during training. This is necessary because to process data in batches, we will have to make all sentences the same length by padding some special token (that is,</s>) to short sentences (also can include truncating very long sentences).

enc_train_inputs,dec_train_inputs = [],[]

# Defining unrolled training inputs for encoder
for ui in range(source_sequence_length):
    enc_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='enc_train_inputs_%d'%ui))

dec_train_labels, dec_label_masks=[],[]

# Defining unrolled training inptus for decoder
for ui in range(target_sequence_length):
    dec_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='dec_train_inputs_%d'%ui))
    dec_train_labels.append(tf.placeholder(tf.int32, shape=[batch_size],name='dec-train_outputs_%d'%ui))
    dec_label_masks.append(tf.placeholder(tf.float32, shape=[batch_size],name='dec-label_masks_%d'%ui))

Defining word embedding related operations

With that, we now define word embedding related operations that are required to fetch the correct word vectors corresponding to the data fed in with enc_train_inputs and dec_train_inputs. I have already created word embeddings for the two languages can be found as numpy matrices (de-embeddings.npy and en-embeddings.npy). These will be loaded into TensorFlow as tensors with the tf.convert_to_tensor operation. You can also initialize encoder_emb_layer and decoder_emb_layer as TensorFlow variables and jointly train them with the NMT. It’s just a matter of changing tf.convert_to_tensor to a tf.Variable(...).

Next we lookup the corresponding embeddings for a batch of training source words (encoder_emb_inp) and training target words (decoder_emb_inp). encoder_emb_inp will be a list of source_sequence_length tensors, each of size [batch_size, embedding_size]. We also define a enc_train_inp_lengths placeholder that contains the length of each sentence in a batch of data. This will be used later. Finally tf.stack operation stack all the list elements and produces a tensor of size [source_sequence_length, batch_size, embedding size]. This is a time_major tensor, as time stamp of the sequence is denoted by the first axis. We do the same for the decoder_emb_inp.

# Need to use pre-trained word embeddings
encoder_emb_layer = tf.convert_to_tensor(np.load('de-embeddings.npy'))
decoder_emb_layer = tf.convert_to_tensor(np.load('en-embeddings.npy'))

# looking up embeddings for encoder inputs
encoder_emb_inp = [tf.nn.embedding_lookup(encoder_emb_layer, src) for src in enc_train_inputs]
encoder_emb_inp = tf.stack(encoder_emb_inp)

# looking up embeddings for decoder inputs
decoder_emb_inp = [tf.nn.embedding_lookup(decoder_emb_layer, src) for src in dec_train_inputs]
decoder_emb_inp = tf.stack(decoder_emb_inp)

# to contain the sentence length for each sentence in the batch
enc_train_inp_lengths = tf.placeholder(tf.int32, shape=[batch_size],name='train_input_lengths')

Defining the encoder

Here we define the encoder and that’s just three lines of code!

encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

initial_state = encoder_cell.zero_state(batch_size, dtype=tf.float32)

encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
    encoder_cell, encoder_emb_inp, initial_state=initial_state,
    time_major=True, swap_memory=True)

Defining the encoder is surprisingly simple (Unless you dive into a pool of performance focused implementation). We first define a encoder_cell, which says “use a LSTM cell with num_units” as the architecture of my encoder”. And as you guessed you can define an array of such cells if you want a deep LSTM network. Then we say “initialize the encoder state (i.e. state variables in LSTM cell) to zero“. And in the third line, we use a special function called dynamic_rnn that is able to handle arbitrary length sequences (just what we wanted!). It says “define a dynamic_rnn that uses the encoder_cell architecture, uses the enc_emb_inp as the input of arbitrary length to it and I will feed the length of each sequence in enc_train_inp_lengths.“. And that’s just it for the encoder. Finally we say that our input is time_major and use swap_memory (for improved performance).

Define the decoder

decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

projection_layer = Dense(units=vocab_size, use_bias=True)

# Helper
helper = tf.contrib.seq2seq.TrainingHelper(
    decoder_emb_inp, [tgt_max_sent_length-1 for _ in range(batch_size)], time_major=True)

# Decoder
if decoder_type == 'basic':
    decoder = tf.contrib.seq2seq.BasicDecoder(
        decoder_cell, helper, encoder_state,
elif decoder_type == 'attention':
    decoder = tf.contrib.seq2seq.BahdanauAttention(
        decoder_cell, helper, encoder_state,
# Dynamic decoding
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(
    decoder, output_time_major=True,

The decoder requires a little bit more work but not more than 10 lines of code. We first define the decoder_cell. Next we define a projection_layer, which is the softmax layer, that provides the one-hot-encoded translated words. And we define a helper that iteratively produces in the inputs in the sequence.
Then we define the most important bit of the decoder, a Decoder! There are many different decoders to choose from. And you can see the list here. As an example, I’m providing two different decoders. This line producing the decoder says,

“to build a decoder, BahdanauAttention type, use the decoder_cell for the architecture, and use the helper to fetch inputs to the decoder, use the last encoder state as the state initialization for the decoder, and to make predictions use the projection_layer (i.e. softmax layer)”

There’s a few things we should talk about in the above statement. So we will go through them briefly.

Why do we need the last encoder state as the first state of the decoder

This is the single link that’s responsible for the communication between the encoder and the decoder (the arrow connecting the encoder and the decoder in the above figure). In other words, this last encoder state provides the context for the decoder in terms of what the translated prediction should be about. The last state of the encoder can be interpreted as a “language-neutral” thought vector.

What is BahdanauAttention?

We have defined two types of decoders in the code BasicDecoder, which is essentially a standard LSTM and a BahdanauAttention which is more complex and better performing than standard Decoder. You see that with a standard decoder, the encoder is forced to concise all the information in the sentence (subject, objects, dependencies, grammar, etc.) in to a fixed length vector. Because this is the only piece of information a standard decoder has access to. Asking too much from the encoder don’t you think? BahdanauAttention, provides the decoder access to the full state history of the encoder during decoding, without relying on a single last state vector. And since seq2seq provide the built-in functionality, you don’t need to worry about the actual underlying mechanism.

What is the newly introduced projection layer

Well without it, we cant infer from the decoder. There should be a way to map the decoder state at each decoding step to some vocabulary prediction. And that’s exactly what the projection_layer is doing.

Finally, we use dynamic_decode function to decode the translation and get the outputs through the projection_layer. output_time_major option basically says that the time axis should be the first axis of the output.

Defining Loss

We got the inputs, true labels and predicted labels. We can define the loss now.

logits = outputs.rnn_output

crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=dec_train_labels, logits=logits)
loss = (tf.reduce_sum(crossent*tf.stack(dec_label_masks)) / (batch_size*target_sequence_length))

Note, how we use the dec_label_masks to mask out the unwanted labels from the loss. But this is optional.

Getting out the predictions

train_prediction = outputs.sample_id

One liner! Easy.


with tf.variable_scope('Adam'):
    adam_optimizer = tf.train.AdamOptimizer(learning_rate)

adam_gradients, v = zip(*adam_optimizer.compute_gradients(loss))
adam_gradients, _ = tf.clip_by_global_norm(adam_gradients, 25.0)
adam_optimize = adam_optimizer.apply_gradients(zip(adam_gradients, v))

with tf.variable_scope('SGD'):
    sgd_optimizer = tf.train.GradientDescentOptimizer(learning_rate)

sgd_gradients, v = zip(*sgd_optimizer.compute_gradients(loss))
sgd_gradients, _ = tf.clip_by_global_norm(sgd_gradients, 25.0)
sgd_optimize = sgd_optimizer.apply_gradients(zip(sgd_gradients, v))

I use the Adam optimizer initially (for example, for the first 10000 iterations) and then switch to a SGD later. This is because using Adam continuously gave some weird results. Gradient clipping to avoid gradient explosion. That’s all that’s taking place here.

Running an Actual Trnaslation Task: German to English

All done. What there’s left is to use this in a real translation task. For that we will be using the WMT’14 English-German data. I’ve created a tutorial using this data set and you can download it here.

Jupyter Notebook: Here

You will need to download the following to run this.

I’ve already created word embeddings (each around 25MB) for both vocabularies and they are available with the Jupyter notebook.

Some Results

Here we show some results of the translator we just implemented. Actual is the actual English translation of the German sentence we fed to the encoder. And the predicted is the predicted sentence by our decoder. Remember, we replace words not found in our vocabulary with the special <unk> token.

At 500 steps…

Actual: To find the nearest car park to an apartment <unk> , have a look at this map link <unk> . <unk> </s> 
Predicted: The the the hotel of <unk> <unk> the <unk> <unk> , the the <unk> <unk> the <unk> <unk> <unk> , <unk> </s> 

At 2500 steps…

Actual: Public parking is possible on site and costs EUR 20 <unk> per day <unk> . <unk> </s> 
Predicted: If parking is possible at site ( costs EUR 6 <unk> per day <unk> . <unk> </s> 

How Do You Improve NMT Systems

So far our goal has been understanding the basics of a NMT system. But our journey shouldn’t stop here. The idea should be to achieve better and better performance. So I’m going to provide some tips on where you can improve the NMT system.

  • Adding more layers to help the system capture more and more subtleties in languages
  • Using bi directional LSTMs. Bi-directional LSTMs read text both forward and backward making them wiser
  • Using attention to give the decoder access to the full state history of the encoder
  • Using hybrid NMTs: Hybrid NMTs have a special way of dealing with rare words without replacing them with a special token

I’ll stop here with the improvements. But they are not limited to this list. So that’s all folks! Hope you found this helpful.

Light on Math Machine Learning: Intuitive Guide to Understanding Word2vec

Here comes the third blog post in the series of light on math machine learning A-Z. This article is going to be about Word2vec algorithms. Word2vec algorithms output word vectors. Word vectors, underpin many of the natural language processing (NLP) systems, that have taken the world by a storm (Amazon...

Light on Math Machine Learning: Intuitive Guide to Convolution Neural Networks

This is the second article on my series introducing machine learning concepts with while stepping very lightly on mathematics. If you missed previous article you can find in here. Fun fact, I’m going to make this an interesting adventure by introducing some machine learning concept for every letter in the...

Light on Math Machine Learning: Intuitive Guide to Understanding KL Divergence

I’m starting a new series of blog articles following a beginner friendly approach to understanding some of the challenging concepts in machine learning. To start with, we will start with KL divergence. Code: Here Concept Grounding First of all let us build some ground rules. We will define few things...