Neural Machine Translator with 50 Lines of Code + Guide


Jupyter Notebook for this Tutorial: Here

Recently, I had to take a dive into the seq2seq library of Tensorflow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to know about the library”. In other words, I didn’t want a 8-layer-deep-bi-directional-LSTM-network-with-Beam-Search-and-Attention-based-decoding that does amazing translation. Well not what I came for. I just wanted to know how do I implement the most basic NMT in the world. And boy weren’t I surprised when I had a look around for a “simple” reference code.

We need Tutorials with more focus on the foundations

Don’t get me wrong, the Tensorflow official NMT Tutorial is a great tutorial and provided a wonderful conceptual foundation. But this and rest of the 90% of the code bases implementing NMT are quite complex. They are so focused on great performance leading to so much embellishments and not so much focus on the basics. This means there is a huge jump for a novice to all there’s left to hang on to are super-complex code bases doing “amazing” translations. I’m here to bridge that gap to make the jump smaller! So let’s get on with it.

What is a Neural Machine Translator

Well, NMT is the latest era of machine translation (i.e. translating a sentence/phrase from a source language to a target language). NMT currently holds the record for state-of-the-art performance. So what does a NMT look like? It has 3 major components,

  • Embedding layers for both source and target vocabulary
  • An encoder: LSTM cell(s) (can be deep)
  • A decoder: LSTM cell(s) (can be deep)

And they all are connected to each other as below.

This depicts the inference process from a “trained” NMT. We do not need target word embeddings during inference, only during training. We will see the details later. Both the encoder and decoder is essentially a LSTM cell. And can be upgraded to a deep LSTM or a bi-directional deep LSTM to gain better performance. But these are small details you can easily code in once you get the basics! So let’s focus on basics.

These models are more broadly known as sequence to sequence models, because we are inputting a sequence of words and outputting an arbitrary-length sequence of words (translation). Sequence to Sequence paper is one of the first to introduce this architecture. Many interesting real world problems such as, machine translators, chatbots, text summarizers, etc. use this architecture.

What is tensorflow.seq2seq?

Basically seq2seq is a tensorflow overlay that takes all the difficult work you might have to do if you are implementing a NMT with raw tensorflow. Because if you are to implement this whole pipeline shown in the above figure, it can be quite a hassle. For example you will have to manually deal with things like,

  • Not all sentences have the same length. So it is tricky to process sentences in batches
  • You have to make sure the decoder is always initiazed with the last encoder state by using control flow ops

Trust me! these become more and more difficult with more and more “upgrades” you need to make your model better. So that’s why seq2seq library is handy!

Right into the implementation (No performance related distractions)

I gave you an very brief overview of what a NMT is. So let’s get into actually implementing one. We’ll be implementing the full NMT with around 50 lines of code, thanks to the beloved Tensorflow’s seq2seq library.

Defining inputs, outputs and masks

enc_train_inputs,dec_train_inputs = [],[]

# Need to use pre-trained word embeddings
encoder_emb_layer = tf.convert_to_tensor(np.load('de-embeddings.npy'))
decoder_emb_layer = tf.convert_to_tensor(np.load('en-embeddings.npy'))

# Defining unrolled training inputs for encoder
for ui in range(source_sequence_length):
    enc_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='enc_train_inputs_%d'%ui))

dec_train_labels, dec_label_masks=[],[]

# Defining unrolled training inptus for decoder
for ui in range(target_sequence_length):
    dec_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='dec_train_inputs_%d'%ui))
    dec_train_labels.append(tf.placeholder(tf.int32, shape=[batch_size],name='dec-train_outputs_%d'%ui))
    dec_label_masks.append(tf.placeholder(tf.float32, shape=[batch_size],name='dec-label_masks_%d'%ui))

# looking up embeddings for encoder inputs
encoder_emb_inp = [tf.nn.embedding_lookup(encoder_emb_layer, src) for src in enc_train_inputs]
encoder_emb_inp = tf.stack(encoder_emb_inp)

# looking up embeddings for decoder inputs
decoder_emb_inp = [tf.nn.embedding_lookup(decoder_emb_layer, src) for src in dec_train_inputs]
decoder_emb_inp = tf.stack(decoder_emb_inp)

# to contain the sentence length for each sentence in the batch
enc_train_inp_lengths = tf.placeholder(tf.int32, shape=[batch_size],name='train_input_lengths')

Nothing too fancy here. One curious detail might be the dec_label_masks. This is to mask the padded </s> tokens at the end of the sentence. We do not want to take them to consideration at loss calculation. So we mask the labels that has </s> in it. But for the sake of completeness, let us talk about the dimensionality of the inputs and outputs. First we define a list with source_sequence_length placeholders, each of [batch_size] as the encoder inputs. Then we do embedding_lookup for each such placeholder. Resulting in another list of source_sequence_length tensors, each of size [batch_size, embedding_size]. Finally tf.stack operation stack all the list elements and produces a tensor of size [target_sequence_length, batch_size, embedding size]. This is a time_major tensor, as time stamp of the sequence is denoted by the first axis. We do the same for decoder inputs.

Defining the encoder

encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

initial_state = encoder_cell.zero_state(batch_size, dtype=tf.float32)

encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
    encoder_cell, encoder_emb_inp, initial_state=initial_state,
    sequence_length=enc_train_inp_lengths, 
    time_major=True, swap_memory=True)

Surprise surprise! Defining the encoder is just 3 lines of code (Unless you dive in a pool of performance focused implementation). We first define a encoder_cell, which says “use a LSTM cell with num_units” as the architecture of my encoder”. And as you guessed you can define an array of such cells if you want a deep LSTM network. Then we say “initialize the encoder state (i.e. state variables in LSTM cell) to zero“. And in the third line, we define a special object called dynamic_rnn that is able to handle arbitrary length sequences (just what we wanted!). It says “define a dynamic_rnn that uses the encoder_cell architecture, uses the enc_emb_inp as the input of arbitrary length to it and I will feed the length of each sequence in enc_train_inp_lengths.“. And that’s just it for the encoder.

Define the decoder

decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

projection_layer = Dense(units=vocab_size, use_bias=True)

# Helper
helper = tf.contrib.seq2seq.TrainingHelper(
    decoder_emb_inp, [tgt_max_sent_length-1 for _ in range(batch_size)], time_major=True)

# Decoder
if decoder_type == 'basic':
    decoder = tf.contrib.seq2seq.BasicDecoder(
        decoder_cell, helper, encoder_state,
        output_layer=projection_layer)
    
elif decoder_type == 'attention':
    decoder = tf.contrib.seq2seq.BahdanauAttention(
        decoder_cell, helper, encoder_state,
        output_layer=projection_layer)
    
# Dynamic decoding
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(
    decoder, output_time_major=True,
    swap_memory=True
)

The decoder requires a little bit more work but not more than 10 lines of code. We first define the decoder_cell. Next we define a projection_layer, which is the softmax layer, that provides the one-hot-encoded translated words. And we define a helper that iteratively produces in the inputs in the sequence.
Then we define the most important bit of the decoder, a Decoder! There are many different decoders to choose from. And you can see the list here. As an example, I’m providing two different decoders. This line producing the decoder says,

“to build a decoder, BahdanauAttention type, use the decoder_cell for the architecture, and use the helper to fetch inputs to the decoder, use the last encoder state as the state initialization for the decoder, and to make predictions use the projection_layer (i.e. softmax layer)”

There’s a few things we should talk about in the above statement. So we will go through them briefly.

Why do we need the last encoder state as the first state of the decoder

This is the single link that’s responsible for the communication between the encoder and the decoder (the arrow connecting the encoder and the decoder in the above figure). In other words, this last encoder state provides the context for the decoder in terms of what the translated prediction should be about. The last state of the encoder can be interpreted as a “language-neutral” thought vector.

What is BahdanauAttention?

You see that the encoder is forced to concise all the information in the sentence (subject, objects, dependencies, grammar, etc.) in to a fixed length vector. Asking too much from the encoder don’t you think? BahdanauAttention, provides a way for the decoder to focus on any part of the full state history of the encoder during predicitons, without relying on a single last state vector. And to implement you don’t need to worry about the actual underlying mechanism.

What is the newly introduced projection layer

Well without it, we cant infer from the decoder. There should be a way to map the decoder state at each decoding step to some vocabulary prediction. And that’s exactly what the projection_layer is doing.

Finally, we use dynamic_decode function to decode the translation and get the outputs through the projection_layer. output_time_major option basically says that the time axis should be the first axis of the output.

Defining Loss

We got the inputs, true labels and predicted labels. We can define the loss now.

logits = outputs.rnn_output

crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=dec_train_labels, logits=logits)
loss = (tf.reduce_sum(crossent*tf.stack(dec_label_masks)) / (batch_size*target_sequence_length))

Note, how we use the dec_label_masks to mask out the unwanted labels from the loss. But this is optional.

Getting out the predictions

train_prediction = outputs.sample_id

One liner! Easy.

Optimizer

with tf.variable_scope('SGD'):
    sgd_optimizer = tf.train.GradientDescentOptimizer(learning_rate)

sgd_gradients, v = zip(*sgd_optimizer.compute_gradients(loss))
sgd_gradients, _ = tf.clip_by_global_norm(sgd_gradients, 25.0)
sgd_optimize = optimizer.apply_gradients(zip(sgd_gradients, v))

SGD optimizer, gradient clipping to avoid gradient explosion. That’s all that’s taking place here.

Running an Actual Trnaslation Task: German to English

Yup all done. What there’s left is to use this in a real translation task. For that we will be using the WMT’14 English-German data. I’ve created a tutorial using this data set and you can download it here.

Jupyter Notebook: Here

You will need to download the following to run this.

I’ve already created word embeddings (each around 25MB) for both vocabularies and they are available with the Jupyter notebook.

Some Results

Here we show some results of the translator we just implemented. Actual is the actual English translation of the German sentence we fed to the encoder. And the predicted is the predicted sentence by our decoder. Remember, we replace words not found in our vocabulary with the special <unk> token.

At 500 steps…

Actual: To find the nearest car park to an apartment <unk> , have a look at this map link <unk> . <unk> </s> 
Predicted: The the the hotel of <unk> <unk> the <unk> <unk> , the the <unk> <unk> the <unk> <unk> <unk> , <unk> </s> 

At 2500 steps…

Actual: Public parking is possible on site and costs EUR 20 <unk> per day <unk> . <unk> </s> 
Predicted: If parking is possible at site ( costs EUR 6 <unk> per day <unk> . <unk> </s> 

So that’s all folks! Hope you found this helpful.


A Practical Guide to Understanding Stochastic Gradient Descent Methods: Workhorse of Machine Learning

IPython Notebook: Here Introduction: Why Optimization? It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models....

Make CNNs for NLP Great Again! Classifying Sentences with CNNs in Tensorflow

Tensorflow Version: 1.2 Original paper: Convolution Neural Networks for Sentence Classification Full code: Here RNN can be miracle workers, But… So, you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable...