Word2Vec (Part 1): NLP With Deep Learning with Tensorflow (Skip-gram)


I will be writing about 2 popular techniques for converting words to vectors; Skip-gram model and Continuous Bag of Words (CBOW). These are unsupervised learning methods to learn the context of words. This post is structured as follows. First I’ll talk about the motivation for Word2Vec techniques. Next I will delve into the details about how Skip-gram and CBOW work. Next I will explain the algorithms in a different vantage point (i.e. implementation). Finally, I will point out technical concerns that might help you to code the algorithms. This will be explained inline with assignment 5_word2vec.ipynb from Udacity course Deep Learning and python syntax if coding required.

Introduction to Word2Vec

Word2Vec is motivated as an effective technique to elicit knowledge from large text corpora in an unsupervised manner. Given that Web (or even just Wikipedia) holds copious amounts of text, it would be immensely beneficial for Natural Language Processing (NLP) to use this already available data in an unsupervised manner. It should also be understood that, labeling data is a very “tedious” and “laborious” task requiring massive number of human hours. For further reading on these techniques, refer “Efficient Estimation of Word Representations in Vector Space” by Mikolov et.al from ICML 2013 and”Distributed Representations of Words and Phrases and their Compositionality” by Mikolov et.al from NIPS 2013. Finally, this blog is quite well written and helped me to understand about these topics.


So how exactly does Word2Vec help us to do NLP? This is by learning a numerical embeddings space of a vocabluary which would result in similar words ending up close to each other. For example, cat will be placed close to kitten, where dog will be placed further than kitten and iphone even further. By learning such numerical representation of the words in the vocabulary, further enables us to do various vectorized operations producing interesting results. For example, operation kitten - cat + dog should result in an embedded vector very close to puppy.

Below is an example of a learnt embedding space visualized using T-SNE. You can see that similar words are placed closed to each other.
Embedding Space

Skip-gram Model; an Approach to Learning Embeddings


So we approach a step closer to the final battle; Understanding how Word2Vec works. The main idea behind skip-gram is training a model on the context of each word, so similar words will have similar numerical representations (i.e similar embedding vectors). For example, when we see the following sentence, cat climbed the tree, we say to the model, if you see the word cat, you should expect words climbed or tree very soon. If we repeat this adequately, our model will learn a representation (i.e. Embedding) of the given vocabulary.


Now let’s understand how learning context of words help us to learn good embeddings. When repeating this process for 2 sentences, for example cat climbed a tree and kitten climbed a tree, and if we train a model with (input:cat,output:tree) and (input:kitten,output:tree) this will eventually force the model to understand that, cat and kitten both as related to tree, thus placing cat and kitten closely in the embedding space.

The model

Now let’s see how exactly we make this work with Skip-gram model. Also let us define the following Notation
V – Vocabulary Size (Number of unique words in the corpora)
P – The Projection or the Embedding Layer
D – Dimensionality of the Embedding Space
b – Size of a single Bach

The model we are assuming is a simple logistic regression (Softmax) model. So the high-level architecture looks like follows. I have included 2 diagrams, where the one on the left is the conceptual architecture, where the one on the right is the implemented architectures. Though the architectures is different, it should be noted it does not make any changes to functionality.

Conceptual and Implemented Diagrams of Skip-gram

Let’s talk about the differences in two views of the model. To make things obvious, let us assume the sentence The dog barked at the mailman. We can visualize the first model as a model that is being trained on data such as (input:'dog',output:['the','barked','at','the','mailman']) while sharing weights and biases of the softmax layer. In other words, the conceptual model trains multiple outputs to the same input simultaneously. However, this is practically difficult to implement. A more practical solution would be to break the above tuple (input:'dog', output:['the','barked','at','the','mailman']) to several single (input,output) tuples as (input:'dog', output:'the'),(input:'dog', output:'barked'),...,(input:'dog', output:'mailman'). This is what is illustrated in the right-sided model.

Skip-gram Model

Intuition (Data Generation)

First here’s the intuition for the Skip-gram model through an example. Consider the sentence, The dog barked at the mailman. First, we pick a word from a text (dog), then we define a window called skip_window which is the number of words back and forth from the selected word we are going to look at. For example if skip_window = 2 then, ['The','dog','barked','at'] will be inside the window. Also, let us assume the definition that skip_window assumes only one-side from the selected word, where span is the whole window. Let us now define another parameter called num_skips denoting the number of different output words we will pick within the span for a single word. Now assuming skip_window=2 and num_skips=2 we compose the following (input,output) tuples. ('dog','barked'),('dog','the').

I will not delve into specific details of batch generation as this is not the most important part of the model. Please refer functions read_data, build_dataset and generate_batch from 5_word2vec.ipynb for generating batches of data. Also you will be able to find rigorously commented completed assignment from my Github account (5_word2vec.py and 5_word2vec_cbow.py).

The important thing to understand is that, these functions will convert textual words to a numerical representations. In other words, it will assign a unique “ID” to unique words. For example, if the sentence The dog barked at the mailman is fed to the functions, they will output [1,2,3,1,4] as the output where id(‘the’)=1, id(‘dog’)=2, etc.

Training the Model

Now, we have (input,output) tuples without using any labeled data, thus Unsupervised learning. Now all there left to do is train the model with produced inputs and outputs. I will explain how to implement skip-gram model with tensorflow.

First we define the required input,output and other required Tensors and parameter values. This should be self explanatory if you are slightly familiar with how tensorflow works.

if __name__ == '__main__':
    batch_size = 128
    embedding_size = 128
    skip_window = 1 # How many words to consider left and right.
    num_skips = 2 # How many times to reuse an input to generate a label.

    valid_size = 16 # Random set of words to evaluate similarity on.
    valid_window = 100 
    # pick 8 samples from (0,100) and (1000,1100) each ranges. lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,random.sample(range(1000,1000+valid_window), valid_size//2))
    num_sampled = 64 # Number of negative examples to sample for sampeled_softmax.

    graph = tf.Graph()

    with graph.as_default(), tf.device('/cpu:0'):

        # Input data.
        train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
        train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
        valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

        # Variables.
        embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
        softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))

Now we define symbolic operations. First embedding_lookup is used to look up corresponding embeddings of the inputs. In other words, the embedding layer P, is of size VxD, which will contain embedding vectors (D dimensional) for all the words in the vocabulary. In order to train the model for a single instance you need to find the corresponding embedding vectors for the given input words by an id lookup (train_dataset in this case contain a set of unique ids corresponding to each word in the batch of data). Although it possible to do this manually, use of this function is required as tensorflow doesn’t allow index lookup with Tensors.

        embed = tf.nn.embedding_lookup(embeddings, train_dataset)

Next, we use an altered version of softmax to calculate the loss. Since the vocabulary can be very large (~50000) for a standard text file it is computationally wasteful to calculate full softmax loss. Therefore, we sample num_sampled number of negative softmax units from full set of V softmax units (units that should have 0 as their output) and calculate the loss only with them. This is found to be very effective approximation to the full softmax but with increased performance.

        loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
                               train_labels, num_sampled, vocabulary_size))

Now we use an advance gradient optimization technique called Adagrad which allows you to find “needle in the haystack”. This works better than standard GD because, Adagrad works very well when there are lot of variables to optimize (softmax_weights, softmax_biases and embed). And mind you all these variables are of 1,000,000 scale.

        optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

This code snippet calculates the similarity (i.e. cosine distance) between a given minibatch of words against all the words, using the learnt embeddings.

        # We use the cosine distance:
        norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
        normalized_embeddings = embeddings / norm
        valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
        similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

Now, all the required things nicely defined, all we have to do is feed the data to the placeholders and run the optimizer, which will minimize the loss with respect to the parameters (softmax_weights, softmax_biases and embeds). Then we also run the loss operation to print out the average loss. This is important as we can see if something is wrong if the loss is behaving weird (e.g. negative or increasing with time).

    with tf.Session(graph=graph) as session:
        average_loss = 0
        for step in range(num_steps):
            batch_data, batch_labels = generate_batch(batch_size, num_skips, skip_window)
            feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
            _, l = session.run([optimizer, loss], feed_dict=feed_dict)
            average_loss += l
            if step % 2000 == 0:
                if step > 0:
                    average_loss = average_loss / 2000
                    # The average loss is an estimate of the loss over the last 2000 batches.
                print('Average loss at step %d: %f' % (step, average_loss))
                average_loss = 0

Now, every 10000 steps, we evaluate our model by looking the most similar top_k words to the data in the valid_dataset.

            # note that this is expensive (~20% slowdown if computed every 500 steps)
            if step % 10000 == 0:
                sim = similarity.eval()
                for i in range(valid_size):
                    valid_word = reverse_dictionary[valid_examples[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k+1]
                    log = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = reverse_dictionary[nearest[k]]
                        log = '%s %s,' % (log, close_word)
        final_embeddings = normalized_embeddings.eval()


Now if you do this properly, you should see improvements of word similarity over time. Here are the results I got at 0th step and after 100,000 steps.

Average loss at step 0: 7.805069
Nearest to when: drummer, acceleration, compost, loan, severe, quicker, nite, pies,
Nearest to people: hr, infertile, detractors, programmability, capacitors, lounge, report, horn,
Nearest to american: department, corvettes, rejoining, lindy, laconic, wels, kojiki, bibliography,
Nearest to than: dallas, corrino, whispers, empowered, intakes, homer, salvage, fern,
Nearest to except: finite, altitudes, particular, helper, endeavoured, scenes, helaman, myocardium,
Nearest to report: occupants, costing, brooker, armas, adversaries, powering, hawkwind, people,
Nearest to professional: bronx, covalently, reappeared, inti, anthologies, alaska, described, midwestern,
Nearest to bbc: cruzi, galatia, football, grammaticus, tights, homilies, agonists, turbines,
Average loss at step 100000: 3.359176
Nearest to when: if, before, while, although, where, after, though, because,
Nearest to people: children, students, players, individuals, men, adapting, women, americans,
Nearest to american: british, australian, german, french, italian, scottish, canadian, soccer,
Nearest to than: or, much, announcements, and, leningrad, spark, kish, while,
Nearest to except: especially, embodied, endeavoured, scenes, devonshire, every, indoors, example,
Nearest to report: sloop, woodbridge, costing, pit, occupants, atheism, jannaeus, uns,
Nearest to professional: anthologies, major, cumings, inti, reset, hollow, copyrighted, covalently,
Nearest to bbc: paper, galatia, fliers, flavia, earth, manufacturing, icosahedron, grammaticus,

This is it for the skip-gram model.
Next: Continuous Bag of Words Model

New Word2Vec technique: GloVe

If you’re interested about new Word2Vec techniques, I’m putting the link to a new Word2Vec technique that saw the light recently.
GloVe: Global Vectors for Word Representation

  • Shadab Shaikh

    Hey Thushan.

    This is by far the most insightful blog I have read on the implementation of skip gram using tensorflow. Thanks a lot.

    Though I have a doubt regarding the parameter ‘num_skips’ used in generate_batch. What is the use of it exactly? Can you help me with it?

    • Shadab Shaikh

      Also does the equation:

      skip_window * 2 == num_skips

      usually hold true? I mean does it give good results or anything?

      • admin

        I’m not sure where you picked this equation (Hopefully not from my blogpost). But this does not necessarily need to hold true. For example, span = skip_window*2 + 1. But num_skips is totally upto you. The only constraint sensible is that num_skips < span ( so you don't have duplicates in the training set)

    • admin

      Hey, thanks for reading this. šŸ™‚
      To understand the purpose of num_skips, you first need to understand “span”. Span is the window we define around a certain word (input). Whatever the output for that input need to be inside the span. So the number of such outputs we pick for a single input is called num_skips. Hope this clarifies things.

      • Shadab Shaikh

        Yeah I understood what num_skips is. Thanks.
        But why is it even there? Isnt skip window (and consequently span) enough parameters to generate our batch and label?
        Why do we need to explicitly state how many times we use that particular input ?

        • admin

          Hi Shadab,

          Well, your argument makes sense and it is not crucial to have skip_window. And frankly, I’m not sure about the exact reason why it is there. But I think this act as some sort of regularization by reducing overfitting. In other words, if you use all the possible combinations from the training set, you can easily overfit to those exact words in the training set.

  • harsh

    what does final_embeddings contain?and why it is less than vocabulary size

    • admin

      Assuming that the final embedding is the output vector given by the neural network for a word,
      (1) Final embedding is a vectorized representation of the word, so that words with similar meanings will have similar values and vise-versa
      (2) The size of the vector (i.e. output size) is usually small as the size is determined by us and that’s one of the key points in embedding. To make the dimensionality small when dealing with a large vocabulary.

      • Laurie

        Hello, very good explanation !!
        But, I am a little confused. Now, when we trained the model, how can we generate vector from any word?

        • admin

          That’s the embedding layer. i th row of the embedding layer is the learned vector for i th word in the vocabulary.

  • Tom

    As I read in some others articles about Word2vec,
    -Input vector is [1, Vocabulary_size]
    -Hidden layer is size of word vector
    But implementation is difference. Input vector is a random vector with size [1,V] V is size of word vector. And the hidden layer is Vocabulary size
    Could you please explain this point?