Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW)


This is a continuation from the previous post Word2Vec (Part 1): NLP With Deep Learning with Tensorflow (Skip-gram). But in this one I will be talking about another Word2Vec technicque called Continuous Bag-of-Words (CBOW).

Intuition CBOW

So what exactly is CBOW? CBOW, or continuous bag-of-words is conceptually similar to a reversed skip-gram model. Instead of training the model on what the context (i.e. words around the input) should be to a given input word, we train the model on what the output (i.e. word in the middle) should be to a given context.

Why CBOW?

So why exactly we need CBOW when we have the skip-gram model. It seems CBOW performs better than the skip-gram. This is probably because the inputs are richer than in the skip-gram model. In other words, assuming the sentence, the dog barked at the mailman, though the input,output tuples for skip gram were single input, single output (e.g. input:'dog',output:'barked'), there are multiple inputs for a single output in CBOW (i.e. input:['the','barked','at'],output:'dog'). As you can see from the example, in a given instance CBOW knows dog occurs when [the, barked, at] words are “collectively” present where skip-gram only knows that dog occurs around barked.

CBOW model

The conceptual model looks like a reversed skip-gram model. Although the idea of CBOW is similar, it is not that simple, since our model is not symmetric. Below you can see what the model look like.

CBOW Model

Note that, unlike in the skip-gram model I haven’t included an implementation architecture of the model as it is very similar to the conceptual model in this case. To convert the conceptual model to the implementable model, all you need to do is, process a batch of (input,output) tuples at once instead of one. In other words, process b (b – batch size) words for each column in the model at a single time (i.e b x word[t-2], b x word[t-1], b x word[t+1], b x word[t+2]).

The idea behind CBOW is that, we use the average embedding vector obtained by averaging over the embedding vectors of all the input words as the input to the learning model, instead of a single input as in the skip-gram model.

Intuition (Data Generation)

Now, the data generation needs to change slightly to occupy multiple inputs. Here’s how it is done.

def generate_batch(batch_size, skip_window):
    # skip window is the amount of words we're looking at from each side of a given word
    # creates a single batch
    global data_index
    assert skip_window%2==1

    span = 2 * skip_window + 1 # [ skip_window target skip_window ]

    batch = np.ndarray(shape=(batch_size,span-1), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    # e.g if skip_window = 2 then span = 5
    # span is the length of the whole frame we are considering for a single word (left + word + right)
    # skip_window is the length of one side

    # queue which add and pop at the end
    buffer = collections.deque(maxlen=span)

    #get words starting from index 0 to span
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)

    # num_skips => # of times we select a random word within the span?
    # batch_size (8) and num_skips (2) (4 times)
    # batch_size (8) and num_skips (1) (8 times)
    for i in range(batch_size):
        target = skip_window  # target label at the center of the buffer
        target_to_avoid = [ skip_window ] # we only need to know the words around a given word, not the word itself

        # do this num_skips (2 times)
        # do this (1 time)

        # add selected target to avoid_list for next time
        col_idx = 0
        for j in range(span):
            if j==span//2:
                continue
            # e.g. i=0, j=0 => 0; i=0,j=1 => 1; i=1,j=0 => 2
            batch[i,col_idx] = buffer[j] # [skip_window] => middle element
            col_idx += 1
        labels[i, 0] = buffer[target]

        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)

    assert batch.shape[0]==batch_size and batch.shape[1]== span-1
    return batch, labels

You should note that the size of the batch is (b x span-1) opposed to what we had in the skip-gram model (b x 1). And also we get rid of num_skips, because we use all the words in the span. Intuitively, (i,j) index of batch can be understood as i-skip_window offset (if i<skip_window) or i-skip_window+1 (if i>=skip_window) word from the j th word in labels in the document. For example, assuming skip_window=1 and sentence the dog barked at the mailman, we will get,
batch: [['the','barked'],['dog','at'],['barked','the'],['at','mailman']]
labels ['dog','barked','at','the']

Training the Model

Now the model training also needs to undergo some serious changes. But it is not that complicated, all you need to do is get the data placeholder sizes right and write the correct symbolic operation for averaging from multiple inputs. Since I consider training process to be the most important. I’ll break the code into small snippets and explain where necessary.

Initialize Variables

First we need to change the size of the train_dataset placeholder to be (b x 2*skip_window) (remember! span-1 = 2*skip_window). Everything else remain as same as the skip-gram model.

if __name__ == '__main__':
    batch_size = 128
    embedding_size = 128 # Dimension of the embedding vector.
    skip_window = 1 # How many words to consider left and right.
    num_skips = 2 # How many times to reuse an input to generate a label.
    valid_size = 16 # Random set of words to evaluate similarity on.
    valid_window = 100 # Only pick dev samples in the head of the distribution.
    # pick 16 samples from 100
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,random.sample(range(1000,1000+valid_window), valid_size//2))
    num_sampled = 64 # Number of negative examples to sample.

    graph = tf.Graph()

    with graph.as_default(), tf.device('/cpu:0'):

        # Input data.
        train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*skip_window])
        train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
        valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

        # Variables.
        # embedding, vector for each word in the vocabulary
        embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
        softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
Embedding lookup and Averaging

This is where we need some serious changes. We need to do the correct embedding lookup and average those lookups properly. In summary, we are looking at each column of the train_dataset (of size b x 2*skip_window), lookup embeddings for the word IDs in the column. Save embeddings to a temporary variable (embeddings_i) and concatenate all those to create a new compound variable (embeds) (of size 2*skip_window x b x D) then perform reduce mean on axis 0. This will produce the averaged embeddings of the corresponding contextual words of the words in train_labels, for each batch of data.

        # Model.
        embeds = None
        for i in range(2*skip_window):
            embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i])
            print('embedding %d shape: %s'%(i,embedding_i.get_shape().as_list()))
            emb_x,emb_y = embedding_i.get_shape().as_list()
            if embeds is None:
                embeds = tf.reshape(embedding_i,[emb_x,emb_y,1])
            else:
                embeds = tf.concat(2,[embeds,tf.reshape(embedding_i,[emb_x,emb_y,1])])

        assert embeds.get_shape().as_list()[2]==2*skip_window
        print("Concat embedding size: %s"%embeds.get_shape().as_list())
        avg_embed =  tf.reduce_mean(embeds,2,keep_dims=False)
        print("Avg embedding size: %s"%avg_embed.get_shape().as_list())
Loss Function and Optimizing

Now instead of single embeddings as in skip-gram model we use averaged embeddings in sampled_softmax_loss. Not many things have changed here.

        loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, avg_embed,
                               train_labels, num_sampled, vocabulary_size))
        optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

        # We use the cosine distance:
        norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
        normalized_embeddings = embeddings / norm
        valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
        similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
Making things Work

And finally,

    with tf.Session(graph=graph) as session:
        tf.initialize_all_variables().run()
        print('Initialized')
        average_loss = 0
        for step in range(num_steps):
            batch_data, batch_labels = generate_batch(batch_size, skip_window)
            feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
            _, l = session.run([optimizer, loss], feed_dict=feed_dict)
            average_loss += l
            if step % 2000 == 0:
                if step > 0:
                    average_loss = average_loss / 2000
                    # The average loss is an estimate of the loss over the last 2000 batches.
                print('Average loss at step %d: %f' % (step, average_loss))
                average_loss = 0
            # note that this is expensive (~20% slowdown if computed every 500 steps)
            if step % 10000 == 0:
                sim = similarity.eval()
                for i in range(valid_size):
                    valid_word = reverse_dictionary[valid_examples[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k+1]
                    log = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = reverse_dictionary[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
        final_embeddings = normalized_embeddings.eval()

Results

Average loss at step 0: 7.687360
Nearest to he: annoying, menachem, publicize, unwise, skinny, attractors, devastating, declination,
Nearest to is: iarc, agrarianism, revoluci, bachman, distinguish, schliemann, carbons, ne,
Nearest to some: routed, oscillations, reverence, collaborating, invitational, murderous, mortimer, migratory,
Nearest to only: walkway, loud, today, headshot, foundational, asceticism, tracked, hare,
...
Nearest to i: intermediates, backed, techs, duly, inefficiencies, ibadi, creole, poured,
Nearest to bbc: mprp, catching, slavic, mol, dorian, mining, inactivity, applet,
Nearest to cost: cakes, voltages, halter, disappeared, poking, buttocks, talents, salle,
Nearest to proposed: prisoners, ecuador, sorghum, complying, saturdays, positioned, probing, observables,
Average loss at step 100000: 2.422888
Nearest to he: she, it, they, there, who, eventually, neighbors, theses,
Nearest to is: was, has, became, remains, be, becomes, seems, cetacean,
Nearest to some: many, several, certain, most, any, all, both, these,
Nearest to only: settling, orchids, commutation, until, either, first, alcohols, rabba,
...
Nearest to i: we, you, ii, iii, iv, they, t, lm,
Nearest to bbc: news, corporation, coffers, inactivity, mprp, formatted, cara, pedestrian,
Nearest to cost: cakes, length, completion, poking, measure, enforcers, parody, figurative,
Nearest to proposed: introduced, discovered, foreground, suggested, dismissed, argued, ecuador, builder,

Full code is available for download at: 5_word2vec_cbow.py

New Word2Vec technique: GloVe

If you’re interested about new Word2Vec techniques, I’m putting the link to a new Word2Vec technique that saw the light recently.
GloVe: Global Vectors for Word Representation


  • Daniel

    Tensor(“Mean:0”, shape=(128, 128), dtype=float32) must be from the same graph as Tensor(“Variable_1:0”, shape=(500, 128), dtype=float32_ref, device=/device:CPU:0).

    When running this code , am facing an issue

    • admin

      This is due to some gpu/cpu data sharing/allocation dilemma. Are you running this on a GPU? Try removing “tf.device(‘/cpu:0’):” and running the code

  • Pingback: Word2Vec : Continous Bag of words (CBOW) – Data Science / Data Mining / Deep Learning()

  • Ilona

    Thank you for your great blog post. I’m just wondering about the
    assert skip_window%2==1
    in your generate_batch function – can you tell the reason?

    • admin

      Thanks. I think this was an implementation specific decision on my own. But it could be any value. I’ll modify the code to get rid of that.