This is a continuation from the previous post Word2Vec (Part 1): NLP With Deep Learning with Tensorflow (Skip-gram). But in this one I will be talking about another Word2Vec technicque called Continuous Bag-of-Words (CBOW).
So what exactly is CBOW? CBOW, or continuous bag-of-words is conceptually similar to a reversed skip-gram model. Instead of training the model on what the context (i.e. words around the input) should be to a given input word, we train the model on what the output (i.e. word in the middle) should be to a given context.
So why exactly we need CBOW when we have the skip-gram model. It seems CBOW performs better than the skip-gram. This is probably because the inputs are richer than in the skip-gram model. In other words, assuming the sentence, the dog barked at the mailman
, though the input,output tuples for skip gram were single input, single output (e.g. input:'dog',output:'barked'
), there are multiple inputs for a single output in CBOW (i.e. input:['the','barked','at'],output:'dog'
). As you can see from the example, in a given instance CBOW knows dog
occurs when [the, barked, at]
words are “collectively” present where skip-gram only knows that dog
occurs around barked
.
The conceptual model looks like a reversed skip-gram model. Although the idea of CBOW is similar, it is not that simple, since our model is not symmetric. Below you can see what the model look like.
Note that, unlike in the skip-gram model I haven’t included an implementation architecture of the model as it is very similar to the conceptual model in this case. To convert the conceptual model to the implementable model, all you need to do is, process a batch of (input,output)
tuples at once instead of one. In other words, process b
(b
– batch size) words for each column in the model at a single time (i.e b x word[t-2]
, b x word[t-1]
, b x word[t+1]
, b x word[t+2]
).
The idea behind CBOW is that, we use the average embedding vector obtained by averaging over the embedding vectors of all the input words as the input to the learning model, instead of a single input as in the skip-gram model.
Now, the data generation needs to change slightly to occupy multiple inputs. Here’s how it is done.
def generate_batch(batch_size, skip_window): # skip window is the amount of words we're looking at from each side of a given word # creates a single batch global data_index span = 2 * skip_window + 1 # [ skip_window target skip_window ] batch = np.ndarray(shape=(batch_size,span-1), dtype=np.int32) labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) # e.g if skip_window = 2 then span = 5 # span is the length of the whole frame we are considering for a single word (left + word + right) # skip_window is the length of one side # queue which add and pop at the end buffer = collections.deque(maxlen=span) #get words starting from index 0 to span for _ in range(span): buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) # num_skips => # of times we select a random word within the span? # batch_size (8) and num_skips (2) (4 times) # batch_size (8) and num_skips (1) (8 times) for i in range(batch_size): target = skip_window # target label at the center of the buffer target_to_avoid = [ skip_window ] # we only need to know the words around a given word, not the word itself # do this num_skips (2 times) # do this (1 time) # add selected target to avoid_list for next time col_idx = 0 for j in range(span): if j==span//2: continue # e.g. i=0, j=0 => 0; i=0,j=1 => 1; i=1,j=0 => 2 batch[i,col_idx] = buffer[j] # [skip_window] => middle element col_idx += 1 labels[i, 0] = buffer[target] buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) assert batch.shape[0]==batch_size and batch.shape[1]== span-1 return batch, labels
You should note that the size of the batch is (b x span-1)
opposed to what we had in the skip-gram model (b x 1). And also we get rid of num_skips
, because we use all the words in the span
. Intuitively, (i,j)
index of batch can be understood as i-skip_window
offset (if i<skip_window
) or i-skip_window+1
(if i>=skip_window
) word from the j
th word in labels
in the document. For example, assuming skip_window=1
and sentence the dog barked at the mailman
, we will get,
batch: [['the','barked'],['dog','at'],['barked','the'],['at','mailman']]
labels ['dog','barked','at','the']
Now the model training also needs to undergo some serious changes. But it is not that complicated, all you need to do is get the data placeholder sizes right and write the correct symbolic operation for averaging from multiple inputs. Since I consider training process to be the most important. I’ll break the code into small snippets and explain where necessary.
First we need to change the size of the train_dataset placeholder to be (b x 2*skip_window)
(remember! span-1 = 2*skip_window
). Everything else remain as same as the skip-gram model.
if __name__ == '__main__': batch_size = 128 embedding_size = 128 # Dimension of the embedding vector. skip_window = 1 # How many words to consider left and right. num_skips = 2 # How many times to reuse an input to generate a label. valid_size = 16 # Random set of words to evaluate similarity on. valid_window = 100 # Only pick dev samples in the head of the distribution. # pick 16 samples from 100 valid_examples = np.array(random.sample(range(valid_window), valid_size//2)) valid_examples = np.append(valid_examples,random.sample(range(1000,1000+valid_window), valid_size//2)) num_sampled = 64 # Number of negative examples to sample. graph = tf.Graph() with graph.as_default(), tf.device('/cpu:0'): # Input data. train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*skip_window]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) # Variables. # embedding, vector for each word in the vocabulary embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
This is where we need some serious changes. We need to do the correct embedding lookup and average those lookups properly. In summary, we are looking at each column of the train_dataset
(of size b x 2*skip_window
), lookup embeddings for the word IDs in the column. Save embeddings to a temporary variable (embeddings_i
) and concatenate all those to create a new compound variable (embeds) (of size 2*skip_window x b x D
) then perform reduce mean on axis 0
. This will produce the averaged embeddings of the corresponding contextual words of the words in train_labels
, for each batch of data.
# Model. embeds = None for i in range(2*skip_window): embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i]) print('embedding %d shape: %s'%(i,embedding_i.get_shape().as_list())) emb_x,emb_y = embedding_i.get_shape().as_list() if embeds is None: embeds = tf.reshape(embedding_i,[emb_x,emb_y,1]) else: embeds = tf.concat(2,[embeds,tf.reshape(embedding_i,[emb_x,emb_y,1])]) assert embeds.get_shape().as_list()[2]==2*skip_window print("Concat embedding size: %s"%embeds.get_shape().as_list()) avg_embed = tf.reduce_mean(embeds,2,keep_dims=False) print("Avg embedding size: %s"%avg_embed.get_shape().as_list())
Now instead of single embeddings as in skip-gram model we use averaged embeddings in sampled_softmax_loss. Not many things have changed here.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, avg_embed, train_labels, num_sampled, vocabulary_size)) optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss) # We use the cosine distance: norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
And finally,
with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print('Initialized') average_loss = 0 for step in range(num_steps): batch_data, batch_labels = generate_batch(batch_size, skip_window) feed_dict = {train_dataset : batch_data, train_labels : batch_labels} _, l = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += l if step % 2000 == 0: if step > 0: average_loss = average_loss / 2000 # The average loss is an estimate of the loss over the last 2000 batches. print('Average loss at step %d: %f' % (step, average_loss)) average_loss = 0 # note that this is expensive (~20% slowdown if computed every 500 steps) if step % 10000 == 0: sim = similarity.eval() for i in range(valid_size): valid_word = reverse_dictionary[valid_examples[i]] top_k = 8 # number of nearest neighbors nearest = (-sim[i, :]).argsort()[1:top_k+1] log = 'Nearest to %s:' % valid_word for k in range(top_k): close_word = reverse_dictionary[nearest[k]] log = '%s %s,' % (log, close_word) print(log) final_embeddings = normalized_embeddings.eval()
Average loss at step 0: 7.687360 Nearest to he: annoying, menachem, publicize, unwise, skinny, attractors, devastating, declination, Nearest to is: iarc, agrarianism, revoluci, bachman, distinguish, schliemann, carbons, ne, Nearest to some: routed, oscillations, reverence, collaborating, invitational, murderous, mortimer, migratory, Nearest to only: walkway, loud, today, headshot, foundational, asceticism, tracked, hare, ... Nearest to i: intermediates, backed, techs, duly, inefficiencies, ibadi, creole, poured, Nearest to bbc: mprp, catching, slavic, mol, dorian, mining, inactivity, applet, Nearest to cost: cakes, voltages, halter, disappeared, poking, buttocks, talents, salle, Nearest to proposed: prisoners, ecuador, sorghum, complying, saturdays, positioned, probing, observables,
Average loss at step 100000: 2.422888 Nearest to he: she, it, they, there, who, eventually, neighbors, theses, Nearest to is: was, has, became, remains, be, becomes, seems, cetacean, Nearest to some: many, several, certain, most, any, all, both, these, Nearest to only: settling, orchids, commutation, until, either, first, alcohols, rabba, ... Nearest to i: we, you, ii, iii, iv, they, t, lm, Nearest to bbc: news, corporation, coffers, inactivity, mprp, formatted, cara, pedestrian, Nearest to cost: cakes, length, completion, poking, measure, enforcers, parody, figurative, Nearest to proposed: introduced, discovered, foreground, suggested, dismissed, argued, ecuador, builder,
Full code is available for download at: 5_word2vec_CBOW.ipynb
If you’re interested about new Word2Vec techniques, I’m putting the link to a new Word2Vec technique that saw the light recently.
GloVe: Global Vectors for Word Representation
Here comes the third blog post in the series of light on math machine learning A-Z. This article is going to be about Word2vec algorithms. Word2vec algorithms output word vectors. Word vectors, underpin many of the natural language processing (NLP) systems, that have taken the world by a storm (Amazon...
This is the second article on my series introducing machine learning concepts with while stepping very lightly on mathematics. If you missed previous article you can find in here. Fun fact, I’m going to make this an interesting adventure by introducing some machine learning concept for every letter in the...
Jupyter Notebook for this Tutorial: Here Recently, I had to take a dive into the seq2seq library of TensorFlow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to...