Hi,

This post will be about a new Word2Vec technique that has come after skip-gram and CBOW, introduced in this paper. Why the authors claim that GloVe is better than context-window based methods is that, it tries to combine both global and local statistics in order to create more general word representations.

GloVe is designed to take advantage two primary families of learning word vectors, i.e. global matrix factorization methods (e.g. latent symantic analysis) and local context window based methods (e.g. skip-gram). Each has their own weakness. Global methods do poor in word analogy tasks, i.e. is to as is to _. But context window based methods do well in such tasks but fail to incorporate the global picture of things while doing so.

I will not go into details of Word2Vec and why is it important. You can refer my previous posts for that.

The motivation for GloVe starts with that ** ratios** between the probabilities of words appearing next to each other carry more information than considering those probabilities individually. Don’t worry if this doesn’t make sense. I’ll break it down. Let us assume element of the co-occurance matrix , denotes the number of times word occur near . Next let us define . These are pretty basic things.

With this notation we can find why the ratio between probabilities work. Consider 4 words; *solid, gas, water* and *fashion*. If you calculate the ratio where from , you will make following observations.

- will be very high
- is very low
- will be higher than

So it can be seen this ratio is quite expressive. Now we say that what ever we are to derive should start with this ratio, leading to the function below.

(1)

Then after some serious mathematical crunching the authors arrive at the following cost function.

(2)

where is the vocabulary size, is the target word, is the context word and is the bias for word . But this cost fuction treat words far-apart and close together equally. To solve this, we introduce a weighing function, turning the equation to,

(3)

You can choose any function for that satisfy a set of properties specified in the paper. The recommended function for f (in the paper) is as below.

This is it. The main contribution of the paper is this new cost function we just saw. Then you train the model as you would do with skip-gram but with the new cost function.

This is how it looks like when everything comes together. Inputs are s, Outputs are s, bias embeddings are and s. Note that the cost function is defined only for a single input,output set. But the picture depicts the algorithm for a batch of data of size .

You can find my implementation of GloVe in github: Here

Please note that this is my perspective of the algorithm. The exact algorithm might differ.

If you run things without any trouble you should see something like below as the output.

Initialized Average loss at step 0: 124.901360 Nearest to three: misskelley, acknowledge, exegesis, glide, sy, achish, dynamism, mandela, Nearest to many: conglomeration, gorgeous, epicycle, kj, compartment, leppard, myoglobin, prolog, Nearest to some: babe, shamanistic, daddy, imparts, also, treasures, fittings, wtc, Nearest to up: cts, prophecies, interlisp, framers, ferrara, boast, vara, arecibo, Nearest to than: monet, waterproof, noriega, elia, mario, csd, courses, chinook, Nearest to will: tartarus, schott, comrade, oath, axe, coups, sami, fax, Nearest to and: nine, the, home, swearing, with, alves, emedicine, tertius, Nearest to used: the, meridians, mop, also, schenectady, anabolic, applied, methanol, Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, resolutely, tleilaxu, hydrazine, Nearest to troops: solve, firing, eudoxus, splendour, impoverished, resentful, tend, unintelligible, Nearest to older: maine, bornu, ahead, seizures, resides, pausanias, deoxygenated, disenchanted, Nearest to brother: ramparts, satirical, breakage, assumptions, mcgowan, funicular, kc, amigaone, Nearest to powers: gulf, mael, marginalized, professed, rgermeister, obscurity, bran, newsstand, Nearest to bbc: oscillator, dragoons, spalding, yam, furthered, pandavas, termite, correspondance, Nearest to running: attar, shuts, praising, heck, whispering, sustainable, its, lightweight, Nearest to articles: parallelism, drogheda, goblin, finegold, anastasia, marquise, melodic, admiring, ... Average loss at step 100000: 0.143277 Nearest to three: four, five, six, seven, two, eight, one, nine, Nearest to many: some, several, of, all, and, are, other, their, Nearest to some: all, many, of, and, in, their, the, for, Nearest to up: set, to, go, them, with, power, and, back, Nearest to than: more, greater, less, larger, rather, other, or, a, Nearest to will: can, may, be, should, would, not, that, you, Nearest to and: the, UNK, of, in, a, with, for, by, Nearest to used: for, to, is, by, be, as, in, are, Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, robertson, albinus, pinching, Nearest to troops: solve, firing, nsdap, silla, resentful, boots, caveats, triumvirate, Nearest to older: ahead, bornu, maine, seizures, rodeo, musketeers, pausanias, distancing, Nearest to brother: big, his, father, works, work, own, its, their, Nearest to powers: gulf, bran, mael, marginalized, valkyrie, montreal, laches, newsstand, Nearest to bbc: news, radio, top, side, on, earth, convention, based, Nearest to running: attar, shuts, heck, praising, catania, sustainable, wah, seems, Nearest to articles: finegold, goblin, interstellar, drogheda, migrating, xa, anastasia, frankfurter,

Jupyter Notebook for this Tutorial: Here Recently, I had to take a dive into the seq2seq library of Tensorflow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to...

IPython Notebook: Here Introduction: Why Optimization? It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models....

Tensorflow Version: 1.2 Original paper: Convolution Neural Networks for Sentence Classification Full code: Here RNN can be miracle workers, But… So, you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable...