Hi,

This post will be about a new Word2Vec technique that has come after skip-gram and CBOW, introduced in this paper. Why the authors claim that GloVe is better than context-window based methods is that, it tries to combine both global and local statistics in order to create more general word representations.

GloVe is designed to take advantage two primary families of learning word vectors, i.e. global matrix factorization methods (e.g. latent symantic analysis) and local context window based methods (e.g. skip-gram). Each has their own weakness. Global methods do poor in word analogy tasks, i.e. is to as is to _. But context window based methods do well in such tasks but fail to incorporate the global picture of things while doing so.

I will not go into details of Word2Vec and why is it important. You can refer my previous posts for that.

The motivation for GloVe starts with that ** ratios** between the probabilities of words appearing next to each other carry more information than considering those probabilities individually. Don’t worry if this doesn’t make sense. I’ll break it down. Let us assume element of the co-occurance matrix , denotes the number of times word occur near . Next let us define . These are pretty basic things.

With this notation we can find why the ratio between probabilities work. Consider 4 words; *solid, gas, water* and *fashion*. If you calculate the ratio where from , you will make following observations.

- will be very high
- is very low
- will be higher than

So it can be seen this ratio is quite expressive. Now we say that what ever we are to derive should start with this ratio, leading to the function below.

(1)

Then after some serious mathematical crunching the authors arrive at the following cost function.

(2)

where is the vocabulary size, is the target word, is the context word and is the bias for word . But this cost fuction treat words far-apart and close together equally. To solve this, we introduce a weighing function, turning the equation to,

(3)

You can choose any function for that satisfy a set of properties specified in the paper. The recommended function for f (in the paper) is as below.

This is it. The main contribution of the paper is this new cost function we just saw. Then you train the model as you would do with skip-gram but with the new cost function.

This is how it looks like when everything comes together. Inputs are s, Outputs are s, bias embeddings are and s. Note that the cost function is defined only for a single input,output set. But the picture depicts the algorithm for a batch of data of size .

You can find my implementation of GloVe in github: Here

Please note that this is my perspective of the algorithm. The exact algorithm might differ.

If you run things without any trouble you should see something like below as the output.

Initialized Average loss at step 0: 124.901360 Nearest to three: misskelley, acknowledge, exegesis, glide, sy, achish, dynamism, mandela, Nearest to many: conglomeration, gorgeous, epicycle, kj, compartment, leppard, myoglobin, prolog, Nearest to some: babe, shamanistic, daddy, imparts, also, treasures, fittings, wtc, Nearest to up: cts, prophecies, interlisp, framers, ferrara, boast, vara, arecibo, Nearest to than: monet, waterproof, noriega, elia, mario, csd, courses, chinook, Nearest to will: tartarus, schott, comrade, oath, axe, coups, sami, fax, Nearest to and: nine, the, home, swearing, with, alves, emedicine, tertius, Nearest to used: the, meridians, mop, also, schenectady, anabolic, applied, methanol, Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, resolutely, tleilaxu, hydrazine, Nearest to troops: solve, firing, eudoxus, splendour, impoverished, resentful, tend, unintelligible, Nearest to older: maine, bornu, ahead, seizures, resides, pausanias, deoxygenated, disenchanted, Nearest to brother: ramparts, satirical, breakage, assumptions, mcgowan, funicular, kc, amigaone, Nearest to powers: gulf, mael, marginalized, professed, rgermeister, obscurity, bran, newsstand, Nearest to bbc: oscillator, dragoons, spalding, yam, furthered, pandavas, termite, correspondance, Nearest to running: attar, shuts, praising, heck, whispering, sustainable, its, lightweight, Nearest to articles: parallelism, drogheda, goblin, finegold, anastasia, marquise, melodic, admiring, ... Average loss at step 100000: 0.143277 Nearest to three: four, five, six, seven, two, eight, one, nine, Nearest to many: some, several, of, all, and, are, other, their, Nearest to some: all, many, of, and, in, their, the, for, Nearest to up: set, to, go, them, with, power, and, back, Nearest to than: more, greater, less, larger, rather, other, or, a, Nearest to will: can, may, be, should, would, not, that, you, Nearest to and: the, UNK, of, in, a, with, for, by, Nearest to used: for, to, is, by, be, as, in, are, Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, robertson, albinus, pinching, Nearest to troops: solve, firing, nsdap, silla, resentful, boots, caveats, triumvirate, Nearest to older: ahead, bornu, maine, seizures, rodeo, musketeers, pausanias, distancing, Nearest to brother: big, his, father, works, work, own, its, their, Nearest to powers: gulf, bran, mael, marginalized, valkyrie, montreal, laches, newsstand, Nearest to bbc: news, radio, top, side, on, earth, convention, based, Nearest to running: attar, shuts, heck, praising, catania, sustainable, wah, seems, Nearest to articles: finegold, goblin, interstellar, drogheda, migrating, xa, anastasia, frankfurter,

Here comes the third blog post in the series of light on math machine learning A-Z. This article is going to be about Word2vec algorithms. Word2vec algorithms output word vectors. Word vectors, underpin many of the natural language processing (NLP) systems, that have taken the world by a storm (Amazon...

This is the second article on my series introducing machine learning concepts with while stepping very lightly on mathematics. If you missed previous article you can find in here. Fun fact, I’m going to make this an interesting adventure by introducing some machine learning concept for every letter in the...

I’m starting a new series of blog articles following a beginner friendly approach to understanding some of the challenging concepts in machine learning. To start with, we will start with KL divergence. Code: Here Concept Grounding First of all let us build some ground rules. We will define few things...