GloVe: Global Vectors for Word Representation + Implementation


This post will be about a new Word2Vec technique that has come after skip-gram and CBOW, introduced in this paper. Why the authors claim that GloVe is better than context-window based methods is that, it tries to combine both global and local statistics in order to create more general word representations.

GloVe is designed to take advantage two primary families of learning word vectors, i.e. global matrix factorization methods (e.g. latent symantic analysis) and local context window based methods (e.g. skip-gram). Each has their own weakness. Global methods do poor in word analogy tasks, i.e. a is to b as c is to _. But context window based methods do well in such tasks but fail to incorporate the global picture of things while doing so.

I will not go into details of Word2Vec and why is it important. You can refer my previous posts for that.

GloVe in More Detail

The motivation for GloVe starts with that ratios between the probabilities of words appearing next to each other carry more information than considering those probabilities individually. Don’t worry if this doesn’t make sense. I’ll break it down. Let us assume (i,j) element of the co-occurance matrix X, X_{i,j} denotes the number of times word j occur near i. Next let us define P_{ij}=P(j|i)=X_{ij}/\sum_{\forall k}X_{ik}. These are pretty basic things.

With this notation we can find why the ratio between probabilities work. Consider 4 words; solid, gas, water and fashion. If you calculate the ratio P_{ice,k}/P_{steam,k} where k \in \{solid, gas, water, fashion\} from X, you will make following observations.

  1. P_{ice,solid}/P_{steam,solid} will be very high
  2. P_{ice,gas}/P_{steam,gas} is very low
  3. P_{ice,water}/P_{steam,water} will be higher than P_{ice,fashion}/P_{steam,fashion}

Deriving the Cost Function

So it can be seen this ratio is quite expressive. Now we say that what ever we are to derive should start with this ratio, leading to the function below.

(1)   \begin{equation*} F(w_i,w_j,\tilde{w}_k)=\frac{P_{ik}}{P_{jk}} \end{equation*}

Then after some serious mathematical crunching the authors arrive at the following cost function.

(2)   \begin{equation*} J=\sum_{i,j=1}^{V} (w_i^T\tilde{w}_j + b_i + \tilde{b}_j - log(1+X_{ij}))^2 \end{equation*}

where V is the vocabulary size, w_i is the target word, \tilde{w}_j is the context word and b_i is the bias for word w_i. But this cost fuction treat words far-apart and close together equally. To solve this, we introduce a weighing function, turning the equation to,

(3)   \begin{equation*} J=\sum_{i,j=1}^{V} f(X_{ij})(w_i^T\tilde{w}_j + b_i + \tilde{b}_j - log(1+X_{ij}))^2 \end{equation*}

You can choose any function for f that satisfy a set of properties specified in the paper. The recommended function for f (in the paper) is as below.
\begin{aligned} f(x)=         \begin{cases}         x/x_{max} \text{ if } x<x_{max}\\         1 \text{ else }         \end{cases}  \end{aligned}

This is it. The main contribution of the paper is this new cost function we just saw. Then you train the model as you would do with skip-gram but with the new cost function.

Big Picture

This is how it looks like when everything comes together. Inputs are w_is, Outputs are \tilde{w}_js, bias embeddings are b_i and \tilde{b}_js. Note that the cost function is defined only for a single input,output set. But the picture depicts the algorithm for a batch of data of size b.

GloVe: High-level Architecture

GloVe: High-level Architecture

GloVe Implementation

You can find my implementation of GloVe in github: Here
Please note that this is my perspective of the algorithm. The exact algorithm might differ.

If you run things without any trouble you should see something like below as the output.

Average loss at step 0: 124.901360
Nearest to three: misskelley, acknowledge, exegesis, glide, sy, achish, dynamism, mandela,
Nearest to many: conglomeration, gorgeous, epicycle, kj, compartment, leppard, myoglobin, prolog,
Nearest to some: babe, shamanistic, daddy, imparts, also, treasures, fittings, wtc,
Nearest to up: cts, prophecies, interlisp, framers, ferrara, boast, vara, arecibo,
Nearest to than: monet, waterproof, noriega, elia, mario, csd, courses, chinook,
Nearest to will: tartarus, schott, comrade, oath, axe, coups, sami, fax,
Nearest to and: nine, the, home, swearing, with, alves, emedicine, tertius,
Nearest to used: the, meridians, mop, also, schenectady, anabolic, applied, methanol,
Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, resolutely, tleilaxu, hydrazine,
Nearest to troops: solve, firing, eudoxus, splendour, impoverished, resentful, tend, unintelligible,
Nearest to older: maine, bornu, ahead, seizures, resides, pausanias, deoxygenated, disenchanted,
Nearest to brother: ramparts, satirical, breakage, assumptions, mcgowan, funicular, kc, amigaone,
Nearest to powers: gulf, mael, marginalized, professed, rgermeister, obscurity, bran, newsstand,
Nearest to bbc: oscillator, dragoons, spalding, yam, furthered, pandavas, termite, correspondance,
Nearest to running: attar, shuts, praising, heck, whispering, sustainable, its, lightweight,
Nearest to articles: parallelism, drogheda, goblin, finegold, anastasia, marquise, melodic, admiring,
Average loss at step 100000: 0.143277
Nearest to three: four, five, six, seven, two, eight, one, nine,
Nearest to many: some, several, of, all, and, are, other, their,
Nearest to some: all, many, of, and, in, their, the, for,
Nearest to up: set, to, go, them, with, power, and, back,
Nearest to than: more, greater, less, larger, rather, other, or, a,
Nearest to will: can, may, be, should, would, not, that, you,
Nearest to and: the, UNK, of, in, a, with, for, by,
Nearest to used: for, to, is, by, be, as, in, are,
Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, robertson, albinus, pinching,
Nearest to troops: solve, firing, nsdap, silla, resentful, boots, caveats, triumvirate,
Nearest to older: ahead, bornu, maine, seizures, rodeo, musketeers, pausanias, distancing,
Nearest to brother: big, his, father, works, work, own, its, their,
Nearest to powers: gulf, bran, mael, marginalized, valkyrie, montreal, laches, newsstand,
Nearest to bbc: news, radio, top, side, on, earth, convention, based,
Nearest to running: attar, shuts, heck, praising, catania, sustainable, wah, seems,
Nearest to articles: finegold, goblin, interstellar, drogheda, migrating, xa, anastasia, frankfurter,

More Resources

Home Page of GloVe
Lecture Video

Light on Math Machine Learning: Intuitive Guide to Understanding Word2vec

Here comes the third blog post in the series of light on math machine learning A-Z. This article is going to be about Word2vec algorithms. Word2vec algorithms output word vectors. Word vectors, underpin many of the natural language processing (NLP) systems, that have taken the world by a storm (Amazon...

Light on Math Machine Learning: Intuitive Guide to Convolution Neural Networks

This is the second article on my series introducing machine learning concepts with while stepping very lightly on mathematics. If you missed previous article you can find in here. Fun fact, I’m going to make this an interesting adventure by introducing some machine learning concept for every letter in the...

Light on Math Machine Learning: Intuitive Guide to Understanding KL Divergence

I’m starting a new series of blog articles following a beginner friendly approach to understanding some of the challenging concepts in machine learning. To start with, we will start with KL divergence. Code: Here Concept Grounding First of all let us build some ground rules. We will define few things...