GloVe: Global Vectors for Word Representation + Implementation


This post will be about a new Word2Vec technique that has come after skip-gram and CBOW, introduced in this paper. Why the authors claim that GloVe is better than context-window based methods is that, it tries to combine both global and local statistics in order to create more general word representations.

GloVe is designed to take advantage two primary families of learning word vectors, i.e. global matrix factorization methods (e.g. latent symantic analysis) and local context window based methods (e.g. skip-gram). Each has their own weakness. Global methods do poor in word analogy tasks, i.e. a is to b as c is to _. But context window based methods do well in such tasks but fail to incorporate the global picture of things while doing so.

I will not go into details of Word2Vec and why is it important. You can refer my previous posts for that.

GloVe in More Detail

The motivation for GloVe starts with that ratios between the probabilities of words appearing next to each other carry more information than considering those probabilities individually. Don’t worry if this doesn’t make sense. I’ll break it down. Let us assume (i,j) element of the co-occurance matrix X, X_{i,j} denotes the number of times word j occur near i. Next let us define P_{ij}=P(j|i)=X_{ij}/\sum_{\forall k}X_{ik}. These are pretty basic things.

With this notation we can find why the ratio between probabilities work. Consider 4 words; solid, gas, water and fashion. If you calculate the ratio P_{ice,k}/P_{steam,k} where k \in \{solid, gas, water, fashion\} from X, you will make following observations.

  1. P_{ice,solid}/P_{steam,solid} will be very high
  2. P_{ice,gas}/P_{steam,gas} is very low
  3. P_{ice,water}/P_{steam,water} will be higher than P_{ice,fashion}/P_{steam,fashion}

Deriving the Cost Function

So it can be seen this ratio is quite expressive. Now we say that what ever we are to derive should start with this ratio, leading to the function below.

(1)   \begin{equation*} F(w_i,w_j,\tilde{w}_k)=\frac{P_{ik}}{P_{jk}} \end{equation*}

Then after some serious mathematical crunching the authors arrive at the following cost function.

(2)   \begin{equation*} J=\sum_{i,j=1}^{V} (w_i^T\tilde{w}_j + b_i + \tilde{b}_j - log(1+X_{ij}))^2 \end{equation*}

where V is the vocabulary size, w_i is the target word, \tilde{w}_j is the context word and b_i is the bias for word w_i. But this cost fuction treat words far-apart and close together equally. To solve this, we introduce a weighing function, turning the equation to,

(3)   \begin{equation*} J=\sum_{i,j=1}^{V} f(X_{ij})(w_i^T\tilde{w}_j + b_i + \tilde{b}_j - log(1+X_{ij}))^2 \end{equation*}

You can choose any function for f that satisfy a set of properties specified in the paper. The recommended function for f (in the paper) is as below.
\begin{aligned} f(x)=         \begin{cases}         x/x_{max} \text{ if } x<x_{max}\\         1 \text{ else }         \end{cases}  \end{aligned}

This is it. The main contribution of the paper is this new cost function we just saw. Then you train the model as you would do with skip-gram but with the new cost function.

Big Picture

This is how it looks like when everything comes together. Inputs are w_is, Outputs are \tilde{w}_js, bias embeddings are b_i and \tilde{b}_js. Note that the cost function is defined only for a single input,output set. But the picture depicts the algorithm for a batch of data of size b.

GloVe: High-level Architecture

GloVe: High-level Architecture

GloVe Implementation

You can find my implementation of GloVe in github: Here
Please note that this is my perspective of the algorithm. The exact algorithm might differ.

If you run things without any trouble you should see something like below as the output.

Average loss at step 0: 124.901360
Nearest to three: misskelley, acknowledge, exegesis, glide, sy, achish, dynamism, mandela,
Nearest to many: conglomeration, gorgeous, epicycle, kj, compartment, leppard, myoglobin, prolog,
Nearest to some: babe, shamanistic, daddy, imparts, also, treasures, fittings, wtc,
Nearest to up: cts, prophecies, interlisp, framers, ferrara, boast, vara, arecibo,
Nearest to than: monet, waterproof, noriega, elia, mario, csd, courses, chinook,
Nearest to will: tartarus, schott, comrade, oath, axe, coups, sami, fax,
Nearest to and: nine, the, home, swearing, with, alves, emedicine, tertius,
Nearest to used: the, meridians, mop, also, schenectady, anabolic, applied, methanol,
Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, resolutely, tleilaxu, hydrazine,
Nearest to troops: solve, firing, eudoxus, splendour, impoverished, resentful, tend, unintelligible,
Nearest to older: maine, bornu, ahead, seizures, resides, pausanias, deoxygenated, disenchanted,
Nearest to brother: ramparts, satirical, breakage, assumptions, mcgowan, funicular, kc, amigaone,
Nearest to powers: gulf, mael, marginalized, professed, rgermeister, obscurity, bran, newsstand,
Nearest to bbc: oscillator, dragoons, spalding, yam, furthered, pandavas, termite, correspondance,
Nearest to running: attar, shuts, praising, heck, whispering, sustainable, its, lightweight,
Nearest to articles: parallelism, drogheda, goblin, finegold, anastasia, marquise, melodic, admiring,
Average loss at step 100000: 0.143277
Nearest to three: four, five, six, seven, two, eight, one, nine,
Nearest to many: some, several, of, all, and, are, other, their,
Nearest to some: all, many, of, and, in, their, the, for,
Nearest to up: set, to, go, them, with, power, and, back,
Nearest to than: more, greater, less, larger, rather, other, or, a,
Nearest to will: can, may, be, should, would, not, that, you,
Nearest to and: the, UNK, of, in, a, with, for, by,
Nearest to used: for, to, is, by, be, as, in, are,
Nearest to recorded: scully, evasive, axumite, hoplite, swallowed, robertson, albinus, pinching,
Nearest to troops: solve, firing, nsdap, silla, resentful, boots, caveats, triumvirate,
Nearest to older: ahead, bornu, maine, seizures, rodeo, musketeers, pausanias, distancing,
Nearest to brother: big, his, father, works, work, own, its, their,
Nearest to powers: gulf, bran, mael, marginalized, valkyrie, montreal, laches, newsstand,
Nearest to bbc: news, radio, top, side, on, earth, convention, based,
Nearest to running: attar, shuts, heck, praising, catania, sustainable, wah, seems,
Nearest to articles: finegold, goblin, interstellar, drogheda, migrating, xa, anastasia, frankfurter,

More Resources

Home Page of GloVe
Lecture Video

Neural Machine Translator with 50 Lines of Code + Guide

Jupyter Notebook for this Tutorial: Here Recently, I had to take a dive into the seq2seq library of Tensorflow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to...

A Practical Guide to Understanding Stochastic Gradient Descent Methods: Workhorse of Machine Learning

IPython Notebook: Here Introduction: Why Optimization? It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models....

Make CNNs for NLP Great Again! Classifying Sentences with CNNs in Tensorflow

Tensorflow Version: 1.2 Original paper: Convolution Neural Networks for Sentence Classification Full code: Here RNN can be miracle workers, But… So, you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable...