A Practical Guide to Understanding Stochastic Gradient Descent Methods: Workhorse of Machine Learning


IPython Notebook: Here

Introduction: Why Optimization?

It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models. This actually allowed the early deep model to train without suffering from the vanishing or exploding gradients.

A good optimization is an essential part of machine learning as significant performance boost often comes from better optimization techniques. This is especially the case for stochastic algorithms. Because, in stochastic settings we only observe a subset of the data at a given time and better optimization techniques help to exploit data efficiently. One such trick would be, maintaining a running mean of gradients over time and adding that to the current gradient. This is in fact the idea behind momentum. In this post, we will be talking about Stochastic Gradient Decent (SGD) methods.

Why this post when there are million other search results?

Yes there are many good reads to understand the basics of the latest SGD methods. Few such resources are,
1. Sebastian Ruder’s Blog
2. Tensorflow official page
3. Stanford Course
4. Geoffrey Hinton’s Slides

Though these resources are great for understanding nuts and bolts of SGD methods and future trends, they don’t go in to practical details of why one method prefer one method over the other. In other words they don’t show what are the practical advantages of each method. In this blog post, I highlight with examples, how each method behave in a variety of settings and thereby progressively proceed along SGD optimization methods that perform better.

2D Problem Formulation

To understand the methods we assume the following problem setting. Assume we are trying to optimize y where y=f(\theta) for some function of x. At time t=0 we start with some initial value \theta_0 and continue finding solutions \theta_1, \theta_2, \ldots, \theta_T s.t. we minimize y. For the following methods we simply set,
y = \theta^2

1. SGD Optimization: Take a Step in the Opposite Direction of Gradient

Update Rule

First we will talk about the naive Stochastic Gradient Decent (SGD) optimization method. In this at time t, from point, \theta_t we take a small step in the opposite direction of the gradient to produce \theta_{t+1}. That is,
\theta_{t+1} = \theta_t - \eta \nabla_\theta y

where \eta is the magnitude of the step we take.

Visualizing SGD

This is what it looks like in our problem. We show decrease the transparency of the points from old to new. So new points (in time) are less transparent.

2. SGD with Momentum: Roll the Ball [1]

Update Rule

With momentum, in addition to taking a step of \eta \nabla_\theta y we add an additional component. This component is the accumulated gradient of the past. Mathematically,
v_t = \gamma v_{t-1} + \eta \nabla_\theta y
\theta_{t+1} = \theta_t - v_t

Intuition behind momentum

Imagine two balls freely rolling down the same hill. For one ball we introduce the following behavioral change. We stop the ball at every 5m interval with our hand for an infinitesimally small amount of time, so that every 5m the ball has a starting velocity of 0. Obviously the ball freely rolling down the hill will reach the bottom much quicker than the ball we reset every 5m.

Here the ball reset every 5m represents the SGD optimization. The ball rolling freely represents SGD with Momentum.

Visualizing SGD with Momentum

Below is the results for SGD with momentum. As you can see SGD with Momentum reaches lower value that SGD in the same number of steps.

Visualizing the Evil side of Momentum

We now have a quicker way of reaching the bottom, now let us run a bit longer.

You can see that SGD with momentum overshoots quite a bit if we continue optimizing after reaching the bottom. This is explicable as we have an accumulated gradient pushing the optimization even the current gradient is small. How can we fix this?

3. Nestrov’s Accelerated Gradient (NAG): Look Ahead Before Jumping! [3]

Update Rule

One way to reduce this effect is to look ahead before taking the step. Because we already have an approximation of y where we will be in the next step. That is f(\theta_t-\gamma v_{t-1}). Now the we can anticipate the overshoot 1 step before actually happening and also accumulated gradient will adapt quicker as well.

v_t = \gamma v_{t-1} + \eta \nabla_\theta f(\theta_t-\gamma v_{t-1})
\theta_{t+1} = \theta_t - v_t

Visualizing NAG

Let us see how NAG performs when we take steps for a longer duration. As shown in the image, the over-shooting is significantly lesser in NAG compared to Momentum based SGD.

Intuition to Nestrov’s Accelerated Gradient

This image taken from Geoffrey Hinton’s slides gives an intuition why NAG is better than Momentum

Moving to bigger and better problems: 3D Space

Now we move onto optimizing 2 parameters instead of 1 parameter. Because the new methods employ individual learning rates/ momentums and we really would like to see how that property might help in more complex learning environments.

3D Problem Formulation

Here, we are trying to optimize z where z=f(\theta^1, \theta^2) for some function of \theta^1 and \theta^2. At time t=0 we start with some initial value (\theta^1_0, \theta^2_0) and continue finding solutions (\theta^1_1, \theta^2_1), (\theta^1_2, \theta^2_2), \ldots, (\theta^1_T, \theta^2_T) s.t. we minimize y. Here we use the following surface.

I’m not going to explain the exact equation. However, the surface consists of the following.

  • A “crude” Gaussian with low variance (high steepness) on \theta^1 axis and relatively high variance (low steepness) on \theta^2 axis (the light blue lump).
  • Just in front of that gaussian is an inverted gaussian that gives the global minima of the surface somewhere within that (the dark blue pit).
  • All this is wrapped in a paraboloid having the equation x^2 + xy + y^2

4. AdaGrad: a Learning rate for Each Parameter [4]

Update Rule

One limitation of the momentum based methods discussed so far is that all the parameters have a single learning rate. However, this is highly ineffective in real world as the loss surfaces dealt with in such problems are very high dimensional and are very complex. So it is more intuitive to have a seperate learning rate for each parameter. But tuning a learning rate for each parameter by hand is impractical (a typical deep network can easily have few million parameters). Therefore we need a clever way of doing that.

Intuition: Small Gradient update means a Large Learning rate

That is the key idea behind AdaGrad. We will arrive at the Adagrad update in steps. We first start with the SGD update,

\theta_{t+1}^i = \theta_t^i - \eta \nabla_{\theta^i} y
let g^i = \nabla_{\theta^i} y so,
\theta_{t+1}^i = \theta_t^i - \eta g^i
and by generalizing this to a sequence \theta = \{\theta_0, \theta_1, \ldots, \theta_M\}
\theta_{t+1} = \theta_t - \eta g

AdaGrad starts with by modifying the SGD update to the following (of course with theoretically motivated reasons)
\theta_{t+1} = \theta_t - \eta {(G^{-1})}^{1/2} g

where G = gg^T where g\in {\rm I\!R}^{M\times 1} and G \in {\rm I\!R}^{M\times M}
Now, inverting a M\times M matrix can be expensive. But the good news is we don’t have to. And we can restrict G to a diagonal matrix such that,

\theta_{t+1} = \theta_t - \eta ({(diag(G)^{-1})}^{1/2} + \epsilon^{-1}) g

Now both inversion and the square root operations on diag(G) are linear in computation time, and even can be reduced to vectorized operations.

Let g_{diag} \in  {\rm I\!R}^{M\times 1} be the vector composing only of the diagonal elements of diag(G)

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{g_{diag}}+\epsilon} \odot g

where \odot represents element-wise multiplication of vectors and epsilon avoids division by zero.

Visualizing AdaGrad

Now let us see how AdaGrad works in the 3D problem space we looked at earlier.

In order to see the advantage of AdaGrad we compare the method to NAG; the best optimization method we have so far. We start at very close to the highest point on the map (but not exactly on top). To be exact, we start on (0.4, 0.5). Now I’ll explain why I chose this point.

The mean of the gaussian lies on (0.5,0.5) and we designed the Gaussian to have a large steepness on \theta^1 axis and a relatively lower steepness on \theta^2 axis. And by starting at (0.4, 0.5) we start a bit to the side where it is more steep. And if we imagine rolling a ball from this point, the ball should travel more on the \theta^1 axis. But this is the undesired behavior, because we’ll get to the global minimum more quickly if we travel more on \theta^2 axis. AdaGrad is supposed to supposed to solve this, as AdaGrad pushes the ball in the direction on which the gradient updates are small.

As it is seen AdaGrad travels more on \theta^2 axis, which exactly the behavior we needed! And NAG ends up travelling more on the \theta^1 axis and less on \theta_2 which is not what we need. However a downside of AdaGrad can also be observed, the learning rate for later updates aggressively diminishes as forced by the update rule. In other words, as we gain more and more better and larger gradients over time, the learning rate becomes smaller!

5. RMSProp Optimizer [5]

Update Rule

RMSProp is serves the same purpose as AdaGrad: Adapting the learning rate for individual parameters. The update rule is as follows.

{\rm I\!E}[g_t^2] = \gamma {\rm I\!E}[g_{t-1}^2]  + (1-\gamma)g_t
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{{\rm I\!E}[g_t^2]}} g_t

6. Adam Optimizer: Individual Learning Rates AND… Momentums for Parameters [6]

Update Rule

Adam Optimizer takes what Adagrad does to a new level and solves the major limitation of AdaGrad as well; the learning rate diminishing over time. Adam Optimizer employs individual learning rates as well as individual momentums for each parameter, thereby keeping the parameters being pushed towards minimum continuously.

m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

Another new property of Adam is that it has a bias correction term. This bias correction term is important as we initialize m_0 and v_0 with zeros, creating a bias in m_t and v_t towards zero. This is counteracted by the following update rule.

\hat{m}_t = \frac{m_t}{(1-\beta_1^t)}
\hat{v}_t = \frac{v_t}{(1-\beta_2^t)}
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}} \hat{m}_t

Visualizing Adam

Now we compare Adam to Adagrad to see if Adam actually alleviates the diminishing learning rate phenomenon. As it is seen from the image, Adam in fact has higher velocity in the desired direction.

IPython Notebook: Here

That’s all for today folks!