**IPython Notebook: Here**

It is no need to stress that optimization is at the core of machine learning algorithms. In fact this was a big enabler of deep learning; where “pre-training” (i.e. an optimization process) the network was used to find a good initialization for deep models. This actually allowed the early deep model to train without suffering from the vanishing or exploding gradients.

A good optimization is an essential part of machine learning as significant performance boost often comes from better optimization techniques. This is especially the case for stochastic algorithms. Because, in stochastic settings we only observe a subset of the data at a given time and better optimization techniques help to exploit data efficiently. One such trick would be, maintaining a running mean of gradients over time and adding that to the current gradient. This is in fact the idea behind momentum. In this post, we will be talking about Stochastic Gradient Decent (SGD) methods.

Yes there are many good reads to understand the basics of the latest SGD methods. Few such resources are,

1. Sebastian Ruder’s Blog

2. Tensorflow official page

3. Stanford Course

4. Geoffrey Hinton’s Slides

Though these resources are great for understanding nuts and bolts of SGD methods and future trends, they don’t go in to practical details of **why** one method prefer one method over the other. In other words they don’t show what are the practical advantages of each method. In this blog post, I highlight with examples, how each method behave in a variety of settings and thereby progressively proceed along SGD optimization methods that perform better.

To understand the methods we assume the following problem setting. Assume we are trying to optimize where for some function of . At time we start with some initial value and continue finding solutions s.t. we minimize . For the following methods we simply set,

First we will talk about the naive Stochastic Gradient Decent (SGD) optimization method. In this at time , from point, we take a small step in the opposite direction of the gradient to produce . That is,

where is the magnitude of the step we take.

This is what it looks like in our problem. We show decrease the transparency of the points from old to new. So new points (in time) are less transparent.

With momentum, in addition to taking a step of we add an additional component. This component is the accumulated gradient of the past. Mathematically,

Imagine two balls freely rolling down the same hill. For one ball we introduce the following behavioral change. We stop the ball at every 5m interval with our hand for an infinitesimally small amount of time, so that every 5m the ball has a starting velocity of . Obviously the ball freely rolling down the hill will reach the bottom much quicker than the ball we reset every 5m.

Here the ball reset every 5m represents the SGD optimization. The ball rolling freely represents SGD with Momentum.

Below is the results for SGD with momentum. As you can see SGD with Momentum reaches lower value that SGD in the same number of steps.

We now have a quicker way of reaching the bottom, now let us run a bit longer.

You can see that SGD with momentum overshoots quite a bit if we continue optimizing after reaching the bottom. This is explicable as we have an accumulated gradient pushing the optimization even the current gradient is small. How can we fix this?

- + Gets to the optimal quicker
- – Can overshoot if not careful with the learning rate

One way to reduce this effect is to look ahead before taking the step. Because we already have an approximation of where we will be in the next step. That is . Now the we can anticipate the overshoot 1 step before actually happening and also accumulated gradient will adapt quicker as well.

Let us see how NAG performs when we take steps for a longer duration. As shown in the image, the over-shooting is significantly lesser in NAG compared to Momentum based SGD.

This image taken from Geoffrey Hinton’s slides gives an intuition why NAG is better than Momentum

- + Prevents overshooting after reaching the optimum
- – Poor at complex problems as all the parameters are governed by a single learning rate

Now we move onto optimizing 2 parameters instead of 1 parameter. Because the new methods employ individual learning rates/ momentums and we really would like to see how that property might help in more complex learning environments.

Here, we are trying to optimize where for some function of and . At time we start with some initial value and continue finding solutions s.t. we minimize . Here we use the following surface.

I’m not going to explain the exact equation. However, the surface consists of the following.

- A “crude” Gaussian with low variance (high steepness) on axis and relatively high variance (low steepness) on axis (the red-orange lump).
- Just in front of that gaussian is an inverted gaussian that gives the global minima of the surface somewhere within that (the dark blue pit).
- All this is wrapped in a paraboloid having the equation

One limitation of the momentum based methods discussed so far is that all the parameters have a single learning rate. However, this is highly ineffective in real world as the loss surfaces dealt with in such problems are very high dimensional and are very complex. So it is more intuitive to have a seperate learning rate for each parameter. But tuning a learning rate for each parameter by hand is impractical (a typical deep network can easily have few million parameters). Therefore we need a clever way of doing that.

Intuition: Small Gradient update means a Large Learning rate

That is the key idea behind AdaGrad. This works quite well especially when gradients are sparse and complex (which usually are in real world problems). We will arrive at the Adagrad update in steps. We first start with the SGD update,

let so,

and by generalizing this to a sequence

AdaGrad starts with by modifying the SGD update to the following (of course with theoretically motivated reasons)

where where and

Now, inverting a matrix can be expensive. But the good news is we don’t have to. And we can restrict to a diagonal matrix such that,

Now both inversion and the square root operations on are linear in computation time, and even can be reduced to vectorized operations.

Let be the vector composing only of the diagonal elements of

where represents element-wise multiplication of vectors and epsilon avoids division by zero.

Now let us see how AdaGrad works in the 3D problem space we looked at earlier.

In order to see the advantage of AdaGrad we compare the method to NAG; the best optimization method we have so far. We start at very close to the highest point on the map (but not exactly on top). To be exact, we start on (0.6, 0.55). Now I’ll explain why I chose this point.

The mean of the gaussian lies on (0.5,0.5) and we designed the Gaussian to have a large steepness on axis and a relatively lower steepness on axis. And by starting at (0.6, 0.55) we start a bit to the side where it is more steep. And if we imagine rolling a ball from this point, the ball will travel more on the axis, because the gradient is lower on axis (similar effect as when gradients are “sparse”). But this is the undesired behavior, because we’ll get to the global minimum more quickly if we travel more on axis. AdaGrad is supposed to supposed to solve this, as AdaGrad pushes the ball in the direction on which the gradient updates are small.

As it is seen AdaGrad quickly starts travelling more on axis, which exactly the behavior we needed! And NAG ends up travelling more on the axis and less on which is not what we need. However a downside of AdaGrad can also be observed, the learning rate for later updates aggressively diminishes as forced by the update rule. In other words, as we gain more and more better and larger gradients over time, the learning rate becomes smaller! Therefore in the experiment, we double the learning rate of AdaGrad.

- + Adaptive learning rates for each parameter
- – Learning rate diminishes very fast

RMSProp is serves the same purpose as AdaGrad: Adapting the learning rate for individual parameters. The update rule is as follows.

Adam Optimizer takes what Adagrad does to a new level and solves the major limitation of AdaGrad as well; the learning rate diminishing over time. Adam Optimizer employs individual learning rates as well as individual momentums for each parameter, thereby keeping the parameters being pushed towards minimum continuously.

Another new property of Adam is that it has a bias correction term. This bias correction term is important as we initialize and with zeros, creating a bias in and towards zero. This is counteracted by the following update rule.

Now we compare Adam to Adagrad to see if Adam actually alleviates the diminishing learning rate phenomenon. As it is seen from the image, Adam in fact has higher velocity in the desired direction.

- + Adaptive learning rate and momentum for each parameter
- + Learning rate does not diminish as in AdaGrad
- – Does not “look ahead” like NAG

Here we show how all the algorithms perform with the same learning rate, starting from the same position on the 3D surface. You can see that SGD makes the least progress, Momentum and RMSProp makes good progress, but slightly in an incorrect direction. However, Adam works quite well by turning towards the minimum quicker than other methods.

**IPython Notebook: Here**

That’s all for today folks!

Jupyter Notebook for this Tutorial: Here Recently, I had to take a dive into the seq2seq library of Tensorflow. And I wanted to a quick intro to the library for the purpose of implementing a Neural Machine Translator (NMT). I simply wanted to know “what do I essentially need to...

Tensorflow Version: 1.2 Original paper: Convolution Neural Networks for Sentence Classification Full code: Here RNN can be miracle workers, But… So, you’re all exhausted from trying to implement a Recurrent Neural Network with Tensorflow to classify sentences? You somehow wrote some Tensorflow code that looks like a RNN but unable...