In the next few posts, I will be talking about several Generative Models that saw daylight quite recently. On a general tone, generative models are used to generate new samples by approximating the underlying distribution of data and quite useful for unsupervised learning and semi-supervised learning. Few state-of-the-art generative models are Variational Autoencoders (VAEs) by Kingma et. al and Generative Adverserial Networks (GANs) by Goodfellow et. al. Generative Adveserial Networks (GANs) shines when it comes to creating photo-realistic images.

Variational Autoencoders is a model that composes of 2 components; an Encoder and a Decoder . The encoder maps into a latent variable in the latent space . The encoder and decoder both are powerful function approximators (e.g. **Neural Networks**). This pushes the encoder to learn a probability distribution of which we denote by . The decoder samples from the latent space and generate a sample . Sample is sampled as where is the approximated probability distribution for learnt by the decoder. This is what happens under the hood in the VAE.

Now, we dive head first into the technical details. I will try to be as thorough as possible while keeping an easy-to-follow flow of how things fall into pieces. Beyond this point, you must clear all thoughts and enter a state of peace. Seriously though, forget the above explanation. Because we are about to embark on the journey that will explain what exactly VAEs are.

Before coming up with the mechanism that dictates VAEs, let’s first understand what is the ultimate goal of VAEs. The goal of VAEs is to, **find a distribution of some latent variable which we can sample from (), to generate new sample (to be precise, an approximated )** through the decoder (a function approximator such as a Neural Network). I know, I know … This statement raises more questions than what it answers. One key point you should take away from this point onwards is that, the decoder maps (or decodes) the latent variables to a new sample that comes from .

This is harder than I thought. First of all I’m going to explain this both probabilistic and non-probabilistic. But after that, I’m going to stick with probabilistic interpretation, otherwise it is hard to explain the theory. So you better pull out your probability drawers in your brain if they’ve been rusting away.

So let’s set our foot on this, we want to find the distribution of some latent (i.e. hidden) representation , of ( is some underlying distribution of data we don’t know about). should capture some semantic features of . This allows us to generate new sample the following way using the decoder .

- Sample from (i.e. )
- Sample from

We want to find some latent (i.e. hidden) representation of the data . z should captures some semantic features of X. So that by looking at the values of we should have an idea about what data that generated this should look like. This allows us to generate new sample by tweaking values of (that will result in a new ) and using the decoder where .

For example, if we want to generate images of digits, by observing images (i.e. ) we define some that define a set of variables like **z=(stroke_thickness, angle, scale, …)** so we can slightly change the values of z and generate new samples.

This is a quite nice approach with some solid theoretical foundation. But there’s a few important questions left to be answered, which we will do below.

Easier said than done! Unfortunately, it’s not as easy as it looks to design . If you attempt to hand engineer you will run into a multitude of problems (What are the variables to consider? What is the importance of each variable?). And to make matters worse, these variables might be correlated. From the Digit example, a smaller stroke_thickness would result in a higher angle. And even if you figure all this out, it would take years to create the dataset for . So, by the looks of it, it is not a good idea to design this by hand.

Yes, you saw it correctly! We assume, that . That is we assume comes from a zero-mean, unit-variance normal distribution. Sounds crazy, but it works. That is because it is not as crazy as it looks. Here’s how this works.

is a powerful distribution. With “powerful” enough function g(z) we can turn z into an arbitrary complex distribution. This tutorial (Figure 2) highlights a nice example of that. This means that, if we have correct g(z) we can map to the meaningful semantic features we want.

Now we got another question at our hands, what is g(z)?

Remember, , our decoder is a function approximator. So why not approximate with that. To add more clarity, if you think the decoder is a four-layer neural network and assume the digit example, the first two layers will map into the meaningful latent variable space we defined earlier. Then the next two layers will convert those latent variables to full-rendered image of the digit. Of course that’s not exactly the way it works. But it is a nice intuitive way to see how things finally become coherent.

From above discussions, you know that we need to to generate new samples. But we have some sort of a pragmatic problem in front of us. If you just sample for most , will be zero. This phenomenon is called the curse of dimensionality. So we have another problem at our hands. How do we find an effective way to samples z so that, z will come from the space where will be non-zero.

It seems that there’s a smarter way to get around sampling, instead of waiting a millennium to find that gives non-zero. Why not use a function approximator to find the distribution , that results in non-zero . What is this ? Ah, things are finally falling into places. is the Encoder we talked about earlier. More formally, will output and that will be the mean and variance of the isotropic gaussian we will be sampling from.

So we came all the way from the defining our goal to why we need the decoder to why we need the encoder. Here’s an image of how things look together.

So in order, we do the following.

- Sample i.e.
- Sample i.e.
- Sample i.e.

By this time you probably have realized you need some basis to train the encoder and the decoder. I’m going to assume neural networks. For that we are going to use maximum likelihood approach. The intuition, is that we want to find values of and so that we are likely to get data like X (or maximize ). Now we will set up our objective function along this line. In other words, we need to find some PDF of , that are likely under data . Mathematically, we define this as the divergence between and . Specifically Kullback-Leibler Divergence.

After some number crunching, you will come at the following equation (See Section 2.1 for exact maths),

So by maximizing the RHS, we will be maximizing the log likelihood of while minimizing a term denoted by simultaneously. However, the term , cannot be ignored as this minimization (when this reaches 0) of this term gives the additional benefit of approximating the intractable . Moreover, it has been proven (for 1D case) that, this term in fact reaches zero given that is small.

We need some tangible values for the RHS if we want to maximize the likelihood of . Well, is something you can write blindfolded if you into deep networks. It’s the reconstruction error between and .

Next, the second term can be computed in closed-form. It’s the KL divergence between two isotropic Gaussians given by,

Now by putting these two terms together, the RHS becomes easy to implement. I won’t state the obvious here (The full equation).

Now it’s just a matter of optimizing this objective function w.r.r using a stochastic gradient method (e.g. Momentum update, Adam optimizer). Right now, this is what our model looks like. The solid lines represent connections that results in a deterministic output in the model, where dashed lines show connections that would result in stochastic outputs.

There’s a major issue that does not allow us to backpropagate through the whole model. Do you see it? Since we are sampling z from a Distribution, the end-to-end deterministic nature of parameters is lost. In other words, backpropagation will work with stochastic inputs, but not stochastic parameters. If we speak in terms of the above image, it is fine to have those dashed lines at the ends of our model, but we need solid lines from end-to-end in our model, otherwise we cannot backpropagate end-to-end.

Don’t let your hopes down. There’s a elegant trick to level out this lump of misfortune. That is called the **“reparameterization trick”**. Instead of sampling as below,

we do the following

where .

the broken link between the encoder and decoder disappears enabling the backpropagation to work as good as ever.

Yep, this is really it. That’s all to basics of VAEs. I know it’s a mouthful of mathematics and probability and what not. But hopefully, after going through a few times and implementing by your own, things will make sense.

When implementing VAEs be mindful of your implementation because if you are using linear unit activations, make sure you have a good initialization (e.g. Xavier Initialization), otherwise the model will converge very quickly to a poor solution, probably because of dead linear units.

My code can be found here.

I’m starting a new series of blog articles following a beginner friendly approach to understanding some of the challenging concepts in machine learning. To start with, we will start with KL divergence. Code: Here Concept Grounding First of all let us build some ground rules. We will define few things...

Paper is found here. One of the key advantages of Deep Models is that they made feature engineering obsolete. With this came a paradim-shift; from engineering robust features to engineering deep architectures, i.e. hyperparameters, for machine learning tasks. This paper uses reinforcement learning (RL) to find the best deep architecture...

In this post, I’m going to introduce a type of a Stacked Autoencoders (SAE) (Don’t worry if you don’t understand what an SAE is. Will explain later.). And worth a mention, that this is some research work done by me and few colleague from our research lab. So yay for...