Convolutional Neural Networks: Mayor of the Visionville

Hey there,


If you are a computer vision fanatic like me, I don’t think I need to convince you about the potent of CNNs. But let me take a shot at it anyway! Convolutional Neural Networks (CNNs) are unrivaled when it comes to computer vision tasks. Given their ability to preserve spatial information of any image and invariance to translations,rotations and scaling, it comes as no surprise that they transcend the competitors by far.



There are quite a few advantages of CNNs compared to other vision techniques. To name a few, CNNs do not require any feature extraction from raw data. In fact, they jointly perform feature extraction and learning during the training process. So you won’t have to keep awake at night feeling guilty about not paying more attention to feature extraction! Next, CNNs on in general deep learning enable forming hierarchical feature representation analogous to the functioning of brain (Huber and Wiesel found out that neurons in the brain are organized in a hierarchical fashion). This could be enabling CNNs to reach human-level accuracy (or even better) in many vision tasks.

Intricacies of CNNs

Great, so you want to learn how CNNs work. I’ll try to keep it short and simple. I will delineate the details under three components (figure below) in the order the data flows through the network.



Input to a CNN is a matrix composed of RGB pixel values of an image. For example, for an RGB image of 32×32, I can represent this as a 32x32x3 Matrix. In other words, three (RGB) 32×32 planes stacked vertically.

Spatial Operations

Then we perform a set of subsequent spatial operations namely Convolution and Pooling. The objective of these operations is to learn various spatial information of images. First the convolution operation performs 2D convolution over the whole input using a small patch by sliding the patch over the whole image. The importance of 2D convolution is highlighted in this. Learning various useful filters (e.g. edge detection) is the main objective of the Convolution operation. This creates the output of the convolution operation. Then the pooling operation serves two main purposes. It reduces the computational overhead by reducing the size of output, but more importantly it makes the network translation,rotation and scale invariant. We will later see how exactly this happens.


First let’s dive into how the convolution works. I’m not going to dive to the very bottom of theory but start somewhere in the middle. 2D Convolution works the following way.

Mathematically, this operation can be represented as below. Embrace yourself! This is going to be a mouthful. But the silver line hidden is that you don’t have to worry about implementing convolution by yourself. All the popular deep learning frameworks have these already implemented. But for the sake of completeness, I’m going to write that down anyway.

    \[ O_{(w_o,h_o)}= \Sigma_{w_f=0}^{w_F-1} \Sigma_{h_f=0}^{h_F-1} I_{(w_o\times S_w+w_f,h_o\times S_h+h_f)}F_{(w_f,h_f)} \]

\noindent  O_{(i,j)}: (i,j)^{th} \text{component of the Output}\\ I_{(i,j)}: (i,j)^{th} \text{component of the Input}\\ F_{(i,j)}: (i,j)^{th} \text{component of the convolution Filter}\\ w_F,h_F: \text{width and height of the convolution filter (i.e. Receptive Field)}\\ S_w,S_h: \text{Stride along width and height}

Stride denotes how many rows and columns you skip after each convolution. For example, if S_w=2 and S_h=2 then convolution would look like below. As you observe from image below higher stride means lower size of the output.



Pooling operation is not as convoluted as the Convolution operation (pun intended!). And quite simple to both understand and implement. The following image depicts max and average pooling with a Stride of (2,2).


Pooling operation takes in input and a window (also called the kernel) and output the max or the average value of the input included within that window. The size of the window is a hyperparameter you need to chose (called the kernel size). This operation makes the features CNN learn translation, rotation and scale invariant. Intuitively, if we are trying to classify a vehicle and the wheels of vehicles appear slightly offset (e.g. translated on x axis) than the trained images, the CNN will still activate correct features because the wheel is still within the window.

Fully Connected Operations

Alright! now we have some fancy operations that allows us to learn exciting features from images, but how are we going to actually classify objects? This is where the fully connected layers comes in. Fully connected layer serves as a transition layer from all those 2D operations you performed to some linear classifier (e.g. Softmax, SVM). Nothing special about it. However, it seems that having a fully connected layers is an overkill and doesn’t give you a significant edge compared to connecting the classifier directly to the output of the 2D operations.

Putting Everything Together

Now we know all the building blocks, so let’s start building a CNN right away. After you put these building blocks together this is what the final structure might look like.


You might notice a parameter called “Padding” which I haven’t discussed but have included for completeness. I’m not going to dive head first into this because this is quite simple and to avoid clutter. Briefly, this is a way to keep the size of the output consistent so to allow to have more layers in the CNN.

Looking from a higher vantage point: Hyperparameter selection

Now let’s climb a few steps up and look back. Now the basic structure is done, we’re still missing the icing on top; hyperparameters. Hyperparameters are as critical as the architecture to properly implement a CNN. In fact, the importance of the hyperparameters increases with the number of layers, as reckless hyperparameters makes deeper networks more prone to malfunction. So what hyperparameters exactly we need to be careful about.

Learning Rate

Learning rate is the size of the step you take towards a specific minimum of the loss function you’re optimizing. A larger step will get you to the minimum quicker but result in lots of oscillations close to the minimum. A smaller learning rate is tedious but will have fewer oscillations.

So to get the best of both worlds, we do something called learning rate decay. At the beginning we take large steps and over time decrease the size of the step. This has shown to work better than a fixed learning rate.

Also, deeper you model is lesser your starting learning rate should be. If you use aggressive learning rates in deeper models, the loss will explode and result in numerical instabilities.


Regularization aims to tame the parameters by imposing various constraints. For example, L2-regularization forces weights to be small so they will not explode in values. I will be focusing on two regularization techniques. L2 and dropout.


L2 regularization looks as follows. Intuitively you add the sum of squares of all the weight parameters in the CNN to the loss value. Therefore, to minimize the loss, it becomes essential to keep weights small.

    \[ L = \text{classification error} + \beta \Sigma_{w_{i,j}\in CNN} w_{i,j}^2$ \]


Next we have dropout. Dropout can be identified as a mostly used regularization techniques. Because it almost always improve the performance. Intuitively dropout works by switching off a parts of your network, for different inputs. Switching off is usually stochastic. The idea is that, the network will learn stronger features so that even if some features are not present, network will have alternative features allowing to classify a certain sample correctly.

Batch Size

In deep learning, we process data in batches. Because in most of the interesting problems, dataset size is too big to keep in memory. A larger batch size is better because then, each step you take down on the loss function becomes more accurate as you look at more data before taking that step. However the trade off is that, the memory requirement and computation goes up.

Things to Remember

Think I covered almost all the important things you need to know to build CNNs.

But what you should understand is that there isn’t a “Silver Bullet” CNN that miraculously works for every task you have. You will run into various both analytical and numerical issues. And debugging can be a nightmare. So let me finish this with, few tips and checks for building CNNs.

  • Deeper your CNN, better the results will be
  • Unless you are using special tricks, keep the number of layers of the CNN less than 12
  • Use Global pooling as the last spatial operation (this will result in fewer fully connected weights) to reduce computational overhead
  • Avoid using fully-connected layers, the slight performance increase is not worth the computational overhead
  • If possible, always use cross validation to find the best hyperparameters for your specific task
  • Don’t use aggressive learning rates. Symptoms of this can be Loss resulting in inf or NaN
  • If you’re using L2 regularization make sure the loss with L2 is of the same scale as loss without L2
  • Experiment with different regularizations (Dropout, BatchNorm, L2) and combinations too!
  • Most importantly, have fun!