What is Regularization ?

Regularization is a technique used in an attempt to solve the overfitting[1] problem in statistical models.*

First of all, I want to clarify how this problem of overfitting arises.

When someone wants to model a problem, let’s say trying to predict the wage of someone based on his age, he will first try a linear regression model with age as an independent variable and wage as a dependent one. This model will mostly fail, since it is too simple.

Then, you might think: well, I also have the age, the sex and the education of each individual in my data set. I could add these as explaining variables.

Your model becomes more interesting and more complex. You measure its accuracy regarding a loss metric L(X,Y)L(X,Y) where XX is your design matrix and YY is the observations (also denoted targets) vector (here the wages).

You find out that your result are quite good but not as perfect as you wish.

So you add more variables: location, profession of parents, social background, number of children, weight, number of books, preferred color, best meal, last holidays destination and so on and so forth.

Your model will do good but it is probably overfitting, i.e. it will probably have poor prediction and generalization power: it sticks too much to the data and the model has probably learned the background noise while being fit. This isn’t of course acceptable.

So how do you solve this?

It is here where the regularization technique comes in handy.

You penalize your loss function by adding a multiple of an L1L1 (LASSO[2]) or an L2L2(Ridge[3]) norm of your weights vector ww (it is the vector of the learned parameters in your linear regression). You get the following equation:


(NN is either the L1L1, L2L2 or any other norm)

This will help you avoid overfitting and will perform, at the same time, features selection for certain regularization norms (the L1L1 in the LASSO does the job).

Finally you might ask: OK I have everything now. How can I tune in the regularization term λλ?

One possible answer is to use cross-validation: you divide your training data, you train your model for a fixed value of λλ and test it on the remaining subsets and repeat this procedure while varying λλ. Then you select the best λλ that minimizes your loss function.

I hope this was helpful. Let me know if there is any mistakes. I will try to add some graphs and eventually some R or Python code to illustrate this concept.

Also, you can read more about these topics (regularization and cross validation) here:

* Actually this is only one of the many uses. According to Wikipedia, it can be used to solve ill-posed problems. Here is the article for reference: Regularization (mathematics).

As always, make sure to follow me for more insights about machine learning and its pitfalls: http://quora.com/profile/Yassine…


[1] Overfitting

[2] Lasso (statistics)

[3] Tikhonov regularization

Source – https://www.quora.com/What-is-regularization-in-machine-learning

What is batch size and epoch in neural network?

Batch size defines number of samples that going to be propagated through the network.For instance, let’s say you have 1050 training samples and you want to set up batch_size equal to 100. Algorithm takes first 100 samples (from 1st to 100th) from the training dataset and trains network. Next it takes second 100 samples (from 101st to 200th) and train network again. We can keep doing this procedure until we will propagate through the networks all samples. The problem usually happens with the last set of samples. In our example we’ve used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get final 50 samples and train the network.


It requires less memory. Since you train network using less number of samples the overall training procedure requires less memory. It’s especially important in case if you are not able to fit dataset in memory.

Typically networks trains faster with mini-batches. That’s because we update weights after each propagation. In our example we’ve propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we’ve updated network’s parameters. If we used all samples during propagation we would make only 1 update for the network’s parameter.


The smaller the batch the less accurate estimate of the gradient. In the figure below you can see that mini-batch (green color) gradient’s direction fluctuates compare to the full batch (blue color).

enter image description here

Stochastic is just a mini-batch with batch_size equal to 1. Gradient changes its direction even more often than a mini-batch.


In the neural network terminology:

  • one epoch = one forward pass and one backward pass of all the training examples
  • batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.
  • number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

FYI: Tradeoff batch size vs. number of iterations to train a neural network



Epoch and Iteration describe slightly different things.

As others have already mentioned, an “epoch” describes the number of times the algorithm sees the ENTIRE data set. So each time the algorithm has seen all samples in the dataset, an epoch has completed.

An “iteration” describes the number of times a “batch” of data passed through the algorithm. In the case of neural networks, that means the “forwarwd pass” and “backward pass”. So every time you pass a batch of data through the NN, you completed an “iteration”

An example might make it clearer:

Say you have a dataset of 10 examples/samples. You have batch size of 2, and you’ve specified you want the algorithm to run for 3 epochs.

Therefore, in each epoch, you have 5 batches (10/2 = 5). Each batch gets passed through the algorithm, therefore you have 5 iterations per epoch. Since you’ve specified 3 epochs, you have a total of 15 iterations (5*3 = 15) for training.


Source – https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network


What is momentum in neural networks

In neural networks, we use gradient descent optimization algorithm to minimize the error function to reach a global minima. In an ideal world the error function would look like this

So you are guaranteed to find the global optimum because there are no local minimum where your optimization can get stuck. However in real the error surface is more complex, may comprise of several local minima and may look like this

In this case, you can easily get stuck in a local minima and the algorithm may think you reach the global minima leading to sub-optimal results. To avoid this situation, we use a momentum term in the objective function, which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minima. If the momentum term is large then the learning rate should be kept smaller. A large value of momentum also means that the convergence will happen fast. But if both the momentum and learning rate are kept at large values, then you might skip the minimum with a huge step. A small value of momentum cannot reliably avoid local minima, and can also slow down the training of the system. Momentum also helps in smoothing out the variations, if the gradient keeps changing direction. A right value of momentum can be either learned by hit and trial or through cross-validation.

Pictures Source – Momentum and Learning Rate Adaptation

What is bias in neural network?

In a typical artificial neural network each neuron/activity in one “layer” is connected – via a weight – to each neuron in the next activity. Each of these activities stores some sort of computation, normally a composite of the weighted activities in previous layers.

A bias unit is an “extra” neuron added to each pre-output layer that stores the value of 1. Bias units aren’t connected to any previous layer and in this sense don’t represent a true “activity”.

Take a look at the following illustration:

The bias units are characterized by the text “+1”. As you can see, a bias unit is just appended to the start/end of the input and each hidden layer, and isn’t influenced by the values in the previous layer. In other words, these neurons don’t have any incoming connections.

So why do we have bias units? Well, bias units still have outgoing connections and they can contribute to the output of the ANN. Let’s call the outgoing weights of the bias units w_b. Now, let’s look at a really simple neural network that just has one input and one connection:

Let’s say act() – our activation function – is just f(x) = x, or the identity function. In such case, our ANN would represent a line because the output is just the weight (m) times the input (x).

When we change our weight w1, we will change the gradient of the function to make it steeper or flatter. But what about shifting the function vertically? In other words, what about setting the y-intercept. This is crucial for many modelling problems! Our optimal models may not pass through the origin.

So, we know that our function output = w*input (y = mx) needs to have this constant term added to it. In other words, we can say output = w*input + w_b, where w_b is our constant term c. When we use neural networks, though, or do any multi-variable learning, our computations will be done through Linear Algebra and matrix arithmetic eg. dot-product, multiplication. This can also be seen graphically in the ANN. There should be a matching number of weights and activities for a weighted sum to occur. Because of this, we need to “add” an extra input term so that we can add a constant term with it. Since, one multiplied by any value is that value, we just “insert” an extra value of 1 at every layer. This is called the bias unit.

From this diagram, you can see that we’ve now added the bias term and hence the weight w_b will be added to the weighted sum, and fed through activation function as a constant value. This constant term, also called the “intercept term” (as demonstrated by the linear example), shifts the activation function to the left or to the right. It will also be the output when the input is zero.

Here is a diagram of how different weights will transform the activation function (sigmoid in this case) by scaling it up/down:

But now, by adding the bias unit, we the possibility of translating the activation function exists:

Going back to the linear regression example, if w_b is 1, then we will add bias*w_b = 1*w_b = w_b to the activation function. In the example with the line, we can create a non-zero y-intercept:

I’m sure you can imagine infinite scenarios where the line of best fit does not go through the origin or even come near it. Bias units are important with neural networks in the same way.


Source: Rohan Kapur