What is momentum in neural networks

In neural networks, we use gradient descent optimization algorithm to minimize the error function to reach a global minima. In an ideal world the error function would look like this

So you are guaranteed to find the global optimum because there are no local minimum where your optimization can get stuck. However in real the error surface is more complex, may comprise of several local minima and may look like this

In this case, you can easily get stuck in a local minima and the algorithm may think you reach the global minima leading to sub-optimal results. To avoid this situation, we use a momentum term in the objective function, which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minima. If the momentum term is large then the learning rate should be kept smaller. A large value of momentum also means that the convergence will happen fast. But if both the momentum and learning rate are kept at large values, then you might skip the minimum with a huge step. A small value of momentum cannot reliably avoid local minima, and can also slow down the training of the system. Momentum also helps in smoothing out the variations, if the gradient keeps changing direction. A right value of momentum can be either learned by hit and trial or through cross-validation.

Pictures Source – Momentum and Learning Rate Adaptation

What is bias in neural network?

In a typical artificial neural network each neuron/activity in one “layer” is connected – via a weight – to each neuron in the next activity. Each of these activities stores some sort of computation, normally a composite of the weighted activities in previous layers.

A bias unit is an “extra” neuron added to each pre-output layer that stores the value of 1. Bias units aren’t connected to any previous layer and in this sense don’t represent a true “activity”.

Take a look at the following illustration:

The bias units are characterized by the text “+1”. As you can see, a bias unit is just appended to the start/end of the input and each hidden layer, and isn’t influenced by the values in the previous layer. In other words, these neurons don’t have any incoming connections.

So why do we have bias units? Well, bias units still have outgoing connections and they can contribute to the output of the ANN. Let’s call the outgoing weights of the bias units w_b. Now, let’s look at a really simple neural network that just has one input and one connection:

Let’s say act() – our activation function – is just f(x) = x, or the identity function. In such case, our ANN would represent a line because the output is just the weight (m) times the input (x).

When we change our weight w1, we will change the gradient of the function to make it steeper or flatter. But what about shifting the function vertically? In other words, what about setting the y-intercept. This is crucial for many modelling problems! Our optimal models may not pass through the origin.

So, we know that our function output = w*input (y = mx) needs to have this constant term added to it. In other words, we can say output = w*input + w_b, where w_b is our constant term c. When we use neural networks, though, or do any multi-variable learning, our computations will be done through Linear Algebra and matrix arithmetic eg. dot-product, multiplication. This can also be seen graphically in the ANN. There should be a matching number of weights and activities for a weighted sum to occur. Because of this, we need to “add” an extra input term so that we can add a constant term with it. Since, one multiplied by any value is that value, we just “insert” an extra value of 1 at every layer. This is called the bias unit.

From this diagram, you can see that we’ve now added the bias term and hence the weight w_b will be added to the weighted sum, and fed through activation function as a constant value. This constant term, also called the “intercept term” (as demonstrated by the linear example), shifts the activation function to the left or to the right. It will also be the output when the input is zero.

Here is a diagram of how different weights will transform the activation function (sigmoid in this case) by scaling it up/down:

But now, by adding the bias unit, we the possibility of translating the activation function exists:

Going back to the linear regression example, if w_b is 1, then we will add bias*w_b = 1*w_b = w_b to the activation function. In the example with the line, we can create a non-zero y-intercept:

I’m sure you can imagine infinite scenarios where the line of best fit does not go through the origin or even come near it. Bias units are important with neural networks in the same way.


Source: Rohan Kapur