What is momentum in neural networks

In neural networks, we use gradient descent optimization algorithm to minimize the error function to reach a global minima. In an ideal world the error function would look like this

So you are guaranteed to find the global optimum because there are no local minimum where your optimization can get stuck. However in real the error surface is more complex, may comprise of several local minima and may look like this

In this case, you can easily get stuck in a local minima and the algorithm may think you reach the global minima leading to sub-optimal results. To avoid this situation, we use a momentum term in the objective function, which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minima. If the momentum term is large then the learning rate should be kept smaller. A large value of momentum also means that the convergence will happen fast. But if both the momentum and learning rate are kept at large values, then you might skip the minimum with a huge step. A small value of momentum cannot reliably avoid local minima, and can also slow down the training of the system. Momentum also helps in smoothing out the variations, if the gradient keeps changing direction. A right value of momentum can be either learned by hit and trial or through cross-validation.

Pictures Source – Momentum and Learning Rate Adaptation

Another Good one – :::

In the early nineties momentum was actually considered not useful, or even harmful. You may wonder why?

Datasets were fairly small and most of the approaches were using full-batch gradient descent. In full-batch gradient descent you have an exact estimation of the gradients for updating weights.

Now, you cannot do full-batch gradient descent on modern datasets. It would take too long to converge, it is going to be painfully slow and it is not suited for deep architecture. So you should use mini-batch stochastic gradient descent. Instead of computing exact gradients for the updates on all the dataset, you estimate them on few samples, e.g. 64, the mini-batch. So, you have approximate gradients for your weights update.
This has some advantages, in redundant datasets it will lead to faster convergence.

However, conversely to what many people think, in the convergence of mini-batch stochastic gradient descent of deep architecture, the critical phase is the short initial one, few hundreds iterations, and not the long regime phase that leads to higher accuracy results, even if this phase dominates in term of training time.

Why?

Because in the initial phase, far away from convergence, the noisy estimation of gradients of some mini-batches can wipe off the path to a local minima, making convergence hard, if not even impossible.

So, why would you need momentum?
Because you current update is not going to depend only on the gradient estimation of your current batch, but it is going to depend in a small part from that, and in a bigger part from the estimations that were done in the previous batch, like in an auto-regressive system.

Thus, your convergence process is going to take an “inertial speed”, that is going to make it resilient against the noise that it will encounter in the “mud” of the initial transient phase.

That is also the reason why momentum should be stronger in the initial phase compared to when the training is close to convergence.

Source – https://www.quora.com/What-does-momentum-mean-in-neural-networks

https://www.quora.com/What-is-an-intuitive-explanation-of-momentum-in-training-neural-networks