May, 2017 | DeepLearning.buzz

In neural networks, we use gradient descent optimization algorithm to minimize the error function to reach a global minima. In an ideal world the error function would look like this

So you are guaranteed to find the global optimum because there are no local minimum where your optimization can get stuck. However in real the error surface is more complex, may comprise of several local minima and may look like this

In this case, you can easily get stuck in a local minima and the algorithm may think you reach the global minima leading to sub-optimal results. To avoid this situation, we use a momentum term in the objective function, which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minima. If the momentum term is large then the learning rate should be kept smaller. A large value of momentum also means that the convergence will happen fast. But if both the momentum and learning rate are kept at large values, then you might skip the minimum with a huge step. A small value of momentum cannot reliably avoid local minima, and can also slow down the training of the system. Momentum also helps in smoothing out the variations, if the gradient keeps changing direction. A right value of momentum can be either learned by hit and trial or through cross-validation.

Pictures Source – Momentum and Learning Rate Adaptation

Another Good one – :::

In the early nineties momentum was actually considered not useful, or even harmful. You may wonder why?

Datasets were fairly small and most of the approaches were using full-batch gradient descent. In full-batch gradient descent you have an exact estimation of the gradients for updating weights.

Now, you cannot do full-batch gradient descent on modern datasets. It would take too long to converge, it is going to be painfully slow and it is not suited for deep architecture. So you should use mini-batch stochastic gradient descent. Instead of computing exact gradients for the updates on all the dataset, you estimate them on few samples, e.g. 64, the mini-batch. So, you have approximate gradients for your weights update.
This has some advantages, in redundant datasets it will lead to faster convergence.

However, conversely to what many people think, in the convergence of mini-batch stochastic gradient descent of deep architecture, the critical phase is the short initial one, few hundreds iterations, and not the long regime phase that leads to higher accuracy results, even if this phase dominates in term of training time.

Why?

Because in the initial phase, far away from convergence, the noisy estimation of gradients of some mini-batches can wipe off the path to a local minima, making convergence hard, if not even impossible.

So, why would you need momentum?
Because you current update is not going to depend only on the gradient estimation of your current batch, but it is going to depend in a small part from that, and in a bigger part from the estimations that were done in the previous batch, like in an auto-regressive system.

Thus, your convergence process is going to take an “inertial speed”, that is going to make it resilient against the noise that it will encounter in the “mud” of the initial transient phase.

That is also the reason why momentum should be stronger in the initial phase compared to when the training is close to convergence.

Source – https://www.quora.com/What-does-momentum-mean-in-neural-networks

https://www.quora.com/What-is-an-intuitive-explanation-of-momentum-in-training-neural-networks

May 30, 2017

Humans Need Not Apply

Neural Network

May 30, 2017

Pix2code: Generating Code from a Graphical User Interface.

Neural Network

https://github.com/tonybeltramelli/pix2code

May 21, 2017

What is bias in neural network?

Neural Network

In a typical artificial neural network each neuron/activity in one “layer” is connected – via a weight – to each neuron in the next activity. Each of these activities stores some sort of computation, normally a composite of the weighted activities in previous layers.

A bias unit is an “extra” neuron added to each pre-output layer that stores the value of 1. Bias units aren’t connected to any previous layer and in this sense don’t represent a true “activity”.

Take a look at the following illustration:

The bias units are characterized by the text “+1”. As you can see, a bias unit is just appended to the start/end of the input and each hidden layer, and isn’t influenced by the values in the previous layer. In other words, these neurons don’t have any incoming connections.

So why do we have bias units? Well, bias units still have outgoing connections and they can contribute to the output of the ANN. Let’s call the outgoing weights of the bias units w_b. Now, let’s look at a really simple neural network that just has one input and one connection:

Let’s say act() – our activation function – is just f(x) = x, or the identity function. In such case, our ANN would represent a line because the output is just the weight (m) times the input (x).

When we change our weight w1, we will change the gradient of the function to make it steeper or flatter. But what about shifting the function vertically? In other words, what about setting the y-intercept. This is crucial for many modelling problems! Our optimal models may not pass through the origin.

So, we know that our function output = w*input (y = mx) needs to have this constant term added to it. In other words, we can say output = w*input + w_b, where w_b is our constant term c. When we use neural networks, though, or do any multi-variable learning, our computations will be done through Linear Algebra and matrix arithmetic eg. dot-product, multiplication. This can also be seen graphically in the ANN. There should be a matching number of weights and activities for a weighted sum to occur. Because of this, we need to “add” an extra input term so that we can add a constant term with it. Since, one multiplied by any value is that value, we just “insert” an extra value of 1 at every layer. This is called the bias unit.

From this diagram, you can see that we’ve now added the bias term and hence the weight w_b will be added to the weighted sum, and fed through activation function as a constant value. This constant term, also called the “intercept term” (as demonstrated by the linear example), shifts the activation function to the left or to the right. It will also be the output when the input is zero.

Here is a diagram of how different weights will transform the activation function (sigmoid in this case) by scaling it up/down:

But now, by adding the bias unit, we the possibility of translating the activation function exists:

Going back to the linear regression example, if w_b is 1, then we will add bias*w_b = 1*w_b = w_b to the activation function. In the example with the line, we can create a non-zero y-intercept:

I’m sure you can imagine infinite scenarios where the line of best fit does not go through the origin or even come near it. Bias units are important with neural networks in the same way.

Source: Rohan Kapur
https://www.quora.com/What-is-bias-in-artificial-neural-network

https://stats.stackexchange.com/questions/153933/importance-of-the-bias-node-in-neural-networks

May 2017

What is momentum in neural networks

Humans Need Not Apply

Pix2code: Generating Code from a Graphical User Interface.

What is bias in neural network?