In a typical artificial neural network each neuron/activity in one “layer” is connected – via a weight – to each neuron in the next activity. Each of these activities stores some sort of computation, normally a composite of the weighted activities in previous layers.
A bias unit is an “extra” neuron added to each pre-output layer that stores the value of 1. Bias units aren’t connected to any previous layer and in this sense don’t represent a true “activity”.
Take a look at the following illustration:
The bias units are characterized by the text “+1”. As you can see, a bias unit is just appended to the start/end of the input and each hidden layer, and isn’t influenced by the values in the previous layer. In other words, these neurons don’t have any incoming connections.
So why do we have bias units? Well, bias units still have outgoing connections and they can contribute to the output of the ANN. Let’s call the outgoing weights of the bias units w_b. Now, let’s look at a really simple neural network that just has one input and one connection:
Let’s say act() – our activation function – is just f(x) = x, or the identity function. In such case, our ANN would represent a line because the output is just the weight (m) times the input (x).
When we change our weight w1, we will change the gradient of the function to make it steeper or flatter. But what about shifting the function vertically? In other words, what about setting the y-intercept. This is crucial for many modelling problems! Our optimal models may not pass through the origin.
So, we know that our function output = w*input (y = mx) needs to have this constant term added to it. In other words, we can say output = w*input + w_b, where w_b is our constant term c. When we use neural networks, though, or do any multi-variable learning, our computations will be done through Linear Algebra and matrix arithmetic eg. dot-product, multiplication. This can also be seen graphically in the ANN. There should be a matching number of weights and activities for a weighted sum to occur. Because of this, we need to “add” an extra input term so that we can add a constant term with it. Since, one multiplied by any value is that value, we just “insert” an extra value of 1 at every layer. This is called the bias unit.
From this diagram, you can see that we’ve now added the bias term and hence the weight w_b will be added to the weighted sum, and fed through activation function as a constant value. This constant term, also called the “intercept term” (as demonstrated by the linear example), shifts the activation function to the left or to the right. It will also be the output when the input is zero.
Here is a diagram of how different weights will transform the activation function (sigmoid in this case) by scaling it up/down:
But now, by adding the bias unit, we the possibility of translating the activation function exists:
Going back to the linear regression example, if w_b is 1, then we will add bias*w_b = 1*w_b = w_b to the activation function. In the example with the line, we can create a non-zero y-intercept:
I’m sure you can imagine infinite scenarios where the line of best fit does not go through the origin or even come near it. Bias units are important with neural networks in the same way.
Source: Rohan Kapur