Backpropagation Algorithm
Posted on 17 Jan 2015, tagged algorithm
deep learning
These days I start to learn neural networks again and write some Matlab codes from scratch. I try to understand everything I do while I write the code, so I derive the equations in the back propagation while try to keep it clear and easy to understand.
Neural Network Functions
A multi layer neural network could be defined as this:
Assume \(layer_i\) means the number of neurons in layer i, then the variables in the equations could be explained as below:
x
is the input, which is a row vector, it has \(layer_1\) elements.- \(w_i\) means the weights in layer
i
, which is a matrix of \(layer_i\) rows and \(layer_{i+1}\) columns. - \(b_i\) means biases in layer i, which is a row vector of \(layer_{i+1}\) elements.
- \(a_i\) means activation function in the ith layer, where the output is a row vector, it has \(layer_i\) elements.
l
means the last layer. The output of \(a_l\) is the output of the neural network.
And \(\sigma(z)\) may be different in different use cases. This one is an example:
Gradient Descent Algorithm
We need a cost function to measure how well do we do for now. And the training of the network becomes an optimization problem. The method we use in the problem is gradient descent. Let me try to explain it.
Assume we have a cost function, and it is always non-negative. For example, this function is a good one:
Then the goal is to try to make the output of the cost function smaller. Since x
and y
are fixed, the change of cost function while changing w
and b
a little could be shown as this:
In order to make the cost function smaller, we need to make \(\Delta C\) negative. We can make \(\Delta w = - {\partial C \over \partial w}\) and \(\Delta b = - {\partial C \over \partial b}\). So that \(\Delta C = - ({\partial C \over \partial w}) ^ 2 - ({\partial C \over \partial b}) ^ 2\) which is always negative.
Compute Partial Derivative
So the goal is to compute the partial derivative \(\partial C \over \partial w\) and \(\partial C \over \partial b\). Using equations (1), (2), (3), (5) and chain rule, we can get this:
Note that we use \(\odot\) before \(\sigma^{'}\) is because function \(\sigma(z)\) is element wise.
We can find there are many same parts in these equations: \({\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})}\). So we can define \(\delta_i = {\partial C \over \partial a_i} \odot {\sigma^{'}(z_i)}\), and rewrite these equations like this to avoid computing these parts many times:
With these equations, we can write the back propagation algorithm easily.