# Backpropagation Algorithm

Posted on 17 Jan 2015, tagged algorithmdeep learning

These days I start to learn neural networks again and write some Matlab codes from scratch. I try to understand everything I do while I write the code, so I derive the equations in the back propagation while try to keep it clear and easy to understand.

## Neural Network Functions

A multi layer neural network could be defined as this:

\begin{equation} a_1(x, w, b) = x \\ \end{equation} \begin{equation} z_i(x, w, b) = a_{i-1}(x, w, b) \cdot w_{i-1} + b_{i-1} \end{equation} \begin{equation} a_i(x, w, b) = \sigma(z_i(x, w, b)) \end{equation}

Assume $$layer_i$$ means the number of neural in layer i, then the variables in the equations could be explained as below:

• x is the input, which is a row vector, it has $$layer_1$$ elements.
• $$w_i$$ meas the weights in layer i, which is a matrix of $$layer_i$$ rows and $$layer_{i+1}$$ columns.
• $$b_i$$ means biases in layer i, which is a row vector of $$layer_{i+1}$$ elements.
• $$a_i$$ means activation function in the ith layer, which the output is a row vector, it has $$layer_i$$ elements.
• l means the last layer. The output of $$a_l$$ is the output of the neural network.

And $$\sigma(z)$$ may be different in different use cases. This one is an example:

\begin{equation} \sigma(z) = {1 \over 1 + e^{-z}} \end{equation}

We need a cost function to measure how well do we do for now. And the training of the network becomes a optimization problem. The method we use in the problem is gradient descent. Let me try to explain it.

Assume we have a cost function, and it is always non-negative. For example, this function is a good one:

\begin{equation} C(x, y, w, b) = {1 \over 2} (y - a_l(x, w, b)) ^ 2 \end{equation}

Then the goal is try to make the output of the cost function smaller. Since x and y is fixed, the change of cost function while change w and b little could be shown as this:

\begin{equation} \Delta C = {\partial C \over \partial w} \Delta w + {\partial C \over \partial b} \Delta b \end{equation}

In order to make the cost function smaller, we need to make $$\Delta C$$ negative. We can make $$\Delta w = - {\partial C \over \partial w}$$ and $$\Delta b = - {\partial C \over \partial b}$$. So that $$\Delta C = - ({\partial C \over \partial w}) ^ 2 - ({\partial C \over \partial b}) ^ 2$$ which is always negative.

## Compute Partial Derivative

So the goal is to compute the partial derivative $$\partial C \over \partial w$$ and $$\partial C \over \partial b$$. Using equations (1), (2), (3), (5) and chain rule, we can get this:

\begin{equation} {\partial C \over \partial a_l} = a_l - y \end{equation} \begin{equation} {\partial C \over \partial a_i} = {\partial C \over \partial a_{i+1}} {\partial a_{i+1} \over \partial \sigma} {\partial \sigma \over \partial z_{i+1}} {\partial z_{i+1} \over \partial a_i} = {\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})} w_i^{'} \end{equation} \begin{equation} {\partial C \over \partial w_i} = {\partial C \over \partial a_{i+1}} {\partial a_{i+1} \over \partial \sigma} {\partial \sigma \over \partial z_{i+1}} {\partial z_{i+1} \over \partial w_i} = a_i^{'} {\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})} \end{equation} \begin{equation} {\partial C \over \partial b_i} = {\partial C \over \partial a_{i+1}} {\partial a_{i+1} \over \partial \sigma} {\partial \sigma \over \partial z_{i+1}} {\partial z_{i+1} \over \partial b_i} = {\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})} \end{equation}

Note that we use $$\odot$$ before $$\sigma^{'}$$ is because function $$\sigma(z)$$ is element wise.

We can find there are many same parts in these equations: $${\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})}$$. So we can define $$\delta_i = {\partial C \over \partial a_i} \odot {\sigma^{'}(z_i)}$$, and rewrite these equations like this to avoid compute these parts many times:

\begin{equation} \delta_l = (a_l - y) \odot \sigma^{'}(z_l) \end{equation} \begin{equation} \delta_i = \delta_{i+1} w_i^{'} \odot \sigma^{'}(z_i) \end{equation} \begin{equation} {\partial C \over \partial w_i} = a_i^{'} \delta_{i+1} \end{equation} \begin{equation} {\partial C \over \partial b_i} = \delta_{i+1} \end{equation}

With these equations, we can write the back propagation algorithm easily.

 Prev: Why Use Reflections to Write A Web Framework Next: How About Translate IMAP And SMTP Into HTTP API?