# Lecture III. Neural Networks and Back-propagation Algorithm.

In this lecture we will discuss the back propagation algorithm for training of neural networks.

Definition: (Feed forward ) Neural network (NN) is a function that is encoded in a graph. It is defined on the subset $X \subset \mathbb{R}^d$  and maps to $Y = \mathbb{R}$ in the case of regression or $Y =\{0, 1\}$ in the case of classification. The first layer of the network is an input layer and accepts the elements of the vector $(x_1, x_2, \ldots, x_d)$. The last layer is an output layer. All the remaining layers are called hidden layers. NN is feed-forward in a sense that information in a graph flows from one layer to the other and there are no directed cycles.

There is also a numeric weight associated with every edge of the network and a smooth activation function $h$.

We will use the following notations: to every node $i$ of the graph there are two numbers $a_i = \sum_{k} w_k z_k$ where $k$ goes through all edges that point into $i$ and $z_i = h(a_i)$. For the nodes of the input layer $z$ is equal to the argument of the function.

Given the input vector $(x_1, x_2, \ldots, x_d)$ we propagate it into hidden layers using the formulas above. Finally, the value of the function is given by $y=h(\sum_{k} w_k z_k)$ for all nodes that point into the output node.

Thus, neural network is simply a function from the input space into output space. The totality of all the weights constitutes the set of model parameters.

Note, that biases could be included into the neural network by creating extra input layers with the value of $1$.

### Symmetries of the Neural Networks.

One of the most common choices of the activation function is a hyperbolic tan function. Since it is anti-symmetric, it is easy to see that by simultaneously changing the signs of the weights on the edges pointing to and from any node of some node in the hidden layer will not change the function that the neural network represents. Moreover, shuffling the nodes in any hidden layer does not change the function either. One can show easily that if the hidden layer contains $M$ nodes the number of equivalent functions is at least $M! 2^M$. This might get to be a major problem because this might mean that in general position the number of local minima for a neural network is growing exponentially with the size of the network.

NN in the problems of regression. As usual we are minimizing the mean squared error when dealing with regression problems.

NN in the problems of classification. In this case, we are using the function $y = \frac{1}{1+e^{-a}}$ as the activation function in the output layer. The functional of risk that we are trying to minimize is the cross-entropy $R^{EMP} (w) = - \sum_{n=1}^N (y_n \log f(x_n, w) + (1-y_n) \log(1 - f(x_n, w))$

The output of the network is interpreted in this case as the class probability $P(y=1 | x)$ and the inputs are assumed to be conditionally independent.

### Backpropagation Algorithm.

Backpropagation algorithm is at the core of most training algorithms for neural networks. It computes the gradient of a neural network at a given point.

In most cases there is no analytical solution to the problem of minimization of the empirical risk and it is attained by gradient descent. At every step of the iteration the weight is updated in the direction of $-\nabla R^{EMP}$ according to $w^{(\tau + 1)} = w^{(\tau)} - \eta \nabla R^{EMP}$ where $\eta >0$ is defines the step size. This procedure is guaranteed to terminate in the local minima of the empirical risk.

We will briefly describe the idea of backpropagation without giving complete proofs.

Step 1.

Let $w$ denotes a weight over some edge $ij$ where $i$ is the start of the node and $j$ is the end. Let $a = \sum_k w_k z_k$.

Lets start with the computation of the partial derivative $\frac{\partial R^{EMP}}{\partial w} = \frac{\partial R^{EMP}}{\partial a}\frac{\partial a}{\partial w}$ because empirical risk depends on $w$ only through $a$.

If we denote $\delta = \frac{\partial R^{EMP}}{\partial a}$ and compute $\frac{\partial a}{\partial w} = z$ we can conclude that $\frac{\partial R^{EMP}}{\partial w} = \delta z$. We know $z$ by simply evaluating the neural net at the current value of the weights. Therefore, to compute the gradient we need to know the value of $\delta$ for all nodes.

Step 2.

Let us take some node and index by $k$ all the nodes that are connected with it and are the ends of the corresponding edges. We can write $\delta = \frac{\partial R^{EMP}}{\partial a} = \sum_k \frac{\partial R^{EMP}}{\partial a_k} \frac{\partial a_k}{\partial a} = \sum_k \delta_k \frac{\partial a_k}{\partial a}$. Now, from the definition of $a_k$ it is easy to see that $\frac{\partial a_k}{\partial a} = w_k h'(a)$. Therefore, in order to compute delta we need to know the values of delta for the next layer. If we do, we may recursively apply these computations and move from the output layer to the input layer one by one. In particular, it suffices to compute the value of deltas on the output layer only. This is left as an exersise for to the reader.