In this lecture we will discuss the back propagation algorithm for training of neural networks.
Definition: (Feed forward ) Neural network (NN) is a function that is encoded in a graph. It is defined on the subset and maps to in the case of regression or in the case of classification.
The first layer of the network is an input layer and accepts the elements of the vector . The last layer is an output layer. All the remaining layers are called hidden layers. NN is feed-forward in a sense that information in a graph flows from one layer to the other and there are no directed cycles.
There is also a numeric weight associated with every edge of the network and a smooth activation function .
We will use the following notations: to every node of the graph there are two numbers where goes through all edges that point into and . For the nodes of the input layer is equal to the argument of the function.
Given the input vector we propagate it into hidden layers using the formulas above. Finally, the value of the function is given by for all nodes that point into the output node.
Thus, neural network is simply a function from the input space into output space. The totality of all the weights constitutes the set of model parameters.
Note, that biases could be included into the neural network by creating extra input layers with the value of .
Symmetries of the Neural Networks.
One of the most common choices of the activation function is a hyperbolic tan function. Since it is anti-symmetric, it is easy to see that by simultaneously changing the signs of the weights on the edges pointing to and from any node of some node in the hidden layer will not change the function that the neural network represents. Moreover, shuffling the nodes in any hidden layer does not change the function either. One can show easily that if the hidden layer contains nodes the number of equivalent functions is at least . This might get to be a major problem because this might mean that in general position the number of local minima for a neural network is growing exponentially with the size of the network.
NN in the problems of regression. As usual we are minimizing the mean squared error when dealing with regression problems.
NN in the problems of classification. In this case, we are using the function as the activation function in the output layer. The functional of risk that we are trying to minimize is the cross-entropy
The output of the network is interpreted in this case as the class probability and the inputs are assumed to be conditionally independent.
Backpropagation algorithm is at the core of most training algorithms for neural networks. It computes the gradient of a neural network at a given point.
In most cases there is no analytical solution to the problem of minimization of the empirical risk and it is attained by gradient descent. At every step of the iteration the weight is updated in the direction of according to where is defines the step size. This procedure is guaranteed to terminate in the local minima of the empirical risk.
We will briefly describe the idea of backpropagation without giving complete proofs.
Let denotes a weight over some edge where is the start of the node and is the end. Let .
Lets start with the computation of the partial derivative because empirical risk depends on only through .
If we denote and compute we can conclude that . We know by simply evaluating the neural net at the current value of the weights. Therefore, to compute the gradient we need to know the value of for all nodes.
Let us take some node and index by all the nodes that are connected with it and are the ends of the corresponding edges. We can write . Now, from the definition of it is easy to see that . Therefore, in order to compute delta we need to know the values of delta for the next layer. If we do, we may recursively apply these computations and move from the output layer to the input layer one by one. In particular, it suffices to compute the value of deltas on the output layer only. This is left as an exersise for to the reader.