In the last lecture we’ve discussed backpropagation algorithm for training of Neural Networks. It is all good and well in theory but practical implementation of the neural network might be quite challenging. In this lecture we will discuss potential issues as well as provide some references.
- Activation functions on the output and hidden layers.
Backpropagation and training by gradient descent will work with different kinds of activation functions. However, some activation functions might perform better than the others. Typical good choices are , sigmoid functions, relu units, etc.
- Initialisation of weights.
This is an important question as setting the initial weights right might greatly increse the speed of learning. The weights are selected in such a way that most of the neurons will be activated in the regions of the most gradient. For the normalized and scaled data the usual choice is where is the number of neurons at the hidden layer.
- Choice of the risk function.
There is evidence that in classification tasks cross-entropy is a better choice for the risk function compared with mean-squared error. To avoid overfitting regularization might be used.
- The choice of the learning rate.
This is a classical topic of numerical analysis. Depending on the available information methods that involve second derivatives of the risk functional should be used.
- Transformation of data.
It is always assumed that data is scaled and transformed to the zero mean and unit volatility.
 Bengio, Yoshua. “Practical recommendations for gradient-based training of deep architectures.” Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 437-478.
 LeCun, Y., Bottou, L., Orr, G. B., and Muller, K. (1998a). Efficient backprop. In Neural Networks, Tricks of the Trade.
 Glorot, Xavier, and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks. “International conference on artificial intelligence and statistics. 2010.