I Introduction
Deep Learning techniques have generated many of the stateoftheart models [1, 2, 3] that reached impressive results on benchmark datasets like MNIST [4]
. Such models are usually trained with variations of the standard Backpropagation method, with stochastic gradient descent (SGD). In the field of shallow neural networks, there have been several developments to training algorithms that have sped up convergence
[5, 6]. This paper aims to bridge the gap between the field of Deep Learning and these advanced training methods, by combining Resilient Propagation (Rprop) [5], Dropout [7] and Deep Neural Networks Ensembles.Ia Rprop
The Resilient Propagation [5]
weight update rule was initially introduced as a possible solution to the “vanishing gradients” problem: as the depth and complexity of an artificial neural network increase, the gradient propagated backwards by the standard SGD backpropagation becomes increasingly smaller, leading to negligible weight updates, which slow down training considerably. Rprop solves this problem by using a fixed update value
, which is increased or decreased multiplicatively at each iteration by an asymmetric factor and respectively, depending on whether the gradient with respect to has changed sign between two iterations or not. This “backtracking” allows Rprop to still converge to a local minima, but the acceleration provided by the multiplicative factor helps it skip over flat regions much more quickly. To avoid double punishment when in the backtracking phase, Rprop artificially forces the gradient product to be , so that the following iteration is skipped. An illustration of Rprop can be found in Algorithm 1.IB Dropout
Dropout [7] is a regularisation method by which only a random selection of nodes in the network is updated during each training iteration, but at the final evaluation stage the whole network is used. The selection is performed by sampling a dropout mask
from a Bernoulli distribution with
, whereis the probability of node
being muted during the weight update step of backpropagation, and is the dropout rate, which is usually for the middle layers, or for the input layers, and for the output layer. For convenience this dropout mask is represented as a weight binary matrix , covering all the weights in the network that can be used to multiply the weightspace of the network to obtain what is called a thinned network, for the current training iteration, where each weight is zeroed out based on the probability of its parent node being muted.The remainder of this paper is structured as follows:
Ii Rprop and Dropout
In this section we explain the zero gradient problem, and propose a solution by adapting the Rprop algorithm to be aware of Dropout.
Iia The zerogradient problem
In order to avoid double punishment when there is a change of sign in the gradient, Rprop artificially sets the gradient product associated with weight for the next iteration to . This condition is checked during the following iteration, and if true no updates to the weights or the learning rate are performed.
Using the zerovalued gradient product as an indication to skip an iteration is acceptable in normal gradient descent because the only other occurrence of this would be when learning has terminated. When Dropout is introduced, an additional number of events can produce these zero values:

When neuron in the layer above is skipped, the gradient propagated back to all the weights is also
These additional zerogradient events force additional skipped training iterations and missed learning rate adaptations that slow down the training unnecessarily.
IiB Adaptations to Rprop
By making Rprop aware of the dropout mask , we are able to distinguish whether a zerogradient event occurs as a signal to skip the next weight update or whether it occurs for a different reason, and therefore and updates should be allowed. The new version of the Rprop update rule for each weight is shown in Algorithm 2. We use to indicate the current training example, for the previous training example, for the next training example, and where a value with appears, it is intended to be the initial value. All other notation is the same as used in the original Rprop:

is the error function (in this case negative log likelihood)

is the current update value for weight at index

is the current weight update value for index
In particular, the conditions at line 5 and line 18 are providing the necessary protection from the additional zerogradients, and implementing correctly the recipe prescribed by Dropout, by completely skipping every weight for which (which means that neuron was dropped out and therefore the gradient will necessarily be . We expect that this methodolgy can be extended to other variants of Rprop, such as, but not limited to, iRprop+ [8] and JRprop [6].
Iii Evaluating on MNIST
In this section we describe an initial evaluation of performance on the MNIST dataset. For all experiments we use a Deep Neural Network (DNN) with five middle layers, of neurons respectively, and a dropout rate for the middle layers and no Dropout on the inputs. The dropout rate has been shown to be an optimal choice for the MNIST dataset in [9]. A similar architecture has been used to produce stateoftheart results [3], however the authors used the entire training set for validation, and graphical transformations of said set for training. These added transformations have led to a “virtually infinite” training set size, whereby at every epoch, a new training set is generated, and a much larger validation set of the original images. The test set remains the original image test set. An explanation of these transformations is provided in [10], which also confirms that:
“The most important practice is getting a training set as large as possible: we expand the training set by adding a new form of distorted data”
We therefore attribute these big improvements to the transformations applied, and have not found it a primary goal to replicate these additional transformations to obtain the stateoftheart results and instead focused on utilising the untransformed dataset, using images for training, for validation and
for testing. Subsequently, we performed a search using the validation set as an indicator to find the optimal hyperparameters of the modified version of Rprop. We found that the best results were reached with
, , and . We trained all models to the maximum of allowed epochs, and measured the error on the validation set at every epoch, so that it could be used to select the model to be applied to the test set. We also measured the time it took to reach the best validation error, and report its approximate magnitude, to use as a comparison of orders of magnitude. The results presented are an average of repeated runs, limited to a maximum of training epochs.Iiia Compared to SGD
From the results in Table I we see that the modified version of Rprop is able to startup much quicker and reaches an error value that is close to the minimum much more quickly. SGD reaches a higher error value, and after a much longer time. Although the overall error improvement is significant, the speed gain from using Rprop is more appealing because it allows to save a large number of iterations that could be used for improving the model in different ways. Rprop obtains its best validation error after only epochs, whilst SGD reached the minimum after . An illustration of the first epochs can be seen in Figure 1.
Method  Min Val Err  Epochs  Time  Test Err  Epoch 

SGD  2.85%  1763  320 min  3.50%  88.65% 
Rprop  3.03%  105  25 min  3.53%  12.81% 
Mod Rprop  2.57%  35  10 min  3.49%  13.54% 
IiiB Compared to unmodified Rprop
We can see from Figure 2 that the modified version of Rprop has a faster startup than the unmodified version, and stays below it consistently until it reaches its minimum. Also, the unmodified version does not reach the same final error as the modified version, and starts overtraining much sooner, and does not reach a better error than SGD. Table I shows with more detail how the performance of the two methods compares over the first epochs.
IiiC Using Modified Rprop to speed up training of Deep Learning Ensembles
The increase in speed of convergence can make it practical to produce Ensembles of Deep Neural Networks, as the time to train each member DNN is considerably reduced without undertraining the network. We have been able to train these Ensembles in less than 12 hours in total on a singleGPU, singleCPU desktop system ^{1}^{1}1
We used a Nvidia GTX770 graphics card on a core i5 processor, programmed with Theano in python
. We have trained different Ensemble types, and we report the final results in Table II. The methods used are Bagging [11] and Stacking [12], with and member DNNs. Each member was trained for a maximum of epochs.
Bagging is an ensemble method by which several different training sets are created by random resampling of the original training set, and each of these are used to train a new classifier. The entire set of trained classifiers is usually then aggregated by taking an average or a majority vote to reach a single classification decision.

Stacking is an ensemble method by which the different classifiers are aggregated using an additional learning algorithm that uses the inputs of these firstspace classifiers to learn information about how to reach a better classification result. This additional learning algorithm is called a secondspace classifier.
In the case of Stacking the final secondspace classifier was another DNN with two middle layers, respectively of size , where is the number of DNNs in the Ensemble, trained for a maximum of epochs with the modified Rprop. We used the same original train, validation and test sets for this, and collected the average over repeated runs. The results are still not comparable to what is presented in [3], which is consistent with the observations about the importance of the dataset transformations, however we note that we are able to improve the error in less time it took to train a single network with SGD. A Wilcoxon signed ranks test shows that the increase in performance obtained from using the ensembles of size compared to the ensemble of size is significant, at the confidence level.
Method  Size  Test Err  Time 

Bagging  35 min  
Bagging  128 min  
Stacking  39 min  
Stacking  145 min 
Iv Conclusions and Future Work
We have highlighted that many training methods that have been used in shallow learning may be adapted for use in Deep Learning. We have looked at Rprop and how the appearance of zerogradients during the training as a side effect of Dropout poses a challenge to learning, and proposed a solution which allows Rprop to train DNNs to a better error, and still be much faster than standard SGD backpropagation.
We then showed that this increase in training speed can be used to train effectively an Ensemble of DNNs on a commodity desktop system, and reap the added benefits of Ensemble methods in less time than it would take to train a Deep Neural Network with SGD.
It remains to be assessed in further work whether this improved methodology would lead to a new stateoftheart error when applying the pretraining and dataset enhancements that have been used in other methods, and how the improvements to Rprop can be ported to its numerous variants.
Acknowledgement
The authors would like to thank the School of Business, Economics and Informatics, Birkbeck College, University of London, for the grant received to support this research.
References

[1]
L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of
neural networks using dropconnect,” in
Proceedings of the 30th International Conference on Machine Learning (ICML13)
, 2013, pp. 1058–1066. 
[2]
D. Ciresan, U. Meier, and J. Schmidhuber, “Multicolumn deep neural networks
for image classification,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. IEEE Press, 2012, pp. 3642–3649.  [3] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep, big, simple neural nets for handwritten digit recognition,” Neural computation, vol. 22, no. 12, pp. 3207–3220, 2010.

[4]
Y. Lecun and C. Cortes, “The MNIST database of handwritten digits.” [Online]. Available:
http://yann.lecun.com/exdb/mnist/  [5] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The rprop algorithm,” in proceeding of the IEEE International Conference on Neural Networks. IEEE, 1993, pp. 586–591.
 [6] A. D. Anastasiadis, G. D. Magoulas, and M. N. Vrahatis, “New globally convergent training scheme based on the resilient propagation algorithm,” Neurocomputing, vol. 64, pp. 253–270, 2005.
 [7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
 [8] C. Igel and M. Hüsken, “Improving the Rprop learning algorithm,” in Proceedings of the second international ICSC symposium on neural computation (NC 2000), vol. 2000. Citeseer, 2000, pp. 115–121.
 [9] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[10]
P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” 2003. [Online]. Available:
http://research.microsoft.com/apps/pubs/default.aspx?id=68920  [11] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
 [12] W. D, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259, 1992.
Comments
There are no comments yet.