Adapting Resilient Propagation for Deep Learning

09/15/2015
by   Alan Mosca, et al.
Birkbeck, University of London
0

The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/15/2018

Leapfrogging for parallelism in deep neural networks

We present a technique, which we term leapfrogging, to parallelize back-...
11/17/2020

ZORB: A Derivative-Free Backpropagation Algorithm for Neural Networks

Gradient descent and backpropagation have enabled neural networks to ach...
05/21/2018

Never look back - A modified EnKF method and its application to the training of neural networks without back propagation

In this work, we present a new derivative-free optimization method and i...
05/21/2018

Never look back - The EnKF method and its application to the training of neural networks without back propagation

In this work, we present a new derivative-free optimization method and i...
05/15/2021

Bilevel Programming and Deep Learning: A Unifying View on Inference Learning Methods

In this work we unify a number of inference learning methods, that are p...
03/16/2018

Deep Component Analysis via Alternating Direction Neural Networks

Despite a lack of theoretical understanding, deep neural networks have a...
06/24/2012

Practical recommendations for gradient-based training of deep architectures

Learning algorithms related to artificial neural networks and in particu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Learning techniques have generated many of the state-of-the-art models [1, 2, 3] that reached impressive results on benchmark datasets like MNIST [4]

. Such models are usually trained with variations of the standard Backpropagation method, with stochastic gradient descent (SGD). In the field of shallow neural networks, there have been several developments to training algorithms that have sped up convergence 

[5, 6]. This paper aims to bridge the gap between the field of Deep Learning and these advanced training methods, by combining Resilient Propagation (Rprop) [5], Dropout [7] and Deep Neural Networks Ensembles.

I-a Rprop

The Resilient Propagation [5]

weight update rule was initially introduced as a possible solution to the “vanishing gradients” problem: as the depth and complexity of an artificial neural network increase, the gradient propagated backwards by the standard SGD backpropagation becomes increasingly smaller, leading to negligible weight updates, which slow down training considerably. Rprop solves this problem by using a fixed update value

, which is increased or decreased multiplicatively at each iteration by an asymmetric factor and respectively, depending on whether the gradient with respect to has changed sign between two iterations or not. This “backtracking” allows Rprop to still converge to a local minima, but the acceleration provided by the multiplicative factor helps it skip over flat regions much more quickly. To avoid double punishment when in the backtracking phase, Rprop artificially forces the gradient product to be , so that the following iteration is skipped. An illustration of Rprop can be found in Algorithm 1.

1:, , ,
2:pick
3:
4:for all  do
5:     if  then
6:         
7:         
8:         
9:         
10:     else if  then
11:         
12:         
13:     else
14:         
15:         
16:         
17:     end if
18:end for
Algorithm 1 Rprop

I-B Dropout

Dropout [7] is a regularisation method by which only a random selection of nodes in the network is updated during each training iteration, but at the final evaluation stage the whole network is used. The selection is performed by sampling a dropout mask

from a Bernoulli distribution with

, where

is the probability of node

being muted during the weight update step of backpropagation, and is the dropout rate, which is usually for the middle layers, or for the input layers, and for the output layer. For convenience this dropout mask is represented as a weight binary matrix , covering all the weights in the network that can be used to multiply the weight-space of the network to obtain what is called a thinned network, for the current training iteration, where each weight is zeroed out based on the probability of its parent node being muted.

The remainder of this paper is structured as follows:

  • In section II we explain why using Dropout causes an incompatibility with Rprop, and propose a modification to solve the issue.

  • In section III

    we show experimental results using the MNIST dataset, first to highlight how Rprop is able to converge much more quickly during the initial epochs, and then use this to speed up the training of a Stacked Ensemble.

  • Finally in section IV, we look at how this work can be extended with further evaluation and development.

Ii Rprop and Dropout

In this section we explain the zero gradient problem, and propose a solution by adapting the Rprop algorithm to be aware of Dropout.

Ii-a The zero-gradient problem

In order to avoid double punishment when there is a change of sign in the gradient, Rprop artificially sets the gradient product associated with weight for the next iteration to . This condition is checked during the following iteration, and if true no updates to the weights or the learning rate are performed.

Using the zero-valued gradient product as an indication to skip an iteration is acceptable in normal gradient descent because the only other occurrence of this would be when learning has terminated. When Dropout is introduced, an additional number of events can produce these zero values:

  • When neuron

    is skipped, the dropout mask for all weights going to the layer above has a value of

  • When neuron in the layer above is skipped, the gradient propagated back to all the weights is also

These additional zero-gradient events force additional skipped training iterations and missed learning rate adaptations that slow down the training unnecessarily.

Ii-B Adaptations to Rprop

By making Rprop aware of the dropout mask , we are able to distinguish whether a zero-gradient event occurs as a signal to skip the next weight update or whether it occurs for a different reason, and therefore and updates should be allowed. The new version of the Rprop update rule for each weight is shown in Algorithm 2. We use to indicate the current training example, for the previous training example, for the next training example, and where a value with appears, it is intended to be the initial value. All other notation is the same as used in the original Rprop:

  • is the error function (in this case negative log likelihood)

  • is the current update value for weight at index

  • is the current weight update value for index

1:, , ,
2:pick
3:
4:for all  do
5:     if  then
6:         
7:         
8:     else
9:         if  then
10:              
11:              
12:              
13:              
14:         else if  then
15:              
16:              
17:         else
18:              if  then
19:                  
20:                  
21:              else
22:                  
23:                  
24:              end if
25:         end if
26:     end if
27:end for
Algorithm 2 Rprop adapted for Dropout

In particular, the conditions at line 5 and line 18 are providing the necessary protection from the additional zero-gradients, and implementing correctly the recipe prescribed by Dropout, by completely skipping every weight for which (which means that neuron was dropped out and therefore the gradient will necessarily be . We expect that this methodolgy can be extended to other variants of Rprop, such as, but not limited to, iRprop+ [8] and JRprop [6].

Iii Evaluating on MNIST

In this section we describe an initial evaluation of performance on the MNIST dataset. For all experiments we use a Deep Neural Network (DNN) with five middle layers, of neurons respectively, and a dropout rate for the middle layers and no Dropout on the inputs. The dropout rate has been shown to be an optimal choice for the MNIST dataset in [9]. A similar architecture has been used to produce state-of-the-art results [3], however the authors used the entire training set for validation, and graphical transformations of said set for training. These added transformations have led to a “virtually infinite” training set size, whereby at every epoch, a new training set is generated, and a much larger validation set of the original images. The test set remains the original image test set. An explanation of these transformations is provided in  [10], which also confirms that:

“The most important practice is getting a training set as large as possible: we expand the training set by adding a new form of distorted data”

We therefore attribute these big improvements to the transformations applied, and have not found it a primary goal to replicate these additional transformations to obtain the state-of-the-art results and instead focused on utilising the untransformed dataset, using images for training, for validation and

for testing. Subsequently, we performed a search using the validation set as an indicator to find the optimal hyperparameters of the modified version of Rprop. We found that the best results were reached with

, , and . We trained all models to the maximum of allowed epochs, and measured the error on the validation set at every epoch, so that it could be used to select the model to be applied to the test set. We also measured the time it took to reach the best validation error, and report its approximate magnitude, to use as a comparison of orders of magnitude. The results presented are an average of repeated runs, limited to a maximum of training epochs.

Iii-a Compared to SGD

From the results in Table I we see that the modified version of Rprop is able to start-up much quicker and reaches an error value that is close to the minimum much more quickly. SGD reaches a higher error value, and after a much longer time. Although the overall error improvement is significant, the speed gain from using Rprop is more appealing because it allows to save a large number of iterations that could be used for improving the model in different ways. Rprop obtains its best validation error after only epochs, whilst SGD reached the minimum after . An illustration of the first epochs can be seen in Figure 1.

Fig. 1: Validation Error - SGD vs Mod. Rprop
Method Min Val Err Epochs Time Test Err Epoch
SGD 2.85% 1763 320 min 3.50% 88.65%
Rprop 3.03% 105 25 min 3.53% 12.81%
Mod Rprop 2.57% 35 10 min 3.49% 13.54%
TABLE I: Simulation results

Iii-B Compared to unmodified Rprop

We can see from Figure 2 that the modified version of Rprop has a faster start-up than the unmodified version, and stays below it consistently until it reaches its minimum. Also, the unmodified version does not reach the same final error as the modified version, and starts overtraining much sooner, and does not reach a better error than SGD. Table I shows with more detail how the performance of the two methods compares over the first epochs.

Fig. 2: Validation Error - Unmod. vs Mod. Rprop

Iii-C Using Modified Rprop to speed up training of Deep Learning Ensembles

The increase in speed of convergence can make it practical to produce Ensembles of Deep Neural Networks, as the time to train each member DNN is considerably reduced without undertraining the network. We have been able to train these Ensembles in less than 12 hours in total on a single-GPU, single-CPU desktop system 111

We used a Nvidia GTX-770 graphics card on a core i5 processor, programmed with Theano in python

. We have trained different Ensemble types, and we report the final results in Table II. The methods used are Bagging [11] and Stacking [12], with and member DNNs. Each member was trained for a maximum of epochs.

  • Bagging is an ensemble method by which several different training sets are created by random resampling of the original training set, and each of these are used to train a new classifier. The entire set of trained classifiers is usually then aggregated by taking an average or a majority vote to reach a single classification decision.

  • Stacking is an ensemble method by which the different classifiers are aggregated using an additional learning algorithm that uses the inputs of these first-space classifiers to learn information about how to reach a better classification result. This additional learning algorithm is called a second-space classifier.

In the case of Stacking the final second-space classifier was another DNN with two middle layers, respectively of size , where is the number of DNNs in the Ensemble, trained for a maximum of epochs with the modified Rprop. We used the same original train, validation and test sets for this, and collected the average over repeated runs. The results are still not comparable to what is presented in  [3], which is consistent with the observations about the importance of the dataset transformations, however we note that we are able to improve the error in less time it took to train a single network with SGD. A Wilcoxon signed ranks test shows that the increase in performance obtained from using the ensembles of size compared to the ensemble of size is significant, at the confidence level.

Method Size Test Err Time
Bagging 35 min
Bagging 128 min
Stacking 39 min
Stacking 145 min
TABLE II: Ensemble performance

Iv Conclusions and Future Work

We have highlighted that many training methods that have been used in shallow learning may be adapted for use in Deep Learning. We have looked at Rprop and how the appearance of zero-gradients during the training as a side effect of Dropout poses a challenge to learning, and proposed a solution which allows Rprop to train DNNs to a better error, and still be much faster than standard SGD backpropagation.

We then showed that this increase in training speed can be used to train effectively an Ensemble of DNNs on a commodity desktop system, and reap the added benefits of Ensemble methods in less time than it would take to train a Deep Neural Network with SGD.

It remains to be assessed in further work whether this improved methodology would lead to a new state-of-the-art error when applying the pre-training and dataset enhancements that have been used in other methods, and how the improvements to Rprop can be ported to its numerous variants.

Acknowledgement

The authors would like to thank the School of Business, Economics and Informatics, Birkbeck College, University of London, for the grant received to support this research.

References

  • [1] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in

    Proceedings of the 30th International Conference on Machine Learning (ICML-13)

    , 2013, pp. 1058–1066.
  • [2] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .   IEEE Press, 2012, pp. 3642–3649.
  • [3] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep, big, simple neural nets for handwritten digit recognition,” Neural computation, vol. 22, no. 12, pp. 3207–3220, 2010.
  • [4]

    Y. Lecun and C. Cortes, “The MNIST database of handwritten digits.” [Online]. Available:

    http://yann.lecun.com/exdb/mnist/
  • [5] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The rprop algorithm,” in proceeding of the IEEE International Conference on Neural Networks.   IEEE, 1993, pp. 586–591.
  • [6] A. D. Anastasiadis, G. D. Magoulas, and M. N. Vrahatis, “New globally convergent training scheme based on the resilient propagation algorithm,” Neurocomputing, vol. 64, pp. 253–270, 2005.
  • [7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
  • [8] C. Igel and M. Hüsken, “Improving the Rprop learning algorithm,” in Proceedings of the second international ICSC symposium on neural computation (NC 2000), vol. 2000.   Citeseer, 2000, pp. 115–121.
  • [9] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [10]

    P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” 2003. [Online]. Available:

    http://research.microsoft.com/apps/pubs/default.aspx?id=68920
  • [11] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
  • [12] W. D, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259, 1992.