1 Introduction
In recent years, neural networks have become empirically successful in a wide range of supervised learning applications, such as computer vision
[Krizhevsky, Sutskever, and Hinton2012, Szegedy et al.2015], speech recognition [Hinton et al.2012][Sutskever, Vinyals, and Le2014] and computational paralinguistics [Keren and Schuller2016, Keren et al.2016]. Standard implementations of training feedforward neural networks for classification are based on gradientbased stochastic optimization, usually optimizing the empirical crossentropy loss
[Hinton1989].However, the crossentropy is only a surrogate for the true objective of supervised network training, which is in most cases to reduce the probability of a prediction error (or in some case BLEU score, worderrorrate, etc). When optimizing using the crossentropy loss, as we show below, the effect of training examples on the gradient is linear in the
prediction bias, which is the difference between the networkpredicted class probabilities and the target class probabilities. In particular, a wrong confident prediction induces a larger gradient than a similarly wrong, but less confident, prediction.In contrast, humans sometimes employ a different approach to learning: when learning new concepts, they might ignore the examples they feel they do not understand, and focus more on the examples that are more useful to them. When improving proficiency regarding a familiar concept, they might focus on the harder examples, as these can contain more relevant information for the advanced learner. We make a first step towards incorporating this ability into neural network models, by proposing a learning algorithm with a tunable sensitivity to easy and hard training examples. Intuitions about human cognition have often inspired successful machine learning approaches [Bengio et al.2009, Cho, Courville, and Bengio2015, Lake et al.2016]. In this work we show that this can be the case also for tunable sensitivity.
Intuitively, the depth of the model should be positively correlated with the optimal sensitivity to hard examples. When the network is relatively shallow, its modeling capacity is limited. In this case, it might be better to reduce sensitivity to hard examples, since it is likely that these examples cannot be modeled correctly by the network, and so adjusting the model according to these examples might only degrade overall prediction accuracy. On the other hand, when the network is relatively deep, it has a high modeling capacity. In this case, it might be beneficial to allow more sensitivity to hard examples, thereby possibly improving the accuracy of the final learned model.
Our learning algorithm works by generalizing the crossentropy gradient, where the new function can be used instead of the gradient in any gradientbased optimization method for neural networks. Many such training methods have been proposed, including, to name a few, Momentum [Polyak1964]
, RMSProp
[Tieleman and Hinton2012], and Adam [Kingma and Ba2015]. The proposed generalization is parameterized by a value , that controls the sensitivity of the training process to hard examples, replacing the fixed dependence of the crossentropy gradient. When the proposed update rule is exactly the crossentropy gradient. Smaller values of decrease the sensitivity during training to hard examples, and larger values of increase it.We report experiments on several benchmark datasets. These experiments show, matching our expectations, that in almost all cases prediction error is improved using large values of for deep networks, small values of for shallow networks, and values close to the default for networks of medium depth. They further show that using a tunable sensitivity parameter generally improves the results of learning.
The paper is structured as follows: In Section 1.1 related work is discussed. Section 2 presents our setting and notation. A framework for generalizing the loss gradient is developed in Section 3. Section 4 presents desired properties of the generalization, and our specific choice is given in Section 5. Experiment results are presented in Section 6, and we conclude in Section 7. Some of the analysis, and additional experimental results, are deferred to the supplementary material due to lack of space.
1.1 Related Work
The challenge of choosing the best optimization objective for neural network training is not a new one. In the past, the quadratic loss was typically used with gradientbased learning in neural networks [Rumelhart, Hinton, and Williams1988], but a line of studies demonstrated both theoretically and empirically that the crossentropy loss has preferable properties over the quadraticloss, such as better learning speed [Levin and Fleisher1988], better performance [Golik, Doetsch, and Ney2013] and a more suitable shape of the error surface [Glorot and Bengio2010]. Other cost functions have also been considered. For instance, a novel cost function was proposed in [Silva et al.2006], but it is not clearly advantageous to crossentropy. The authors of [Bahdanau et al.2015] address this question in a different setting of sequence prediction.
Our method allows controlling the sensitivity of the training process to examples with a large prediction bias. When this sensitivity is low, the method can be seen as a form of implicit outlier detection or noise reduction. Several previous works attempt to explicitly remove outliers or noise in neural network training. In one work
[Smith and Martinez2011], data is preprocessed to detect label noise induced from overlapping classes, and in another work [Jeatrakul, Wong, and Fung2010] the authors use an auxiliary neural network to detect noisy examples. In contrast, our approach requires a minimal modification on gradientbased training algorithms for neural networks and allows emphasizing examples with a large prediction bias, instead of treating these as noise.The interplay between “easy” and “hard” examples during neural network training has been addressed in the framework of Curriculum Learning [Bengio et al.2009]. In this framework it is suggested that training could be more successful if the network is first presented with easy examples, and harder examples are gradually added to the training process. In another work [Kumar, Packer, and Koller2010], the authors define easy and hard examples based on the fit to the current model parameters. They propose a curriculum learning algorithm in which a tunable parameter controls the proportions of easy and hard examples presented to a learner at each phase. Our method is simpler than curriculum learning approaches, in that the examples can be presented at random order to the network. In addition, our method allows also a heightened sensitivity to harder examples. In a more recent work [Zaremba and Sutskever2014], the authors indeed find that a curriculum in which harder examples are presented in early phases outperforms a curriculum that at first uses only easy examples.
2 Setting and Notation
We consider a standard feedforward multilayer neural network [Svozil, Kvasnicka, and Pospichal1997]
, where the output layer is a softmax layer
[Bridle1990], with units, each representing a class. Let denote the neural network parameters, and let denote the value of output unit when the network has parameters , before the applying the softmax function. Applying the softmax function, the probability assigned by the network to class is . The label predicted by the network for example is . We consider the task of supervised learning of , using a labeled training sample , , where, by optimizing the loss function:
. A popular choice for is the crossentropy cost function, defined by .3 Generalizing the gradient
Our proposed method allows controlling the sensitivity of the training procedure to examples on which the network has large errors in prediction, by means of generalizing the gradient. A naïve alternative towards the same goal would be using an exponential version of the crossentropy loss: , where is the probability assigned to the correct class and
is a hyperparameter controlling the sensitivity level. However, the derivative of this function with respect to
is an undesired term since it is not monotone in for a fixed , resulting in lack of relevant meaning for small or large values of . The gradient resulting from the above form is of a desired form only for , due to cancellation of terms from the derivatives of and the softmax function. Another naïve option would be to consider , but this is only a scaled version of the crossentropy loss and amounts to a change in the learning rate.In general, controlling the loss function alone is not sufficient for controlling the relative importance to the training procedure of examples on which the network has large and small errors in prediction. Indeed, when computing the gradients, the derivative of the loss function is being multiplied by the derivative of the softmax function, and the latter is a term that also contains the probabilities assigned by the model to the different classes. Alternatively, controlling the parameters updates themselves, as we describe below, is a more direct way of achieving the desired effect.
Let be a single labeled example in the training set, and consider the partial derivative of with respect to some parameter in . We have
where is the input to the softmax layer when the input example is , and the network parameters are .
If is the crossentropy loss, we have and
Hence
For given , define the prediction bias of the network for example on class , denoted by , as the (signed) difference between the probability assigned by the network to class and the probability that should have been assigned, based on the true label of this example. We get for , and otherwise. Thus, for the crossentropy loss,
(1) 
In other words, when using the cross entropy loss, the effect of any single training example on the gradient is linear in the prediction bias of the current network on this example.
As discussed in Section 1, it is likely that in many cases, the results of training could be improved if the effect of a single example on the gradient is not linear in the prediction bias. Therefore, we propose a generalization of the gradient that allows nonlinear dependence in .
For given and for , define , let , and consider the following generalization of :
(2) 
Here is the ’th component of . When is the identity, we have , and . However, we are now at liberty to study other assignments for .
We call the vector of values of for in a pseudogradient, and propose to use in place of the gradient within any gradientbased algorithm. In this way, optimization of the crossentropy loss is replaced by a different algorithm of a similar form. However, as we show in Section 5.2, is not necessarily the gradient of any loss function.
4 Properties of
Consider what types of functions are reasonable to use for instead of the identity. First, we expect to be monotonic nondecreasing, so that a larger prediction bias never results in a smaller update. This is a reasonable requirement if we cannot identify outliers, that is, training examples that have a wrong label. We further expect to be positive when and negative otherwise.
In addition to these natural properties, we introduce an additional property that we wish to enforce. To motivate this property, we consider the following simple example. Assume a network with one hidden layer and a softmax layer (see Figure 1), where the inputs to the softmax layer are and the outputs of the hidden layer are , where is the input vector, and are the scalar bias and weight vector between the input layer and the hidden layer. Suppose that at some point during training, hidden unit is connected to all units in the softmax layer with the same positive weight . In other words, for all , . Now, suppose that the training process encounters a training example , and let be some input coordinate.
What is the change to the weight that this training example should cause? Clearly it need not change if , so we consider the case . Only the value is directly affected by changing . From the definition of , the predicted probabilities are fully determined by the ratios , or equivalently, by the differences , for all . Now, . Therefore, , and therefore
We conclude that in the case of equal weights from unit to all output units, there is no reason to change the weight for any . Moreover, preliminary experiments show that in these cases it is desirable to keep the weight stationary, as otherwise it can cause numerical instability due to explosion or decay of weights.
Therefore, we would like to guarantee this behavior also for our pseudogradients. Therefore, we require in this case. It follows that
Dividing by , we get the following desired property for the function , for any vector of prediction biases:
(3) 
Note that this indeed holds for the crossentropy loss, since , and in the case of crossentropy, is the identity.
5 Our choice of
In the case of the crossentropy, is the identity, leading to a linear dependence on . A natural generalization is to consider higher order polynomials. Combining this approach with the requirement in Eq. (3), we get the following assignment for , where is a parameter.
(4) 
The expression is a normalization term which makes sure Eq. (3) is satisfied. Setting , we get that is the gradient of the crossentropy loss. Other values of result in different pseudogradients.
To illustrate the relationship between the value of and the effect of prediction biases of different sizes on the pseudogradient, we plot as a function of for several values of (see Figure 2). Note that absolute values of the pseudogradient are of little importance, since in gradientbased algorithms, the gradient (or in our case, the pseudogradient) is usually multiplied by a scalar learning rate which can be tuned.
As the figure shows, when is large, the pseudogradient is more strongly affected by large prediction biases, compared to small ones. This follows since is monotonic increasing in for . On the other hand, when using a small positive we get that tends to , therefore, the pseudogradient in this case would be much less sensitive to examples with large prediction biases. Thus, the choice of , parameterized by , allows tuning the sensitivity of the training process to large errors. We note that there could be other reasonable choices for which have similar desirable properties. We leave the investigation of such other choices to future work.

Test Error  Test CrossEntropy Loss  

Dataset  Layer Size  Momentum  Selected  Selected  Selected  
MNIST 
400  0.5  0.5  1.76%  1.74%  0.078  0.167 
MNIST  800  0.5  0.5  1.67%  1.65%  0.072  0.150 
MNIST  1100  0.5  0.5  1.67%  1.65%  0.071  0.145 
SVHN 
400  0.5  0.25  16.88%  16.16%  0.661  1.576 
SVHN  800  0.5  0.125  16.09%  15.64%  0.648  3.108 
SVHN  1100  0.5  0.25  16.04%  15.53%  0.626  1.525 
CIFAR10 
400  0.5  0.25  48.32%  47.06%  1.430  3.034 
CIFAR10  800  0.5  0.125  46.91%  46.01%  1.388  5.645 
CIFAR10  1100  0.5  0.25  46.43%  45.84%  1.410  2.820 
CIFAR100 
400  0.5  0.25  75.18%  74.41%  3.302  6.931 
CIFAR100  800  0.5  0.25  74.04%  73.78%  3.260  7.449 
CIFAR100  1100  0.5  0.125  73.69%  73.11%  3.239  13.557 
5.1 A Toy Example
To further motivate our choice of , we describe a very simple example of a distribution and a neural network. Consider a neural network with no hidden layers, and only one input unit connected to two softmax units. Denoting the input by , the input to softmax unit is , where and are the network weights and biases respectively.
It is not hard to see that the set of possible prediction functions that can be represented by this network is exactly the set of threshold functions of the form or .
For convenience assume the labels mapped to the two softmax units are named . Let , and suppose that labeled examples are drawn independently at random from the following distribution over : Examples are uniform in ; Labels of examples in are deterministically , and they are for all other examples. For this distribution, the prediction function with the smallest prediction error that can be represented by the network is .
However, optimizing the crossentropy loss on the distribution, or in the limit of a large training sample, would result in a different threshold, leading to a larger prediction error (for a detailed analysis see Appendix A in the supplementary material). Intuitively, this can be traced to the fact that the examples in
cannot be classified correctly by this network when the threshold is close to
, but they still affect the optimal threshold for the crossentropy loss.Thus, for this simple case, there is motivation to move away from optimizing the crossentropy, to a different update rule that is less sensitive to large errors. This reduced sensitivity is achieved by our update rule with . On the other hand, larger values of would result in higher sensitivity to large errors, thereby degrading the classification accuracy even more.
We thus expect that when training the network using our new update rule, the prediction error of the resulting network should be monotonically increasing with , hence values of which are smaller than would give a smaller error. We tested this hypothesis by training this simple network on a synthetic dataset generated according to the distribution described above, with .
We generated 30,000 examples for each of the training, validation and test datasets. The biases were initialized to 0 and the weights were initialized from a uniform distribution on
. We used batch gradient descent with a learning rate of for optimization of the four parameters, where the gradient is replaced with the pseudogradient from Eq. (2), using the function defined in Eq. (4). is parameterized by , and we performed this experiment using values of between and. After each epoch, we computed the prediction error on the validation set, and training was stopped after 3000 epochs in which this error was not changed by more than
. The values of the parameters at the end of training were used to compute the misclassification rate on the test set.Table 2 reports the results for these experiments, averaged over runs for each value of . The results confirm our hypothesis regarding the behavior of the network for the different values of , and further motivate the possible benefits of using . Note that while the prediction error is monotonic in in this experiment, the crossentropy is not, again demonstrating the fact that optimizing the crossentropy is not optimal in this case.
Test error  Threshold  CE Loss  
4  8.36%  0.116  0.489 
2  6.73%  0.085  0.361 
1  4.90%  0.049  0.288 
0.5  4.27%  0.037  0.299 
0.25  4.04%  0.030  0.405 
0.125  3.94%  0.028  0.625 
0.0625  3.61%  0.022  1.190 

5.2 Nonexistence of a Cost Function for
It is natural to ask whether, with our choice of in Eq. (4), is the gradient of another cost function, instead of the crossentropy. The following lemma demonstrates that this is not the case.
Lemma 1.
Assume as in Eq. (4) with , and the resulting pseudogradient. There exists a neural network for which the is not a gradient of any cost function.
The proof of is lemma is left for the supplemental material. Note that the above lemma does not exclude the possibility that a gradientbased algorithm that uses instead of the gradient still somehow optimizes some cost function.
6 Experiments

Test Error  Test CE Loss  
Dataset  Layer Sizes  Mom’  Selected  Selected  Selected  
MNIST  400  0.5  1  —  —  —  — 
MNIST  800  0.5  1  —  —  —  — 
SVHN  400  0.5  2  16.52%  16.52%  1.604  0.968 
SVHN  800  0.5  1  —  —  —  — 
CIFAR10  400  0.5  2  46.81%  46.63%  3.023  2.121 
CIFAR10  800  0.5  1  —  —  —  — 
CIFAR100  400  0.5  0.5  75.20%  74.95%  3.378  4.511 
CIFAR100  800  0.5  1  —  —  —  — 

Test Error  Test CE Loss  
Dataset  Layer Sizes  Mom’  Selected  Selected  Selected  
MNIST  400  0.5  0.5  1.71%  1.69%  0.113  0.224 
MNIST  800  0.5  0.25  1.61%  1.60%  0.118  0.390 
SVHN  400  0.5  4  17.41%  16.49%  1.436  0.708 
SVHN  800  0.5  0.5  17.07%  16.61%  1.343  2.604 
CIFAR10  400  0.5  2  48.05%  47.85%  2.017  1.962 
CIFAR10  800  0.5  4  44.21%  44.24%  4.610  1.677 
CIFAR100  400  0.5  2  75.69%  75.48%  3.611  3.228 
CIFAR100  800  0.5  2  74.10%  73.57%  4.650  4.439 
MNIST  400  0.9  1  —  —  —  — 
MNIST  800  0.9  4  1.58%  1.60  0.098  0.060 
SVHN  400  0.9  4  17.89%  16.54%  1.284  0.718 
SVHN  800  0.9  2  16.24%  15.73%  1.647  0.998 
CIFAR10  400  0.9  4  47.91%  47.57%  2.202  1.648 
CIFAR10  800  0.9  2  45.69%  44.11%  3.316  2.171 
CIFAR100  400  0.9  1  —  —  —  — 
CIFAR100  800  0.9  4  74.32%  74.62%  3.872  3.432 

For our experiments, we used four classification benchmark datasets from the field of computer vision: The MNIST dataset [LeCun et al.1998], the Street View House Numbers dataset (SVHN) [Netzer et al.2011] and the CIFAR10 and CIFAR100 datasets [Krizhevsky and Hinton2009]. A more detailed description of the datasets can be found in Appendix C.1 in the supplementary material.
The neural networks we experimented with are feedforward neural networks that contain one, three or five hidden layers of various layer sizes. For optimization, we used stochastic gradient descent with momentum
[Sutskever et al.2013] with several values of momentum and a minibatch size of 128 examples. For each value of , we replaced the gradient in the algorithm with the pseudogradient from Eq. (2), using the function defined in Eq. (4). For the multilayer experiments we also used GradientClipping
[Pascanu, Mikolov, and Bengio2013] with a threshold of 100. In the hidden layers, biases were initialized to 0 and for the weights we used the initialization scheme from [Glorot and Bengio2010]. Both biases and weights in the softmax layer were initialized to 0.In each experiment, we used crossvalidation to select the best value of . The learning rate was optimized using crossvalidation for each value of separately, as the size of the pseudogradient can be significantly different between different values of , as evident from Eq. (4). We compared the test error between the models using the selected and , each with its best performing learning rate. Additional details about the experiment process can be found in Appendix C.2 in the supplementary material.
We report the test error of each of the trained models for MNIST, SVHN, CIFAR10 and CIFAR100 in Tables 1, 3 and 4 for networks with one, three and five layers respectively. Additional experiments are reported in Appendix C.1 in the supplementary material. We further report the crossentropy values using the selected and the default .
Several observations are evident from the experiment results. First, aligned with our hypothesis, the value of selected by the crossvalidation scheme was almost always smaller than , for the shallow networks, larger than one for the deep networks, and close to one for networks with medium depth. Indeed, the capacity of network is positively correlated with the optimal sensitivity to hard examples.
Second, for the shallow networks the crossentropy loss on the test set was always worse for the selected than for . This implies that indeed, by using a different value of we are not optimizing the crossentropy loss, yet are improving the success of optimizing the true prediction error. On the contrary, in the experiments with three and five layers, the cross entropy is also improved by selecting the larger . This is an interesting phenomenon, which might be explained by the fact that examples with a large prediction bias have a high crossentropy loss, and so focusing training on these examples reduces the empirical crossentropy loss, and therefore also the true crossentropy loss.
To summarize, our experiments show that overall, crossvalidating over the value of usually yields improved results over , and that, as expected, the optimal value of grows with the depth of the network.
7 Conclusions
Inspired by an intuition in human cognition, in this work we proposed a generalization of the crossentropy gradient step in which a tunable parameter controls the sensitivity of the training process to hard examples. Our experiments show that, as we expected, the optimal level of sensitivity to hard examples is positively correlated with the depth of the network. Moreover, the experiments demonstrate that selecting the value of the sensitivity parameter using cross validation leads overall to improved prediction error performance on a variety of benchmark datasets.
The proposed approach is not limited to feedforward neural networks — it can be used in any gradientbased training algorithm, and for any network architecture. In future work, we plan to study this method as a tool for improving training in other architectures, such as convolutional networks and recurrent neural networks, as well as experimenting with different levels of sensitivity to hard examples in different stages of the training procedure, and combining the predictions of models with different levels of this sensitivity.
Acknowledgments
This work has been supported by the European Community’s Seventh Framework Programme through the ERC Starting Grant No. 338164 (iHEARu). Sivan Sabato was supported in part by the Israel Science Foundation (grant No. 555/15).
References
 [Bahdanau et al.2015] Bahdanau, D.; Serdyuk, D.; Brakel, P.; Ke, N. R.; Chorowski, J.; Courville, A.; and Bengio, Y. 2015. Task loss estimation for sequence prediction. arXiv preprint arXiv:1511.06456.
 [Bengio et al.2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proc. of the 26th annual International Conference on Machine Learning (ICML), 41–48. Montreal, Canada: ACM.

[Bridle1990]
Bridle, J. S.
1990.
Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition.
In Neurocomputing. Springer. 227–236.  [Cho, Courville, and Bengio2015] Cho, K.; Courville, A.; and Bengio, Y. 2015. Describing multimedia content using attentionbased encoderdecoder networks. IEEE Transactions on Multimedia 17(11):1875–1886.

[Glorot and
Bengio2010]
Glorot, X., and Bengio, Y.
2010.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proc. of International Conference on Artificial Intelligence and Statistics
, 249–256.  [Golik, Doetsch, and Ney2013] Golik, P.; Doetsch, P.; and Ney, H. 2013. Crossentropy vs. squared error training: a theoretical and experimental comparison. In Proc. of INTERSPEECH, 1756–1760.
 [Hinton et al.2012] Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T. N.; et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE 29(6):82–97.
 [Hinton1989] Hinton, G. E. 1989. Connectionist learning procedures. Artificial intelligence 40(1):185–234.
 [Jeatrakul, Wong, and Fung2010] Jeatrakul, P.; Wong, K. W.; and Fung, C. C. 2010. Data cleaning for classification using misclassification analysis. Journal of Advanced Computational Intelligence and Intelligent Informatics 14(3):297–302.
 [Keren and Schuller2016] Keren, G., and Schuller, B. 2016. Convolutional RNN: an enhanced model for extracting features from sequential data. In Proc. of 2016 International Joint Conference on Neural Networks (IJCNN), 3412–3419.
 [Keren et al.2016] Keren, G.; Deng, J.; Pohjalainen, J.; and Schuller, B. 2016. Convolutional neural networks with data augmentation for classifying speakers’ native language. In Proc. of INTERSPEECH, 2393–2397.
 [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
 [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Proc. of Advances in Neural Information Processing Systems (NIPS), 1097–1105.
 [Kumar, Packer, and Koller2010] Kumar, M. P.; Packer, B.; and Koller, D. 2010. Selfpaced learning for latent variable models. In Proc. of Advances in Neural Information Processing Systems (NIPS), 1189–1197.
 [Lake et al.2016] Lake, B. M.; Ullman, T. D.; Tenenbaum, J. B.; and Gershman, S. J. 2016. Building machines that learn and think like people. arXiv preprint arXiv:1604.00289.
 [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Levin and Fleisher1988] Levin, E., and Fleisher, M. 1988. Accelerated learning in layered neural networks. Complex systems 2:625–640.

[Netzer et al.2011]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y.
2011.
Reading digits in natural images with unsupervised feature learning.
In
NIPS workshop on deep learning and unsupervised feature learning
. Granada, Spain.  [Pascanu, Mikolov, and Bengio2013] Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML), 1310–1318.
 [Polyak1964] Polyak, B. T. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5):1–17.
 [Rumelhart, Hinton, and Williams1988] Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1988. Learning representations by backpropagating errors. Cognitive modeling 5:3.
 [Silva et al.2006] Silva, L. M.; De Sa, J. M.; Alexandre, L.; et al. 2006. New developments of the ZEDM algorithm. In Intelligent Systems Design and Applications, volume 1, 1067–1072.
 [Smith and Martinez2011] Smith, M. R., and Martinez, T. 2011. Improving classification accuracy by identifying and removing instances that should be misclassified. In The 2011 International Joint Conference on Neural Networks (IJCNN), 2690–2697.
 [Sutskever et al.2013] Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013. On the importance of initialization and momentum in deep learning. In Proc. of the 30th International Conference on Machine Learning (ICML), 1139–1147.
 [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112.
 [Svozil, Kvasnicka, and Pospichal1997] Svozil, D.; Kvasnicka, V.; and Pospichal, J. 1997. Introduction to multilayer feedforward neural networks. Chemometrics and intelligent laboratory systems 39(1):43–62.
 [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Tieleman and Hinton2012] Tieleman, T., and Hinton, G. 2012. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
 [Zaremba and Sutskever2014] Zaremba, W., and Sutskever, I. 2014. Learning to execute. arXiv preprint arXiv:1410.4615.
Appendix A Proof for Toy Example
Consider the neural network from the toy example in Section 5.1. In this network, there exists one classification threshold such that examples above or below it are classified to different classes. We prove that for a large enough training set, the value of the crossentropy cost is not minimal when the threshold is at .
Suppose that there is an assignment of network parameters that minimizes the crossentropy which induces a threshold at . The output of the softmax layer is determined uniquely by , or equivalently by . Therefore, we can assume without loss of generality that . Denote . If in the minimizing assignment, then all examples are classified as members of the same class and in particular, the classification threshold is not zero. Therefore we may assume . In this case, the classification threshold is . Since we assume a minimal solution at zero, the minimizing assignment must have .
When the training set size approaches infinity, the crossentropy on the sample approaches the expected crossentropy on . Let be the expected crossentropy on for network parameter values . Then
And we have:
Therefore
Differentiating under the integral sign, we get
Since we assume the crossentropy has a minimal solution with , we have
Therefore
Since , it must be that . This contradicts our assumption, hence the crossentropy does not have a minimal solution with a threshold at .
Appendix B Proof of Lemma 1
Proof.
Consider a neural network with three units in the output layer, and at least one hidden layer. Let be a labeled example, and suppose that there exists some cost function , differentiable in , such that for as defined in Eq. (2) and defined in Eq. (4) for some , we have for each parameter in . We now show that this is only possible if .
Under the assumption on , for any two parameters ,
hence
(5) 
Recall our notations: is the output of unit in the last hidden layer before the softmax layer, is the weight between the hidden unit in the last hidden layer, and unit in the softmax layer, is the input to unit in the softmax layer, and is the bias of unit in the softmax layer.
Plugging in as defined in Eq. (4), and using the fact that for , we get:
Since , we have . In addition, and for . Therefore
(6)  
Next, we evaluate each side of the equation separately, using the following:
Hence Eq. (6) holds if and only if:
For , this equality holds since . However, for any , there are values of such that this does not hold. We conclude that our choice of does not lead to a pseudogradient which is the gradient of any cost function. ∎
Appendix C Additional Experiment details and Results
c.1 datasets
The MNIST dataset [LeCun et al.1998], consisting of grayscale 28x28 pixel images of handwritten digits, with 10 classes, 60,000 training examples and 10,000 test examples, the Street View House Numbers dataset (SVHN) [Netzer et al.2011], consisting of RGB 32x32 pixel images of digits cropped from house numbers, with 10 classes 73,257 training examples and 26,032 test examples and the CIFAR10 and CIFAR100 datasets [Krizhevsky and Hinton2009]
, consisting of RGB 32x32 pixel images of 10/100 object classes, with 50,000 training examples and 10,000 test examples. All datasets were linearly transformed such that all features are in the interval
.c.2 Choosing the value of
In each experiment, we used crossvalidation to select the best value of . For networks with one hidden layer, was selected out of the values . For networks with 3 or 5 hidden layers, was selected out of the values , removing the smaller values of due to performance considerations (in preliminary experiments, these small values yielded poor results for deep networks). The learning rate was optimized using crossvalidation for each value of separately, as the size of the pseudogradient can be significantly different between different values of , as evident from Eq. (4).
For each experiment configuration, defined by a dataset, network architecture and momentum, we selected an initial learning rate , based on preliminary experiments on the training set. Then the following procedure was carried out for , for every tested value of :

Randomly split the training set into 5 equal parts, .

Run the iterative training procedure on , until there is no improvement in test prediction error for epochs on the early stopping set, .

Select the network model that did the best on .

Calculate the validation error of the selected model on the validation set, .

Repeat the process for times after permuting the roles of . We set for MNIST, and for CIFAR10/100 and SVHN.

Let be the average of the validation errors.
We then found . If the minimum was found with the minimal or the maximal that we tried, we also performed the above process using half the or double the , respectively. This continued iteratively until there was no need to add learning rates. At the end of this process we selected , and retrained the network with parameters on the training sample, using one fifth of the sample as an early stopping set. We compared the test error of the resulting model to the test error of a model retrained in the same way, except that we set (leading to standard crossentropy training), and the learning rate to . The final learning rates in the selected models were in the range for MNIST, and for the other datasets.
c.3 Results

Test Error  Test CrossEntropy Loss  
Dataset  Layer Size  Momentum  Selected  Selected  Selected  
MNIST  400  0  0.5  1.71%  1.70%  0.0757  0.148 
MNIST  800  0  0.5  1.66%  1.67%  0.070  0.137 
MNIST  1100  0  0.5  1.64%  1.62%  0.068  0.131 
MNIST  400  0.9  0.5  1.75%  1.75%  0.073  0.140 
MNIST  800  0.9  2  1.71%  1.63%  0.070  0.054 
MNIST  1100  0.9  0.5  1.74%  1.69%  0.069  0.127 
SVHN 
400  0  0.25  16.84%  16.09%  0.658  1.575 
SVHN  800  0  0.25  16.19%  15.71%  0.641  1.534 
SVHN  1100  0  0.25  15.97%  15.68%  0.636  1.493 
SVHN  400  0.9  0.125  16.65%  16.30%  0.679  2.861 
SVHN  800  0.9  0.25  16.15%  15.68%  0.675  1.632 
SVHN  1100  0.9  0.25  15.85%  15.47%  0.640  1.657 
CIFAR10 
400  0  0.125  48.15%  46.91%  1.435  5.609 
CIFAR10  800  0  0.125  46.92%  46.14%  1.390  5.390 
CIFAR10  1100  0  0.125  46.63%  46.00%  1.356  5.290 
CIFAR10  400  0.9  0.0625  48.19%  46.71%  1.518  11.049 
CIFAR10  800  0.9  0.125  47.09%  46.16%  1.616  5.294 
CIFAR10  1100  0.9  0.125  46.71%  45.77%  1.850  5.904 
CIFAR100 
400  0.9  0.25  74.96%  74.28%  3.306  7.348 
CIFAR100  800  0.9  0.125  74.12%  73.47%  3.327  13.267 
CIFAR100  1100  0.9  0.25  73.47%  73.19%  3.235  7.489 


Test Error  Test CE Loss  

Dataset  Layer Sizes  Mom’  Selected  Selected  Selected  
MNIST  400  0  1  —  —  —  — 
MNIST  800  0  1  —  —  —  — 
MNIST  400  0.9  1  —  —  —  — 
MNIST  800  0.9  0.5  1.60%  1.53%  0.091  0.189 
SVHN  400  0.9  1  —  —  —  — 
SVHN  800  0.9  2  16.14%  15.96%  1.651  1.062 
CIFAR10  400  0.9  2  47.52%  46.92%  2.226  2.010 
CIFAR10  800  0.9  2  45.27%  44.26%  2.855  2.341 
CIFAR100  400  0.9  0.25  74.97%  74.52%  3.356  8.520 
CIFAR100  800  0.9  0.5  74.48%  73.17%  4.133  8.642 
Comments
There are no comments yet.