are being used in many deep neural networks such as Convolutional Neural Networks (CNNs) since they can be easily implemented with high memory efficiency.
The empirical success of RMSProp could be explained by using Hessian-based preconditioning 
. Hessian is the matrix that represents the curvature of the loss function; Hessian-based preconditioning locally changes the curvature of the loss function. When training deep neural networks, pathological curvatures such as saddle points and cliffs 
can slow the progress of first order gradient descent, such as Stochastic Gradient Descent (SGD). Hessian-based preconditioning improves the condition of the curvature, and thus enhances SGD speed. However, SGD with Hessian-based preconditioning incurs high computation cost because it generally computes the inverse matrix of Hessian. Since RMSProp approximates Hessian-based preconditioning by using first order gradients , it achieves efficient training. In addition, RMSProp is easy to implement. Therefore, in terms of practical use, RMSProp and its variants such as AdaDelta  and Adam  are still seen as the most powerful approach to training deep neural networks.
However, the first order gradients used in RMSProp include noise caused by stochastic optimization techniques such as mini-batch setting. With batch setting, since the model inputs are fixed in each iteration, only parameter updates change the gradients. On the other hand, with mini-batch setting, since the inputs are not fixed in each iteration, gradients can also be changed by randomly selecting the inputs in each iteration. This change in the mini-batch setting can be seen as noise. Since RMSProp uses the noisy first order gradients to approximate Hessian-based preconditioning, the approximation may be inaccurate. This indicates that the efficiency of RMSProp can be improved by effectively handling the noise in the first order gradients.
This paper proposes a novel adaptive learning rate algorithm called SDProp. The key idea is to use covariance matrix based preconditioning instead of Hessian-based preconditioning. The covariance matrix is derived by assuming a distribution for the noise in the observed gradients. Since the distribution effectively captures the noise, SDProp can effectively capture the changes in gradients caused by random input selection in each iteration. Interestingly, our theoretical analysis reveals that SDProp uses the information of directions over past gradients in adapting the learning rate while RMSProp and its variants use the magnitudes of the gradients. In experiments, we compare SDProp with RMSProp. SDProp needs 50
fewer training iterations than RMSProp to reach the final training loss for CNN in Cifar-10, Cifar-100 and MNIST datasets. In addition, SDProp outperforms Adam, a state-of-the-art algorithm based on RMSProp, in several datasets. Our approach is also more effective than RMSProp for training Recurrent Neural Network (RNN) and very deep fully-connected neural networks.
We briefly review the background of this paper. First, we describe SGD, which is a basic algorithm in stochastic optimization such as mini-batch setting. Second, we review RMSProp. Finally, we explain the relationship between Hessian-based preconditioning and RMSProp.
2.1. Stochastic Gradient Descent
Many learning algorithms aim at minimizing loss function
with respect to parameter vector,[10, 11]. SGD is a popular algorithm in the mini-batch setting. To minimize , SGD iteratively updates with a mini-batch of samples as follows:
where is the learning rate, is the -th element of the parameter vector at time , is the sample or mini-batch at time , and is the first order gradient with respect to the -th parameter given by . SGD applies Equation (2.1) to each sample or mini-batch while Gradient Descent (GD) applies Equation (2.1) to all data in the batch setting. Although includes noise due to the random selection of mini-batch , SGD uses it in the training phase. Since SGD only uses a part of the data for computing , each iteration has reduced computation cost while memory efficiency is high.
RMSProp is a popular algorithm based on SGD for training neural networks. AdaDelta and Adam are follow-up methods of RMSProp. RMSProp rapidly reduces loss function by adapting the learning rate of SGD. The updating rule of RMSProp is as follows:
is the moving average of uncentered variance over past first order gradients, is the decay rate for computing , and is the small value for the stable computation. Intuitively, RMSProp divides the learning rate, , by magnitude of the past first order gradients . Therefore, if the -th parameter has large values in terms of the magnitude in the past, RMSProp yields a small learning rate because in Equation (2.3) is large. Empirically, this idea efficiently reduces the loss function for deep neural networks. Follow-up methods such as AdaDelta and Adam are based on this idea. For the convex optimization, regret analysis can be used to explain the efficiency of the methods . For non-convex optimization such as deep neural networks, the empirical success of RMSProp could be explained by using Hessian-based preconditioning. We briefly review the relationship between Hessian-based preconditioning and RMSProp by following  in the next section.
2.3. Hessian-based Preconditioning
Some kind of pathological curvature of the loss function slows the progress of SGD . Therefore, it is important to capture the curvature in order to efficiently train deep neural networks.
Hessian-based preconditioning locally changes the function by using Hessian , which can capture the curvature of the function. Hessian is the square matrix of the second order gradients of function represented by
. The condition number of Hessian estimates the extent to which the curvature is pathological. Condition number is defined aswhere and
are the largest and smallest singular values of, respectively. The function has less pathological curvature if the condition number has a small value. This is because the function equally curves if it has small condition number. Therefore, we can increase the efficiency of the training by reducing the Hessian condition number .
Hessian-based preconditioning locally transforms an original parameter into another parameter so that the Hessian has small condition number. Preconditioning matrix gives transformations such as where is the transformed parameter. By using , function is transformed into function where . If has smaller condition number than , we can efficiently train a model by applying first order gradient descent to . The updating rule of is . Since , we have the following form for original parameter :
If is the Hessian of transformed function , is given as . When , has a smaller condition number because
is an identity matrix. In this case, Equation (2.4) corresponds the Newton method. However, exists only when is positive-semidefinite. Since deep neural networks have many saddle points where Hessian can be indefinite , the Newton method is unsuitable for training deep neural networks. On the other hand, the diagonal equilibration matrix of works well even if is indefinite . This indicates that GD can efficiently escape from saddle points by preconditioning based on the diagonal equilibration matrix.
In RMSProp, the role of in Equation (2.3) could be explained by using Hessian-based preconditioning . A comparison of Equation (2.4) to Equation (2.3) indicates that corresponds to the -th element of the diagonal preconditioning matrix. In addition, empirical results suggest that approximates the -th element of the diagonal equilibration matrix which can be used to efficiently train deep neural networks . Thus, RMSProp can be interpreted as Hessian-based preconditioning using an approximated diagonal equilibration matrix in the mini-bath setting. Therefore, since RMSProp is more efficient in escaping from saddle points than SGD, RMSProp and its follow-up methods achieve high efficiency.
3. Proposed Method
We first introduce the novel preconditioning idea. Then, we derive SDProp based on this idea.
RMSProp approximates Hessian-based preconditioning by using the first order gradients as described in the preliminary section. However, in stochastic optimization approaches such as mini-batch setting, the first order gradients include noise because input is randomly selected in each iteration. Since the first order gradients in Equation (2.2) and the square roots of the uncentered variances in Equation (2.3) contain noise, it is difficult to effectively approximate Hessian-based preconditioning. In order to effectively handle the noise, we replace Hessian-based preconditioning with covariance matrix based preconditioning.
In covariance matrix based preconditioning, we assume that the first order gradients
follow a Gaussian distribution. This is because the field of probabilistic modeling uses Gaussian distributions to model the noise of observations[12, 13, 14, 15]. By following , we assume the following Gaussian distribution of first order gradient :
where is the true gradient without the noise while includes the noise. is a Gaussian distribution with mean and covariance matrix ; is the covariance matrix of whose size is . The diagonal elements in represent the magnitude of oscillation of the first order gradients that include the noise. Specifically, let be the -th row and the -th column element in , represents the covariance of the -th and the -th first order gradient. Therefore, if the -th first order gradient strongly correlates with the -th first order gradient, has large absolute value. On the other hand, represents the variance of the -th first order gradient. Therefore, has large value if the first order gradient strongly oscillates in the -th dimension.
Intuitively, large oscillations in -th dimension incur high variance of updating directions and inefficient progress in plain SGD. However, it is difficult to reduce the oscillation since it can be a result of the noise induced by the mini-batch setting. How can we reduce the oscillation by using ? This is the motivation behind our approach; plain SGD efficiently progresses if we can control the oscillation by utilizing . In this paper, we propose the preconditioning of to control the oscillation. While Hessian-based preconditioning reduces the condition number of Hessian, our preconditioning reduces the condition number of by transforming into an identity matrix. We describe our approach in the next section.
3.2. Covariance Matrix Based Preconditioning
The previous section suggests that large values in the diagonal of prevent the efficient progress of SGD. Therefore, if we could control the values in the diagonal of , we improve the efficiency of SGD. Our covariance matrix based preconditioning transforms into where I is an identity matrix whose size is and is a hyper-parameter that has a positive value. Since the element in the diagonal of represents the variance of first order gradient, we can hold the variance to constant value . If the variance is larger than , its value is reduced to . Therefore, SGD efficiently progresses if we transform into .
We first describe the approach used to transform into I instead of . This is because once is transformed into I, it is easy to transform I into as we describe later. Hessian-based preconditioning transforms first order gradients to yield where is a preconditioning matrix. The preconditioning matrix of reduces the condition number of Hessian as described in the preliminary section. Unlike the previous approach, we execute the preconditioning of and so use the transformation . In this transformation, is a transformed first order gradient and is a first order gradient as defined in Equation (3.1). Since the transformation is an affine transformation of generated from the Gaussian distribution in Equation (3.1), we have following distribution of :
In Equation (3.2), we use the following major rule to transform Equation (3.1) into (3.2): if and , then ; is a Gaussian distribution that has mean and covariance matrix , is a matrix for affine transformation and is a transformed variable. By setting in Equation (3.2), we have the following property :
If we transform first order gradient to yield , we have the following Gaussian distribution:
where I is an identity matrix whose size is .
By using eigen decomposition, we can represent as where is an orthogonal matrix of
is an orthogonal matrix ofand is a diagonal matrix of . Since is assumed to be a positive semi-definite matrix, all eigen values are equal to or higher than 0. Thus, can be computed as . By setting the covariance term of Equation (3.2) to , the Gaussian distribution of is represented as follows:
In the above formulations, since is an orthogonal matrix, we use and . As a result, we have the distribution of Equation (3.3). ∎
The above theorem indicates that the transformation of results in the Gaussian distribution of whose covariance matrix is identity matrix I. In other words, we can control the covariance matrix to be I by using instead of .
Our preconditioning transforms the value of variance for first order gradients into 1 by using . However, may have an extremely large value if the variance is 1. Thus, we introduce hyper-parameter to generalize our preconditioning. Specifically, by using the transformation of instead of , we have the following distribution:
The above equation denotes that controls the value of the covariance matrix while the previous transformation only gives an identity matrix as shown in Equation (3.3). We show that has the same role as learning rate when we derive SDProp in the next section.
Since we compute the first order gradients at each time in SGD, we have to incrementally compute the covariance matrix although Theorem 1 is based on the property that is a positive semi-definite matrix. In order to incrementally compute as a positive semi-definite matrix, we use the online updating rule of  as follows:
where is the moving average of and is the hyper-parameter of the decay rate for the moving average that has . and are initialized as and . The above updating rule gives the following property:
By following the definition of positive semi-definite matrixes, if we have matrix of such that holds for every non-zero column vector of real numbers, is a positive semi-definite matrix. Since the above inequation shows that holds, it is clear that is a positive semi-definite matrix even if in Equation (3.6) has any real value.
Then, we prove that in Equation (3.5) is a positive semi-definite matrix by mathematical induction.
Initial step: If , the initialization yields . Since is computed as by using Equation (3.5) and (3.6), is a positive semi-definite matrix. This is because is a positive semi-definite matrix as proved above.
Inductive step: We assume that is a positive semi-definite matrix. Since is computed as by using Equations (3.5) and (3.6), is represented as follows:
In the above equation, and because and are positive semi-definite matrices. Therefore, is a positive semi-definite matrix because holds in the above equation. This completes the inductive step. ∎
Note that Hessian-based preconditioning cannot control the oscillation of first order gradients. This is because its transformation results in the distribution of where the covariance matrix is uncontrollable. In addition, since Hessian may not be a positive semi-definite matrix, it is difficult to compute . Therefore, our covariance matrix based preconditioning inherently differs from Hessian-based preconditioning. Our idea of preconditioning is more suitable than Hessian-based preconditioning in handling the oscillation triggered by the noise of first order gradients.
Since deep neural networks have a large number of parameters, the idea described in the previous section incurs large memory consumption of where is the number of parameters. In addition, it costs time to compute
by using eigenvalue decomposition. To avoid these problems, we employ diagonal preconditioning matrix . Since this approach only needs the diagonal terms, the memory and computation costs are . Although this approach ignores the correlation of first order gradients, it is sufficient to control the oscillation in each dimension. This is because the diagonal of represents the variance of the oscillation as described in the previous section. By picking the diagonal of Equation (3.4), the updating rule is:
We rewrite this updating rule (all steps) as follows:
where is the moving average of first order gradients for the -th parameter at time and is the hyper-parameter of the decay rate for the moving average that has . is the exponentially moving variance of first order gradients for the -th parameter at time . We use in Equation (3.9) as the decay rate of the exponentially moving variance. and are initialized as and , respectively. For stable computation, is set at a small positive value. Equation (3.10) corresponds to Equation (3.7). We call the algorithm SDProp because Equation (3.10) includes Standard Deviation . Although includes the bias imposed by initialization, we can remove the bias in the same way as .
Notice that takes the same role as learning rate in Equation (2.3) of RMSProp. Therefore, Equation (3.10) divides the learning rate by the square root of centered variance while Equation (2.3) of RMSProp divides the learning rate by the square root of uncentered variance . In other words, RMSProp and its follow-up methods such as Adam adapt the learning rate by the magnitude of gradients while we adapt it by the variance of gradients. Although RMSProp and SDProp have similar updating rules, they have totally different goals as described in the previous sections. RMSProp executes Hessian-based preconditioning while SDProp executes covariance matrix based preconditioning.
We performed experiments to compare SDProp to RMSProp and Adam, a state-of-the-art algorithm based on RMSProp.  shows that Adam is a more efficient and effective approach than RMSProp or AdaDelta by integrating momentum into RMSProp. First, we show the efficiency and effectiveness of our approach by using CNN. Second, since SDProp effectively handles the oscillation described in the previous section, we evaluate SDProp by using small mini-batches which suffer noise in the first order gradients. Third, we show the efficiency and effectiveness of SDProp for RNN. Fourth, we demonstrate the effectiveness of SDProp for 20 layered fully-connected neural network that is difficult to train due to many sadle points.
4.1. Efficiency and Effectiveness for CNN
and MNIST. The experiments were conducted on a 7-layered CNN with ReLU activation function. The loss function was negative log likelihood. We compared SDProp to RMSProp and Adam. In SDProp, we tried various combinations of hyper-parameters by usingand . In RMSProp, we tried combinations of hyper-parameters by using and . As a result, SDProp achieves the lowest loss in the settings of , . RMSProp has the lowest loss when and . Adam achieves the lowest loss when , and
. The mini-batch size was 128. The number of epochs was 50. We use the training loss to evaluate the algorithms because they optimize the training criterion.
Figure 1 shows the training losses of each dataset. In Cifar-10, Cifar-100 and SVHN, SDProp yielded lower losses than RMSProp and Adam in early epochs. In MNIST, although the training loss of SDProp and Adam nearly reached 0.0, SDProp reduces the loss faster than Adam. SDProp needs 50 fewer training iterations than RMSProp to reach its final training loss in Cifar-10, Cifar-100 and MNIST. This suggests that our idea of covariance matrix based preconditioning is more efficient and effective than Hessian-based preconditioning in the mini-batch setting because RMSProp and Adam approximate Hessian-based preconditioning as described in the preliminary section. Since SDProp captures the noise, it effectively reduces the loss even if the gradients are noisy. In the next experiment, we investigate the performance of SDProp in terms of its effectiveness against noise by using noisy first order gradients.
4.2. Sensitivity of Mini-batch Size
The previous experimental results show that SDProp is more efficient and effective than existing methods because it well handles the noise in our idea and in practice. In other words, SDProp is expected to effectively train the model even if we use small mini-batch sizes that incur noisy first order gradients . Therefore, we investigated the sensitivity of SDProp and existing methods to mini-batch size. While the main purpose of this experiment is to reveal the one performance attribute of SDProp, the result suggests that SDProp can be used on devices with scant memory that must use small mini-batches.
We compared SDProp to RMSProp and Adam using mini-batch sizes of 16, 32, 64 and 128. We used the Cifar-10 dataset for the 10-class image classification task. We used CNN as per the previous section. The hyper-parameters are also the same as the previous section; they are tuned by grid search. The number of epochs was 50.
Table 1 shows the final training accuracies. SDProp outperforms RMSProp and Adam in all mini-batch size values examined. Specifically, although small mini-batch size of 16 incurs very noisy first order gradients, SDProp obviously achieves effective training unlike RMSProp and Adam. In addition, Table 1 shows that the superiority of our approach over RMSProp and Adam increases as mini-batch size falls. For example, if the mini-batch size is 16, our approach has 8.75 percent higher accuracy than RMSProp and 2.24 percent more accurate if the mini-batch size is 128. This indicates that our covariance matrix based preconditioning effectively handles the noise of first order gradients.
4.3. Efficiency and Effectiveness for RNN
We evaluated the efficiency and effectiveness of SDProp for the Recurrent Neural Network (RNN). In this experiment, we predicted the next character by using previous characters via character-level RNN. We used the subset of shakespeare dataset and the source code of the linux kernel as the dataset . The size of the internal state was 128. The pre-processing of the dataset followed that of . The mini-batch size was 128. In SDProp, we tried grid search with and . As a result, SDProp used the settings of and . In RMSProp, we tried grid search with and . Finally, we used the settings of and
for RMSProp. The training criterion was cross entropy. We used gradient clipping and learning rate decay. Gradient clipping is a popular approach for scaling down the gradients by manually setting a threshold; it prevents gradients from exploding in RNN training. We set the threshold to 5.0. We decayed the learning rate every tenth epoch by the factor of 0.97 for RMSProp following . In SDProp, was also decayed the same as of RMSProp.
Figure 2 shows the results of the shakespeare dataset and the source code of the linux kernel. SDProp reduces the training loss faster than RMSProp. Since SDProp effectively handles the noise induced by the mini-batch setting, it can efficiently train models other than CNN, such as RNN.
4.4. 20 Layered Fully-connected Neural Network
In this section, we performed experiments to evaluate the effectiveness of SDProp for training deep fully-connected neural networks.  suggests that the number of saddle points exponentially increases with the dimensions of the parameters. Since deep fully-connected networks typically have parameters with higher dimension than other models such as CNN, this optimization problem has many saddle points. This problem is challenging because SGD slowly progresses around saddle points .
We used a very deep fully-connected network with 20 hidden layers, 50 hidden units and ReLU activation functions. We used the MNIST dataset for the 10-class image classification task. This setting is the same as  used in evaluating the effectiveness of SGD with high dimensional parameters. Note that MNIST is sufficient for our evaluation because, unlike CNN, fully-connected networks do not saturate the accuracy in our experiment. Our purpose is to evaluate the effectiveness under the setting of very high dimensional parameter. Thus, it is sufficient to evaluate effectiveness if the accuracy is not saturated. The training criterion was negative log likelihood. The mini-batch size was 128. We initialized parameters from a Gaussian with mean 0 and standard deviation 0.01 following . We compared SDProp to RMSProp. In SDProp, we tried the combinations of hyper-parameters by using and . In RMSProp, we tried the combinations of hyper-parameters by using and . The number of epochs was 50. Although these algorithms are trapped around saddle points, its frequency may depend the initialization of parameter. Therefore, we tried 10 runs for each of the above settings.
Table 2 lists the results for the best setting of and . It shows averages, best, worst of training accuracies for each setting. The result shows that SDProp achieves higher accuracy than RMSProp for the best setting. In addition, the difference between best and worst accuracy of SDProp is smaller than RMSProp. Since SDProp effectively handles the randomness of noise, it can reduce result uncertainty. The results show that SDProp effectively trains models that have very high dimensional parameters.
We proposed SDProp for the effective and efficient training of deep neural networks. Our approach utilizes the idea of using covariance matrix based preconditioning to effectively handle the noise present in the first order gradients. Our experiments showed that, for various datasets and models, SDProp is more efficient and effective than existing methods. In addition, SDProp achieved high accuracy even if the first order gradients were noisy.
Tijmen Tieleman and Geoffrey Hinton.
Lecture 6.5-rmsprop: Divide the Gradient by a Running Average
of its Recent Magnitude.
COURSERA: Neural Networks for Machine Learning, 2012.
-  Matthew Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701, 2012.
-  Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference in Learning Representations (ICLR), 2014.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Yann Dauphin, Harm de Vries, and Yoshua Bengio. Equilibrated Adaptive Learning Rates for Non-convex Optimization. In NIPS, pages 1504–1512, 2015.
-  Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and Attacking the Saddle Point Problem in High-dimensional Non-convex Optimization. In NIPS, pages 2933–2941, 2014.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the Difficulty of Training Recurrent Neural Networks. In ICML, pages 1310–1318, 2013.
-  Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, pages 400–407, 1951.
-  Jeffrey Elman. Finding Structure in Time. Cognitive Science, 14(2):179–211, 1990.
-  Yasuhiro Fujiwara, Yasutoshi Ida, Junya Arai, Mai Nishimura, and Sotetsu Iwamura. Fast Algorithm for the Lasso based L1-Graph Construction. In proceedings of the Very Large Database Endowment(PVLDB), 10(3):229–240, 2016.
-  Yasuhiro Fujiwara, Yasutoshi Ida, Hiroaki Shiokawa, and Sotetsu Iwamura. Fast Lasso Algorithm via Selective Coordinate Descent. In AAAI, pages 1561–1567, 2016.
-  Suvrit Sra, Sebastian Nowozin, and Stephen Wright. Optimization for Machine Learning. Mit Press, 2012.
-  Yasutoshi Ida, Takuma Nakamura, and Takashi Matsumoto. Domain-dependent/independent Topic Switching Model for Online Reviews with Numerical Ratings. In CIKM, pages 229–238, 2013.
-  Yukikatsu Fukuda, Yasutoshi Ida, Takashi Matsumoto, Naohiro Takemura, and Kaoru Sakatani. A Bayesian Algorithm for Anxiety Index Prediction based on Cerebral Blood Oxygenation in the Prefrontal Cortex Measured by Near Infrared Spectroscopy. IEEE journal of translational engineering in health and medicine, 2:1–10, 2014.
Hiroki Miyashita, Takuma Nakamura, Yasutoshi Ida, Takashi Matsumoto, and
Nonparametric Bayes-based Heterogeneous Drosophila
Melanogaster Gene Regulatory Network Inference: T-process
International Conference on Artificial Intelligence and Applications, pages 51–58, 2013.
-  Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM review, 53(2):217–288, 2011.
-  Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images, 2009.
Pierre Sermanet, Sandhya Chintala, and Yann LeCun.
Convolutional Neural Networks Applied to House Numbers
International Conference on Pattern Recognition (ICPR), pages 3288–3291, 2012.
-  Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction using Mini-batches. Journal of Machine Learning Research, 13:165–202, 2012.
-  Andrej Karpathy, Justin Johnson, and Fei-Fei Li. Visualizing and Understanding Recurrent Networks. arXiv preprint arXiv:1506.02078, 2015.
-  Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding Gradient Noise Improves Learning for Very Deep Networks. arXiv preprint arXiv:1511.06807, 2015.