1. Introduction
Adaptive learning rate algorithms are widely used for the efficient training of deep neural networks. RMSProp [1] and its followon methods [2, 3]
are being used in many deep neural networks such as Convolutional Neural Networks (CNNs)
[4] since they can be easily implemented with high memory efficiency.The empirical success of RMSProp could be explained by using Hessianbased preconditioning [5]
. Hessian is the matrix that represents the curvature of the loss function; Hessianbased preconditioning locally changes the curvature of the loss function. When training deep neural networks, pathological curvatures such as saddle points
[6] and cliffs [7]can slow the progress of first order gradient descent, such as Stochastic Gradient Descent (SGD)
[8]. Hessianbased preconditioning improves the condition of the curvature, and thus enhances SGD speed. However, SGD with Hessianbased preconditioning incurs high computation cost because it generally computes the inverse matrix of Hessian. Since RMSProp approximates Hessianbased preconditioning by using first order gradients [5], it achieves efficient training. In addition, RMSProp is easy to implement. Therefore, in terms of practical use, RMSProp and its variants such as AdaDelta [2] and Adam [3] are still seen as the most powerful approach to training deep neural networks.However, the first order gradients used in RMSProp include noise caused by stochastic optimization techniques such as minibatch setting. With batch setting, since the model inputs are fixed in each iteration, only parameter updates change the gradients. On the other hand, with minibatch setting, since the inputs are not fixed in each iteration, gradients can also be changed by randomly selecting the inputs in each iteration. This change in the minibatch setting can be seen as noise. Since RMSProp uses the noisy first order gradients to approximate Hessianbased preconditioning, the approximation may be inaccurate. This indicates that the efficiency of RMSProp can be improved by effectively handling the noise in the first order gradients.
This paper proposes a novel adaptive learning rate algorithm called SDProp. The key idea is to use covariance matrix based preconditioning instead of Hessianbased preconditioning. The covariance matrix is derived by assuming a distribution for the noise in the observed gradients. Since the distribution effectively captures the noise, SDProp can effectively capture the changes in gradients caused by random input selection in each iteration. Interestingly, our theoretical analysis reveals that SDProp uses the information of directions over past gradients in adapting the learning rate while RMSProp and its variants use the magnitudes of the gradients. In experiments, we compare SDProp with RMSProp. SDProp needs 50
fewer training iterations than RMSProp to reach the final training loss for CNN in Cifar10, Cifar100 and MNIST datasets. In addition, SDProp outperforms Adam, a stateoftheart algorithm based on RMSProp, in several datasets. Our approach is also more effective than RMSProp for training Recurrent Neural Network (RNN)
[9] and very deep fullyconnected neural networks.2. Preliminary
We briefly review the background of this paper. First, we describe SGD, which is a basic algorithm in stochastic optimization such as minibatch setting. Second, we review RMSProp. Finally, we explain the relationship between Hessianbased preconditioning and RMSProp.
2.1. Stochastic Gradient Descent
Many learning algorithms aim at minimizing loss function
with respect to parameter vector,
[10, 11]. SGD is a popular algorithm in the minibatch setting. To minimize , SGD iteratively updates with a minibatch of samples as follows:(2.1) 
where is the learning rate, is the th element of the parameter vector at time , is the sample or minibatch at time , and is the first order gradient with respect to the th parameter given by . SGD applies Equation (2.1) to each sample or minibatch while Gradient Descent (GD) applies Equation (2.1) to all data in the batch setting. Although includes noise due to the random selection of minibatch , SGD uses it in the training phase. Since SGD only uses a part of the data for computing , each iteration has reduced computation cost while memory efficiency is high.
2.2. RMSProp
RMSProp is a popular algorithm based on SGD for training neural networks. AdaDelta and Adam are followup methods of RMSProp. RMSProp rapidly reduces loss function by adapting the learning rate of SGD. The updating rule of RMSProp is as follows:
(2.2)  
(2.3) 
where
is the moving average of uncentered variance over past first order gradients
, is the decay rate for computing , and is the small value for the stable computation. Intuitively, RMSProp divides the learning rate, , by magnitude of the past first order gradients . Therefore, if the th parameter has large values in terms of the magnitude in the past, RMSProp yields a small learning rate because in Equation (2.3) is large. Empirically, this idea efficiently reduces the loss function for deep neural networks. Followup methods such as AdaDelta and Adam are based on this idea. For the convex optimization, regret analysis can be used to explain the efficiency of the methods [3]. For nonconvex optimization such as deep neural networks, the empirical success of RMSProp could be explained by using Hessianbased preconditioning. We briefly review the relationship between Hessianbased preconditioning and RMSProp by following [5] in the next section.2.3. Hessianbased Preconditioning
Some kind of pathological curvature of the loss function slows the progress of SGD [6]. Therefore, it is important to capture the curvature in order to efficiently train deep neural networks.
Hessianbased preconditioning locally changes the function by using Hessian , which can capture the curvature of the function. Hessian is the square matrix of the second order gradients of function represented by
. The condition number of Hessian estimates the extent to which the curvature is pathological. Condition number is defined as
where andare the largest and smallest singular values of
, respectively. The function has less pathological curvature if the condition number has a small value. This is because the function equally curves if it has small condition number. Therefore, we can increase the efficiency of the training by reducing the Hessian condition number [5].Hessianbased preconditioning locally transforms an original parameter into another parameter so that the Hessian has small condition number. Preconditioning matrix gives transformations such as where is the transformed parameter. By using , function is transformed into function where . If has smaller condition number than , we can efficiently train a model by applying first order gradient descent to . The updating rule of is . Since , we have the following form for original parameter :
(2.4) 
If is the Hessian of transformed function , is given as . When , has a smaller condition number because
is an identity matrix. In this case, Equation (
2.4) corresponds the Newton method. However, exists only when is positivesemidefinite. Since deep neural networks have many saddle points where Hessian can be indefinite [6], the Newton method is unsuitable for training deep neural networks. On the other hand, the diagonal equilibration matrix of works well even if is indefinite [5]. This indicates that GD can efficiently escape from saddle points by preconditioning based on the diagonal equilibration matrix.In RMSProp, the role of in Equation (2.3) could be explained by using Hessianbased preconditioning [5]. A comparison of Equation (2.4) to Equation (2.3) indicates that corresponds to the th element of the diagonal preconditioning matrix. In addition, empirical results suggest that approximates the th element of the diagonal equilibration matrix which can be used to efficiently train deep neural networks [5]. Thus, RMSProp can be interpreted as Hessianbased preconditioning using an approximated diagonal equilibration matrix in the minibath setting. Therefore, since RMSProp is more efficient in escaping from saddle points than SGD, RMSProp and its followup methods achieve high efficiency.
3. Proposed Method
We first introduce the novel preconditioning idea. Then, we derive SDProp based on this idea.
3.1. Idea
RMSProp approximates Hessianbased preconditioning by using the first order gradients as described in the preliminary section. However, in stochastic optimization approaches such as minibatch setting, the first order gradients include noise because input is randomly selected in each iteration. Since the first order gradients in Equation (2.2) and the square roots of the uncentered variances in Equation (2.3) contain noise, it is difficult to effectively approximate Hessianbased preconditioning. In order to effectively handle the noise, we replace Hessianbased preconditioning with covariance matrix based preconditioning.
In covariance matrix based preconditioning, we assume that the first order gradients
follow a Gaussian distribution. This is because the field of probabilistic modeling uses Gaussian distributions to model the noise of observations
[12, 13, 14, 15]. By following [12], we assume the following Gaussian distribution of first order gradient :(3.1) 
where is the true gradient without the noise while includes the noise. is a Gaussian distribution with mean and covariance matrix ; is the covariance matrix of whose size is . The diagonal elements in represent the magnitude of oscillation of the first order gradients that include the noise. Specifically, let be the th row and the th column element in , represents the covariance of the th and the th first order gradient. Therefore, if the th first order gradient strongly correlates with the th first order gradient, has large absolute value. On the other hand, represents the variance of the th first order gradient. Therefore, has large value if the first order gradient strongly oscillates in the th dimension.
Intuitively, large oscillations in th dimension incur high variance of updating directions and inefficient progress in plain SGD. However, it is difficult to reduce the oscillation since it can be a result of the noise induced by the minibatch setting. How can we reduce the oscillation by using ? This is the motivation behind our approach; plain SGD efficiently progresses if we can control the oscillation by utilizing . In this paper, we propose the preconditioning of to control the oscillation. While Hessianbased preconditioning reduces the condition number of Hessian, our preconditioning reduces the condition number of by transforming into an identity matrix. We describe our approach in the next section.
3.2. Covariance Matrix Based Preconditioning
The previous section suggests that large values in the diagonal of prevent the efficient progress of SGD. Therefore, if we could control the values in the diagonal of , we improve the efficiency of SGD. Our covariance matrix based preconditioning transforms into where I is an identity matrix whose size is and is a hyperparameter that has a positive value. Since the element in the diagonal of represents the variance of first order gradient, we can hold the variance to constant value . If the variance is larger than , its value is reduced to . Therefore, SGD efficiently progresses if we transform into .
We first describe the approach used to transform into I instead of . This is because once is transformed into I, it is easy to transform I into as we describe later. Hessianbased preconditioning transforms first order gradients to yield where is a preconditioning matrix. The preconditioning matrix of reduces the condition number of Hessian as described in the preliminary section. Unlike the previous approach, we execute the preconditioning of and so use the transformation . In this transformation, is a transformed first order gradient and is a first order gradient as defined in Equation (3.1). Since the transformation is an affine transformation of generated from the Gaussian distribution in Equation (3.1), we have following distribution of :
(3.2) 
In Equation (3.2), we use the following major rule to transform Equation (3.1) into (3.2): if and , then ; is a Gaussian distribution that has mean and covariance matrix , is a matrix for affine transformation and is a transformed variable. By setting in Equation (3.2), we have the following property :
Theorem 1.
If we transform first order gradient to yield , we have the following Gaussian distribution:
(3.3) 
where I is an identity matrix whose size is .
Proof.
By using eigen decomposition, we can represent as where
is an orthogonal matrix of
and is a diagonal matrix of . Since is assumed to be a positive semidefinite matrix, all eigen values are equal to or higher than 0. Thus, can be computed as . By setting the covariance term of Equation (3.2) to , the Gaussian distribution of is represented as follows:In the above formulations, since is an orthogonal matrix, we use and . As a result, we have the distribution of Equation (3.3). ∎
The above theorem indicates that the transformation of results in the Gaussian distribution of whose covariance matrix is identity matrix I. In other words, we can control the covariance matrix to be I by using instead of .
Our preconditioning transforms the value of variance for first order gradients into 1 by using . However, may have an extremely large value if the variance is 1. Thus, we introduce hyperparameter to generalize our preconditioning. Specifically, by using the transformation of instead of , we have the following distribution:
(3.4) 
The above equation denotes that controls the value of the covariance matrix while the previous transformation only gives an identity matrix as shown in Equation (3.3). We show that has the same role as learning rate when we derive SDProp in the next section.
Since we compute the first order gradients at each time in SGD, we have to incrementally compute the covariance matrix although Theorem 1 is based on the property that is a positive semidefinite matrix. In order to incrementally compute as a positive semidefinite matrix, we use the online updating rule of [12] as follows:
(3.5)  
(3.6) 
where is the moving average of and is the hyperparameter of the decay rate for the moving average that has . and are initialized as and . The above updating rule gives the following property:
Theorem 2.
Proof.
In order to prove Theorem 2, we first prove that in Equation (3.5) is a positive semidefinite matrix. By setting , we have:
By following the definition of positive semidefinite matrixes, if we have matrix of such that holds for every nonzero column vector of real numbers, is a positive semidefinite matrix. Since the above inequation shows that holds, it is clear that is a positive semidefinite matrix even if in Equation (3.6) has any real value.
Then, we prove that in Equation (3.5) is a positive semidefinite matrix by mathematical induction.
Initial step: If , the initialization yields .
Since is computed as by using Equation (3.5) and (3.6), is a positive semidefinite matrix.
This is because is a positive semidefinite matrix as proved above.
Inductive step: We assume that is a positive semidefinite matrix.
Since is computed as by using Equations (3.5) and (3.6), is represented as follows:
In the above equation, and because and are positive semidefinite matrices. Therefore, is a positive semidefinite matrix because holds in the above equation. This completes the inductive step. ∎
Thus, if we compute by using Equations (3.5) and (3.6), we can execute the preconditioning specified by Theorem 1.
Note that Hessianbased preconditioning cannot control the oscillation of first order gradients. This is because its transformation results in the distribution of where the covariance matrix is uncontrollable. In addition, since Hessian may not be a positive semidefinite matrix, it is difficult to compute . Therefore, our covariance matrix based preconditioning inherently differs from Hessianbased preconditioning. Our idea of preconditioning is more suitable than Hessianbased preconditioning in handling the oscillation triggered by the noise of first order gradients.
3.3. Algorithm
Since deep neural networks have a large number of parameters, the idea described in the previous section incurs large memory consumption of where is the number of parameters. In addition, it costs time to compute
by using eigenvalue decomposition
[16]. To avoid these problems, we employ diagonal preconditioning matrix . Since this approach only needs the diagonal terms, the memory and computation costs are . Although this approach ignores the correlation of first order gradients, it is sufficient to control the oscillation in each dimension. This is because the diagonal of represents the variance of the oscillation as described in the previous section. By picking the diagonal of Equation (3.4), the updating rule is:(3.7) 
We rewrite this updating rule (all steps) as follows:
(3.8)  
(3.9)  
(3.10) 
where is the moving average of first order gradients for the th parameter at time and is the hyperparameter of the decay rate for the moving average that has . is the exponentially moving variance of first order gradients for the th parameter at time . We use in Equation (3.9) as the decay rate of the exponentially moving variance. and are initialized as and , respectively. For stable computation, is set at a small positive value. Equation (3.10) corresponds to Equation (3.7). We call the algorithm SDProp because Equation (3.10) includes Standard Deviation . Although includes the bias imposed by initialization, we can remove the bias in the same way as [3].
Notice that takes the same role as learning rate in Equation (2.3) of RMSProp. Therefore, Equation (3.10) divides the learning rate by the square root of centered variance while Equation (2.3) of RMSProp divides the learning rate by the square root of uncentered variance . In other words, RMSProp and its followup methods such as Adam adapt the learning rate by the magnitude of gradients while we adapt it by the variance of gradients. Although RMSProp and SDProp have similar updating rules, they have totally different goals as described in the previous sections. RMSProp executes Hessianbased preconditioning while SDProp executes covariance matrix based preconditioning.
4. Experiments
We performed experiments to compare SDProp to RMSProp and Adam, a stateoftheart algorithm based on RMSProp. [3] shows that Adam is a more efficient and effective approach than RMSProp or AdaDelta by integrating momentum into RMSProp. First, we show the efficiency and effectiveness of our approach by using CNN. Second, since SDProp effectively handles the oscillation described in the previous section, we evaluate SDProp by using small minibatches which suffer noise in the first order gradients. Third, we show the efficiency and effectiveness of SDProp for RNN. Fourth, we demonstrate the effectiveness of SDProp for 20 layered fullyconnected neural network that is difficult to train due to many sadle points.
4.1. Efficiency and Effectiveness for CNN
We investigate the efficiency and effectiveness of SDProp. We used 4 datasets to assess the classification of images; Cifar10, Cifar100 [17], SVHN [18]
and MNIST. The experiments were conducted on a 7layered CNN with ReLU activation function. The loss function was negative log likelihood. We compared SDProp to RMSProp and Adam. In SDProp, we tried various combinations of hyperparameters by using
and . In RMSProp, we tried combinations of hyperparameters by using and . As a result, SDProp achieves the lowest loss in the settings of , . RMSProp has the lowest loss when and . Adam achieves the lowest loss when , and. The minibatch size was 128. The number of epochs was 50. We use the training loss to evaluate the algorithms because they optimize the training criterion.
Figure 1 shows the training losses of each dataset. In Cifar10, Cifar100 and SVHN, SDProp yielded lower losses than RMSProp and Adam in early epochs. In MNIST, although the training loss of SDProp and Adam nearly reached 0.0, SDProp reduces the loss faster than Adam. SDProp needs 50 fewer training iterations than RMSProp to reach its final training loss in Cifar10, Cifar100 and MNIST. This suggests that our idea of covariance matrix based preconditioning is more efficient and effective than Hessianbased preconditioning in the minibatch setting because RMSProp and Adam approximate Hessianbased preconditioning as described in the preliminary section. Since SDProp captures the noise, it effectively reduces the loss even if the gradients are noisy. In the next experiment, we investigate the performance of SDProp in terms of its effectiveness against noise by using noisy first order gradients.
4.2. Sensitivity of Minibatch Size
16  32  64  128  

RMSProp  81.42  93.10  94.98  95.07 
Adam  83.24  93.57  95.48  97.12 
SDProp  90.17  94.87  96.54  97.31 
The previous experimental results show that SDProp is more efficient and effective than existing methods because it well handles the noise in our idea and in practice. In other words, SDProp is expected to effectively train the model even if we use small minibatch sizes that incur noisy first order gradients [19]. Therefore, we investigated the sensitivity of SDProp and existing methods to minibatch size. While the main purpose of this experiment is to reveal the one performance attribute of SDProp, the result suggests that SDProp can be used on devices with scant memory that must use small minibatches.
We compared SDProp to RMSProp and Adam using minibatch sizes of 16, 32, 64 and 128. We used the Cifar10 dataset for the 10class image classification task. We used CNN as per the previous section. The hyperparameters are also the same as the previous section; they are tuned by grid search. The number of epochs was 50.
Table 1 shows the final training accuracies. SDProp outperforms RMSProp and Adam in all minibatch size values examined. Specifically, although small minibatch size of 16 incurs very noisy first order gradients, SDProp obviously achieves effective training unlike RMSProp and Adam. In addition, Table 1 shows that the superiority of our approach over RMSProp and Adam increases as minibatch size falls. For example, if the minibatch size is 16, our approach has 8.75 percent higher accuracy than RMSProp and 2.24 percent more accurate if the minibatch size is 128. This indicates that our covariance matrix based preconditioning effectively handles the noise of first order gradients.
4.3. Efficiency and Effectiveness for RNN
We evaluated the efficiency and effectiveness of SDProp for the Recurrent Neural Network (RNN). In this experiment, we predicted the next character by using previous characters via characterlevel RNN. We used the subset of shakespeare dataset and the source code of the linux kernel as the dataset [20]. The size of the internal state was 128. The preprocessing of the dataset followed that of [20]. The minibatch size was 128. In SDProp, we tried grid search with and . As a result, SDProp used the settings of and . In RMSProp, we tried grid search with and . Finally, we used the settings of and
for RMSProp. The training criterion was cross entropy. We used gradient clipping and learning rate decay. Gradient clipping is a popular approach for scaling down the gradients by manually setting a threshold; it prevents gradients from exploding in RNN training
[7]. We set the threshold to 5.0. We decayed the learning rate every tenth epoch by the factor of 0.97 for RMSProp following [20]. In SDProp, was also decayed the same as of RMSProp.Figure 2 shows the results of the shakespeare dataset and the source code of the linux kernel. SDProp reduces the training loss faster than RMSProp. Since SDProp effectively handles the noise induced by the minibatch setting, it can efficiently train models other than CNN, such as RNN.
4.4. 20 Layered Fullyconnected Neural Network
Accuracy  
Method  Ave.  Best  Worst  
RMSProp  0.001  0.9  92.86  97.95  84.79 
0.99  98.81  99.11  98.34  
Ave.  Best  Worst  
SDProp  0.001  0.9  93.77  97.9  87.57 
0.99  99.20  99.42  99.09 
In this section, we performed experiments to evaluate the effectiveness of SDProp for training deep fullyconnected neural networks. [6] suggests that the number of saddle points exponentially increases with the dimensions of the parameters. Since deep fullyconnected networks typically have parameters with higher dimension than other models such as CNN, this optimization problem has many saddle points. This problem is challenging because SGD slowly progresses around saddle points [6].
We used a very deep fullyconnected network with 20 hidden layers, 50 hidden units and ReLU activation functions. We used the MNIST dataset for the 10class image classification task. This setting is the same as [21] used in evaluating the effectiveness of SGD with high dimensional parameters. Note that MNIST is sufficient for our evaluation because, unlike CNN, fullyconnected networks do not saturate the accuracy in our experiment. Our purpose is to evaluate the effectiveness under the setting of very high dimensional parameter. Thus, it is sufficient to evaluate effectiveness if the accuracy is not saturated. The training criterion was negative log likelihood. The minibatch size was 128. We initialized parameters from a Gaussian with mean 0 and standard deviation 0.01 following [21]. We compared SDProp to RMSProp. In SDProp, we tried the combinations of hyperparameters by using and . In RMSProp, we tried the combinations of hyperparameters by using and . The number of epochs was 50. Although these algorithms are trapped around saddle points, its frequency may depend the initialization of parameter. Therefore, we tried 10 runs for each of the above settings.
Table 2 lists the results for the best setting of and . It shows averages, best, worst of training accuracies for each setting. The result shows that SDProp achieves higher accuracy than RMSProp for the best setting. In addition, the difference between best and worst accuracy of SDProp is smaller than RMSProp. Since SDProp effectively handles the randomness of noise, it can reduce result uncertainty. The results show that SDProp effectively trains models that have very high dimensional parameters.
5. Conclusion
We proposed SDProp for the effective and efficient training of deep neural networks. Our approach utilizes the idea of using covariance matrix based preconditioning to effectively handle the noise present in the first order gradients. Our experiments showed that, for various datasets and models, SDProp is more efficient and effective than existing methods. In addition, SDProp achieved high accuracy even if the first order gradients were noisy.
References

[1]
Tijmen Tieleman and Geoffrey Hinton.
Lecture 6.5rmsprop: Divide the Gradient by a Running Average
of its Recent Magnitude.
COURSERA: Neural Networks for Machine Learning
, 2012.  [2] Matthew Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701, 2012.
 [3] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference in Learning Representations (ICLR), 2014.
 [4] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [5] Yann Dauphin, Harm de Vries, and Yoshua Bengio. Equilibrated Adaptive Learning Rates for Nonconvex Optimization. In NIPS, pages 1504–1512, 2015.
 [6] Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and Attacking the Saddle Point Problem in Highdimensional Nonconvex Optimization. In NIPS, pages 2933–2941, 2014.
 [7] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the Difficulty of Training Recurrent Neural Networks. In ICML, pages 1310–1318, 2013.
 [8] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, pages 400–407, 1951.
 [9] Jeffrey Elman. Finding Structure in Time. Cognitive Science, 14(2):179–211, 1990.
 [10] Yasuhiro Fujiwara, Yasutoshi Ida, Junya Arai, Mai Nishimura, and Sotetsu Iwamura. Fast Algorithm for the Lasso based L1Graph Construction. In proceedings of the Very Large Database Endowment(PVLDB), 10(3):229–240, 2016.
 [11] Yasuhiro Fujiwara, Yasutoshi Ida, Hiroaki Shiokawa, and Sotetsu Iwamura. Fast Lasso Algorithm via Selective Coordinate Descent. In AAAI, pages 1561–1567, 2016.
 [12] Suvrit Sra, Sebastian Nowozin, and Stephen Wright. Optimization for Machine Learning. Mit Press, 2012.
 [13] Yasutoshi Ida, Takuma Nakamura, and Takashi Matsumoto. Domaindependent/independent Topic Switching Model for Online Reviews with Numerical Ratings. In CIKM, pages 229–238, 2013.
 [14] Yukikatsu Fukuda, Yasutoshi Ida, Takashi Matsumoto, Naohiro Takemura, and Kaoru Sakatani. A Bayesian Algorithm for Anxiety Index Prediction based on Cerebral Blood Oxygenation in the Prefrontal Cortex Measured by Near Infrared Spectroscopy. IEEE journal of translational engineering in health and medicine, 2:1–10, 2014.

[15]
Hiroki Miyashita, Takuma Nakamura, Yasutoshi Ida, Takashi Matsumoto, and
Takashi Kaburagi.
Nonparametric Bayesbased Heterogeneous Drosophila
Melanogaster Gene Regulatory Network Inference: Tprocess
Regression.
In
International Conference on Artificial Intelligence and Applications
, pages 51–58, 2013.  [16] Nathan Halko, PerGunnar Martinsson, and Joel Tropp. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM review, 53(2):217–288, 2011.
 [17] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images, 2009.

[18]
Pierre Sermanet, Sandhya Chintala, and Yann LeCun.
Convolutional Neural Networks Applied to House Numbers
Digit Classification.
In
International Conference on Pattern Recognition (ICPR)
, pages 3288–3291, 2012.  [19] Ofer Dekel, Ran GiladBachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction using Minibatches. Journal of Machine Learning Research, 13:165–202, 2012.
 [20] Andrej Karpathy, Justin Johnson, and FeiFei Li. Visualizing and Understanding Recurrent Networks. arXiv preprint arXiv:1506.02078, 2015.
 [21] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding Gradient Noise Improves Learning for Very Deep Networks. arXiv preprint arXiv:1511.06807, 2015.
Comments
There are no comments yet.