1 Introduction
Deep neural networks have been advancing the stateoftheart performance over a number of tasks in artificial intelligence, from speech recognition
(Hinton et al., 2012)He et al. (2016) to natural language understanding Hochreiter & Schmidhuber (1997). These problems are typically formulated as minimizing nonconvex objectives parameterized by the neural network models. Typically the models are trained with stochastic gradient descent (SGD) or its variants and the gradient information is computed through backpropagation (BP) Rumelhart et al. (1986).However, the magnitudes of gradient components often vary significantly in neural network. Recall that one coordinate of the gradient is the directional derivative along with that coordinate which represents how a change on the weight will affect the loss rather than how we should modify the weight to minimize the loss. Thus vanilla SGD with a single learning rate could be problematic for the optimization of deep neural network because of the inconsistent magnitude of gradient components. In practice, extreme small learning rate alleviates this problem but leads to slow convergence. Moreover, momentum (Rumelhart et al., 1986; Qian, 1999; Nesterov, 2013; Sutskever et al., 2013) amends this problem by accumulating velocity along the coordinate with small magnitude but consistent direction and reducing the velocity for those coordinates with large magnitudes but opposite directions. Adaptive learning rate algorithms (Duchi et al., 2011; Kingma & Ba, 2014) scale coordinates of the gradient by reciprocals of some averages of their past magnitudes, confirming that weakening the affect of magnitudes of the gradient components could be favorable to the optimization from the other side.
We want to solve this problem from another perspective. Ye et al. (2017) suggests that the magnitude inconsistence of gradient components are mainly across layers. We can get a hint by scrutinizing the backpropagation through a fully connected layer^{1}^{1}1We omit bias terms for simplicity. which has output and input and weight parameter . The layer mapping is given by . If is the backpropagated value on the output is , i.e., the partial derivatives of the loss with respect to , the backpropagation equation is given by
where is the th column of (represents all the connections emanating from unit ) and is the backpropagated value on the layer’s input computed through backpropagation. If the rows of are initialized with unit length (roughly) to preserve the forward signal, then backpropagation through this layer would shrink the magnitude of the backward signal heavily when the number of input is much larger than the number of output.
We suggest a principled way to overcome the problem of magnitude inconsistence of gradient components across layers in backpropagation. Specifically, we compute changes and on the weight parameter and on the input respectively, to match the error guiding signal as closely as possible. This motivates us to formulate the backward pass through the fully connected layer as solving a group of leastsquares problems. By hiding technical details and some assumption, we propose the backmatching propagation as follows,
(1)  
(2) 
where the expectation is over the data points in a mini batch. A direct expalanation of (2) is that we want to change the weight matrix by to produce a desired change on the output (or sufficiently close to) given the current input . So can we explain equation (1). Then we use to update the parameter and use as the error guiding signal to backpropagate to lower layers.
For the backmatching propagation (1), we need to compute which is easy since it is a scaler. For the parameter update solution (2), we need to compute an inverse
. This requires a large number of matrix inverse operations, roughly the number of neurons, and each inverse requires flops on the cubic order of the number of neurons in one layer. This hinders it to be applied to large neural networks which typically contain tens of thousands of neurons in a single layer.
Fortunately, we can work with the batch normalization (BN) technique
Ioffe & Szegedy (2015) to circumvent this difficulty. With batch normalization, we regardas an identity matrix approximately and remove the inverse in (
2). Then with some approximation, we can reduce the backmatching propagation into a layerwise gradient adaption strategy, which can be viewed as layerwise adaptive learning rates when applying pure SGD. As such a layerwise gradient adaption strategy is built upon the regular BP process, it is easy to implement in current deep learning frameworks
Bastien et al. (2012); Abadi et al. (2016); Paszke et al. (2017); Seide & Agarwal (2016). Moreover, this strategy also works with other popular optimization techiques (momentum, adaalgorithms, weightdecay) naturally to achieve possible higher performances in machine learning tasks. We expect this layerwise gradient adaption strategy could accelerate the training procedure. Surprisingly, this strategy often improves the test accuracy by a considerable margin in practice.1.1 Related Works
Training neural network with layerwise adaptive learning rate has been proposed in several previous works. Specifically, Singh et al. (2015) suggested using as the learning rate for the layer . You et al. (2017) suggested using as the learning rate for layer and demonstrated that this would benefit the largebatch training. However, the suggestion in both works mainly comes from empirical experience and do not have explanation of why the rate is set in that way.
Our paper is related to the blockdiagonal second order algorithms Lafond et al. (2017); Zhang et al. (2017); Grosse & Martens (2016). Specifically, Lafond et al. (2017) proposes a weight reparametrization scheme with a diagonal rescaling stepsize and show its potential advantages over batch normalization. Zhang et al. (2017) proposes a block diagonal Hessianfree method to train neural networks and shows fast convergence rate over firstorder methods. Martens & Grosse (2015); Grosse & Martens (2016) propose the Kronecker Factored Approximation (KFA) method to approximate the natural gradient using a blockdiagonal or blocktridiagonal approximation of the Fisher matrix. These secondorder algorithms all share a layerwise or blockdiagonal structure design, which agrees with our algorithm. However, our layerwise adaptive learning rate comes from the perspective of backmatching propagation and is different from the secondorder approximations.
Our paper is also related to the Riemannian algorithms Amari (1998); Ollivier (2015); MarceauCaron & Ollivier (2016). Specifically, Ollivier (2015) proposes using as the update for the parameter , where
is a backpropagated metric. Similarly,
Ye et al. (2017) advocates using as the update of the parameter.In comparison, the backmatching propagation comes from a different perspective that the backpropagated values should match the error guiding signal. Our layerwise gradient adaption strategy, which is derived from backmatching propagation, is simpler than the Riemannian algorithms in terms of implementational and computational complexity.
2 Backmatching Propagation
In this section, we present how the backmatching propagation works under several popular types of layers. Specifically, we derive the formula of the backprogated values on the layer’s input and on the layer’s parameters given the backpropagated value on the layer’s output based on the backmatching propagation. Moreover, we compare the backmatching propagation to the regular BP.
We introduce several notations here (some have been used in Introduction). Let
denote the objective (loss function). We use
and to denote the layer’s output and input respectively and use parameter to denote the layer’s parameter. We use to denote the backpropagated value on the layer’s output. Then we use and to denote the backpropagated values computed through BP, and use and to denote the backpropagated values computed through backmatching propagation.Let us briefly review the regular BP here. The BP propagates derivatives from the top layer back to the bottom one. Suppose we are dealing with a general layer which has forward mapping . Then the derivative of the loss with respect to a specific output component is . The BP equations are given by
(3)  
(4) 
where represents a data point and the expectation is over the data points in a mini batch.
Next we present how the backmatching propagation backpropagated through specific layers. In order to compare conveniently, for each type of layer we first provide the BP formula and then derive the formula via backmatching propagation and in the end discuss the relation between the backmatching propagation and BP.
2.1 Fully Connected Layer
We first consider a fully connected layer, whose mapping function is given by^{2}^{2}2For simplicity we omit the bias term.
(5) 
Suppose the backpropagated values on the output are . Following backpropagation equations (3) and (4), we compute the backpropagated values on the input and on the weight parameter as follows,
(6)  
(7) 
where is the th column of .
We next derive the formula for backmatching propagation, where we compute and that try to match the guiding signal as accurately as possible, in the sense of minimizing square error,
(8)  
(9) 
Note that by writing the matching problem as two independent problems (8) and (9), we presume that updating and propagating backward values are independent, and such layer independence has been used in blockdiagonal secondorder algorithms Zhang et al. (2017); Lafond et al. (2017). Moreover, (8) is separable along the rows of . Hence we obtain a bunch of (total number ) leastsquares problems
(10) 
where is the th row of and represents the backpropagated values at neuron in one mini batch of data. We further assume all are updated independently, based on the intuition that a neuron doesn’t know the other neurons’ states on/off and a fair strategy is to try its best to match the guiding signal by itself. Then (9) becomes a bunch (total number ) of leastsquares problems
(11) 
where are the weights emanating from neuron (a column of corresponding neuron ). We call equations (8) and (11) the backmatching propagation rule. Solving the leastsquares problems (11) and (8) gives us:
(12)  
(13) 
2.2 Convolutional Layer
In this section we study the backmatching propagation through a convolutional layer. The weight parameter is an array with dimension , where and are the number of output features and the number of input features respectively, and and are the width and height of convolutional kernels. We use to denote the output at location of feature and to denote the input at location of feature , then the forward process is
(16) 
and the BP is given by
(17)  
(18) 
However, this formula of the forward and backward process of convolutional layer make the derivation of backmatching propagation complex. Note that the convolution operation essentially performs dot products between the convolution kernels and local regions of the input. The forward pass of a convolution layer can be formulated as one big matrix multiply with im2col operation. In order to describe back matching process clearly, we rewrite the convolution layer forward and backward pass with im2col operation. We use and to represent the weight matrices with dimension and , respectively, which both are stretched out from . To mimic the convolutional operation, we rearrange the input features into a big matrix through im2col operation: each column of is composed of the elements of that are used to compute one location in . Thus if has dimension , then has dimension . Furthermore, we stack the latter two dimensions of
into a tall vector, denoted as
which has dimension . The forward process (16) of convolutional layer can be rewritten as(19) 
Similarly, we can rewrite the regular BP (17) and (18) as
(20)  
(21) 
where is composed of weight components that interact with input location , which approximately has elements and
is a factor related with pooling and stride, and
is composed of output locations that have interaction with input location . With these notations, we can derive the formula for backmatching propagation via solving the least squares problems (11) and (8), given by(22)  
(23) 
We can see that the formulas (22) (23) of backmatching propagation are the corresponding BP formulas (20) (21) rescaled by a number or a matrix. As the convolutional layer is essentially a linear mapping, the formulas here is similar to those of the fully connected layer although they are more involved.
2.3 Batch Normalization Layer
Batch normalization (BN) is widely used for accelerating training of feedforward neural networks. In practice, BN is usually inserted right before the activation function. We fix the affine transformation of batch normalization to be identity. Then the BN layer mapping is given by
(24) 
The BP formula through the BN layer is given by (Ioffe & Szegedy, 2015),
(25) 
where is the minibatch size, and and is the backpropagated values on quantities and respectively.
2.4 Rectified Linear Unit (ReLU)
We use
to denote the ReLU nonlinear function. Then the ReLU mapping is given by
(28) 
For the formula of BP, we have
(29) 
Following (11), we have the formula of backmatching propagation for the ReLU layer
(30) 
Therefore the formula of backmatching propagation for ReLU is the same as that of BP,
(31) 
3 Layerwise Adaptive Rate via Approximate Backmatching Propagation
The backmatching propagation involves large number of matrix inverse operations, which is computationally prohibited in training large neural networks. In this section we present how to approximate the backmatching propagation under certain assumption and end up with a layerwise adaptive rate strategy based on the approximation of the backmatching propagation, which allows easy implementation in frameworks equipped with autodifferentiation.
3.1 Approximate Backmatching Propagation via BP
We firstly look at the formula of the backmatching propagated value on the weight parameter (15) and (23). It is the gradient scaled by an inverse of a matrix, which is prohibited for large networks. We use batch normalization to circumvent this difficulty.
With batch normalization, we regard as identity matrix approximately. From now on, we require each intermediate layer is bonded with a batch normalization layer except the output layer. Since BN has been widely used for accelerating the training process and improving the test accuracy, this requirement does not confine us much. Under this requirement, the backmatching propagation for the fully connected layer (15) is approximated by,
(32) 
and the backmatching propagation for the convolutional layer (23) is approximated by,
(33) 
where is the sharing factor for the convolutional layer.
Next we consider the formula of backmatching propagated values on the input (14) and (22). To further reduce the complexity and develop a layerwise adaptive learning rate strategy, we assume that is row homogeneous Ba et al. (2016), i.e., they represent the same level of information and are roughly of similar magnitude. We define
where is the th row of . Then the in equation (1) can be approximated as Under this assumption, the backmatching propagation for the fully connected layer (14) is approximated by,
(34) 
and the backmatching propagation for the convolutional layer (22) is approximated by,
(35) 
where
is a factor related with pooling, stride and padding operations. We will see a detailed example in Section
3.3.For BN layer, we assume the weight parameter is row homogeneous and then the backmatching propagation for the BN layer (26) is approximated by,
(36) 
3.2 Layerwise Adaptive Rate Strategy
Based on the approximations in Section 3.1 we are ready to derive a layerwise adaptive learning rate strategy. We note that the approximate backmatching propagation gives a scaling factor for each layer’s gradient if we backpropagate starting from the top layer. We set the initial factor of the output layer is , which indicates that we regard the derivative of the loss with respect to the output of the network as the desired changes on the output to minimize the loss.
Then starting from the top layer, we compute a backward factor for each layer through
(37) 
where the relations of and are given by (34), (35), (36) and (31) for fully connected layer, convolutional layer, BN layer and ReLU, respectively. If the layer has parameter and gradient , then we use as the new adaptive gradient to update , where is the backward factor on the output of the layer and is the sharing factor of the layer. Then can be viewed as a layerwise adaptive learning rate when using vanilla SGD. This strategy is described in Algorithm 1.
Our algorithm can work with momentum naturally. In practice, we use the flow in Algorithm 1 to modify the gradient computed via BP. Then we apply the momentum update with the modified gradient. With the modified gradient given by Algorithm 1, we can also apply other adaptive strategy, i.e., Adam and Adagrad, without difficulty.
Weightdecay is a widely used technique to improve the generalization of the model. Note that both the weightdecay and our algorithm are modifying the gradient of the network parameter computed through BP. In practice, we first apply the weightdecay modification and then apply our algorithm on the modified gradient, which produces better result than the other way around.
3.3 An Example: LeNet
We use LeNet (Figure. 1) as an example. We modify the original LeNetLeCun et al. (1998a) by inserting a batch normalization transformation before each activation layer (ReLU) and omitting all the bias terms.
We next walk through the approximated backmatching propagation of the LeNet and show how each layer’s weight should be changed (). Following the procedure of Algorithm 1, we have the following initial value: . Given the loss , we can compute the normal gradient on each weight parameter through BP, which are denoted as with subscript of the layer name. We start from the top layer fc3 and compute
and update . Since the ReLU activation does not contain parameter and does not change the backward factor , then we move to the BN layer. Since our BN layer does not have parameter, we only have to update the backward factor
Then we move to layer fc2 and compute
and update the backward factor
We continue doing this till the bottom layer. Further details are provided in supplemental material.
We next train LeNet with BN to classify the CIFAR10 dataset
Krizhevsky & Hinton (2009). CIFAR10 is composed of 60,000 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. We want to compare the training procedure and test accuracy of the classical SGD based on BP and our algorithm. There are many hyperparameters/hyperroutines that would affect the learning curve significantly, and we try our best to make a fair comparison. In the first experiment, we use only CIFAR10 dataset without augmentation and fix momentum to be for both algorithm, two models start from the same initial point and pass the same mini batch of data, where the minibatch size is 128. Global learning rates are chosen to perform best in terms of test accuracy from a pool of five candidates^{3}^{3}3The pool for regular SGD is and the pool for our algorithm is .. We choose global learning rate for SGD andfor our algorithm. We fix the global learning rate through the training process and train for 200 epochs. A weightdecay term (0.0005) is optional for both methods. We note that the loss in the training curve does not include the weightdecay term in Figure.
2 no matter whether the weightdecay is used.From Figure. 2, we can see that without weight decay both algorithms can drive the training loss to zero but test accuracy of the regular SGD turns worse after very few epochs (15) and the model becomes overfitting from then on. In comparison, our algorithm achieves a much lower test error than the regular SGD and there is a considerable margin between our algorithm and the regular SGD even in the overfitting phase. On the other hand, with weight decay both training losses do not converge to zero any more but our algorithm achieves much lower training loss than the regular SGD. Weight decay improves the final test accuracy but does not improve the lowest test error during the training period of our algorithm. The weight obtained by our algorithm have rather large magnitude (the network does not explode as the batch normalization stabilize the forward propagation). We believe our algorithm combats overfitting differently from the weight decay and could provide another way to improve generalization in certain setting that cannot achieved by weight decay.
4 Experiments
In this section, we evaluate the proposed algorithm for image classification tasks with two datasets: CIFAR10 (Krizhevsky & Hinton, 2009), CIFAR100 (Krizhevsky & Hinton, 2009). CIFAR10 has been introduced in Section 3.3. The CIFAR100 dataset is similar to the CIFAR10, except it has 100 classes with 600 images per class and there are 500 training images and 100 testing images per class. We train VGG networks (Simonyan & Zisserman, 2015) to classify these two datasets because they are widely used baselines for image classification tasks and they are of feedforward architecture. We modify the VGG nets by keeping the last fully connected layers and removing the intermediate two fully connected layers and all the biases ^{4}^{4}4We find this does not hurt accuracy for CIFAR dataset and shortens training time due to fewer parameters.. We equip each intermediate layer of the VGG nets with batch normalization transformation right before the activation function and the batch normalization has no trainable parameters.
Differently from the setting in Section 3.3, here we train VGG nets by using the randomly augmented CIFAR10 and CIFAR100 datasets (random flip and rotation) as such big models get overfitting to the datasets rapidly. We note that augmenting the dataset the training does not gain much benefit directly. We need to decay the learning rate to learn effectively with data augmentation. In order to compare fairly, we apply the same learning rate scheduling strategy to all algorithms: multiplying the learning rate by a factor every 60 epochs.
4.1 Baseline Algorithms
We introduce several baseline algorithms and their settings.
The base algorithm is the regular SGD with Nesterov momentum . The learning rate is set to be .
The second baseline algorithm is LSALR Singh et al. (2015) which uses as the learning rate for the layer . The global learning rate is set to be , which achieves best performance comparing from a pool of candidates , and is different from the suggestion () in the original paper.
The third baseline algorithm is LARS You et al. (2017) which uses as the learning rate for layer . In our experiment, we use the global learning rate for LARS, which achieves best performance from a pool of .
Noting that all these layerwise adaptive algorithms modify the regular layer gradient computed through BP, we equip Nesterov momentum on them in the experiment. For baseline algorithms, we apply weight decay with coefficient 5e4 if without specific description.
4.2 Result
We first compare the learning curves between the our algorithm and the vanilla SGD on training VGG11 with CIFAR100. We apply Nesterov momentum 0.9 and weight decay coefficient 5e3 for both algorithms. Similarly to Section 3.3, two algorithms start from the same initialization and pass the same batches of data. We set the same learning rate for both algorithms. Both algorithms are run 300 epochs. We plot the learning curves in Figure 3.
From Figure 3, we can see that the learning curves of our algorithm and SGD have similar trend: curves jump at each learning rate decay. This is predictable as our algorithm only modifies the magnitude of the layer’s gradient as a whole and does not involve any further information (second order information) and moreover we use the same hyperparameters and the same learning rate scheduling strategy for both algorithms. Scrutinizing more closely, we can see our training loss curve is almost always lower than SGD’s and our test error fluctuates heavier initially but ends with a considerably lower number.
Next we present the test result of different VGG nets for classification of CIFAR100 in Table 1. For this group of experiments, we use global learning rate and weight decay coefficient 5e3 for our algorithm. Our algorithm achieves higher test accuracy over its competitors on all four VGG models with margins.
Model  VGG11  VGG13  VGG16  VGG19 

SGD  71.47  74.01  72.86  71.35 
LARS  67.26  70.21  69.90  69.52 
LSALR  70.75  73.74  72.56  70.76 
Ours  73.39  75.32  74.46  72.90 
We then present the test accuracy result of different VGG nets for classification of CIFAR10 in Table 2. The numbers in the table are the best of five independent trials of each algorithm. We use learning rate 2e3 and weight decay coefficient 1e4 for this group of experiments. We use a different learning rate from the case of CIFAR100. The reason is that for training CIFAR10 the last layer of VGG nets has 10 output and then the backpropagated values shall be multiplied by if following the backmatching propagation rule. Such an imbalanced mapping layer makes our algorithm behave aggressively. Hence we reduce the learning rate to 2e3 for consistent result. We can see that our algorithm achieves higher test accuracy as its competitors on almost all VGG models with various margins.
Model  VGG11  VGG13  VGG16  VGG19 

SGD  92.63  93.90  93.72  93.66 
LARS  91.81  93.20  94.00  93.48 
LSALR  92.58  93.81  94.00  93.46 
Ours  92.69  94.08  94.22  93.98 
5 Conclusion and Discussion
In this paper we present the backmatching propagation which provides a principled way of computing the backpropagated values on the weight parameter and on the input, which try to match the error guiding signal on the output as accurately as possible. To utilize the idea of backmatching propagation in training large neural networks efficiently, we make several approximations based on intuitive understanding and reduce the backmatching propagation to the regular BP with a layerwise adaptive learning rate strategy. It it easy to implement within current machine learning frameworks that are equipped with autodifferentiation. We test our algorithm in training feedforward neural networks and achieve favorable result over SGD.
There are several future directions along with this work. In our derivation of the Algorithm 1, we assume that each neuron updates its values independently from others in the same layer. This is a strong assumption and may produce inaccuracy on computing the backpropagated values across layers. Thus one future direction is to modify the backmatching propagation by considering the coupdate of neurons in the same layer, which is closely related to Riemannian algorithms Ollivier (2015)
that have been introduced but not widely used because of their complexity. Moreover, applying the idea of backmatching propagation to other architectures like residual networks and recurrent neural networks is also under consideration.
References
 Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, et al. TensorFlow: A system for largescale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
 Amari (1998) Amari, ShunIchi. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Ba et al. (2016) Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Bastien et al. (2012) Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, WardeFarley, David, and Bengio, Yoshua. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.
 Duchi et al. (2011) Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 Grosse & Martens (2016) Grosse, Roger and Martens, James. A Kroneckerfactored approximate Fisher matrix for convolution layers. In International Conference on Machine Learning (ICML), 2016.

He et al. (2016)
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016.  Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdelrahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp. 448–456, 2015.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
 Lafond et al. (2017) Lafond, Jean, Vasilache, Nicolas, and Bottou, Léon. Diagonal rescaling for neural networks. arXiv preprint arXiv:1705.09319, 2017.
 LeCun et al. (1998a) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998a.
 LeCun et al. (2015) LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learning. Nature, 521(7553):436, 2015.
 LeCun et al. (1998b) LeCun, Yann A, Bottou, Léon, Orr, Genevieve B, and Müller, KlausRobert. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 1998b.
 MarceauCaron & Ollivier (2016) MarceauCaron, Gaétan and Ollivier, Yann. Practical Riemannian neural networks. arXiv preprint arXiv:1602.08007, 2016.
 Martens & Grosse (2015) Martens, James and Grosse, Roger. Optimizing neural networks with Kroneckerfactored approximate curvature. In International Conference on Machine Learning (ICML), pp. 2408–2417, 2015.
 Nesterov (2013) Nesterov, Yurii. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 Ollivier (2015) Ollivier, Yann. Riemannian metrics for neural networks I: feedforward networks. Information and Inference: A Journal of the IMA, 4(2):108–153, 2015.

Paszke et al. (2017)
Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, Gregory, Yang, Edward,
DeVito, Zachary, Lin, Zeming, Desmaison, Alban, Antiga, Luca, and Lerer,
Adam.
Automatic differentiation in PyTorch.
2017.  Qian (1999) Qian, Ning. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 Rumelhart et al. (1986) Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by backpropagating errors. nature, 323(6088):533, 1986.
 Seide & Agarwal (2016) Seide, Frank and Agarwal, Amit. CNTK: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2135–2135. ACM, 2016.
 Simonyan & Zisserman (2015) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 Singh et al. (2015) Singh, B., De, S., Zhang, Y., Goldstein, T., and Taylor, G. Layerspecific adaptive learning rates for deep networks. In IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 364–368, Dec 2015.
 Sutskever et al. (2013) Sutskever, Ilya, Martens, James, Dahl, George E, and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (ICML), 28:1139–1147, 2013.
 Ye et al. (2017) Ye, Chengxi, Yang, Yezhou, Fermuller, Cornelia, and Aloimonos, Yiannis. On the importance of consistency in training deep neural networks. arXiv preprint arXiv:1708.00631, 2017.
 You et al. (2017) You, Yang, Gitman, Igor, and Ginsburg, Boris. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888v3, 2017.
 Zhang et al. (2017) Zhang, Huishuai, Xiong, Caiming, Bradbury, James, and Socher, Richard. Blockdiagonal hessianfree optimization for training neural networks. arXiv preprint arXiv:1712.07296, 2017.
Comments
There are no comments yet.