Deep neural networks have been advancing the state-of-the-art performance over a number of tasks in artificial intelligence, from speech recognition(Hinton et al., 2012)2016) to natural language understanding Hochreiter & Schmidhuber (1997). These problems are typically formulated as minimizing non-convex objectives parameterized by the neural network models. Typically the models are trained with stochastic gradient descent (SGD) or its variants and the gradient information is computed through back-propagation (BP) Rumelhart et al. (1986).
However, the magnitudes of gradient components often vary significantly in neural network. Recall that one coordinate of the gradient is the directional derivative along with that coordinate which represents how a change on the weight will affect the loss rather than how we should modify the weight to minimize the loss. Thus vanilla SGD with a single learning rate could be problematic for the optimization of deep neural network because of the inconsistent magnitude of gradient components. In practice, extreme small learning rate alleviates this problem but leads to slow convergence. Moreover, momentum (Rumelhart et al., 1986; Qian, 1999; Nesterov, 2013; Sutskever et al., 2013) amends this problem by accumulating velocity along the coordinate with small magnitude but consistent direction and reducing the velocity for those coordinates with large magnitudes but opposite directions. Adaptive learning rate algorithms (Duchi et al., 2011; Kingma & Ba, 2014) scale coordinates of the gradient by reciprocals of some averages of their past magnitudes, confirming that weakening the affect of magnitudes of the gradient components could be favorable to the optimization from the other side.
We want to solve this problem from another perspective. Ye et al. (2017) suggests that the magnitude inconsistence of gradient components are mainly across layers. We can get a hint by scrutinizing the back-propagation through a fully connected layer111We omit bias terms for simplicity. which has output and input and weight parameter . The layer mapping is given by . If is the back-propagated value on the output is , i.e., the partial derivatives of the loss with respect to , the back-propagation equation is given by
where is the -th column of (represents all the connections emanating from unit ) and is the back-propagated value on the layer’s input computed through back-propagation. If the rows of are initialized with unit length (roughly) to preserve the forward signal, then back-propagation through this layer would shrink the magnitude of the backward signal heavily when the number of input is much larger than the number of output.
We suggest a principled way to overcome the problem of magnitude inconsistence of gradient components across layers in back-propagation. Specifically, we compute changes and on the weight parameter and on the input respectively, to match the error guiding signal as closely as possible. This motivates us to formulate the backward pass through the fully connected layer as solving a group of least-squares problems. By hiding technical details and some assumption, we propose the back-matching propagation as follows,
where the expectation is over the data points in a mini batch. A direct expalanation of (2) is that we want to change the weight matrix by to produce a desired change on the output (or sufficiently close to) given the current input . So can we explain equation (1). Then we use to update the parameter and use as the error guiding signal to back-propagate to lower layers.
. This requires a large number of matrix inverse operations, roughly the number of neurons, and each inverse requires flops on the cubic order of the number of neurons in one layer. This hinders it to be applied to large neural networks which typically contain tens of thousands of neurons in a single layer.
Fortunately, we can work with the batch normalization (BN) techniqueIoffe & Szegedy (2015) to circumvent this difficulty. With batch normalization, we regard
as an identity matrix approximately and remove the inverse in (2
). Then with some approximation, we can reduce the back-matching propagation into a layer-wise gradient adaption strategy, which can be viewed as layer-wise adaptive learning rates when applying pure SGD. As such a layer-wise gradient adaption strategy is built upon the regular BP process, it is easy to implement in current deep learning frameworksBastien et al. (2012); Abadi et al. (2016); Paszke et al. (2017); Seide & Agarwal (2016). Moreover, this strategy also works with other popular optimization techiques (momentum, ada-algorithms, weight-decay) naturally to achieve possible higher performances in machine learning tasks. We expect this layer-wise gradient adaption strategy could accelerate the training procedure. Surprisingly, this strategy often improves the test accuracy by a considerable margin in practice.
1.1 Related Works
Training neural network with layer-wise adaptive learning rate has been proposed in several previous works. Specifically, Singh et al. (2015) suggested using as the learning rate for the layer . You et al. (2017) suggested using as the learning rate for layer and demonstrated that this would benefit the large-batch training. However, the suggestion in both works mainly comes from empirical experience and do not have explanation of why the rate is set in that way.
Our paper is related to the block-diagonal second order algorithms Lafond et al. (2017); Zhang et al. (2017); Grosse & Martens (2016). Specifically, Lafond et al. (2017) proposes a weight reparametrization scheme with a diagonal rescaling step-size and show its potential advantages over batch normalization. Zhang et al. (2017) proposes a block diagonal Hessian-free method to train neural networks and shows fast convergence rate over first-order methods. Martens & Grosse (2015); Grosse & Martens (2016) propose the Kronecker Factored Approximation (KFA) method to approximate the natural gradient using a block-diagonal or block-tridiagonal approximation of the Fisher matrix. These second-order algorithms all share a layer-wise or block-diagonal structure design, which agrees with our algorithm. However, our layer-wise adaptive learning rate comes from the perspective of back-matching propagation and is different from the second-order approximations.
Our paper is also related to the Riemannian algorithms Amari (1998); Ollivier (2015); Marceau-Caron & Ollivier (2016). Specifically, Ollivier (2015) proposes using as the update for the parameter , where
is a backpropagated metric. Similarly,Ye et al. (2017) advocates using as the update of the parameter.
In comparison, the back-matching propagation comes from a different perspective that the back-propagated values should match the error guiding signal. Our layer-wise gradient adaption strategy, which is derived from back-matching propagation, is simpler than the Riemannian algorithms in terms of implementational and computational complexity.
2 Back-matching Propagation
In this section, we present how the back-matching propagation works under several popular types of layers. Specifically, we derive the formula of the backprogated values on the layer’s input and on the layer’s parameters given the backpropagated value on the layer’s output based on the back-matching propagation. Moreover, we compare the back-matching propagation to the regular BP.
We introduce several notations here (some have been used in Introduction). Let
denote the objective (loss function). We useand to denote the layer’s output and input respectively and use parameter to denote the layer’s parameter. We use to denote the back-propagated value on the layer’s output. Then we use and to denote the back-propagated values computed through BP, and use and to denote the back-propagated values computed through back-matching propagation.
Let us briefly review the regular BP here. The BP propagates derivatives from the top layer back to the bottom one. Suppose we are dealing with a general layer which has forward mapping . Then the derivative of the loss with respect to a specific output component is . The BP equations are given by
where represents a data point and the expectation is over the data points in a mini batch.
Next we present how the back-matching propagation back-propagated through specific layers. In order to compare conveniently, for each type of layer we first provide the BP formula and then derive the formula via back-matching propagation and in the end discuss the relation between the back-matching propagation and BP.
2.1 Fully Connected Layer
We first consider a fully connected layer, whose mapping function is given by222For simplicity we omit the bias term.
where is the -th column of .
We next derive the formula for back-matching propagation, where we compute and that try to match the guiding signal as accurately as possible, in the sense of minimizing square error,
Note that by writing the matching problem as two independent problems (8) and (9), we presume that updating and propagating backward values are independent, and such layer independence has been used in block-diagonal second-order algorithms Zhang et al. (2017); Lafond et al. (2017). Moreover, (8) is separable along the rows of . Hence we obtain a bunch of (total number ) least-squares problems
where is the -th row of and represents the back-propagated values at neuron in one mini batch of data. We further assume all are updated independently, based on the intuition that a neuron doesn’t know the other neurons’ states on/off and a fair strategy is to try its best to match the guiding signal by itself. Then (9) becomes a bunch (total number ) of least-squares problems
where are the weights emanating from neuron (a column of corresponding neuron ). We call equations (8) and (11) the back-matching propagation rule. Solving the least-squares problems (11) and (8) gives us:
2.2 Convolutional Layer
In this section we study the back-matching propagation through a convolutional layer. The weight parameter is an array with dimension , where and are the number of output features and the number of input features respectively, and and are the width and height of convolutional kernels. We use to denote the output at location of feature and to denote the input at location of feature , then the forward process is
and the BP is given by
However, this formula of the forward and backward process of convolutional layer make the derivation of back-matching propagation complex. Note that the convolution operation essentially performs dot products between the convolution kernels and local regions of the input. The forward pass of a convolution layer can be formulated as one big matrix multiply with im2col operation. In order to describe back matching process clearly, we rewrite the convolution layer forward and backward pass with im2col operation. We use and to represent the weight matrices with dimension and , respectively, which both are stretched out from . To mimic the convolutional operation, we rearrange the input features into a big matrix through im2col operation: each column of is composed of the elements of that are used to compute one location in . Thus if has dimension , then has dimension . Furthermore, we stack the latter two dimensions of
into a tall vector, denoted aswhich has dimension . The forward process (16) of convolutional layer can be rewritten as
where is composed of weight components that interact with input location , which approximately has elements and
is a factor related with pooling and stride, andis composed of output locations that have interaction with input location . With these notations, we can derive the formula for back-matching propagation via solving the least squares problems (11) and (8), given by
We can see that the formulas (22) (23) of back-matching propagation are the corresponding BP formulas (20) (21) rescaled by a number or a matrix. As the convolutional layer is essentially a linear mapping, the formulas here is similar to those of the fully connected layer although they are more involved.
2.3 Batch Normalization Layer
Batch normalization (BN) is widely used for accelerating training of feedforward neural networks. In practice, BN is usually inserted right before the activation function. We fix the affine transformation of batch normalization to be identity. Then the BN layer mapping is given by
The BP formula through the BN layer is given by (Ioffe & Szegedy, 2015),
where is the mini-batch size, and and is the backpropagated values on quantities and respectively.
2.4 Rectified Linear Unit (ReLU)
to denote the ReLU nonlinear function. Then the ReLU mapping is given by
For the formula of BP, we have
Following (11), we have the formula of back-matching propagation for the ReLU layer
Therefore the formula of back-matching propagation for ReLU is the same as that of BP,
3 Layer-wise Adaptive Rate via Approximate Back-matching Propagation
The back-matching propagation involves large number of matrix inverse operations, which is computationally prohibited in training large neural networks. In this section we present how to approximate the back-matching propagation under certain assumption and end up with a layer-wise adaptive rate strategy based on the approximation of the back-matching propagation, which allows easy implementation in frameworks equipped with auto-differentiation.
3.1 Approximate Back-matching Propagation via BP
We firstly look at the formula of the back-matching propagated value on the weight parameter (15) and (23). It is the gradient scaled by an inverse of a matrix, which is prohibited for large networks. We use batch normalization to circumvent this difficulty.
With batch normalization, we regard as identity matrix approximately. From now on, we require each intermediate layer is bonded with a batch normalization layer except the output layer. Since BN has been widely used for accelerating the training process and improving the test accuracy, this requirement does not confine us much. Under this requirement, the back-matching propagation for the fully connected layer (15) is approximated by,
and the back-matching propagation for the convolutional layer (23) is approximated by,
where is the sharing factor for the convolutional layer.
Next we consider the formula of back-matching propagated values on the input (14) and (22). To further reduce the complexity and develop a layer-wise adaptive learning rate strategy, we assume that is row homogeneous Ba et al. (2016), i.e., they represent the same level of information and are roughly of similar magnitude. We define
and the back-matching propagation for the convolutional layer (22) is approximated by,
is a factor related with pooling, stride and padding operations. We will see a detailed example in Section3.3.
For BN layer, we assume the weight parameter is row homogeneous and then the back-matching propagation for the BN layer (26) is approximated by,
3.2 Layer-wise Adaptive Rate Strategy
Based on the approximations in Section 3.1 we are ready to derive a layer-wise adaptive learning rate strategy. We note that the approximate back-matching propagation gives a scaling factor for each layer’s gradient if we back-propagate starting from the top layer. We set the initial factor of the output layer is , which indicates that we regard the derivative of the loss with respect to the output of the network as the desired changes on the output to minimize the loss.
Then starting from the top layer, we compute a backward factor for each layer through
where the relations of and are given by (34), (35), (36) and (31) for fully connected layer, convolutional layer, BN layer and ReLU, respectively. If the layer has parameter and gradient , then we use as the new adaptive gradient to update , where is the backward factor on the output of the layer and is the sharing factor of the layer. Then can be viewed as a layer-wise adaptive learning rate when using vanilla SGD. This strategy is described in Algorithm 1.
Our algorithm can work with momentum naturally. In practice, we use the flow in Algorithm 1 to modify the gradient computed via BP. Then we apply the momentum update with the modified gradient. With the modified gradient given by Algorithm 1, we can also apply other adaptive strategy, i.e., Adam and Adagrad, without difficulty.
Weight-decay is a widely used technique to improve the generalization of the model. Note that both the weight-decay and our algorithm are modifying the gradient of the network parameter computed through BP. In practice, we first apply the weight-decay modification and then apply our algorithm on the modified gradient, which produces better result than the other way around.
3.3 An Example: LeNet
We use LeNet (Figure. 1) as an example. We modify the original LeNetLeCun et al. (1998a) by inserting a batch normalization transformation before each activation layer (ReLU) and omitting all the bias terms.
We next walk through the approximated back-matching propagation of the LeNet and show how each layer’s weight should be changed (). Following the procedure of Algorithm 1, we have the following initial value: . Given the loss , we can compute the normal gradient on each weight parameter through BP, which are denoted as with subscript of the layer name. We start from the top layer fc3 and compute
and update . Since the ReLU activation does not contain parameter and does not change the backward factor , then we move to the BN layer. Since our BN layer does not have parameter, we only have to update the backward factor
Then we move to layer fc2 and compute
and update the backward factor
We continue doing this till the bottom layer. Further details are provided in supplemental material.
We next train LeNet with BN to classify the CIFAR-10 datasetKrizhevsky & Hinton (2009). CIFAR-10 is composed of 60,000 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. We want to compare the training procedure and test accuracy of the classical SGD based on BP and our algorithm. There are many hyper-parameters/hyper-routines that would affect the learning curve significantly, and we try our best to make a fair comparison. In the first experiment, we use only CIFAR-10 dataset without augmentation and fix momentum to be for both algorithm, two models start from the same initial point and pass the same mini batch of data, where the mini-batch size is 128. Global learning rates are chosen to perform best in terms of test accuracy from a pool of five candidates333The pool for regular SGD is and the pool for our algorithm is .. We choose global learning rate for SGD and
for our algorithm. We fix the global learning rate through the training process and train for 200 epochs. A weight-decay term (0.0005) is optional for both methods. We note that the loss in the training curve does not include the weight-decay term in Figure.2 no matter whether the weight-decay is used.
From Figure. 2, we can see that without weight decay both algorithms can drive the training loss to zero but test accuracy of the regular SGD turns worse after very few epochs (15) and the model becomes overfitting from then on. In comparison, our algorithm achieves a much lower test error than the regular SGD and there is a considerable margin between our algorithm and the regular SGD even in the overfitting phase. On the other hand, with weight decay both training losses do not converge to zero any more but our algorithm achieves much lower training loss than the regular SGD. Weight decay improves the final test accuracy but does not improve the lowest test error during the training period of our algorithm. The weight obtained by our algorithm have rather large magnitude (the network does not explode as the batch normalization stabilize the forward propagation). We believe our algorithm combats overfitting differently from the weight decay and could provide another way to improve generalization in certain setting that cannot achieved by weight decay.
In this section, we evaluate the proposed algorithm for image classification tasks with two datasets: CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton, 2009). CIFAR-10 has been introduced in Section 3.3. The CIFAR-100 dataset is similar to the CIFAR-10, except it has 100 classes with 600 images per class and there are 500 training images and 100 testing images per class. We train VGG networks (Simonyan & Zisserman, 2015) to classify these two datasets because they are widely used baselines for image classification tasks and they are of feedforward architecture. We modify the VGG nets by keeping the last fully connected layers and removing the intermediate two fully connected layers and all the biases 444We find this does not hurt accuracy for CIFAR dataset and shortens training time due to fewer parameters.. We equip each intermediate layer of the VGG nets with batch normalization transformation right before the activation function and the batch normalization has no trainable parameters.
Differently from the setting in Section 3.3, here we train VGG nets by using the randomly augmented CIFAR-10 and CIFAR-100 datasets (random flip and rotation) as such big models get overfitting to the datasets rapidly. We note that augmenting the dataset the training does not gain much benefit directly. We need to decay the learning rate to learn effectively with data augmentation. In order to compare fairly, we apply the same learning rate scheduling strategy to all algorithms: multiplying the learning rate by a factor every 60 epochs.
4.1 Baseline Algorithms
We introduce several baseline algorithms and their settings.
The base algorithm is the regular SGD with Nesterov momentum . The learning rate is set to be .
The second baseline algorithm is LSALR Singh et al. (2015) which uses as the learning rate for the layer . The global learning rate is set to be , which achieves best performance comparing from a pool of candidates , and is different from the suggestion () in the original paper.
The third baseline algorithm is LARS You et al. (2017) which uses as the learning rate for layer . In our experiment, we use the global learning rate for LARS, which achieves best performance from a pool of .
Noting that all these layer-wise adaptive algorithms modify the regular layer gradient computed through BP, we equip Nesterov momentum on them in the experiment. For baseline algorithms, we apply weight decay with coefficient 5e-4 if without specific description.
We first compare the learning curves between the our algorithm and the vanilla SGD on training VGG11 with CIFAR-100. We apply Nesterov momentum 0.9 and weight decay coefficient 5e-3 for both algorithms. Similarly to Section 3.3, two algorithms start from the same initialization and pass the same batches of data. We set the same learning rate for both algorithms. Both algorithms are run 300 epochs. We plot the learning curves in Figure 3.
From Figure 3, we can see that the learning curves of our algorithm and SGD have similar trend: curves jump at each learning rate decay. This is predictable as our algorithm only modifies the magnitude of the layer’s gradient as a whole and does not involve any further information (second order information) and moreover we use the same hyper-parameters and the same learning rate scheduling strategy for both algorithms. Scrutinizing more closely, we can see our training loss curve is almost always lower than SGD’s and our test error fluctuates heavier initially but ends with a considerably lower number.
Next we present the test result of different VGG nets for classification of CIFAR-100 in Table 1. For this group of experiments, we use global learning rate and weight decay coefficient 5e-3 for our algorithm. Our algorithm achieves higher test accuracy over its competitors on all four VGG models with margins.
We then present the test accuracy result of different VGG nets for classification of CIFAR-10 in Table 2. The numbers in the table are the best of five independent trials of each algorithm. We use learning rate 2e-3 and weight decay coefficient 1e-4 for this group of experiments. We use a different learning rate from the case of CIFAR-100. The reason is that for training CIFAR-10 the last layer of VGG nets has 10 output and then the backpropagated values shall be multiplied by if following the back-matching propagation rule. Such an imbalanced mapping layer makes our algorithm behave aggressively. Hence we reduce the learning rate to 2e-3 for consistent result. We can see that our algorithm achieves higher test accuracy as its competitors on almost all VGG models with various margins.
5 Conclusion and Discussion
In this paper we present the back-matching propagation which provides a principled way of computing the backpropagated values on the weight parameter and on the input, which try to match the error guiding signal on the output as accurately as possible. To utilize the idea of back-matching propagation in training large neural networks efficiently, we make several approximations based on intuitive understanding and reduce the back-matching propagation to the regular BP with a layer-wise adaptive learning rate strategy. It it easy to implement within current machine learning frameworks that are equipped with auto-differentiation. We test our algorithm in training feedforward neural networks and achieve favorable result over SGD.
There are several future directions along with this work. In our derivation of the Algorithm 1, we assume that each neuron updates its values independently from others in the same layer. This is a strong assumption and may produce inaccuracy on computing the backpropagated values across layers. Thus one future direction is to modify the back-matching propagation by considering the co-update of neurons in the same layer, which is closely related to Riemannian algorithms Ollivier (2015)
that have been introduced but not widely used because of their complexity. Moreover, applying the idea of back-matching propagation to other architectures like residual networks and recurrent neural networks is also under consideration.
- Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, et al. TensorFlow: A system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
- Amari (1998) Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- Ba et al. (2016) Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Bastien et al. (2012) Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, Warde-Farley, David, and Bengio, Yoshua. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.
- Duchi et al. (2011) Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Grosse & Martens (2016) Grosse, Roger and Martens, James. A Kronecker-factored approximate Fisher matrix for convolution layers. In International Conference on Machine Learning (ICML), 2016.
He et al. (2016)
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
- Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp. 448–456, 2015.
- Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
- Lafond et al. (2017) Lafond, Jean, Vasilache, Nicolas, and Bottou, Léon. Diagonal rescaling for neural networks. arXiv preprint arXiv:1705.09319, 2017.
- LeCun et al. (1998a) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998a.
- LeCun et al. (2015) LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learning. Nature, 521(7553):436, 2015.
- LeCun et al. (1998b) LeCun, Yann A, Bottou, Léon, Orr, Genevieve B, and Müller, Klaus-Robert. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 1998b.
- Marceau-Caron & Ollivier (2016) Marceau-Caron, Gaétan and Ollivier, Yann. Practical Riemannian neural networks. arXiv preprint arXiv:1602.08007, 2016.
- Martens & Grosse (2015) Martens, James and Grosse, Roger. Optimizing neural networks with Kronecker-factored approximate curvature. In International Conference on Machine Learning (ICML), pp. 2408–2417, 2015.
- Nesterov (2013) Nesterov, Yurii. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Ollivier (2015) Ollivier, Yann. Riemannian metrics for neural networks I: feedforward networks. Information and Inference: A Journal of the IMA, 4(2):108–153, 2015.
Paszke et al. (2017)
Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, Gregory, Yang, Edward,
DeVito, Zachary, Lin, Zeming, Desmaison, Alban, Antiga, Luca, and Lerer,
Automatic differentiation in PyTorch.2017.
- Qian (1999) Qian, Ning. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- Rumelhart et al. (1986) Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
- Seide & Agarwal (2016) Seide, Frank and Agarwal, Amit. CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2135–2135. ACM, 2016.
- Simonyan & Zisserman (2015) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Singh et al. (2015) Singh, B., De, S., Zhang, Y., Goldstein, T., and Taylor, G. Layer-specific adaptive learning rates for deep networks. In IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 364–368, Dec 2015.
- Sutskever et al. (2013) Sutskever, Ilya, Martens, James, Dahl, George E, and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (ICML), 28:1139–1147, 2013.
- Ye et al. (2017) Ye, Chengxi, Yang, Yezhou, Fermuller, Cornelia, and Aloimonos, Yiannis. On the importance of consistency in training deep neural networks. arXiv preprint arXiv:1708.00631, 2017.
- You et al. (2017) You, Yang, Gitman, Igor, and Ginsburg, Boris. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888v3, 2017.
- Zhang et al. (2017) Zhang, Huishuai, Xiong, Caiming, Bradbury, James, and Socher, Richard. Block-diagonal hessian-free optimization for training neural networks. arXiv preprint arXiv:1712.07296, 2017.