1 Introduction
Deep neural networks have shown great success in computer vision
(He et al., 2016a)and natural language processing tasks
(Hochreiter and Schmidhuber, 1997). These models are typically trained using firstorder optimization methods like stochastic gradient descent (SGD) and its variants. The vanilla SGD does not incorporate any curvature information about the objective function, resulting in slow convergence in certain cases. Momentum
(Qian, 1999; Nesterov, 2013; Sutskever et al., 2013) or adaptive gradientbased methods (Duchi et al., 2011; Kingma and Ba, 2014) are sometimes used to rectify these issues. These adaptive methods can be seen as implicitly computing finitedifference approximations to the diagonal entries of the Hessian matrix (LeCun et al., 1998).A drawback of firstorder methods in general, including adaptive ones, is that they perform best with small minibatches (Dean et al., 2012; Zhang et al., 2015; Das et al., 2016; Recht et al., 2011; Chen et al., 2016). This limits available parallelism and makes distributed training difficult. Moreover, in the distributed setting the gradients must be accumulated after each training update and the communication among workers may become a major bottleneck. The optimization methods that scale best to the distributed setting are those that can reach convergence with few parameter updates. The weakness of the firstorder methods on this metric extends even to the convex case, where it was shown to be the result of correlation between the gradients at different data points in a minibatch leading to “overshooting” in the direction of the correlation (Takác et al., 2013).
In the case of deep neural networks, large minibatch sizes lead to substantially increased generalization error (Keskar et al., 2016; Dinh et al., 2017). Although Goyal et al. (2016)
recently successfully trained deep ResNets on the ImageNet dataset in one hour with minibatch size as large as 8192 by using the momentumSGD equipped with some welldesigned hyperparameters, it also showed there is severe performance decay for even larger minibatch sizes, which indeed corroborates the difficulty of training with large minibatches. These difficulties motivate us to revisit the secondorder optimization methods, which use the Hessian or other curvature matrices to rectify the gradient direction. The secondorder methods employ more information about the local structure of the loss function, as they approximate it quadratically rather than linearly, and they scale better to large minibatch sizes.
However, finding the exact minimum of a quadratic approximation to a loss function is infeasible in most deep neural networks because it involves inverting an by curvature matrix for a parameter count of . The Hessianfree (HF) methods (Martens, 2010; Martens and Sutskever, 2011; Byrd et al., 2011, 2012)
minimize the quadratic function that locally approximates the loss using the conjugate gradient (CG) method instead. This involves evaluating a sequence of curvaturevector products rather than explicitly inverting—or even computing—the curvature matrix or Hessian. The Hessianvector product can be calculated efficiently using one forward pass and one backward pass
(Pearlmutter, 1994), while other curvaturevector products have similarly efficient algorithms (Schraudolph, 2002; Martens and Sutskever, 2012).Normally, the HF method requires many hundreds of CG iterations for one update, which makes even a single optimization step fairly computationally expensive. Thus, when comparing HF to firstorder methods, the benefit in terms of fewer iterations from incorporating curvature information often does not compensate for the added computational burden.
We propose using a blockdiagonal approximation to the curvature matrix to improve Hessianfree convergence properties, inspired by several results that link these two concepts for other optimization methods. Collobert (2004)
argues that when training a multilayer perceptron (MLP) with one hidden layer the gradient descent converges faster with the crossentropy loss than with mean squared error because its Hessian is more closely blockdiagonal. A blockdiagonal approximation of the Fisher information matrix, one kind of curvature matrix, has also been shown to improve the performance of the online natural gradient method
(Le Roux et al., 2008) for training a onelayer MLP.The advantage of a blockdiagonal Hessianfree method is that updates to certain subsets of the parameters are independent of the gradients for other subsets. This makes the subproblem separable and reduces the complexity of the local search space (Collobert, 2004)
. We hypothesize that using the blockdiagonal approximation of the curvature matrix may make the Hessianfree method more robust to noise that results from using a relatively small minibatch for curvature estimation.
In the cases of Collobert (2004) and Le Roux et al. (2008)
, the parameter blocks for which the Hessian or Fisher matrix is blockdiagonal are composed of all weights and biases involved in computing the activation of each neuron in the hidden and output layers. Thus it equates to the statement that gradient interactions among weights that affect a single output neuron are greater than those between weights that affect two different neurons.
In order to strike a balance between the curvature information provided by additional Hessian terms and the potential benefits of a more nearly blockdiagonal curvature matrix, and adapt the concept to more complex contemporary neural network models, we choose to treat each layer or submodule of a deep neural network as a parameter block instead. Thus, different from Collobert (2004) and Le Roux et al. (2008), our hypothesis then becomes that gradient interactions among weights in a single layer are more useful for training than those between weights in different layers.
We now introduce our blockdiagonal Hessianfree method in detail, then test this hypothesis by comparing the performance of our method on a deep autoencoder, a deep convolutional network, and a multilayer LSTM to the original Hessianfree method (Martens, 2016) and the Adam method (Kingma and Ba, 2014).
2 The BlockDiagonal HessianFree Method
In this section, we describe the blockdiagonal HF method in detail and compare it with the original HF method (Martens, 2010; Martens and Sutskever, 2011).
Throughout the paper, we use boldface lowercase letters to denote column vectors, boldface capital letters to denote matrices or tensors, and the superscript
to denote the transpose. We denote an input sample and its label as , the output of the network as , and the loss as , where refers to the network parameters flattened to a single vector.2.1 The BlockDiagonal HessianFree Method
We first recall how secondorder optimization works. For each parameter update, the secondorder method finds that minimizes a local quadratic approximation of the objective function at point :
(1) 
where is some curvature matrix of at , such as the Hessian matrix or the generalized GaussNewton matrix (Martens and Sutskever, 2012). The resulting subproblem of
(2) 
is solved using conjugate gradient (CG), a procedure that only requires evaluating a series of matrixvector products .
There exist efficient algorithms for computing these matrixvector products given a computationgraph representation of the loss function. If the curvature matrix is the Hessian matrix, (1) is the secondorder Taylor expansion and the Hessianvector product can be computed as the gradient of the directional derivative of the loss function in the direction of , operations also known as the L and Roperators and respectively:
(3) 
The Roperator can be implemented as a single forward traversal of the computation graph (applying forwardmode automatic differentiation), while the Loperator requires a backward traversal (reversemode automatic differentiation) (Pearlmutter, 1994; Baydin et al., 2015). The Hessianvector product can also be computed as the gradient of the dot product of a vector and the gradient; that method does not require the Roperator but has twice the computational cost.
However, the objective of deep neural networks is nonconvex and the Hessian matrix may have a mixture of positive and negative eigenvalues, which makes the optimization problem (
2) unstable. It is common to use the generalized GaussNewton matrix (Schraudolph, 2002) as a substitute curvature matrix, as it is always positive semidefinite if the objective function can be expressed as the composition of two functions with convex, a property satisfied by most training objectives. For a curvature minibatch of data , the generalized GaussNewton matrix is defined as(4) 
where is the Jacobian matrix of derivatives of network outputs with respect to the parameters and is the Hessian matrix of the objective with respect to the network outputs . It is an approximation to the Hessian that results from dropping terms that involve second derivatives of (Martens and Sutskever, 2012).
The GaussNewton vector product can also be evaluated as
(5) 
In an automatic differentiation package like Theano
(AlRfou et al., 2016), this requires one forwardmode and one reversemode traversal of the computation graphs of each of and .However, it is still inefficient to solve problem (2) for a deep neural network with a large number of parameters, so we propose the blockdiagonal Hessianfree method. We first split the network parameters into a set of parameter blocks. For instance, each block may contain the parameters from one layer or a group of adjacent layers. Then the subproblems corresponding to each block are solved separately, while their solutions are concatenated together to produce a single update.
Specifically, if there are blocks in total, the parameter vector can be rewritten as . Similarly, we split the gradient into blocks as , where is the vector that contains the gradient only with respect to the parameters in block . We further split the curvature matrix into square blocks and let be the th diagonal block of . Then we obtain separate subproblems for each block as follows:
We solve these subproblems separately by conjugate gradient and concatenate their solutions together. Hence will be our update (see Algorithm 1).
The th subproblem of the blockdiagonal HF method is equivalent to minimizing the overall objective (1) with constraint for , since the secondorder term of such a constrained objective is zero for all terms in not in . This confirms that blockdiagonal HF as described above is equivalent to ordinary HF with the curvature matrix replaced by a blockdiagonal approximation that includes only terms involving pairs of parameters from the same block.
The problem (2) has been separated into independent subproblems for each block, reducing the dimensionality of the search space that CG needs to consider. Although we have subproblems to solve for one update, each subproblem has smaller size and requires fewer CG iterations. Hence, the total compute needs are on par with those of the HF method with the same minibatch sizes; if the independent subproblems can be executed in parallel (e.g., on multiple nodes in a distributed system), there is potential for up to fold speed improvement. As we demonstrate below, blockdiagonal Hessianfree achieves better performance than the HF method on deep autoencoders, multilayer LSTMs, and deep CNNs.
2.2 Implementation Details
We partition the network parameters into blocks based on the architecture of the network. When partitioning the network parameters, we try to define roughly equal sized blocks. This allows each subproblem to make roughly similar progress with the same number of CG iterations. We seek to partition the network such that parameters whose gradients we expect to be strongly correlated are part of the same block. For example, in our experiment we split the autoencoder network into two blocks: one for the encoder and one for the decoder. For the multilayer LSTM, we treat each layer of recurrent cells as a block. And for the deep CNN, we divide the convolutional layers into three contiguous blocks.
When solving the problem (2), we use truncated conjugate gradient (Yuan, 2000). This means we terminate the CG iteration before finding the local minimum. There are two reasons to do this truncation. First, CG iterations are expensive and later iterations of CG provide diminishing improvements. More importantly, when we use minibatches to evaluate the curvaturevector product, early termination of CG keeps the update from overfitting to the specific minibatch.
One way to reduce the computational burden of the HF method is to use smaller minibatch sizes to evaluate the curvaturevector product while still using a large minibatch to evaluate the objective and the gradient (Byrd et al., 2011, 2012; Kiros, 2013). Martens (2010) similarly implements the HF method using the full dataset to evaluate the objective and the gradient, and minibatches to calculate the curvaturevector products. This is possible because Newtonlike methods are more tolerant to approximations of the Hessian than they are to that of the gradient (Byrd et al., 2011). In our implementation, the curvature minibatch is chosen to be a strict subset of the gradient minibatch as shown in Algorithm 1.
However, small minibatches inevitably make the curvature estimation deviate from the true curvature, reducing the convergence benefits of the HF method over firstorder optimization (Martens and Sutskever, 2012). In practice it is not trivial to choose a minibatch size that balances accurate estimation of curvature and the computational burden (Byrd et al., 2011). The key to making Hessianfree methods, including blockdiagonal Hessianfree, converge well with small curvature minibatches is to use short CG runs to tackle minibatch overfitting.
Martens (2010) suggests using factored Tikhonov damping to make the HF method more stable. With damping, is used as the curvature matrix to make the curvature “more” positive definite, where is the intensity of damping. We also incorporate damping in many of our experiments. For the sake of comparison, we use the same damping strength for the HF method and the blockdiagonal HF method and choose a fixed value for each experiment.
Another suggestion made by (Martens, 2010) is to use a form of “momentum” to accelerate the HF method. Here, momentum means initializing the CG algorithm with the last CG solution scaled by some constant close to 1, rather than initializing it randomly or to the zero vector. This change often brings additional speedup with little extra computation. We apply a fixed momentum value of for all experiments.
We also adopt fixed hyperparameter settings across the experiments, rather than an adaptive schedule. One reason is that the statistics that control the adaptive hyperparameter scheduling can cost more than the gradient and curvaturevector product evaluation, which makes the HF method even slower. Furthermore, these tricks are not independent and it is often unclear how to adjust and fit them to every scenario. Our fixed hyperparameters work well in practice across the three different neural network architectures we investigated.
3 Related Work
The Hessian matrix is indefinite for nonconvex objectives, which makes the secondorder method unstable as the local quadratic approximation becomes unbounded from below. (Martens and Sutskever, 2012) advocates using the generalized GaussNewton matrix (Schraudolph, 2002) as the curvature matrix instead, which is guaranteed to be positive semidefinite. Another way to circumvent the indefiniteness of the Hessian is to use the Fisher information matrix as the curvature matrix; this approach has been widely studied under the name “natural gradient descent” (Amari and Nagaoka, 2007; Amari, 1998; Pascanu and Bengio, 2014; Le Roux et al., 2008). In some cases these two curvature matrices are exactly equivalent (Pascanu and Bengio, 2014; Martens, 2016). It has also been argued that the negative eigenvalues of the full Hessian are helpful for finding parameters with lower energy, e.g., in the saddlefree Newton method (Dauphin et al., 2014) and in an approach that mixes the Hessian and GaussNewton matrices (He et al., 2016b).
Recently, Martens and Grosse (2015); Grosse and Martens (2016), and Ba et al. (2017) propose the KFAC method to approximate the natural gradient using a blockdiagonal or blocktridiagonal approximation to the inverse of the Fisher information matrix, and demonstrate the advantages over firstorder methods of a specialized version of this optimizer tailored to deep convolutional networks. In their work, the parameters are partitioned into blocks of similar size and structure to those used in our method.
4 Experiments
We evaluate the performance of the blockdiagonal HF method on three deep architectures: a deep autoencoder on the MNIST dataset, a 3layer LSTM for downsampled sequential MNIST classification, and a deep CNN based on the ResNet architecture for CIFAR10 classification. For all three experiments, we first compare the performance of the blockdiagonal HF method with that of Adam (Kingma and Ba, 2014) to demonstrate that blockdiagonal Hessianfree is able to handle large batch size more efficiently. We then demonstrate the advantage of the blockdiagonal method over ordinary Hessianfree by comparing their performance at various curvature minibatch sizes.
Although the blockdiagonal HF method needs to solve more quadratic minimization problems, each subproblem is much smaller and the computation time is similar to the HF method. We note that the independence of the CG subproblems means the blockdiagonal method is particularly amenable to a distributed implementation.
We use the Lasagne deep learning framework (Dieleman et al., 2015) based on Theano (AlRfou et al., 2016) for our implementation of the HF and blockdiagonal HF methods, as we found no other software framework to support both convenient definition of deep neural networks and the forwardmode automatic differentiation required to implement the Roperator.
4.1 Deep Autoencoder
Our first experiment is conducted on a deep autoencoder task. The goal of a neural network autoencoder is to learn a lowdimensional representation (encoding) of data from an input distribution. The “encoder” part, a multilayer feedforward network, maps the input data to a lowdimensional vector representation while the “decoder” part, another multilayer feedforward network, reconstructs the input data given the lowdimensional vector representation. The autoencoder is trained by minimizing the reconstruction error.
The MNIST dataset (LeCun et al., 2001) is composed of handwritten digits of size with training samples and test samples. The pixel values of both the training and test data are rescaled to .
Our autoencoder is composed of an encoder with three hidden layers and state sizes 784100050025030, followed by a decoder that is the mirror image of the encoder^{1}^{1}1 The model of autoencoder is the same as that in Hinton et al. (2006) and Martens (2010) for easy comparison.. We use the activation function and the mean squared error loss function.
For hyperparameters, we use a fixed learning rate of , no damping, and maximum CG iterations max_cg_iters for both the HF and blockdiagonal HF methods. For blockdiagonal HF, we define two blocks: one block for the encoder and the other for the decoder. For Adam, we use the default setting in Lasagne with learning rate 0.001, =0.9, =0.999, and .
A performance comparison between Adam, HF, and blockdiagonal HF is shown in Figure 1. For Adam, the number of dataset epochs needed to converge and the final achievable reconstruction error are heavily affected by the minibatch size, with a similar number of updates required for smallminibatch and largeminibatch training. Our blockdiagonal HF method with large minibatch size achieves approximately the same reconstruction error as Adam with small minibatches while requiring an order of magnitude fewer updates to converge compared to Adam with either small or large minibatches. Moreover, blockdiagonal Hessianfree provides consistently better reconstruction error—on both the train and test sets—than the HF method over the entire course of training. This advantage holds across different values of the curvature minibatch size.
4.2 Multilayer LSTM
Our second experiment is conducted using a threelayer stacked LSTM on the sequential MNIST classification task. The MNIST data is downsampled to by average pooling. The neural network has three LSTM (Hochreiter and Schmidhuber, 1997; Gers et al., 2002) layers followed by a fullyconnected layer on the final layer’s last hidden state. Each LSTM has 10 hidden units with peephole connections (Gers et al., 2002).
For HF and blockdiagonal HF, we use a fixed learning rate of , damping strength , and maximum CG iterations max_cg_iter . The blockdiagonal method has three blocks—one block for each LSTM layer, with the top block also containing the fullyconnected layer. For Adam, we again use a learning rate of 0.001, =0.9, =0.999, and .
A performance comparison between blockdiagonal HF, HF, and Adam is found in Figure 2. Similar to the autoencoder case, the blockdiagonal method with large minibatches requires far fewer updates to achieve lower training loss and better test accuracy than Adam with any minibatch size. Furthermore, compared to HF, the blockdiagonal HF method requires fewer updates, achieves better minima, and exhibits less performance deterioration for small curvature minibatch sizes.
4.3 Deep Convolutional Neural Network
We also train a deep convolutional neural network (CNN) for the CIFAR10 classification task with the three optimization methods. The CIFAR10 dataset has
training samples and test samples, and each sample is a image with three channels.Our model is a simplified version of the ResNet architecture (He et al., 2016a). It has one convolutional layer
at the bottom followed by three residual blocks and a fullyconnected layer at the top. We did not include batch normalization layers
^{2}^{2}2 Computing the Hessianvector product becomes extremely slow when involving the batch normalization layers with the Theano framework..For HF and blockdiagonal HF, we use a fixed learning rate of , damping strength , and maximum CG iterations max_cg_iter . The blockdiagonal method again has three blocks—one for each residual block, with the top and bottom blocks also containing the fullyconnected and convolution layers respectively. We use the same default Adam hyperparameters.
The common practice of training deep CNNs using customtuned learning rate decay schedules does not straightforwardly extend to the secondorder case. However, Grosse and Martens (2016) suggests that Polyak averaging (Polyak and Juditsky, 1992) can obviate the need for learning rate decay while still achieving high test accuracy. In order to ensure a fair comparison, we apply Polyak averaging with exponential decay rate when evaluating the test accuracy for all three algorithms.
A performance comparison between blockdiagonal HF, HF, and Adam is found in Figure 2. Similar to the autoencoder case, the blockdiagonal method with large minibatches requires far fewer updates to achieve lower training loss and better test accuracy than Adam with any minibatch size. Furthermore, compared to the HF method, the blockdiagonal HF method requires fewer updates, achieves better minima, and exhibits less performance deterioration for small curvature minibatch sizes.
A performance comparison between blockdiagonal Hessianfree, Hessianfree, and Adam is found in Figure 3. Blockdiagonal HF with large minibatches obtains comparable test accuracy to Adam with small ones. Furthermore, the blockdiagonal method achieves slightly better training loss and higher test accuracy—and substantially more stable training—than Hessianfree for three different curvature minibatch sizes.
Although not plotted in figures, the time consumption of blockdiagonal HF and that of the HF are comparable in our experiments. The time per iteration of blockdiagonal HF and HF is 510 times larger than that of the Adam method. However, the total number of iterations of blockdiagonal HF and HF are much smaller than Adam and they have potential benefit of parallelization for large minibatches.
5 Conclusion and Discussion
We propose a blockdiagonal HF method for training neural networks. This approach divides network parameters into blocks, then separates the conjugate gradient subproblem independent for each parameter block. This extension to the original HF method reduces the number of updates needed for training several deep learning models while improving training stability and reaching better minima. Compared to firstorder methods including the popular Adam optimizer, blockdiagonal HF scales significantly better to large minibatches, requiring an order of magnitude fewer updates in the largebatch regime.
Our results strengthen the claim of Collobert (2004) that “the more blockdiagonal the Hessian, the easier it is to train” a neural network by showing that, in the case of Hessianfree optimization, simply ignoring offblockdiagonal curvature terms improves convergence properties.
Due to the separability of the subproblems for different parameter blocks, the blockdiagonal HF method we introduce is inherently more parallelizable than the ordinary HF method. Future work can take advantage of this feature to apply the blockdiagonal HF method to largescale machine learning problems in a distributed setting.
References
 AlRfou et al. (2016) R. AlRfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
 Amari (1998) S.I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Amari and Nagaoka (2007) S.i. Amari and H. Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
 Ba et al. (2017) J. Ba, R. Grosse, and J. Martens. Distributed secondorder optimization using Kroneckerfactored approximations. In International Conference on Learning Representations (ICLR), 2017.
 Baydin et al. (2015) A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767, 2015.
 Byrd et al. (2011) R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
 Byrd et al. (2012) R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155, 2012.
 Chen et al. (2016) J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981, 2016.
 Collobert (2004) R. Collobert. Large Scale Machine Learning. PhD thesis, Université de Paris VI, 2004.
 Das et al. (2016) D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan, D. Kalamkar, B. Kaul, and P. Dubey. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709, 2016.
 Dauphin et al. (2014) Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014.
 Dean et al. (2012) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
 Dieleman et al. (2015) S. Dieleman, J. Schlüter, C. Raffel, E. Olson, S. K. Sønderby, D. Nouri, D. Maturana, M. Thoma, E. Battenberg, J. Kelly, J. D. Fauw, M. Heilman, D. M. de Almeida, B. McFee, H. Weideman, G. Takács, P. de Rivaz, J. Crall, G. Sanders, K. Rasul, C. Liu, G. French, and J. Degrave. Lasagne: First release., Aug. 2015. URL http://dx.doi.org/10.5281/zenodo.27878.
 Dinh et al. (2017) L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
 Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 Gers et al. (2002) F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3(Aug):115–143, 2002.
 Goyal et al. (2016) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2016.
 Grosse and Martens (2016) R. Grosse and J. Martens. A Kroneckerfactored approximate Fisher matrix for convolution layers. In International Conference on Machine Learning (ICML), 2016.

He et al. (2016a)
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016a.  He et al. (2016b) X. He, D. Mudigere, M. Smelyanskiy, and M. Takáč. Large scale distributed Hessianfree optimization for deep neural network. arXiv preprint arXiv:1606.00511, 2016b.
 Hinton et al. (2006) G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
 Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Keskar et al. (2016) N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On largebatch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.
 Kingma and Ba (2014) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kiros (2013) R. Kiros. Training neural networks with stochastic Hessianfree optimization. arXiv preprint arXiv:1301.3641, 2013.
 Le Roux et al. (2008) N. Le Roux, P.A. Manzagol, and Y. Bengio. Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems, pages 849–856, 2008.
 LeCun et al. (2001) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Intelligent Signal Processing, pages 306–351. IEEE Press, 2001.
 LeCun et al. (1998) Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 1998.
 Martens (2010) J. Martens. Deep learning via Hessianfree optimization. In International Conference on Machine Learning (ICML), pages 735–742, 2010.
 Martens (2016) J. Martens. Secondorder optimization for neural networks. PhD thesis, University of Toronto, 2016.
 Martens and Grosse (2015) J. Martens and R. Grosse. Optimizing neural networks with Kroneckerfactored approximate curvature. In International Conference on Machine Learning (ICML), pages 2408–2417, 2015.

Martens and Sutskever (2011)
J. Martens and I. Sutskever.
Learning recurrent neural networks with Hessianfree optimization.
In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1033–1040, 2011.  Martens and Sutskever (2012) J. Martens and I. Sutskever. Training deep and recurrent networks with Hessianfree optimization. In Neural networks: Tricks of the trade, pages 479–535. Springer, 2012.
 Nesterov (2013) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 Pascanu and Bengio (2014) R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. In International Conference on Learning Representations (ICLR), 2014.
 Pearlmutter (1994) B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1):147–160, 1994.
 Polyak and Juditsky (1992) B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 Qian (1999) N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 Recht et al. (2011) B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
 Schraudolph (2002) N. N. Schraudolph. Fast curvature matrixvector products for secondorder gradient descent. Neural computation, 14(7):1723–1738, 2002.
 Sutskever et al. (2013) I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (ICML), 28:1139–1147, 2013.
 Takác et al. (2013) M. Takác, A. S. Bijral, P. Richtárik, and N. Srebro. Minibatch primal and dual methods for SVMs. In International Conference on Machine Learning (ICML), pages 1022–1030, 2013.
 Yuan (2000) Y. Yuan. On the truncated conjugate gradient method. Mathematical Programming, 87(3):561–573, 2000.
 Zhang et al. (2015) S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems, pages 685–693, 2015.
Comments
There are no comments yet.