1 Introduction
As dataset sizes and neural network complexity increases, DNN training times have exploded. It is common to spend multiple weeks training an industrial scale DNN model on a single machine with multiple GPUs. Reducing DNN training times using distributed DNN training Abadi et al. (2016) is becoming imperative to get fast and reasonable turn around times for research experiments and training production DNN models. In addition, faster DNN training methods can reduce the cost of training.
In this work we study the limits of synchronous DNN training on popular largescale DNN training tasks: ImageNet classification Russakovsky et al. (2015) and CIFAR100 classification. Several system level challenges need to be resolved in order to implement an efficient largescale synchronous distributed training algorithm. Once the throughput and latency issues have been optimized, the fundamental limitation to scaling a synchronous DNN algorithm is making effective use of an extremely large minibatch. There has been recent work extending SGD based algorithms to maintain the model accuracy as the minibatch size increases Akiba et al. (2017), Goyal et al. (2017), You et al. (2017), Smith et al. (2017), Shaw et al. (2018) and they have shown promise for training Resnet50 on ImageNet.
In this work, we take an orthogonal approach and explore an algorithm that uses the large minibatch sizes more effectively than SGD and its variants. We study the performance of various optimizers in the standard DNN training setting. We propose to use the preconditioned nonlinear conjugate gradient method Polak & Ribière (1969), Fletcher & Reeves (1964), Nocedal & Wright (2006) in the standard DNN training setting. We demonstrate how to efficiently use second order information in the preconditioner of the NLCG optimizer. To our knowledge, this is the first work that shows NLCG based optimizers can provide better solutions than SGD based optimizers for large scale DNN training tasks like ImageNet, particularly for large minibatch training, which is essential for scaling up synchronous distributed training.
In Section 2, we review distributed DNN training methods and their challenges. In Section 3, we describe the stochastic preconditioned nonlinear conjugate gradient method and its application to DNN training. In Section 4 we compare NLCG and SGD based methods for training the Resnet50 model for the ImageNet classification task and training the Resnet32 model for the CIFAR100 classification task.
2 Distributed DNN training
DNN training can be distributed either by splitting the model (model parallelism) or splitting the data (data parallelism) Dean et al. (2012). In model parallelism, the neural network is split across multiple worker nodes. This is typically employed when the network is too large to fit on a single worker. More common is data parallelism in which each worker uses different subsets of the data (minibatches) to train the same model. When a parameter server is used Dean et al. (2012)
, the consensus DNN weights are stored on the parameter server. Meanwhile each worker keeps a copy of the DNN graph and the latest weights. At each iteration each worker samples a minibatch of training data and computes an estimate of the gradient of the loss with respect to the weights. All workers communicate their estimate to the parameter server which updates the weights and broadcasts the updated weights to all workers.
2.1 Data Parallel Distributed DNN training algorithms
One can use different distributed training algorithms to apply the weight update within the data parallel framework. In asynchronous training methods like asynchronous SGD Dean et al. (2012), the workers execute without synchronization. Each worker computes an update to the weights independently and on different minibatches. The update is communicated to the parameter server which updates the weights and communicates these back to the worker. The weight update happens asynchronously. Since there is no explicit synchronization point the overall system throughput is high. To further reduce the communication overhead methods like Block Momentum SGD (BMSGD)Chen & Huo (2016) and Elastic Averaging SGD (EASGD)Zhang et al. (2014) have been proposed. These perform N updates on each worker before communicating an Nstep weight update to the parameter server. These methods increase system throughput by increasing computation to communication ratio, but introduce additional disparity between the weights and the updates applied.
Synchronous DNN training Dean et al. (2012) works by explicitly synchronizing the workers. Coordination happens when all the gradient estimates (one per worker) are averaged and applied to the master model. Synchronous DNN training methods can be implemented using parameter servers Abadi et al. (2016) or techniques from High Performance Computing (HPC) like ring allreduce averaging Gibiansky (2017)
. With synchronous DNN training the effective minibatch size increases. A larger minibatch size has the effect of reducing the variance of the stochastic gradient. This reduced variance allows larger steps to be taken (increase the learning rate)
Goyal et al. (2017).It has been shown that for the ImageNet training task, as the number of workers increases, synchronous distributed methods tend to have better performance than asynchronous methods Chen et al. (2016). In our experiments, we also observe that given a fixed number of epochs, synchronous DNN training has better final accuracy than its asynchronous counterparts. Hence we focus on optimization methods for scaling of synchronous DNN training as the number of workers increases. We conduct our research on training the popular Resnet50 model He et al. (2015) on the ImageNet image classification task Russakovsky et al. (2015). We also test our proposed optimization algorithm on training the Resnet32 model on the CIFAR100 image classification task.
2.2 Scaling Synchronous Distributed DNN training throughput
The work in Chen et al. (2016) shows that throughput for parameter server based distributed synchronous training is reduced because of the straggler effect of the slowest worker / communication link dominating the overall system throughput. It also suggests that to improve system throughput and mitigate the straggler effect, we should slightly reduce the number of gradients averaged compared to the number of workers in the cluster.
Synchronous DNN training throughput can be further improved using distributed ring reduce algorithm Sergeev & Balso (2018), from High Performance Computing (HPC) field, to average the gradients from different machines. The Horovod library Sergeev & Balso (2018)
built on top of Tensorflow provides a good implementation of ring allreduce using MPI and NCCL libraries.
2.3 Large Minibatch Size DNN Training
Scaling up synchronous distributed training requires the ability to train with large minibatches. In synchronous distributed training, we should keep the per worker minibatch size as high as possible to maintain high system throughput (high compute to communication ratio). With a fixed minibatch size per worker, the effective minibatch size per training step becomes larger as the number of workers increases. If we keep the number of training epochs constant the number of training steps (weight updates) reduces. So as we scale the number of workers, to achieve lower training times, the DNN needs to train in fewer training steps. Until a certain minibatch size, this can be achieved effectively by increasing the learning rate proportionally to the minibatch size Goyal et al. (2017). Different research groups have observed that at some problem specific minibatch size, accuracy of the DNN training starts reducing due to training not having converged to an optimal point Goyal et al. (2017), Shallue et al. (2018). Figure 4 shows this effect for training Resnet50 model on ImageNet dataset where the optimization fails to converge to baseline accuracy in 90 epochs for minibatch sizes above 32768. As suggested by Hoffer et al. (2017), we also observe that training longer with large minibatches helps to reduce the training suboptimality.
In addition to solving the system challenges of implementing large scale efficient synchronous DNN training, another fundamental challenge for scaling synchronous DNN training is to make effective use of an extremely large minibatch in the optimization process. Goyal et al. (2017) propose to increase learning rate linearly for larger minibatch sizes. You et al. (2017) propose to use layer specific gradient normalization to scale to larger minibatch sizes. Smith et al. (2017) propose to progressively increase the minibatch size during the training process to to control gradient noise levels in the early parts of the optimization process. Shaw et al. (2018) propose to use progressively larger images during the training process. Jia et al. (2018) propose to use mixedprecision during training to scale to large minibatches.
In this work, we devise a novel preconditioned NLCG DNN training algorithm with second order information that uses the large minibatch sizes more effectively than SGD based training algorithms.
3 Nonlinear conjugate gradient method for DNN training
3.1 DNN Training Optimization Background
The objective of stochastic DNN training is to minimize the minibatch loss , where represents the model weights, is a minibatch, and is the loss for a sample . Each iteration of stochastic gradient descent proceeds as follows: First, a minibatch (a subset of the training data) is sampled and used to compute a gradient estimate , where is the current training step. The weights are then updated with the rule , where is a hyperparameter termed the learning rate. When the minibatch is small, the variance of the gradient estimate is high. For the optimization to succeed should be small.
Popular variants of SGD generalize the above rule to where,
which is a function of previous gradient estimates. These methods include SGD with momentum (Momentum), RMSProp, Adam
Kingma & Ba (2014), ADAGrad and others. Since they only use gradient information (first derivative) they are referred to as firstorder methods.Newton’s method achieves faster convergence than first order methods by leveraging information about the Hessian of the target function. It models the function as a quadratic by using a second order Taylor expansion at the current iterate. It then minimizes the approximation to choose the next iterate. This results in the update vector
with ( ). Observe that the computation of the update vector requires the inverse of the Hessian, or the solution of the system .A model with weights results in a Hessian with entries. For a DNN with millions of weights, computing, storing and solving is prohibitively expensive. To this end, Hessian free optimization Martens (2010) uses Krylovsubspace methods to solve the linear system without ever forming . Variants of these methods use linear conjugate gradient (CG), conjugate residuals (CR) and others. Importantly, these methods only require forming products of the form , which can be achieved with the Pearlmutter trick Pearlmutter (1994).
Second order methods have been previously used for DNN training Le et al. (2011), Bollapragada et al. (2018) Martens (2010) Cho et al. (2015). However achieving good convergence results on large scale DNN training problems like ImageNet classification has been difficult. Recently natural gradient based methods like KFAC Jimmy Ba & Martens (2017) and methods using Neumann power series Krishnan et al. (2017) have shown promise in successfully training large scale problems like ImageNet classification.
3.2 Nonlinear Conjugate Gradient Method
Linear CG method is not only useful for solving linear systems, it is an optimization algorithm in its own right. For a quadratic problem and a given accuracy threshold, CG will typically converge much faster than gradient descent.
NLCG method generalizes the linear CG method to nonlinear optimization problems and can also work successfully for nonconvex optimization problems Boyd & Vandenberghe (2004). However, it has not been explored successfully for large scale DNN training tasks (e.g. ImageNet classification). At each step , NLCG chooses a direction, , that is conjugate (orthogonal) to all the previous directions (i.e. ). In contrast to gradient descent method, using conjugate gradients helps to avoid exploring the same directions multiple times. It is possible to further improve convergence of the NLCG method by using second order information through the preconditioner. We hypothesize that with large minibatches, the variance of the gradients is reduced, which makes NLCG method effective for DNN training.
We describe the stochastic preconditioned NLCG algorithm for DNN training in Algorithm 1. The overall structure is very similar to the classic preconditioned NLCG algorithm Nocedal & Wright (2006). Key differences are the introduction of an efficient quasi newton preconditioner and online stochastic line search to automatically determine step size. At each iteration, we compute the gradient of the minibatch using backpropagation. As described in Algorithm 2 and Section 3.2.1, we compute a diagonal preconditioner which estimates the curvature of the optimization problem at the current step. Using the diagonal preconditioner lets us introduce second order information into the optimization process and reduces the condition number of the system matrix to speed up convergence.
After preconditioning the gradient, we compute the conjugate direction using the PolakRibiere (PR) Polak & Ribière (1969) or FletcherReaves (FR) Fletcher & Reeves (1964) update formula to compute the term in Algorithm 1. Like momentum term in the Momentum optimizer, determines the amount of previous direction to keep in the new update. The conjugate gradient update formulas, PR and FR, are designed to approximately keep each direction Hessian conjugate to previous directions. Conjugate gradient methods help in efficiently traversing optimization landscapes with narrow illconditioned valleys where gradient descent method can slow down.
3.2.1 Quasi Newton Preconditioner
Typically, the preconditioner in the NLCG method is supposed to approximate the Hessian inverse () of the optimization problem. In addition, for convergence guarantees, the preconditioner is supposed to be constant during the optimization process. For DNN training problems with several million optimization variables, computing the full Hessian inverse approximation matrix would be prohibitive in both runtime and memory. It is nontrivial to compute a useful static preconditioner for DNN training optimization. In our implementation, we use a dynamic preconditioner that is updated using the quasi Newton BFGS update Nocedal & Wright (2006) at every iteration. The BFGS method approximates the Hessian using the past difference of gradients and difference of weights to satisfy the secant equation Nocedal & Wright (2006). Our preconditioner is limited to only the diagonal of the Hessian inverse which effectively scales the gradient of each variable individually. By only computing the diagonal, we don’t need to represent the BFGS approximate Hessian matrix in memory. The inverse of the diagonal Hessian is also easy to compute. All the numerical operations involved are vector additions, subtractions, multiplications or dot products, which keeps the runtime overhead of the diagonal BFGS preconditioner low.
3.2.2 Online Stochastic Line Search
Once the direction of the step is computed, we need to compute the step length. In classical NLCG, a line search procedure Nocedal & Wright (2006) is employed to find the minimum of the function along the search direction. However, in a stochastic setting, it is difficult to get a stable value of the function/gradient because of high variance introduced by minibatches. Even if the function value is stable, computing the step length using traditional line search methods like secant line search Boyd & Vandenberghe (2004) or Armijo line search Boyd & Vandenberghe (2004) can be very expensive because several function and/or gradient calls are required. Instead, we compute the global step length by following a traditional learning rate schedule used in DNN training (see Figure 1). We also compute a learning rate scale which is multiplied with global learning rate to get the final learning rate . The learning rate scale
is initialized to 1.0. We monitor the loss function value at each iteration update. If the loss function increases by more than a certain percentage (2%), we reduce the
by 2.5%. On the other other hand, if the loss function reduces or stays similar(<1% increase), we increase by 2.5%, up to the maximum of 1.0. The thresholds 1%, 2% and 2.5% are hyperparameters that depend on the variance of the loss function and may be tuned for better performance. The online stochastic line search strategy is inspired by the backtracking line search strategy used in convex optimization.4 Results
In this section, we detail the results of using the NLCG method to train the Resnet50 model on the ImageNet image classification task and to train the Resnet32 model on the CIFAR100 image classification task.
ImageNet image classification task is a large scale learning task and deep learning models have been successful in obtaining state of the art accuracy on this classification task. We conduct our experiments in the large minibatch scenario with 299x299 image crops. We use parameter server based distributed DNN training. Recent work on synchronous distributed training have shown scaling results upto minibatch size of 65536 using 1024 GPUs to obtain very fast training times. We restrict our experiments to 64 V100 GPUs, primarily due to availability constraints in our cluster. One Nvidia V100 GPU can efficiently process 64 299x299 image crops from the ImageNet training dataset. Therefore the maximum native minibatch size we can experiment with is 4096. For batch sizes greater than 4096, we use virtual batching to simulate the effect of larger minibatches. In virtual batching, each worker uses multiple minibatches to compute gradients and averages them before communicating the update to the parameter server. The parameter server then computes an average of averaged gradients before applying the update to the shared weight parameters. This effectively increases the minibatch size for a fixed number of workers. Virtual batching increases the system computation to communication ratio and improves distributed training system throughput.
Each step of preconditioned NLCG optimizer does more work than traditional SGD, SGD with Momentum (Momentum) or RMSProp with Momentum (RMSProp) optimizers. Hence, the system throughput is expected to be lower than traditional SGD based training methods. In our experiments, a single NLCG training step is about 15% to 30% slower than natively implemented RMSProp/Momentum. Note that, due to virtual batching, this overhead is not visible in Figure 3. This is because of higher computation cost of computing average gradient of the virtual minibatch compared to the additional computational overhead of the NLCG optimizer.
4.1 Convergence studies
The initial learning rate is set to 0.001. We employ standard DNN training with learning rate increase with warmup Goyal et al. (2017) for 5 epochs (max learning rate = ). For minibatch sizes less than 8192, we reduce the learning rate to 0.001 using exponential rate decay every 2 epochs. For minibatch sizes greater than or equal to 8192, we reduce the learning rate to 0.01 using exponential rate decay every 2 epochs. For minibatch sizes greater than 8192, the warm up epochs is increased to 15. For minibatch sizes greater than 32768, the warm up epochs is increased to 30. To be consistent in the experiments, we keep the same learning rate schedule for all optimizers, which is tuned for Momentum optimizer. With this setting, we are able to achieve the state of the art top1 accuracy for the baseline SGD based optimizers. For illustration, we detail the learning rate schedule curve for NLCG, RMSProp and Momentum optimizers with a batch size of 65536 in Figure 1. We use a momentum value of 0.9 for both RMSProp and Momentum optimizers. We use standard image augmentation pipeline with 299x299 image crops.
With these settings our baseline results for Momentum and RMSProp optimizers are similar to the work in Goyal et al. (2017) which is closest to our setting of traditional DNN training. We do not use techniques customized for large batch ImageNet training from recent works like LARSYou et al. (2017), progressive batch size increasesSmith et al. (2017) and progressive image size increasesShaw et al. (2018)). We study the performance of various optimizers in the standard DNN training setting.
In Figures 2 and 3, we show the training loss function convergence curves for NLCG_FR, NLCG_PR, RMSProp and Momentum optimizers at an extremely large minibatch size of 65536. We observe that at these extremely large minibatch sizes, preconditioned NLCG optimizers have better loss convergence per step. The final loss value achieved by NLCG methods is also better than Momentum or RMSProp.
In Figure 4, we show the top1 test accuracy of Resnet50 models trained using various optimizers. We train the Resnet50 model on ImageNet training data for 90 epochs and vary the minibatch size from 512 to 98304. We only concentrate on large and extremely large minibatch sizes for this study. For batch sizes 16384 and lower, all the optimizers are able to get a top1 accuracy close to or greater than 75%, with the best accuracy being 76.9% obtained by NLCG_FR optimizer at batch size of 1024. For comparison, we show the best baseline top1 accuracy (76.9%) as a separate line. In this study, we observe that as the batch size increases >16384, the four optimizers under study start degrading the top1 test accuracy. Amongst the 4 optimizers, preconditioned NLCG optimizers drop the accuracy the least. This is in line with the observation in Figure 2 that with NLCG optimizers, we optimize the training loss function better to achieve a lower loss function value. At batch size of 65536, the difference in accuracy between NLCG_FR and Momentum is about 10.3%. At batch size of 98304, the difference in accuracy between NLCG_FR and Momentum is about 19.5%. For Resnet50 training on ImageNet data, with fixed number of epochs, NLCG optimizers provide a lower training loss for large batch sizes.
As seen in Figure 4, with large minibatches of 65536 and 98304, we are not able to get close to the baseline top1 accuracy when training for 90 epochs. We want to understand if this is a result of training difficulty or a generalization issue Keskar et al. (2016) because of large batch sizes. In Figure 5, we fix the batch size to 65536 and run the training longer. We train for 90, 150 and 180 epochs to see if we can close the test accuracy gap by training longer. For each experiment, we change the learning rate schedule so that the final learning rate value at the end of the specified number of epochs is 0.01. We observe that at 180 epochs, all 4 optimizers are able to get close to 75% top1 accuracy with NLCG optimizers dominating amongst the 4 optimizers. In Figure 6, we repeat the same study at an even larger batch size of 98304 and train for 90, 180 and 270 epochs. We observe that at this batch size only NLCG optimizers are able to reach the top1 accuracy close to 75% by training longer. Our conclusion is similar to Hoffer et al. (2017) that for extremely large minibatches, training longer helps in reducing training suboptimality and improves test accuracy. As the batch sizes grow, the preconditioned NLCG optimizers starts to dominate the Momentum and the RMSProp optimizers in terms of loss/accuracy convergence per step. Note that the ImageNet convergence/accuracy graphs are generated using Resnet50 V2 architecture. We repeated the study also on training Resnet50 V1 architecture and reached the same conclusion.
To study the performance of the NLCG optimizer on a different dataset, we repeat the experiments on training the Resnet32 model on the CIFAR100 data. The results are detailed in Figures 7 and 8. We run the training for 200 epochs and report the top1 accuracy for NLCG_FR, NLCG_PR, Momentum, RMSProp and SGD optimizers. We run 5 experiments for each data point and show the mean and the standard deviation error bars. For comparison, we also show the best top1 accuracy (68.9%) achieved by any optimizer (NLCG_FR) at any minibatch size (256). For the NLCG optimizers, we had to turn off the line search for minibatch sizes less than 2048, because the minibatch loss function value was too noisy to do meaningful loss function monitoring. We observe that as the minibatch sizes increase to 16384, all optimizers start degrading the top1 accuracy. NLCG_PR and NLCG_FR optimizers are able to maintain the best top1 accuracy when training for a fixed number of epochs (200).
In Figure 8, we fix the minibatch size to 16384 and progressively increase the number of training epochs from 200 to 500. We observe that NLCG based optimizers are able to obtain the best accuracies. The standard deviations for some points are large because some of the runs can diverge with such a large batch size. Over the several experiments, NLCG optimizers seems to be more robust to such divergences.
5 Conclusion and Future Work
Training with large minibatches is essential to scaling up synchronous distributed DNN training. We proposed a novel NLCG based optimizer that uses second order information to scale DNN training with extremely large minibatches. The NLCG optimizer uses conjugate gradients and an efficient diagonal preconditioner to speedup the training convergence. We demonstrated on the ImageNet and the CIFAR100 datasets that for large minibatches, our method outperforms existing stateoftheart optimizers by a large margin.
There is a runtime overhead in the NLCG optimizer because of more processing being done to compute the preconditioned search direction. For this work, we have implemented the preconditioned NLCG optimizer in TensorFlow using several TensorFlow ops. We can further improve the runtime of NLCG optimizer by natively implementing the optimizer as a fused native TensorFlow op with an optimized C++ / CUDA implementation.
We have mainly focused this study on the large minibatch scenario. The NLCG method is stable in this scenario because the gradients have low variance and are meaningful to compute valid preconditioners and conjugate search directions. As the minibatch becomes very small (eg. 128), the variance of gradients increase and the NLCG method becomes unstable (preconditioning, conjugate search direction computation and line search). We intend to research techniques to stabilize the NLCG method in the smaller minibatch, high gradient variance scenario.
Acknowledgements
We would like to thank Ashish Shrivastava, Dennis De Coste, Shreyas Saxena, Santiago Akle, Russ Webb, Barry Theobald and Jerremy Holland for valuable comments for manuscript preparation.
References
 Abadi et al. (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D. G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P. A., Vanhoucke, V., Vasudevan, V., Viégas, F. B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: Largescale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016. URL http://arxiv.org/abs/1603.04467.
 Akiba et al. (2017) Akiba, T., Suzuki, S., and Fukuda, K. Extremely large minibatch sgd: Training resnet50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

Bollapragada et al. (2018)
Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.J., and Tang, P. T. P.
A progressive batching lBFGS method for machine learning.
Proceedings of the 35th International Conference on Machine Learning, 80:620–629, 10–15 Jul 2018. URL http://proceedings.mlr.press/v80/bollapragada18a.html.  Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. 2004.
 Chen et al. (2016) Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. Revisiting distributed synchronous sgd. International Conference on Learning Representations Workshop Track, 2016. URL https://arxiv.org/abs/1604.00981.
 Chen & Huo (2016) Chen, K. and Huo, Q. Scalable training of deep learning machines by incremental block training with intrablock parallel optimization and blockwise modelupdate filtering. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016.

Cho et al. (2015)
Cho, M., Dhir, C., and Lee, J.
Hessianfree optimization for learning deep multidimensional recurrent neural networks.
Advances in Neural Information Processing Systems 28, pp. 883–891, 2015.  Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. Advances in neural information processing systems, pp. 1223–1231, 2012.
 Fletcher & Reeves (1964) Fletcher, R. and Reeves, C. M. Function minimization by conjugate gradients. Comput. J., 1964.
 Gibiansky (2017) Gibiansky, A. Bringing hpc techniques to deep learning, 2017. URL http://research.baidu.com/bringinghpctechniquesdeeplearning.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
 Hoffer et al. (2017) Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing System, 2017.
 Jia et al. (2018) Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., Xie, L., Guo, Z., Yang, Y., Yu, L., Chen, T., Hu, G., Shi, S., and Chu, X. Highly scalable deep learning training system with mixedprecision: Training imagenet in four minutes. CoRR, 2018. URL https://arxiv.org/abs/1807.11205.
 Jimmy Ba & Martens (2017) Jimmy Ba, R. G. and Martens, J. Distributed secondorder optimization using kroneckerfactored approximations. Proceedings of the International Conference on Learning Representations, 2017.
 Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On largebatch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Krishnan et al. (2017) Krishnan, S., Xiao, Y., and Saurous, R. A. Neumann optimizer: A practical optimization algorithm for deep neural networks. CoRR, abs/1712.03298, 2017. URL http://arxiv.org/abs/1712.03298.
 Le et al. (2011) Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. Y. On optimization methods for deep learning. Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 265–272, 2011. URL http://dl.acm.org/citation.cfm?id=3104482.3104516.
 Martens (2010) Martens, J. Deep learning via hessianfree optimization. Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 735–742, 2010. URL http://dl.acm.org/citation.cfm?id=3104322.3104416.
 Nocedal & Wright (2006) Nocedal, J. and Wright, S. J. Numerical optimization. 2006.
 Pearlmutter (1994) Pearlmutter, B. A. Fast exact multiplication by the hessian. Proceedings of the Neural Computation, 1994.
 Polak & Ribière (1969) Polak, E. and Ribière, G. Note sur la convergence de directions conjugu´ee. Rev. Francaise Informat Recherche Operationelle, 1969.

Russakovsky et al. (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and FeiFei, L.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV)
, 115(3):211–252, 2015. doi: 10.1007/s112630150816y.  Sergeev & Balso (2018) Sergeev, A. and Balso, M. D. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018.
 Shallue et al. (2018) Shallue, C. J., Lee, J., Antognini, J., SohlDickstein, J., Frostig, R., and Dahl, G. E. Measuring the effects of data parallelism on neural network training. CoRR, 2018. URL https://arxiv.org/abs/1811.03600.
 Shaw et al. (2018) Shaw, A., Bulatov, Y., and Howard, J. Now anyone can train imagenet in 18 minutes. 2018. URL http://www.fast.ai/2018/08/10/fastaidiuimagenet/.
 Smith et al. (2017) Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay the learning rate, increase the batch size. CoRR, abs/1711.00489, 2017. URL http://arxiv.org/abs/1711.00489.
 You et al. (2017) You, Y., Gitman, I., and Ginsburg, B. Scaling SGD batch size to 32k for imagenet training. CoRR, abs/1708.03888, 2017. URL http://arxiv.org/abs/1708.03888.
 Zhang et al. (2014) Zhang, S., Choromanska, A., and LeCun, Y. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651, 2014. URL http://arxiv.org/abs/1412.6651.
Comments
There are no comments yet.