1 Introduction
Second order optimization techniques have been extensively explored for problems involving pathological curvature, such as deep neural network (DNN) training problems. In fact, [1] demonstrated success of a second order technique, known as Hessianfree (HF) optimization [2], with DNNs on various image recognition tasks. In addition, [3] successfully applied the HF optimization technique with DNNs for speech recognition tasks. Other second order methods, including LBFGS [4] and Krylov Subspace Descent [5], have also shown great success for DNN training.
Second order methods are particularly important for sequencetraining of DNNs, which provides a 1020% relative improvement in WER over a crossentropy (CE) trained DNN [6]
. Because sequence training must use information from timesequential lattices corresponding to utterances, sequence training is performed using utterance randomization rather than frame randomization. For minibatch stochastic gradient descent (SGD), which is often used for CE training, frame randomization performs better than utterance randomization
[7]. However, because sequencetraining must be accomplished at the utterance level, second order methods perform much better than SGD, as second order methods compute the gradient over a large batch of utterances compared to utterance minibatch SGD [3].At IBM Research, we employ HF optimization techniques for sequence training [1]. One of the drawbacks of this method is that training can be very slow, requiring about 3 weeks for training a 300hour Switchboard task [3] using 64 parallel machines. There are two reasons why training is slow. Firstly, a great number of Krylov subspace iterations may be required for a solution to approximate the Hessian within each HF iteration [1], [3]. Secondly, [3] proposes using a fixed amount of data for all HF iterations in both the gradient and Krylov subspace iteration computations. The purpose of this research is to explore algorithmic strategies for reduction of the amount of time spent in both gradient and Krylov subspace computations, both by reducing the amount of data needed for training, as well as by reducing the number of Krylov subspace iterations.
In this paper, we exploit a specific instance of Krylov subspace solvers which are consumed to symmetric positive definite matrices, known as conjugate gradient (CG) solvers. For simplicity, we will use the term “conjugate gradient” as the specific Krylov subspace technique used to estimate the Hessian. However, the algorithms we propose for reducing training time are generic and can work with any other flexible Krylov subspace solver variant.
Preconditioning in the context of linear algebra refers to the process of transforming a system of equations into one that can be solved more readily [8]. For example, preconditioning has been extensively used to reduce CG iterations [9]. Obtaining an appropriate preconditioner for a given problem can be challenging. First, the type of preconditioner that works best is problem specific. Second, while in principle, it is possible to design preconditioning strategies that will reduce the computational burden of the consequent solution phase radically, the computational investment in attaining such a preconditioner might offset its benefit. Thus, it is critical to identify a proper balance between computational efforts invested in preconditioning, vs. that invested in the consequent solution phase.
For our optimization problem, it is computationally intractable to construct the Hessian explicitly. QuasiNewton approaches construct (typically a low rank) an approximation to the Hessian, and in their limited memory versions, only form such approximations implicitly. In this work, we propose using the quasiNewton LBFGS method [10] as a preconditioner to the CG solver. Our rationale is that while both quasiNewton approaches and CG exploit the underlying structure of the linear(ized) system, the postulated structural assumptions of both (low rank, CG) are complementary. Therefore a combination of both methods is typically more effective than dependence upon each one solely. The reason LBFGS was not used directly for HF optimization of DNNs is that LBFGS crudely approximates the curvature matrix, whereas the HF method in [1] makes implicitly available the exact curvature matrix, which allows for the identification of directions with extremely low curvature.
The use of LBFGS for preconditioning has been suggested before [11] for numerical simulations. We extend upon the work in [11], and demonstrate that LBFGS serves as an effective preconditioner for CGbased HF training of DNNs on largescale speech recognition data. Furthermore, unlike [11] which used a typical fixed CG approach, we make here an important observation that nonfixed point preconditioners, as the proposed LBFGS, cannot be used stablely with standard CG iterative schemes [8]. Thus, to ensure, stable and predictable convergence, we propose here the use of flexible variants of CG methods [12]. These variants avoid the failures and breakdowns that their standard counterparts are susceptible to.
Second, we introduce a sampling strategy in which the amount of data used for gradient and CG calculations, is gradually increased. In optimization problems, gradientbased methods typically operate within two popular regimes [13]. Stochastic approximation methods (i.e. such as stochastic gradient descent) select a small sample size to estimate the gradient. These methods often decrease the objective function loss quickly during initial training iterations, albeit, during later iterations the movement of the objective function is slow. On the other end of the spectrum, sample approximation techniques compute the gradient on a large sample of data. While this computation is expensive, the gradient estimates are much more reliable, and the objective function progresses well during later training iterations. In this study, we propose a hybrid method that captures the benefits of both stochastic and sample approximation methods, by increasing the amount of sampled data used for gradient and CG calculations.
Sampling the amount of data used for gradient and CG calculations was explored in [13]
, which observed the variance of the batch gradient to determine the amount of data to use for gradient and CG calculations. Alternatively,
[14]explored geometrically increasing the amount of data used for logistic regression and conditional random field problems. The benefit of this approach is that the schedule for selecting data is given ahead of time, and there is no need to compute the expensive gradient variance. In this paper, we extend the idea in
[14] for HF DNN training, and compare this to the sampling approach in [13].Initial experiments are conducted on a 50hr English Broadcast News (BN) task [3]. We find that preconditioning allows for more than 20% speedup by reducing the number of CG iterations. Furthermore, we find that gradient and CG sampling provide roughly additional 20% improvement in training time. In total, by combining both sampling and preconditioning speedup ideas we were able to reduce overall training time by a factor of 1.5. Second, we extend the preconditioning and sampling ideas to a larger 300hr Switchboard (SWB) task, where we find that the proposed techniques provide more than a 2.3x speedup, with no loss in accuracy.
2 HessianFree Optimization
Before describing the speedups made to the Hessianfree (HF) algorithm, we briefly summarize the HF algorithm for DNN training, as described in [1]. Let denote the network parameters,
denote a loss function,
denote the gradient of the loss with respect to the parameters, denote a search direction, and denote a matrix characterizing the curvature of the loss around (i.e., a Hessian approximation). The central idea in HF optimization is to iteratively form a quadratic approximation to the loss,(1) 
and to minimize this approximation using Krylov subspace methods, such as conjugate gradient (CG), which access the curvature matrix only implicitly through matrixvector products of the form
. Such products can be computed efficiently for neural networks [15]. In the HF algorithm, the CG search is truncated, based upon the relative improvement in the approximate loss. The curvature matrix is often chosen to be the GaussNewton matrix [16], which may not be positive definite. To avoid breakdown of CG due to negative curvature, a positive definite approximation can be enforced by shifting the matrix using an additional damping term: , where is set via the LevenbergMarquardt algorithm.Our implementation of HF optimization, which is illustrated as pseudocode in Algorithm 1, closely follows that of [1]. Gradients are computed over all the training data. GaussNewton matrixvector products are computed over a sample (about 1% of the training data) that is taken each time CGMinimize is called. The loss, , is computed over a heldout set. CGMinimize uses CG to minimize , starting with search direction . This function returns a series of steps that are then used in a line search procedure. The parameter update, , is based on an Armijo rule backtracking line search. Distributed computation to computer gradients and curvature matrixvector products is done using a master/worker architecture [3].
3 Preconditioning
One of the problems with the HF technique used in [3] is that CG algorithms used to obtain an approximate solution to the Hessian require many iterations. Figure 1 indicates that as HF training iterations increase, training time per iteration is in fact dominated by CG iterations. In this section, we discuss how to reduce the number of CG iterations using preconditioning.
3.1 Motivation
2ndorder optimization techniques require computation of Hessian in order to determine a search direction of the form . In this formulation, is the Hessian approximation and the gradient of the objective function at the HF iteration. The aforementioned CG method can be used to solve for this search direction. Specifically, we set , where is the GaussNewton matrix, and solve .
As mentioned above, in principle, LBFGS [10] can be used for optimization of the HF DNN training problem. The reason LBFGS was not used for optimization of neural networks is that in practice LBFGS crudely approximates curvature of such systems, whereas for this domain problem HF algorithms manage to capture salient features of the curvature, and thereby identify search directions of extremely low curvature [1].
Yet, the computation of each HF search direction can be computationally excessive, requiring a great number of CG iterations. Thus, the use of quasiNewton methods for preconditioning such implicit systems is sensible, as the structural assumptions of CG and LBFGS are complementary. In the section below, we describe the LBFGS algorithm and detail using this as a preconditioner for flexible CG.
3.1.1 LBFGS algorithm
LBFGS is a quasiNetwton optimization method that uses a limited memory technique to approximate the Hessian or its inverse. Specifically, instead of computing the Hessian directly, which can often be a large and dense matrix, the LBFGS algorithm stores a small number of vectors which can be used as a low rank approximation of the Hessian. The LBFGS algorithm is outlined below in Algorithm 2.
3.1.2 LBFGS as a Preconditioner
CG iterative methods can be used to solve for the search direction , by minimizing the following problem . Preconditioning typically involves a process or transformation (e.g. change of coordinates) applied upon a system of equations, which in return, converts the system to of more favorable structure. Preconditioning makes the CG problem easier to solve and reduces the number of CG iterations. If we define as a preconditioner, preconditioned CG involves the following transformation to the CG problem . The preconditioner is required to be symmetric and positive definite, and fixed for all iterations. If any of these conditions are violated, the CG method may fail.
Prescription of a suitable preconditioning scheme for a given problem is challenging. First, each system has its own characteristic structure. Identification of which and respectively determining the type of preconditioner that works best is generally problem specific. Second, if the preconditioner is computationally expensive to obtain, then this will offset any reduction in CG iterations, and thus the preconditioner will not be cost effective. Third, as challenging as preconditioning is in ordinary circumstances, a greater challenge is to precondition an implicit system, that cannot be accessed directly.
Previous preconditioning work for HF optimization has focused on diagonal matrix preconditioners. [1] explored using the diaogonal elements of the Fisher information matrix as a preconditioner for HF training of DNNs. Using diagonal matrix elements has a very limited ability to precondition the system, and is mainly beneficial when the matrix suffers scaling issues. In addition, [17] explored using the Jacobi preconditioner, which is computed over a batch of data just like the curvaturevector products, thus requiring the master/worker dataparallelization architecture. For our specific DNN speech problem, we found that the Jacobi preconditioner was costly to compute and offset reductions in CG iterations. The LBFGS [11] preconditioner we propose is far more powerful compared to diagonal matrix preconditioners as it improves the spectral properties of the system, rather than merely tackling potential scaling issues. Furthermore, it does not require any data parallelization.
The LBFGS preconditioner is described as follows. Each iteration of CG produces a sequence of iterates (i.e., in Algorithm 1) and a sequence of residuals [18]. Using these statistics, the vectors and are stored for iterations of CG, where is specified by the user. Once statistics are saved, an LBFGS matrix can be defined using the steps in Algorithm 2. This LBFGS matrix is used as the preconditioner for CG.
There are a variety of different methodologies to choose the statistics to use when estimating the LBFGS matrix. We adopt a strategy proposed in [11], namely using vectors evenly distributed throughout the CG run, to estimate the LBFGS matrix. This implies that our preconditioner changes for different CG iterations. The requirement that the preconditioner needs to be fixed for all iterations of CG is inconvenient, since as we obtain more LBFGS statistics we can improve the estimate of the preconditioner. Changing the preconditioner for CG requires using a flexible CG approach [12]. More specifically, instead of using the equivalent of FletcherReeves updating formula for nonpreconditioned CG, the PolakRibière variant is required [18]. This is opposed to the approach taken in [11] which did not use a flexible CG approach.
4 Sampling
Another problem with the HF technique used in [3] was that the gradient was computed using all data, and CG on a fixed data sample. In this section, we explore reducing the amount of data used for the gradient and CG computations. Specifically, we explore a hybrid technique that first starts with a small amount of data similar to stochastic approximation methods, and gradually increases the amount of sampled data similar to sample approximation methods. In the following section, we detail two different hybrid methods.
4.1 Sampling From Variance Estimates
[13] proposes a method to increase the sample size based on variance estimates obtained during the computation of a batch gradient. This algorithm can be described as follows. Denote as the output from the DNN and the true output, such that a loss between predicted and true values can be defined as . The loss over the training set of size , is defined as the sum of the losses from the individual training examples , as shown by Equation 2.
(2) 
In addition, the loss over a subset is defined by Equation 3.
(3) 
Denoting the gradients of the full and subset losses as and respectively, the algorithm ensures that descent made in at every iteration must admit a descent direction for the true objective function . The is expressed by Equation 4.
(4) 
In practice, the quantity is not evaluated (the computation of is expensive for large data sets), but instead is estimated from the variance of . Inequality 4 can be simplified to the inequality
(5) 
If this inequality fails, a new sample size is selected to satisfy Inequality 5. The same dynamic selection strategy is also applied to the CG iterations.
In this paper, we explore this sampling approach within a DNN framework. Given an input utterance , the output of the DNN is the sum of the gradients of all training frames in that utterance, i.e. . Therefore, to compute the variance of the gradient estimate, this requires two passes through each utterance to compute the gradient and gradientsquared statistics . Since this makes the algorithm computationally expensive, we compute the average gradient per utterance , i.e. . The variance statistics become the sum and sumsquared of over all utterances in the training set, as shown by Equation 6. This only requires one pass through the network per utterance.
(6) 
4.2 Geometric Sampling
The sampling approach proposed above uses sampling statistics to approximate the descent condition (5), but the need to estimate the variance in (5) adds notable computational complexity to the gradient computation. In contrast, the framework discussed in [14] provides an expected guarantee of descent in each iteration, as long as the sampling errors
are bounded, and the bounds are decreasing. In fact, [14, Theorem 2.2] links the sampling errors directly to the expected rate of convergence. This approach does not require computing statistics along the way, and the sampling strategy used to select is linked directly to the expected convergence rate in [14, 19].
[14] uses a geometrically increasing sample size. We adopt this strategy for the gradient and CG iteration samples in each iteration. Specifically, given initial sample size , the sample size at each iteration is given by Equation 7 where is the geometric factor that is tuned on a development set.
(7) 
This approach fits into the theory proposed in [14], and has the practical benefit of a priori sample size selection. The sample size can be used both for gradient and CG iteration calculations.
5 Experiments
5.1 Broadcast News
Our initial experiments are conducted on a 50hr English Broadcast News (BN) task and results reported on both the EARS dev04f set. We use a recipe outlined in [20] to extract acoustic features. The hybrid DNN is trained using speakeradapted VTLN+fMLLR features as input, with a context of 9 frames around the current frame. In [3]
, it was observed that a 5layer DBN with 1,024 hidden units per layer and a sixth softmax layer with 2,220 output targets was an appropriate architecture for BN tasks.
We explore the behavior of preconditioning and sampling for HF training on a smaller BN task first, before moving to a larger Switchboard task. All timing experiments in this study were run on an 8 core Intel Xeon X5570@2.93GHz CPU. Matrix/vector operations for DNN training are multithreaded using Intel MKLBLAS. 12 machines were exclusively reserved for HF training to get reliable training time estimates.
6 Results
6.1 Preconditioning
In this section, we compare CG with preconditioning and no preconditioning (noPC). For preconditioning, we explore the behavior with different number of statistics used to estimate the LBFGS preconditioned, namely 16 (PC16), 32 (PC32) and 64 (PC64).
Table 1 shows the total time spent in CG, and total number of training iterations, to achieve the same loss. In addition, Figure 2 provides a closer look at the cumulative time for CG for the 4 methods. The Figure indicates that that all preconditioning methods require less time for CG, particularly as the number of total HF iterations increases (and thus the number of CG iterations increases). We see that PC64 manifests a significant reduction in CG time after 30 HF iterations, but this also results in the loss moving much slower for this method, as explained by increased HF iterations in Table 1. PC32 appears to be the most costefficient choice for the given task, both in terms of CG iteration runtime and in terms of loss reduction, and is roughly 22% faster than the baseline method.
Method  Loss  HF Iterations  Time (min) 

noPC  1.9153  39  3,492.2 
PC16  1.9157  35  3,042.2 
PC32  1.9150  33  2,7095.3 
PC64  1.9158  46  2,745.6 
6.2 Gradient+CG Sampling
Next, we compare the behavior of the geometric and variance sampling methods. Sampling methods require a tradeoff between amount of data used, and the number of iterations for the training loss to converge. Using too little data for gradient and CG will require more training iterations, while using too much data will make each iteration computationally expensive.
For geometric sampling, the geometric factor was tuned on a heldout set for both gradient and CG. It was found that an for the gradient and for CG allowed for the best tradeoff between reduction in amount of training data used and training time. This geometric factor corresponds to seeing roughly 100% of the total data used for gradient and CG calculations when roughly 50% of the total training iterations are completed. For variance sampling, in Equation 6 is tuned, where smaller favors a larger sample size.
Figure 3 shows the percentage of data accessed for the gradient for the geometric and variance methods, per HF iteration, for three different values of . Notice that the variance methods access a lot of training data in the beginning relative to the geometric method. One reason is that during the beginning of training, there is little data available to get a reliable variance estimate, so a larger sample size is preferred. The variance method with provided the best tradeoff between training time and data accessed. A similar was also used for estimating amount of data used for CG.
Figure 4 shows the cumulative time for gradient and CG calculation per HF iteration, for the full gradient/CG and sampling approaches, where both sampling approaches are tuned to provide best tradeoff between training time and amount of data accessed. The geometric method is quicker than the variance sampling method, particularly because it accesses less data during early training iterations, as shown in Figure 3. Overall, we find that the geometric method provides roughly a 20% reduction in training time. It is possible that a technique that starts with geometric sampling, and then switches to variance sampling once enough data is obtained for a reliable variance estimate, might provide further speedups.
6.3 Overall Speedups
In this section, we combine the preconditioning and sampling to calculate the overall speedup in training time for BN. Figure 5 shows the tradeoff between loss and overall training time of the baseline (no speedup) method, preconditioning, and then including gradient and CG sampling. Overall we can see that PC+Gradient+CG sampling offers the fastest training time compared to the baseline. Table 2 shows the training time and corresponding WER for the baseline and speedup methods. Training time is reduced from 68.7 hours to 44.5 hours, roughly a 1.5x speedup, with no loss in accuracy.
Method  WER  Total Training Time (hrs) 

Baseline  17.8  68.7 
PC+Grad+CG Speedups  17.8  44.5 
6.4 Speedups on Larger Task
After analyzing the behavior of preconditioning and sampling speedups on a smaller 50hour Broadcast News task, in this section, we explore training speed improvements on a larger 300hour Switchboard task.
6.4.1 Experimental Setup
We explore DNNs performance on 300 hours of conversational American English telephony data from the Switchboard corpus. Development is done on the Hub5’00 set, while testing is done on the rt03 set, where we report performance separately on the Switchboard (SWB) and Fisher (FSH) portions of the set.
Similar to BN, the training features are speakeradapted, using VTLN and fMLLR techniques. The input features into the DNN have an 11frame context () around the current frame. Similar to [3], the DNN has six hidden layers each containing 2,048 sigmoidal units, and 8,260 output targets. Results with and without HF speedups are reported after sequence training.
6.4.2 Results
Performance with the baseline and speedup HF techniques are shown in Table 3. Since using 32 LBFGS stats performed well for the smaller 50hour BN task, we used the same on the Switchboard task for preconditioning. In addition, because of the increased amount of training data associated with the larger task, we found that using a smaller sample size (i.e., ) for the gradient and CG iteration calculations still allowed for an appropriate estimate of these statistics.
Since we use more parallel machines (i.e. 64) for SWB compared to BN, it was not possible to exclusively reserve machines for timing calculations. Therefore, training time is estimated by calculating total number of accessed data points for training, which is correlated to timing. Table 3 shows the total accessed data points for the baseline and speedup techniques. Notice that with a larger dataset, because we are able to decrease the fraction of data used for gradient and conjugate gradient calculations, was can achieve an even larger speedup of 2.3x over the baseline, with no loss in accuracy. This suggests that even further speedups are possible as the data size grows.
Method  WER  Total Accessed Data Points 

Baseline  12.5  2.26e9 
PC+Grad+CG Speedups  12.5  9.95e8 
7 Conclusions
In this paper, we explored using an LBFGS preconditioner and geometric sampling approach to accelerate HF training. We find that both approaches combined provided roughly a 1.5x speedup over a 50hr Broadcast News task and a 2.3x speedup on a 300hr Switchboard task, with no loss in accuracy. We anticipate an even larger speedup to be attained by more informed selection of quasiNewton statistics (potentially adaptive) as well as by application of the proposed algorithmic strategies upon problems of greater scale.
References

[1]
J. Martens,
“Deep learning via Hessianfree optimization,”
inProc. Intl. Conf. on Machine Learning (ICML)
, 2010.  [2] L. Horesh, M. Schweiger, S.R. Arridge, and D.S. Holder, “Largescale nonlinear 3d reconstruction algorithms for electrical impedance tomography of the human head,” in World Congress on Medical Physics and Biomedical Engineering 2006, R. Magjarevic and J.H. Nagel, Eds., vol. 14 of IFMBE Proceedings, pp. 3862–3865. Springer Berlin Heidelberg, 2007.
 [3] B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessianfree Optimization,” in Proc. Interspeech, 2012.
 [4] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large Scale Distributed Deep Networks,” in NIPS, 2012.
 [5] O. Vinyals and D. Povey, “Krylov Subspace Descent for Deep Learning,” in AISTATS, 2012.
 [6] B. Kingsbury, “Latticebased optimization of sequence classification criteria for neuralnetwork acoustic modeling,” in Proc. ICASSP, 2009, pp. 3761–3764.
 [7] H. Su, G. Li, D. Yu, and F. Seide, “Error Back Propagation For Sequence Training Of ContextDependent Deep Networks For Conversational Speech Transcription,” in Proc. ICASSP, 2013.
 [8] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM, Philadelphia, PA, 1994.
 [9] S. Eisenstat, “Efficient Implementation of a Class of Preconditioned Conjugate Gradient Methods,” SIAM Journal on Scientific and Statistical Computing, 1981.
 [10] J. Nocedal, “Updating QuasiNewton Matrices with Limited Storage,” Mathematics of Computation, vol. 33, pp. 773–782, 1980.
 [11] J.L. Morales and J. Nocedal, “Automatic Preconditioning by Limited Memory QuasiNewton Updating,” SIAM Journal on Optimization, 1999.
 [12] Y. Notay, “Flexible Conjugate Gradients,” SIAM Journal on Scientific Computing, 2000.
 [13] R. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, “Sample size selection in optimization methods for machine learning,” Mathematical Programming B, 2012.
 [14] M. P. Friedlander and M. Schmidt, “Hybrid deterministicstochastic methods for data fitting,” SIAM J. Scientific Computing, vol. 34, no. 3, 2012.
 [15] B. A. Pearlmutter, “Fast exact multiplication by the Hessian,” Neural Computation, vol. 6, no. 1, pp. 147–160, 1994.
 [16] N. N. Schraudolph, “Fast curvature matrixvector products for secondorder gradient descent,” Neural Computation, vol. 14, pp. 1723–1738, 2004.
 [17] O. Chapelle and D. Erhan, “Improved Preconditioner for Hessianfree Optimization,” in Proc. NIPS Workshop NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 [18] J. Shewchuk, “An Introduction to the Conjugate Gradient Method without the Agonizing Pain,” 1994.
 [19] A. Aravkin, M. P. Friedlander, F. Herrmann, and T. van Leeuwen, “Robust inversion, dimensionality reduction, and randomized sampling,” Mathematical Programming, vol. 134, no. 1, pp. 101–125, 2012.
 [20] H. Soltau, G. Saon, and B. Kingsbury, “The IBM Attila speech recognition toolkit,” in Proc. IEEE Workshop on Spoken Language Technology, 2010, pp. 97–102.