Second order optimization techniques have been extensively explored for problems involving pathological curvature, such as deep neural network (DNN) training problems. In fact,  demonstrated success of a second order technique, known as Hessian-free (HF) optimization , with DNNs on various image recognition tasks. In addition,  successfully applied the HF optimization technique with DNNs for speech recognition tasks. Other second order methods, including L-BFGS  and Krylov Subspace Descent , have also shown great success for DNN training.
Second order methods are particularly important for sequence-training of DNNs, which provides a 10-20% relative improvement in WER over a cross-entropy (CE) trained DNN 
. Because sequence training must use information from time-sequential lattices corresponding to utterances, sequence training is performed using utterance randomization rather than frame randomization. For mini-batch stochastic gradient descent (SGD), which is often used for CE training, frame randomization performs better than utterance randomization. However, because sequence-training must be accomplished at the utterance level, second order methods perform much better than SGD, as second order methods compute the gradient over a large batch of utterances compared to utterance mini-batch SGD .
At IBM Research, we employ HF optimization techniques for sequence training . One of the drawbacks of this method is that training can be very slow, requiring about 3 weeks for training a 300-hour Switchboard task  using 64 parallel machines. There are two reasons why training is slow. Firstly, a great number of Krylov subspace iterations may be required for a solution to approximate the Hessian within each HF iteration , . Secondly,  proposes using a fixed amount of data for all HF iterations in both the gradient and Krylov subspace iteration computations. The purpose of this research is to explore algorithmic strategies for reduction of the amount of time spent in both gradient and Krylov subspace computations, both by reducing the amount of data needed for training, as well as by reducing the number of Krylov subspace iterations.
In this paper, we exploit a specific instance of Krylov subspace solvers which are consumed to symmetric positive definite matrices, known as conjugate gradient (CG) solvers. For simplicity, we will use the term “conjugate gradient” as the specific Krylov subspace technique used to estimate the Hessian. However, the algorithms we propose for reducing training time are generic and can work with any other flexible Krylov subspace solver variant.
Preconditioning in the context of linear algebra refers to the process of transforming a system of equations into one that can be solved more readily . For example, preconditioning has been extensively used to reduce CG iterations . Obtaining an appropriate preconditioner for a given problem can be challenging. First, the type of preconditioner that works best is problem specific. Second, while in principle, it is possible to design preconditioning strategies that will reduce the computational burden of the consequent solution phase radically, the computational investment in attaining such a preconditioner might offset its benefit. Thus, it is critical to identify a proper balance between computational efforts invested in preconditioning, vs. that invested in the consequent solution phase.
For our optimization problem, it is computationally intractable to construct the Hessian explicitly. Quasi-Newton approaches construct (typically a low rank) an approximation to the Hessian, and in their limited memory versions, only form such approximations implicitly. In this work, we propose using the quasi-Newton L-BFGS method  as a preconditioner to the CG solver. Our rationale is that while both quasi-Newton approaches and CG exploit the underlying structure of the linear(ized) system, the postulated structural assumptions of both (low rank, CG) are complementary. Therefore a combination of both methods is typically more effective than dependence upon each one solely. The reason L-BFGS was not used directly for HF optimization of DNNs is that L-BFGS crudely approximates the curvature matrix, whereas the HF method in  makes implicitly available the exact curvature matrix, which allows for the identification of directions with extremely low curvature.
The use of L-BFGS for preconditioning has been suggested before  for numerical simulations. We extend upon the work in , and demonstrate that L-BFGS serves as an effective preconditioner for CG-based HF training of DNNs on large-scale speech recognition data. Furthermore, unlike  which used a typical fixed CG approach, we make here an important observation that non-fixed point preconditioners, as the proposed L-BFGS, cannot be used stablely with standard CG iterative schemes . Thus, to ensure, stable and predictable convergence, we propose here the use of flexible variants of CG methods . These variants avoid the failures and breakdowns that their standard counterparts are susceptible to.
Second, we introduce a sampling strategy in which the amount of data used for gradient and CG calculations, is gradually increased. In optimization problems, gradient-based methods typically operate within two popular regimes . Stochastic approximation methods (i.e. such as stochastic gradient descent) select a small sample size to estimate the gradient. These methods often decrease the objective function loss quickly during initial training iterations, albeit, during later iterations the movement of the objective function is slow. On the other end of the spectrum, sample approximation techniques compute the gradient on a large sample of data. While this computation is expensive, the gradient estimates are much more reliable, and the objective function progresses well during later training iterations. In this study, we propose a hybrid method that captures the benefits of both stochastic and sample approximation methods, by increasing the amount of sampled data used for gradient and CG calculations.
Sampling the amount of data used for gradient and CG calculations was explored in 
, which observed the variance of the batch gradient to determine the amount of data to use for gradient and CG calculations. Alternatively,
explored geometrically increasing the amount of data used for logistic regression and conditional random field problems. The benefit of this approach is that the schedule for selecting data is given ahead of time, and there is no need to compute the expensive gradient variance. In this paper, we extend the idea in for HF DNN training, and compare this to the sampling approach in .
Initial experiments are conducted on a 50-hr English Broadcast News (BN) task . We find that preconditioning allows for more than 20% speedup by reducing the number of CG iterations. Furthermore, we find that gradient and CG sampling provide roughly additional 20% improvement in training time. In total, by combining both sampling and preconditioning speedup ideas we were able to reduce overall training time by a factor of 1.5. Second, we extend the preconditioning and sampling ideas to a larger 300-hr Switchboard (SWB) task, where we find that the proposed techniques provide more than a 2.3x speedup, with no loss in accuracy.
2 Hessian-Free Optimization
Before describing the speedups made to the Hessian-free (HF) algorithm, we briefly summarize the HF algorithm for DNN training, as described in . Let denote the network parameters,
denote a loss function,denote the gradient of the loss with respect to the parameters, denote a search direction, and denote a matrix characterizing the curvature of the loss around (i.e., a Hessian approximation). The central idea in HF optimization is to iteratively form a quadratic approximation to the loss,
and to minimize this approximation using Krylov subspace methods, such as conjugate gradient (CG), which access the curvature matrix only implicitly through matrix-vector products of the form. Such products can be computed efficiently for neural networks . In the HF algorithm, the CG search is truncated, based upon the relative improvement in the approximate loss. The curvature matrix is often chosen to be the Gauss-Newton matrix , which may not be positive definite. To avoid breakdown of CG due to negative curvature, a positive definite approximation can be enforced by shifting the matrix using an additional damping term: , where is set via the Levenberg-Marquardt algorithm.
Our implementation of HF optimization, which is illustrated as pseudo-code in Algorithm 1, closely follows that of . Gradients are computed over all the training data. Gauss-Newton matrix-vector products are computed over a sample (about 1% of the training data) that is taken each time CG-Minimize is called. The loss, , is computed over a held-out set. CG-Minimize uses CG to minimize , starting with search direction . This function returns a series of steps that are then used in a line search procedure. The parameter update, , is based on an Armijo rule backtracking line search. Distributed computation to computer gradients and curvature matrix-vector products is done using a master/worker architecture .
One of the problems with the HF technique used in  is that CG algorithms used to obtain an approximate solution to the Hessian require many iterations. Figure 1 indicates that as HF training iterations increase, training time per iteration is in fact dominated by CG iterations. In this section, we discuss how to reduce the number of CG iterations using preconditioning.
2nd-order optimization techniques require computation of Hessian in order to determine a search direction of the form . In this formulation, is the Hessian approximation and the gradient of the objective function at the HF iteration. The aforementioned CG method can be used to solve for this search direction. Specifically, we set , where is the Gauss-Newton matrix, and solve .
As mentioned above, in principle, L-BFGS  can be used for optimization of the HF DNN training problem. The reason L-BFGS was not used for optimization of neural networks is that in practice L-BFGS crudely approximates curvature of such systems, whereas for this domain problem HF algorithms manage to capture salient features of the curvature, and thereby identify search directions of extremely low curvature .
Yet, the computation of each HF search direction can be computationally excessive, requiring a great number of CG iterations. Thus, the use of quasi-Newton methods for preconditioning such implicit systems is sensible, as the structural assumptions of CG and L-BFGS are complementary. In the section below, we describe the L-BFGS algorithm and detail using this as a preconditioner for flexible CG.
3.1.1 L-BFGS algorithm
L-BFGS is a quasi-Netwton optimization method that uses a limited memory technique to approximate the Hessian or its inverse. Specifically, instead of computing the Hessian directly, which can often be a large and dense matrix, the L-BFGS algorithm stores a small number of vectors which can be used as a low rank approximation of the Hessian. The L-BFGS algorithm is outlined below in Algorithm 2.
3.1.2 L-BFGS as a Preconditioner
CG iterative methods can be used to solve for the search direction , by minimizing the following problem . Preconditioning typically involves a process or transformation (e.g. change of coordinates) applied upon a system of equations, which in return, converts the system to of more favorable structure. Preconditioning makes the CG problem easier to solve and reduces the number of CG iterations. If we define as a preconditioner, preconditioned CG involves the following transformation to the CG problem . The preconditioner is required to be symmetric and positive definite, and fixed for all iterations. If any of these conditions are violated, the CG method may fail.
Prescription of a suitable preconditioning scheme for a given problem is challenging. First, each system has its own characteristic structure. Identification of which and respectively determining the type of preconditioner that works best is generally problem specific. Second, if the preconditioner is computationally expensive to obtain, then this will offset any reduction in CG iterations, and thus the preconditioner will not be cost effective. Third, as challenging as preconditioning is in ordinary circumstances, a greater challenge is to precondition an implicit system, that cannot be accessed directly.
Previous preconditioning work for HF optimization has focused on diagonal matrix preconditioners.  explored using the diaogonal elements of the Fisher information matrix as a preconditioner for HF training of DNNs. Using diagonal matrix elements has a very limited ability to precondition the system, and is mainly beneficial when the matrix suffers scaling issues. In addition,  explored using the Jacobi pre-conditioner, which is computed over a batch of data just like the curvature-vector products, thus requiring the master/worker data-parallelization architecture. For our specific DNN speech problem, we found that the Jacobi preconditioner was costly to compute and offset reductions in CG iterations. The L-BFGS  preconditioner we propose is far more powerful compared to diagonal matrix preconditioners as it improves the spectral properties of the system, rather than merely tackling potential scaling issues. Furthermore, it does not require any data parallelization.
The L-BFGS preconditioner is described as follows. Each iteration of CG produces a sequence of iterates (i.e., in Algorithm 1) and a sequence of residuals . Using these statistics, the vectors and are stored for iterations of CG, where is specified by the user. Once statistics are saved, an L-BFGS matrix can be defined using the steps in Algorithm 2. This L-BFGS matrix is used as the preconditioner for CG.
There are a variety of different methodologies to choose the statistics to use when estimating the L-BFGS matrix. We adopt a strategy proposed in , namely using vectors evenly distributed throughout the CG run, to estimate the L-BFGS matrix. This implies that our preconditioner changes for different CG iterations. The requirement that the preconditioner needs to be fixed for all iterations of CG is inconvenient, since as we obtain more L-BFGS statistics we can improve the estimate of the preconditioner. Changing the preconditioner for CG requires using a flexible CG approach . More specifically, instead of using the equivalent of Fletcher-Reeves updating formula for non-preconditioned CG, the Polak-Ribière variant is required . This is opposed to the approach taken in  which did not use a flexible CG approach.
Another problem with the HF technique used in  was that the gradient was computed using all data, and CG on a fixed data sample. In this section, we explore reducing the amount of data used for the gradient and CG computations. Specifically, we explore a hybrid technique that first starts with a small amount of data similar to stochastic approximation methods, and gradually increases the amount of sampled data similar to sample approximation methods. In the following section, we detail two different hybrid methods.
4.1 Sampling From Variance Estimates
 proposes a method to increase the sample size based on variance estimates obtained during the computation of a batch gradient. This algorithm can be described as follows. Denote as the output from the DNN and the true output, such that a loss between predicted and true values can be defined as . The loss over the training set of size , is defined as the sum of the losses from the individual training examples , as shown by Equation 2.
In addition, the loss over a subset is defined by Equation 3.
Denoting the gradients of the full and subset losses as and respectively, the algorithm ensures that descent made in at every iteration must admit a descent direction for the true objective function . The is expressed by Equation 4.
In practice, the quantity is not evaluated (the computation of is expensive for large data sets), but instead is estimated from the variance of . Inequality 4 can be simplified to the inequality
If this inequality fails, a new sample size is selected to satisfy Inequality 5. The same dynamic selection strategy is also applied to the CG iterations.
In this paper, we explore this sampling approach within a DNN framework. Given an input utterance , the output of the DNN is the sum of the gradients of all training frames in that utterance, i.e. . Therefore, to compute the variance of the gradient estimate, this requires two passes through each utterance to compute the gradient and gradient-squared statistics . Since this makes the algorithm computationally expensive, we compute the average gradient per utterance , i.e. . The variance statistics become the sum and sum-squared of over all utterances in the training set, as shown by Equation 6. This only requires one pass through the network per utterance.
4.2 Geometric Sampling
The sampling approach proposed above uses sampling statistics to approximate the descent condition (5), but the need to estimate the variance in (5) adds notable computational complexity to the gradient computation. In contrast, the framework discussed in  provides an expected guarantee of descent in each iteration, as long as the sampling errors
are bounded, and the bounds are decreasing. In fact, [14, Theorem 2.2] links the sampling errors directly to the expected rate of convergence. This approach does not require computing statistics along the way, and the sampling strategy used to select is linked directly to the expected convergence rate in [14, 19].
 uses a geometrically increasing sample size. We adopt this strategy for the gradient and CG iteration samples in each iteration. Specifically, given initial sample size , the sample size at each iteration is given by Equation 7 where is the geometric factor that is tuned on a development set.
This approach fits into the theory proposed in , and has the practical benefit of a priori sample size selection. The sample size can be used both for gradient and CG iteration calculations.
5.1 Broadcast News
Our initial experiments are conducted on a 50-hr English Broadcast News (BN) task and results reported on both the EARS dev04f set. We use a recipe outlined in  to extract acoustic features. The hybrid DNN is trained using speaker-adapted VTLN+fMLLR features as input, with a context of 9 frames around the current frame. In 
, it was observed that a 5-layer DBN with 1,024 hidden units per layer and a sixth softmax layer with 2,220 output targets was an appropriate architecture for BN tasks.
We explore the behavior of preconditioning and sampling for HF training on a smaller BN task first, before moving to a larger Switchboard task. All timing experiments in this study were run on an 8 core Intel Xeon X5570@2.93GHz CPU. Matrix/vector operations for DNN training are multi-threaded using Intel MKL-BLAS. 12 machines were exclusively reserved for HF training to get reliable training time estimates.
In this section, we compare CG with preconditioning and no preconditioning (noPC). For preconditioning, we explore the behavior with different number of statistics used to estimate the L-BFGS preconditioned, namely 16 (PC-16), 32 (PC-32) and 64 (PC-64).
Table 1 shows the total time spent in CG, and total number of training iterations, to achieve the same loss. In addition, Figure 2 provides a closer look at the cumulative time for CG for the 4 methods. The Figure indicates that that all preconditioning methods require less time for CG, particularly as the number of total HF iterations increases (and thus the number of CG iterations increases). We see that PC-64 manifests a significant reduction in CG time after 30 HF iterations, but this also results in the loss moving much slower for this method, as explained by increased HF iterations in Table 1. PC-32 appears to be the most cost-efficient choice for the given task, both in terms of CG iteration runtime and in terms of loss reduction, and is roughly 22% faster than the baseline method.
|Method||Loss||HF Iterations||Time (min)|
6.2 Gradient+CG Sampling
Next, we compare the behavior of the geometric and variance sampling methods. Sampling methods require a tradeoff between amount of data used, and the number of iterations for the training loss to converge. Using too little data for gradient and CG will require more training iterations, while using too much data will make each iteration computationally expensive.
For geometric sampling, the geometric factor was tuned on a held-out set for both gradient and CG. It was found that an for the gradient and for CG allowed for the best tradeoff between reduction in amount of training data used and training time. This geometric factor corresponds to seeing roughly 100% of the total data used for gradient and CG calculations when roughly 50% of the total training iterations are completed. For variance sampling, in Equation 6 is tuned, where smaller favors a larger sample size.
Figure 3 shows the percentage of data accessed for the gradient for the geometric and variance methods, per HF iteration, for three different values of . Notice that the variance methods access a lot of training data in the beginning relative to the geometric method. One reason is that during the beginning of training, there is little data available to get a reliable variance estimate, so a larger sample size is preferred. The variance method with provided the best tradeoff between training time and data accessed. A similar was also used for estimating amount of data used for CG.
Figure 4 shows the cumulative time for gradient and CG calculation per HF iteration, for the full gradient/CG and sampling approaches, where both sampling approaches are tuned to provide best tradeoff between training time and amount of data accessed. The geometric method is quicker than the variance sampling method, particularly because it accesses less data during early training iterations, as shown in Figure 3. Overall, we find that the geometric method provides roughly a 20% reduction in training time. It is possible that a technique that starts with geometric sampling, and then switches to variance sampling once enough data is obtained for a reliable variance estimate, might provide further speedups.
6.3 Overall Speedups
In this section, we combine the preconditioning and sampling to calculate the overall speedup in training time for BN. Figure 5 shows the trade-off between loss and overall training time of the baseline (no speedup) method, preconditioning, and then including gradient and CG sampling. Overall we can see that PC+Gradient+CG sampling offers the fastest training time compared to the baseline. Table 2 shows the training time and corresponding WER for the baseline and speedup methods. Training time is reduced from 68.7 hours to 44.5 hours, roughly a 1.5x speedup, with no loss in accuracy.
|Method||WER||Total Training Time (hrs)|
6.4 Speedups on Larger Task
After analyzing the behavior of preconditioning and sampling speedups on a smaller 50-hour Broadcast News task, in this section, we explore training speed improvements on a larger 300-hour Switchboard task.
6.4.1 Experimental Setup
We explore DNNs performance on 300 hours of conversational American English telephony data from the Switchboard corpus. Development is done on the Hub5’00 set, while testing is done on the rt03 set, where we report performance separately on the Switchboard (SWB) and Fisher (FSH) portions of the set.
Similar to BN, the training features are speaker-adapted, using VTLN and fMLLR techniques. The input features into the DNN have an 11-frame context () around the current frame. Similar to , the DNN has six hidden layers each containing 2,048 sigmoidal units, and 8,260 output targets. Results with and without HF speedups are reported after sequence training.
Performance with the baseline and speedup HF techniques are shown in Table 3. Since using 32 L-BFGS stats performed well for the smaller 50-hour BN task, we used the same on the Switchboard task for preconditioning. In addition, because of the increased amount of training data associated with the larger task, we found that using a smaller sample size (i.e., ) for the gradient and CG iteration calculations still allowed for an appropriate estimate of these statistics.
Since we use more parallel machines (i.e. 64) for SWB compared to BN, it was not possible to exclusively reserve machines for timing calculations. Therefore, training time is estimated by calculating total number of accessed data points for training, which is correlated to timing. Table 3 shows the total accessed data points for the baseline and speedup techniques. Notice that with a larger dataset, because we are able to decrease the fraction of data used for gradient and conjugate gradient calculations, was can achieve an even larger speedup of 2.3x over the baseline, with no loss in accuracy. This suggests that even further speedups are possible as the data size grows.
|Method||WER||Total Accessed Data Points|
In this paper, we explored using an L-BFGS pre-conditioner and geometric sampling approach to accelerate HF training. We find that both approaches combined provided roughly a 1.5x speedup over a 50-hr Broadcast News task and a 2.3x speedup on a 300-hr Switchboard task, with no loss in accuracy. We anticipate an even larger speedup to be attained by more informed selection of quasi-Newton statistics (potentially adaptive) as well as by application of the proposed algorithmic strategies upon problems of greater scale.
“Deep learning via Hessian-free optimization,”in
Proc. Intl. Conf. on Machine Learning (ICML), 2010.
-  L. Horesh, M. Schweiger, S.R. Arridge, and D.S. Holder, “Large-scale non-linear 3d reconstruction algorithms for electrical impedance tomography of the human head,” in World Congress on Medical Physics and Biomedical Engineering 2006, R. Magjarevic and J.H. Nagel, Eds., vol. 14 of IFMBE Proceedings, pp. 3862–3865. Springer Berlin Heidelberg, 2007.
-  B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization,” in Proc. Interspeech, 2012.
-  J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large Scale Distributed Deep Networks,” in NIPS, 2012.
-  O. Vinyals and D. Povey, “Krylov Subspace Descent for Deep Learning,” in AISTATS, 2012.
-  B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in Proc. ICASSP, 2009, pp. 3761–3764.
-  H. Su, G. Li, D. Yu, and F. Seide, “Error Back Propagation For Sequence Training Of Context-Dependent Deep Networks For Conversational Speech Transcription,” in Proc. ICASSP, 2013.
-  R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM, Philadelphia, PA, 1994.
-  S. Eisenstat, “Efficient Implementation of a Class of Preconditioned Conjugate Gradient Methods,” SIAM Journal on Scientific and Statistical Computing, 1981.
-  J. Nocedal, “Updating Quasi-Newton Matrices with Limited Storage,” Mathematics of Computation, vol. 33, pp. 773–782, 1980.
-  J.L. Morales and J. Nocedal, “Automatic Preconditioning by Limited Memory Quasi-Newton Updating,” SIAM Journal on Optimization, 1999.
-  Y. Notay, “Flexible Conjugate Gradients,” SIAM Journal on Scientific Computing, 2000.
-  R. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, “Sample size selection in optimization methods for machine learning,” Mathematical Programming B, 2012.
-  M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” SIAM J. Scientific Computing, vol. 34, no. 3, 2012.
-  B. A. Pearlmutter, “Fast exact multiplication by the Hessian,” Neural Computation, vol. 6, no. 1, pp. 147–160, 1994.
-  N. N. Schraudolph, “Fast curvature matrix-vector products for second-order gradient descent,” Neural Computation, vol. 14, pp. 1723–1738, 2004.
-  O. Chapelle and D. Erhan, “Improved Preconditioner for Hessian-free Optimization,” in Proc. NIPS Workshop NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
-  J. Shewchuk, “An Introduction to the Conjugate Gradient Method without the Agonizing Pain,” 1994.
-  A. Aravkin, M. P. Friedlander, F. Herrmann, and T. van Leeuwen, “Robust inversion, dimensionality reduction, and randomized sampling,” Mathematical Programming, vol. 134, no. 1, pp. 101–125, 2012.
-  H. Soltau, G. Saon, and B. Kingsbury, “The IBM Attila speech recognition toolkit,” in Proc. IEEE Workshop on Spoken Language Technology, 2010, pp. 97–102.