Papers and blogs related to distributed deep learning
We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.READ FULL TEXT VIEW PDF
Stochastic Gradient Descent (SGD) is the key learning algorithm for many...
In this work we apply model averaging to parallel training of deep neura...
Distributing Neural Network training is of particular interest for sever...
Deep learning models (DLMs) are state-of-the-art techniques in speech
Identifying the influence of training data for data cleansing can improv...
Data parallelism has become the de facto standard for training Deep Neur...
This paper presents a novel natural gradient and Hessian-free (NGHF)
Papers and blogs related to distributed deep learning
Parallel training of neural networks generally makes use of some combination of model parallelism and data parallelism (Dean et al., 2012), and the normal approach to data parallelism involves communication of model parameters for each minibatch. Here we describe our neural-net training framework which uses a different version of data parallelism: we have multiple SGD processes on separate machines, and only infrequently (every minute or so) average the model parameters and redistribute them to the individual machines. This is very effective for us for large-scale training of systems for speech recognition– but in our case it only works well when combined with an efficient implementation of natural gradient stochastic gradient descent (NG-SGD) that we have developed. We don’t attempt in this paper to develop a framework that explains why parameter averaging should work well despite non-convexity of DNNs, or why NG-SGD is so helpful. The point of this paper is to describe our methods and to establish that empirically they work well. The significance of this work is that we show that it is possible to get a linear speedup when increasing the number of GPUs, without requiring frequent data transfer (however, this only holds up to about 4 or 8 GPUs).
In Section 2 we describe our problem setting, which is Deep Neural Networks (DNNs) applied to speech recognition– although our ideas are more general than this. In Section 3 we introduce the parallel training method. In Section 4 we describe the general ideas behind our natural gradient method, although most of the technical details have been relegated to appendices. In this paper we don’t give any proofs, but we do discuss in Section 5 what we think we can and can’t be proven about our methods. Section 6 has experiments on the convergence of SGD with and without natural gradient and parallelism. We conclude in Section 7.
on each frame, but in the log-probabilitiesfor all , since we will use them as costs in a Viterbi search for a path corresponding to the most likely word sequence. The objective function for training is the sum, over all frames of training data, of the log-probability of given : . Since we are maximizing this, our use of the term SGD is of course a slight misnomer; it is gradient ascent. The supervision labels
are derived from a Viterbi alignment of a Hidden Markov Model (HMM) derived from the reference word sequence of each training utterance.
The parameter-averaging aspect of our training is quite simple. We have machines (e.g. ) each doing SGD separately with different randomized subsets of the training data, and we allow their parameters to gradually diverge. After each machine has processed a fixed number of samples (typically
), we average the parameters across all the jobs and re-distribute the result to the machines. (In practice we do this by spawning new jobs in GridEngine or in whatever job management system we are using). This is repeated until we have processed all the data for a specified number of epochs, e.g. 10.
We define an outer iteration of training as the time it takes for each job to process training examples. The number of outer iterations per epoch depends on and the quantity of training data.
We find it useful to define the effective learning rate of the parallel SGD procedure as the learning rate being used by the individual jobs, divided by the number of jobs . As we increase the number of jobs , in order to get a linear speed up we need to increase the learning rate proportional to so that the effective learning rate stays the same. The concept is that when we do the parameter averaging, the parameter update from any individual SGD job gets diluted -fold.
The reason why we set this up as parameter-averaging instead of summing the parameter changes from the respective jobs, is out of concern for stability. Imagine there is some direction in parameter space where the Hessian is large enough (and our learning rate large enough) that stochastic gradient descent nearly reaches equilibrium after processing samples. If there are, say, 4 jobs, then after processing examples and summing the parameter changes, the parameters end up not close to the equilibrium value but off in the opposite direction and 3 times farther away than at the start. It is clear that this would lead to divergence.
At this point we provide some more details of other relevant features of our SGD implementation, namely the learning rate schedule and the way we enforce a maximum parameter change to prevent divergence.
There are some other, less directly relevant issues which we have relegated to Appendix C: namely, CPU versus GPU-based SGD (C.1); data randomization issues (C.2); generalized model averaging (C.4); mixture components a.k.a. sub-classes (C.5); input data normalization (C.6); parameter initialization (C.7); sequence training (C.8); and online decoding with i-vectors (C.9).
It was found in (Senior et al., 2013) that when training DNNs for speech recognition, an exponentially decreasing learning rate works well, and we independently found the same thing. We generally use a learning rate that decreases by a factor of 10 during training, on an exponential schedule. Unless mentioned otherwise, for experiments reported here the learning rate starts at 0.01 and ends at 0.001. We specify the number of epochs in advance; it is typically a number in the range 4 to 20 (if we have more data, we train for fewer epochs).
Note that while in the experiments presented here we did not separately tune the learning rate schedule for SGD and NG-SGD (we just used values which had worked well in the past for NG-SGD), we have done extensive experiments in the past, on smaller setups, where the learning rate schedule was tuned independently; and in all circumstances we found NG-SGD to be helpful. It was not feasible to repeat those experiments on a large-scale setup.
A common pathology when doing SGD for deep learning is that during training, the parameters will suddenly start getting very large and the objective function will go to negative infinity. This is known as parameter divergence. The normal solution is to decrease the learning rate and start the training again, but this is a very inconvenient. To avoid this pathology, we modified the SGD procedure to enforce a maximum parameter change per minibatch. This limit tends to be active only early in training, particularly for layers closer to the output. We have provided further details on this in AppendixC.3.
In this section we describe our natural-gradient modification to SGD, in which we scale the gradients by a symmetric positive definite matrix that is an approximation to the inverse of the Fisher matrix.
Technically speaking, Natural Gradient means taking a step along the gradient of a Riemannian parameter surface, which follows a curving path in conventional parameter space and which is extremely hard to compute. However, previous work (Yang & Amari, 1998; Roux et al., 2007) has used the term “Natural Gradient” to describe methods like ours which use an an approximated inverse-Fisher matrix as the learning rate matrix, so we follow their precedent in calling our method “Natural Gradient”.
In SGD, the learning rate is often assumed to be a scalar , decreasing with time, and the update equation is something like
where is the objective-function gradient sampled on time (e.g. computed from a training sample or a minibatch). However, it is possible to replace this scalar with a symmetric positive definite matrix, and we can write instead:
with the matrix component of learning-rate; it is more convenient for proofs to keep separate rather than absorbing it into . It acceptable for
to be random: if we can bound the eigenvalues ofabove and below, by positive constants known in advance, and and are independently sampled given the parameter , then we can prove convergence under the same kinds of conditions as if we were using a scalar learning rate (Bottou, 1998, Sec. 4.2.2)
In general, the learning-rate matrix should not be a function of the data sample which we are currently processing, or it may prevent convergence to a local optimum. As an example of this, a matrix that was systematically smaller for a particular type of training data would clearly bias the learning by downweighting that data.
There are reasons from statistical learning theory, related to the Natural Gradient idea(Amari, 1998), why we may want to set the learning-rate matrix to the inverse of the Fisher information matrix. See for example, (Murata & Amari, 1999) and (Roux et al., 2007). The Fisher matrix is most directly defined for situations where we are learning a distribution, as opposed to classification problems such as the current one. Suppose , which may be discrete or continuous, is the variable whose distribution we are modeling, and is the probability or likelihood of given parameters , then the Fisher information matrix
is defined as the expected outer product (second moment) of the derivative of the log-probability w.r.t. the parameters, i.e. of
This derivative is called the “score” in information theory. Part of the justification for this use of the Fisher matrix is that, under certain conditions, the Fisher matrix is identical to the Hessian; and it is obvious why the inverse Hessian would be a good gradient descent direction. These conditions are quite stringent, and include that the model is correct and is at the value corresponding to the true data distribution; but even if these conditions do not apply, the Fisher information matrix is in some sense “dimensionally” the same as the Hessian– that is, it transforms the same way under changes of parameterization– so its inverse may still be a good choice of learning-rate matrix.
It is quite easy to generalize the notion of the Fisher matrix to a prediction task . We write for a data distribution that we assume to be independently known (and not a function of ). It is not hard to see that the score equals just ; since does not depend on , there is no additional term involving
. The expectation that we take when computing the Fisher matrix is taken over the joint distribution ofand . This argument also appears in (Roux et al., 2007, Section 3).
Still more generally, we can compute a quantity that is analogous to Fisher matrix for any objective function, even one that does not represent a log-probability or log-likelihood; and we will still have a matrix that transforms in the same way as the Hessian under changes of variables - i.e. its inverse may still be a reasonable choice for a learning-rate matrix.
For large-scale problems, such as DNNs for speech recognition with millions of parameters, even one inversion of the Fisher matrix is impractical because it would take time in the parameter dimension. However, it may be practical to deal with factored forms of it. There has been previous literature on this. In (Roux et al., 2007), the Fisher matrix was divided into diagonal blocks and each block was approximated by a low-rank matrix. The idea of diagonal blocks was also explored in (Bastian et al., 2011), with one block per weight matrix; our approach uses the same idea. In the unpublished manuscript (Yang & Amari, 1997) (some of the material in which was later published in (Yang & Amari, 1998)), the authors attempted to show analytically that under certain quite strong assumptions, the Fisher matrix for a single-hidden-layer neural network has the form of a Kronecker product. Although we are interested in more general networks than they considered, the Kronecker product does also appear in our factorization of the Fisher matrix.
We should note that there are ways to use Natural Gradient without factorizing the Fisher information matrix, if one is willing to accept a significantly increased time per iteration. See for example (Pascanu & Bengio, 2013), which uses a truncated Newton method to approximate multiplication by the inverse of the Fisher matrix.
Our factored form of the Fisher matrix is as follows: given a neural network with weight matrices, we divide the Fisher matrix into diagonal blocks, one for each weight matrix. Consider the ’th diagonal block of the Fisher matrix, corresponding to the parameters of a weight matrix , and assume that there is no separate bias term (we can append a 1 to the inputs and include the bias term in the weight matrix). The ’th block of the Fisher matrix is a Kronecker product of two symmetric positive definite matrices: , whose dimension is the output (row) dimension of , and , whose dimension is the input (column) dimension of . We further factorize the matrices and
as a low-rank symmetric matrix plus a multiple of the identity matrix. We write the approximated Fisher matrix in the form
where and are each factorized in the form . The order in which and appear in the Kronecker product depends on the way in which we vectorize the weight matrices– row-wise or column-wise. In practice we don’t ever deal explicitly with these Kronecker products or vectorized weight matrices in the algorithm, so this choice doesn’t matter. It is not hard to show that if the Fisher matrix can be factored this way, then its inverse can be factored the same way.
We have two different methods for estimating the factorized Fisher matrix:
Simple method: We estimate the Fisher matrix from the other samples in the minibatch we are currently training on, holding out the current sample to avoid bias. This can be done surprisingly efficiently. Details are in Appendix A.
Online method: We estimate the Fisher matrix from all previous minibatches, using a forgetting factor to downweight minibatches that are distant in time. Details are in Appendix B.
We generally use the online method as it is significantly faster on GPUs and usually seems to lead to faster learning, probably due to the less noisy estimate of the Fisher matrix. We describe the simple method because it is easier to understand and helps to motivate the online method.
Although we describe our Fisher matrix as as a Kronecker product, we do not have to explicitly construct this product in our code.
Suppose that we process the training examples one at a time. The SGD update for the ’th weight matrix is:
where is the derivative of the objective function w.r.t. output of the ’th weight matrix computed at the current sample, and
is the input that the weight matrix acts on. These quantities naturally occur in backpropagation.
In our natural gradient method, this is modified as follows:
where and are factors of the Fisher matrix. It is easy to show that this is equivalent to multiplying the parameter step by the inverse of the Fisher matrix formed from the and quantities as in Equation (2).
Rather than processing training examples one at a time, we process them in minibatches (e.g. 512 at a time). Instead of vector-valued derivatives and inputs , we now have matrices and , each row of which corresponds to one of the or quantities ( is now the index for the minibatch). The update is now as follows:
and note that unlike some authors, we don’t divide the gradients by the minibatch size– this makes it easier to tune the minibatch size and learning rate independently. The update now has the form
with the bar indicating modified and quantities. In the online version of our natural gradient method, we can write these as:
but in the simple method, because the and matrices are estimated from the other elements in the minibatch, we can’t write it this way– it is a separate matrix multiplication for each row of and – but it can still be computed efficiently; see Appendix A.
In programming terms, we can describe the interface of the core natural-gradient code as follows:
Simple method: Given a minibatch of vectors with each row being one element of the minibatch, estimate the Fisher-matrix factors by holding out each sample, do the multiplication by their inverses, and return the modified vectors .
Online method: Given a minibatch of vectors and a previous Fisher-matrix factor , compute and the updated Fisher-matrix factor .
The interface of the natural gradient code works the same with the and quantities, as with and . We call the interface above times for each minibatch: twice for each weight matrix in the network.
In both natural gradient methods, we want to prevent the Fisher-matrix multiplication from affecting the overall magnitude of the update very much, compared with the step-sizes in standard SGD. There are several reasons for this:
Early in training, the and quantities may be very small or zero, leading to huge or infinite inverse-Fisher matrices.
The conventional convergence-proof techniques require that the matrix component of the learning rate matrix should have eigenvalues bounded above and below by constants known in advance, which we cannot guarantee if we use an unmodified Fisher matrix.
Empirically, we have found that it is hard to prevent parameter divergence if we use the real, un-scaled Fisher matrix.
Our method is to scale the and quantities so that they have the same Frobenius norm as the corresponding inputs and . We will introduce notation for this in the Appendices.
This scaling introduces a slight problem for convergence proofs. The issue is that each sample can now affect the value of its own learning-rate matrix (via the scalar factor that we use to rescale the matrices). As we mentioned before, it is not permissible in general to use a per-sample learning rate that is a function of the sample itself. However, we don’t view this as a practical problem because we never use a minibatch size less than 100, so the resulting bias is tiny.
In both versions of NG-SGD, we smooth our estimates of the factors of the Fisher matrix by adding a multiple of the identity matrix before inverting them. In the simple method this is necessary because in general the Fisher matrix estimated from the minibatch will not be full rank. In the online method it is not strictly necessary because we deal with a factorization of the Fisher matrix that already contains a multiple of the unit matrix, but we found that by adding an additional multiple of the unit matrix, as for the simple method, we can improve the convergence of the SGD training. In both cases the smoothing is of the following form. If is a Fisher matrix factor estimated directly from data as the uncentered covariance of the or quantities, then instead of using as the Fisher-matrix factor or , we use instead , where
where is used to stop the smoothed from ever being exactly zero. That is, we smooth the Fisher with the identity matrix scaled by times the average diagonal element of . We found in tuning experiments that the relatively large value is suitable under a wide range of circumstances, for both the simple and online methods, and even for settings where the noise in should not be a big problem– e.g. for large minibatch sizes. Our interpretation is that when is fairly large, we are using a smaller than normal learning rate only in a few directions where the or quantities have quite high covariance, and a relatively constant learning rate in all the remaining directions.
Although this is not a theoretical paper, we would like to say what we think is, and is not, possible to prove about our methods.
If we assume that the distribution of the and quantities is Gaussian and independent (between and for a single layer, and between layers), then it should not be hard to show that the Fisher matrix has the form of (2), where the and quantities correspond to the uncentered covariances of the and quantities, and that the inverse-Fisher has the same form, with the replacing and replacing .
Of course these conditions won’t hold in practical deep learning applications, but we do believe that it’s a reasonable factorization. One could try to show this experimentally as follows, given a task. One could make a linear change of variables to make our approximated Fisher matrix equal the unit matrix, and then try to measure the eigenvalue distribution of the full Fisher matrix in the new co-ordinates. We believe that the eigenvalue distribution of the transformed Fisher matrix would probably be much more closerly centered around 1 than before the change of variables. Since our motivation for the work published here is a practical one, so we have not allocated effort towards this type of experiment.
Regarding the convergence of SGD using our factored-Fisher learning rate matrices, the most we think is easily provable is that a slightly modified form of this method would converge under similar conditions to unmodified SGD.
The smoothing with constant can give us a bound on the ratio of the largest to smallest eigenvalues of the and factors; using this together with the rescaling of Section 4.8, we can bound from above and below the eigenvalues of the rescaled and factors. By multiplying these together, we can get lower and upper bounds on the eigenvalues of the overall inverse-Fisher matrix that we use as the learning-rate matrix .
It is necessary for the Fisher matrix to be randomly chosen independent of the identity of the current sample. Unfortunately this is not quite true due to the rescaling being done at the minibatch level; we mentioned in Section 4.8 that this would be a problem for proofs. As mentioned, it would be easy to use the rescaling factor from the previous minibatch; this gives us back the independence, but at the cost of no longer having such easy bounds on the upper and lower eigenvalues of the rescaled and factors. Alternately, one could keep the algorithm as it is and try to prove instead that the parameter value we converge to will not differ very much in some sense from an optimum of the true objective function, as the minibatch size gets large.
There might be some interesting things to say about our online natural gradient method, described in Appendix B, in which estimate the uncentered covariance matrices and in a factored form as . Our online estimation of the covariance matrices involves multiplying by a weighted combination of (a) the observed covariance matrix from the current minibatch, and (b) the previous value of our factored approximation to it; it is like a matrix version of the power method (Del Corso, 1997).
Probably the analysis would have to be done initially in the steady state (i.e. assuming the parameter vector is constant). If in addition we assume infinite minibatch size so that the covariance matrix equals its expected value, we are confident that we could show that the only stable fixed point of our update equations gives us in some suitable sense the closest approximation to the covariance; and, with a little more effort, that our updates will converge with probability 1 to that best approximation.
The analysis for finite minibatch size would have to involve different methods. Because of the noise and the finite forgetting factor, we would never converge to the true value; but it might be possible to define some objective function that measures some kind of goodness of approximation, and then say something about the convergence of the distribution of that objective function.
We would like to touch on the subject of some other popular modifications of SGD, to explain why we do not use them in our experiments.
One frequently used modification to SGD is momentum (Polyak, 1964). This can be helpful in preventing parameter divergence, as momentum allows SGD to use a higher effective learning rate before parameter divergence is encountered. The original reason why none of our experiments involve momentum is that we found it quite hard to successfully incorporate momentum into multi-threaded parameter updates, needed for the CPU version of our training method; this is likely to be the reason why Downpour (Dean et al., 2012) does not use momentum. We developed other methods to prevent instability– namely, the limit on parameter change per layer per minibatch (Appendix C.3); and the natural gradient method itself.
Another popular modification of SGD is Adagrad (Hazan et al., 2007)
. This method divides the learning rate for each parameter by the standard deviation of the sum of gradients for that parameter, averaged over time (from the beginning of optimization until the present). This naturally gives thelearning rate schedule that is believed from theory to be optimal (Kushner & Yin, 2003), as well as giving separate learning rates for each diagonal element. There are two reasons why we felt that Adagrad was very unlikely to be helpful for large-scale speech recognition. Firstly, a learning rate has been found empirically be inferior to an exponentially decaying learning rate (Senior et al., 2013). Secondly, because our -norm nonlinearities (Zhang et al., 2014)
are non-saturating we don’t believe that our networks are susceptible to the kind of pathologies that would make some neurons in a layer require higher learning rates than others. This is also true between different hidden layers, due to special properties of the p-norm networks that we use here111The detailed argument in involves scale invariance of the network output w.r.t. the parameters for each layer; an invariance of the learning procedure with respect to scaling up the parameters for a layer and scaling up the learning rate at the same time; and the notion that parameters in a layer will tend to grow in size due to parameter noise, if the learning rate is too high. Essentially, we have reason to believe that as far as some directions requiring higher learning rates than others is concerned, all the interesting action for our particular type of network is “off the diagonal”– that is, it cannot be captured by a diagonal matrix. That is why we have not investigated Adagrad and why we smooth our estimates of the factors of the Fisher matrix to the identity and not to a diagonal matrix222Actually, there is another reason for this. We have previously derived an efficient online update of a factored Fisher matrix that had a low-rank plus diagonal form (work with Oriol Vinyals, not published), and the diagonal term caused the math to become very significantly more complicated..
We show experiments on a speech recognition setup called Fisher English333Linguistic Data Consortium (LDC) catalog numbers LDC2004S13, LDC2005S13, LDC2004T19 and LDC2005T19, which is English-language conversational telephone speech, sampled at 8kHz, and transcribed in a quick but relatively low-quality way. The total amount of training data is 1600 hours (only including transcribed segments, i.e. not the silent other half of the telephone conversation). We test on a held-out subset of the data, about 3.3 hours long, that we defined ourselves.
Our main results are convergence plots, but to give the reader some idea of the ultimate results in Word Error Rate, we show some results in Table 1. The Word Error Rates may seem on the high side, but this is mainly due to the difficulty of the data and the quick transcription method used on this data.
The GMM system is based on MFCC features, spliced across frames and processed with LDA+MLLT to 40-dimensional features, then adapted with feature-space MLLR (fMLLR) in both training and test time. See (Povey et al., 2011)
for an explanation of these terms and the normal system build steps. All these systems used the same phonetic context decision tree withcontext-dependent states; the GMM system had Gaussians in total.
The DNN1 system uses speaker adapted features from the GMM system, so it requires a first pass of GMM decoding and adaptation. The 40-dimensional features from GMM1 are spliced across frames of context and used as input to the DNN. DNN1 is a p-norm DNN (Zhang et al., 2014) with 5 hidden layers and p-norm (input, output) dimensions of (5000, 500) respectively, i.e. the nonlinearity reduces the dimension tenfold. We use “sub-classes” (see Section C.5 for explanation), and the number of parameters is 19.3 million. It is trained for 12 epochs with learning rate varying from 0.08 to 0.008, trained with 8 parallel jobs with online natural gradient SGD (NG-SGD). For both this and the DNN2 system, we trained with samples per outer iteration for each machine.
The DNN2 system is trained for our online decoding setup (see Appendix C.9), which is geared towards applications where reduced latency is important and audio data must be processed strictly in the order it is received. The input features are equivalent to unadapted, un-normalized 40-dimensional log-mel filterbank features, spliced for frames, plus a 100-dimensional i-vector representing speaker characteristics, extracted from only the speaker’s audio up to and including the current time. For the results shown here, we include previous utterances of the same speaker in the same conversation when computing the i-vector. Because this system is intended for real-time decoding on a single CPU, we limit the number of parameters by using only 4 hidden layers, p-norm (input, output) dimensions of (350, 3500), and sub-classes, for a total of 10.4 million parameters. It was trained using online NG-SGD with 6 parallel jobs for 5 epochs, with the learning rate decreasing exponentially from 0.01 to 0.001. All our experiments below are based on this setup.
Our server hardware is fairly typical: the majority of them are Dell PowerEdge R720 servers with two Intel Xeon E5-2680v2 CPUs having with 10 cores each, running at 2.8GHz; and with a single NVidia Tesla K10 GPU card, providing two GPUs– each GPU corresponds to a single machine in our notation, and it becomes incidental that they are co-located. We also have some similar machines with K20 GPU cards, and when reporting time taken, we report the slightly more optimistic figures obtained from running the same jobs on the faster K20 GPUs.
Our main result is in Figure 0(a) (best viewed in color), where we plot the objective function versus amount of training data processed, for our parallel training method with and without natural gradient, and with 1, 2, 4, 8 and 16 jobs. In order to keep the effective learning rate (Section 3.1) constant, we make the initial/final learning rates proportional to the number of jobs, with the default learning rates of 0.01 to 0.001 corresponding to the 6-job case.
Our natural gradient method always helps– the NG-SGD curves are all above the plain-SGD curves. Also, when using online natural-gradient, the curves shown in Figure 0(a) are close to each other up to about 4 jobs– i.e. after processing the same amount of data with different numbers of jobs we get about the same objective function; however, the 8- and 16-job runs converge a little slower. Thus, for small we are getting a linear speed up in the number of machines, because the time taken per epoch is proportional to . As gets larger than around we need more epochs to get the same improvement, so the speedup becomes sub-linear. The plot also shows that the simple and online natural gradient converge about the same (only tested with one job). We show the final Word Error Rates in Table 2; with NG-SGD, they are not very sensitive to the number of jobs.
Figure 0(b) shows the same plots as Figure 0(a) but with time as the x-axis. This is a simulated clock time, obtained by multiplying the time taken for each “outer iteration” of training, by the number of outer iterations; the actual clock time depends on queue load. The time per outer iteration was 88 seconds for plain SGD, 93 seconds for online NG-SGD, and 208 seconds for plain NG-SGD, all measured on a K20 GPU. The circles mark the end of training, after 5 epochs.
We have described an efficient Natural Gradient version of SGD training (NG-SGD). We have shown experimentally that not only does the method improve the convergence versus plain SGD, it also makes it possible for us to to use a data parallelization method where we periodically average and redistribute parameters across multiple SGD runs. This enables us to train in parallel even on machines that lack fast interconnections. Although we only show results from one setup, we are confident based on past experience that it holds true for other types of neural network (e.g. ReLUMaas et al. (2013) or sigmoid activations) and improves our final results (Word Error Rate) as well as convergence speed.
We do not have a very good explanation why our parallel training method only works when using Natural Gradient, except to say that the statements in (Pascanu & Bengio, 2013) that NG prevents large parameter steps and is more robust to reorderings of the training set, may be relevant.
We would like to thank Karel Vesely, who wrote the original “nnet1” neural network training code upon which the work here is based; Ehsan Variani and Pegah Ghahremani for their work on CUDA kernels; Hagen Soltau, Oriol Vinyals and Steven Eliuk for fruitful discussions; and many others, too numerous to mention, who have contributed to some aspect of the neural net setup and to Kaldi more generally.
The authors were supported by DARPA BOLT contract No HR0011-12-C-0015, and IARPA BABEL contract No W911NF-12-C-0015. We gratefully acknowledge the support of Cisco Systems, inc. (grant #574560) and Google, Inc. (Award 2012_R2_106, “Deep Neural Networks for Speech Recognition”), funds which were used to buy computer equipment and cloud computing time that were used in the development of these methods.
The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government.
Estimating an eigenvector by the power method with a random start.SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997.
Mean and Variance Adaptation Within the MLLR Framework.Computer Speech and Language, 10:249–264, 1996.
International Conference on Artificial Intelligence and Statistics, pp. 249–256, 2010.
Investigation of full-sequence training of deep belief networks for speech recognition.In INTERSPEECH, pp. 2846–2849, 2010.
Natural gradient descent for training multi-layer perceptrons.Unpublished (submitted to IEEE Tr. on Neural Networks), 1997.
Complexity issues in natural gradient descent method for training multilayer perceptrons.Neural Computation, 10(8):2137–2157, 1998.
In this section we describe the natural gradient method that uses the other elements of the minibatch to estimate the factors of the Fisher matrix.
As mentioned in Section 4.7, the interface can be described as follows. Given a matrix , each row of which represents one element of the minibatch (and with a number of columns corresponding to either the row or column dimension of one of the weight matrices in the network), do the inverse-Fisher multiplication for each row of and return the modified matrix .
The core of the inverse-Fisher multiplication is this: let , where is the Fisher matrix estimated from the other rows of , i.e. if is the minibatch size, then . We extend this basic idea by adding smoothing of with the identity matrix, and by scaling the output to have the same Frobenius norm as the input.
In this section we describe what we compute in our “simple” natural gradient method, without considering how to compute it efficiently. As described in Section 4.9, we smooth the Fisher matrix with the identity. Defining
where our normal settings are and , and and are the number of rows and columns respectively of , we define smoothed Fisher matrices as follows, to be applied to each row of :
For each row we will then define
and then the result of our computation will be
where the rescaling factor , intended to make sure that has the same Frobenius norm as the input , is defined as
If the denominator of the above is zero, we take to be one.
We should note that by computing the scalars and without “holding out” the current sample, we are violating the rule that the randomly sampled learning rate matrix should be independent of the current sample. However, since we always use a fairly large minibatch size (at least 100) and these are only scalar quantities, we don’t believe the small amount of “contamination” that takes place here will significantly bias the training. In fact, it might not turn out to be very difficult to modify the equations to properly hold out the current sample for these purposes, but because we don’t believe it would perceptibly affect the results, we haven’t gone to the trouble of doing this.
We now describe how we efficiently compute what we described above. Define the smoothed Fisher matrix
which is like the quantities but without holding out the current sample. Next, compute
where only differs from by not being held out from the corresonding . There are two equivalent methods to compute :
In column space:
In row space:
We derived the rather surprising row-space version of the formulation by expanding the inverted expression on the right of the column-space expression using the Morrison-Woodbury formula, and simplifying the resulting expression.
For efficiency, we choose method (i) above if the minibatch size is greater than the dimension (), and method (ii) otherwise. Our formula below for is derived by expressing each as a rank-one correction to , and computing the corresponding correction by which each row differs from the corresponding row of . It turns out that the correction is in the same direction as itself, so just becomes a scalar multiple of . Defining, for each row-index ,
and defining the scalar factor
then we can use the following efficient formula: for each row of ,
We then get the output by scaling by , as described above.
When working on CPUs with small minibatch sizes (e.g. ) and large hidden-layer dimensions (e.g. ), the computation above is very efficient, and does not comprise more than about 20% of the time of the overall backprop computation. However, when using GPUs with larger minibatch sizes (e.g. ) it can take the majority of the time. Even though it typically takes considerably less than half of the total floating point operations of the overall computation, it contains a matrix inversion, and matrix inversions are not very easy to compute on a GPU. Our “online” method which we will describe below is designed to solve this efficiency problem.
The interface of the online natural-gradient method is essentially the same as the simple method: the user provides a matrix , and we return a matrix that’s been multiplied by the inverse-Fisher and then rescaled to have the same Frobenius norm as . Again, each row of corresponds to an element of the minibatch and the column dimension corresponds to the row or column dimension of one of the weight matrices. A difference from the simple method is that the online method is “stateful”, because we maintain a running estimate of the Fisher matrix. Each time we process a minibatch, we use the Fisher matrix estimated from the previous minibatches; and we then update that estimate using the current minibatch. For a single neural net, the number of separate copies of this “state” that we need to maintain corresponds to twice the number of trainable weight matrices in the neural net: one for each of the and quantities in Equation (2).
Let the input be , where is the minibatch size (e.g. 512) and is the row or column size of the weight matrix we’re updating (e.g. 2000). We introduce a user-specified parameter which is the rank of the non-identity part of the Fisher matrix. Let the subscript correspond to the minibatch. Define
where , and will be estimated online from data; has orthonormal rows and is diagonal and nonnegative. We’ll estimate these quantities online from the data with the aim being that should be a good estimate of the covariance of the rows of the quantities.
We define to be a kind of “smoothed” version of where we add in more of the unit matrix, controlled by the parameter we’ve previously discussed (normally ):
and then the output will be:
where is computed so as to ensure that the Frobenius norm of equals that of :
or if the denominator of the above equation is 0.
Next we discuss the method we use to estimate our low-rank approximation of the uncentered covariance of the rows of the inputs . Define
as the uncentered covariance of the rows of . We introduce a user-specified “forgetting factor” (we describe how this is set in Section B.4), and we define
We will try to set to be a good low-rank approximation to . The obvious way would be to make correspond to the top eigenvalues of and to the corresponding eigenvectors, but this would be too slow. Instead we use a method inspired by the power method for finding the top eigenvalue of a matrix. On each iteration we compute
with . It is useful to think of as containing each eigenvector scaled by its corresponding eigenvalue in (of course, this is true in a precise sense only at convergence). Our update uses symmetric eigenvalue decomposition of to find these scaling factors (they are actually the square roots of the eigenvalues of ), puts them on the diagonal of , and puts the corresponding eigenvectors in the rows of . We then have to work out the correct amount of the unit-matrix to add into our factorization of the covariance matrix (i.e. set ) and subtract that amount from the diagonals of . We will give equations for this below.
Observant readers might have noted that it would seem more straightforward do do a Singular Value Decomposition (SVD) oninstead of a symmetric eigenvalue decomposition on . We do it this way for speed.
The details of our update are as follows:
so . Then do the symmetric eigenvalue decomposition
with orthogonal and diagonal. The diagonal elements of are positive; we can prove this using (which makes positive definite) and using the fact that has full row rank. We define as:
If we expand out using (27), it is easy to see that it reduces to the identity, hence has orthonormal rows. In order to make sure that has the desired covariance in the directions corresponding to the rows of , we will set
but note that at this point, is still unknown. When we say the “desired covariance”, we are ensuring that for each dimension corresponding to a row of , the value of the inner product equals that of , but this is only precisely true at convergence.
We choose in order to ensure that . This value can be worked out as:
We then let
for ; this is is to ensure that if we get a sequence of zero inputs, will not become exactly zero in its machine representation.
The previous section described what we are computing in the online natural gradient method; here we describe how to compute it efficiently. The essential idea here is to reduce the multiplication by to two multiplications by a “fat” matrix (of dimension ). Since typically is much smaller than , this is quite efficient. We also address how to efficiently keep these matrices updated, at the level of optimizing the matrix expressions. This section is mostly derivation, and will likely only be of interest to someone who is considering implementing this method. In Section B.5 below, we will summarize the algorithm we derive here.
We can write as:
where the factor of is inserted arbitrarily to simplify the update equations; a scalar factor on doesn’t matter because we will later rescale it to have the same norm as . The output of this whole process is
where, in the expression for , if the denominator is zero we take . Note: is not the same as in (21) because of the arbitrary factor of , so consider (21) to be superseded by (38). To efficiently compute (36), we apply the Woodbury matrix identity to (31), giving us
In order to reduce the number of matrix multiplies, it is useful to break the expression into two equal parts, so we define
and we will never store ; instead, we will work with and the small diagonal factors and . We can now write the following, which is where most of our computation will take place:
You may recall the symmetric matrix defined in (25), which is involved in the update of our factorization. The following expressions are going to be useful when computing it, and the first of them appears as a sub-expression of (45). For convenience we state the dimensions of these quantities below:
After we have , we can compute using a single matrix multiply as:
We can expand , defined in (24), into quantities that will be computed, as:
Using (55) we can expand , as:
and we can substitute some of the sub-expressions we defined above into this, to give:
Our strategy will be to compute the symmetric quantities and on the GPU, and transfer them to the CPU where we can then compute using the expression above – this can be done in – and then do the symmetric eigenvalue decomposition as in (26), on the CPU. We repeat the equation here for convenience:
Here, will be orthogonal, and mathematically, no element of the diagonal matrix can be less than , so we floor its diagonal to that value to prevent problems later if, due to roundoff, any element is smaller than that.
Below, we’ll say how we efficiently compute and ; for now, just assume those quantities have been computed.
We compute as follows, expanding in (29):
We can now compute and ; we floor both to to ensure they never go to exactly zero which could cause problems for our algorithm.
for a small constant (the first is taken per element). We can now compute the scalar and the diagonal matrix (we show the formula for its diagonal elements):
We never construct in memory, but instead we directly compute . We can factor it as follows:
and note that while it might seem like a factor of is missing from the second term in , in fact we use the fact that it commutes with to move it to the left, into . If we’re using a GPU, will be computed in time on the CPU and transferred to the GPU; we then compute on the GPU efficiently by scaling the rows of and adding ; then we multiply and on the GPU.
We have noticed that the invariance can sometimes be lost due to roundoff. A proper analysis of roundoff in our algorithm is not something we have time to do, but we will describe how we detect and fix this problem in practice. For speed, we only do the following operations if the diagonal matrix , has condition number greater than , or if any elements were floored as mentioned just after (58). Note: all the computations we describe in this paper were done in single precision.
We compute the symmetric matrix
where the part in parentheses is computed on the GPU and transferred to the CPU. If no element of differs by more than from the corresponding element of the unit matrix, we consider that is sufficiently orthogonal and we do nothing more. Otherwise, we do a Cholesky decomposition , compute the reorthogonalizing factor on the CPU and copy to the GPU, and do to reorthogonalize. Re-orthogonalization happens extremely rarely, and usually only if something bad has already happened such as parameter divergence.
In our implementation we don’t bother dumping the “state’ of the computation to disk so each new process reinitializes them for the first minibatch it processes. We initialize them so as to most closely approximate the covariance of the first minibatch of features. This is done by taking
and finding the top eigenvalues and eigenvectors; the rows of contain the top eigenvectors. Let be the corresponding eigenvalues, for , and we set
for , and for , we let .
We mentioned above that we have a fast way of computing the quantities and . These are needed to compute using (38), and to compute using (59). We compute these as a side effect of the fact that we need, for each row of the output, its squared norm . This will be required to enforce the “maximum parameter change” per minibatch, as described in Section 3.2.2. Suppose we’ve already computed using (53). We compute the inner products for all rows of as
using a single GPU kernel invocation. If we are updating the parameters of the Fisher-matrix factorization, then we can most efficiently obtain our desired traces as follows:
The expression for was obtained by expanding using (53), moving to the left, and recognizing sub-expressions that we have already computed. In case we are not updating the parameters of the Fisher-matrix factorization, we have no other need for so it will be more efficient to compute directly; this can of course be done in operations and does not require a matrix multiply. Once we have the scaling factor we can scale the quantities by its square, and they will equal the quantities that we’ll need for enforcing the maximum parameter change.
Most of what we have written above is geared towards operation using a GPU, but we also support operation with CPUs, where our SGD implementation is multithreaded. In this case, we have to consider the interaction with multithreaded code because of the “stateful” nature of the computation. We wanted to avoid a bottleneck where different threads wait to update the parameters sequentially. Our solution is that before doing the part of the computation where we update the parameters, we try to get a lock, and if this fails, we simply apply the fixed inverse-Fisher matrix but don’t update the Fisher-matrix parameters and so on. Since the model parameters don’t move very fast, we don’t expect that this will make any noticeable difference to the SGD convergence, and we have seen no evidence that it does.
The most important user-specified parameters for our algorithm are the rank and the constant that controls smoothing with the unit matrix. The value seems to work well over a wide variety of conditions, so we normally leave it at that value. The rank should generally increase with the dimension of the vectors we are multiplying. Our experiments here are with “p-norm” networks where the nonlinearity is dimension reducing, like maxout (Goodfellow et al., 2013), typically reducing the dimension from something like 3000 to 300. So a typical parameter matrix will increase the dimension from something like 301 to 3000 (it’s 301 instead of 300 because of the bias term). Our normal rule for ranks is to use on the input side of each matrix and on the output side. Part of the way we originally tuned this is to look at the diagonal matrices . These matrices have diagonal values , sorted on from greatest to least, and can be interpreted as the amount by which the input is scaled in a certain direction in the space. A value of close to 1 means we are strongly scaling down the input, and a value close to 0 means we are leaving it unchanged. If the last has a value of, say, 0.1, then reducing by one will be like taking a scaling factor of 0.9 applied to a gradient, and setting to 1 instead; this seems unlikely to make any difference to the SGD, as it’s like changing the learning rate in some direction from 0.9 to 1. Our final values are normally in the range 0.05 to 0.2.
Another configurable constant is the “forgetting factor” : the closer is to 1, the more rapidly we track changes in the Fisher matrix due to changes in parameters, but the more noise we will have in our estimates. Because we don’t want to have to tune when we change the minibatch size, we set it as follows. The user specifies a parameter (interpreted as an approximate number of samples to include in our estimate of the Fisher matrix), and we set
where is the minibatch size. We normally set ; we have no reason to believe that this is a very important parameter.
In order to increase the speed of the algorithm, we normally configure it so that we only actually update the parameters of the Fisher matrix every 4 minibatches, except on the first 10 minibatches in a process, when we always update them.
Here we summarize the online natural-gradient SGD method– that is, we summarize the core part of the algorithm that takes a matrix , and outputs a matrix . To understand how this fits into the bigger picture of back-propagation and SGD, see Section 4.
For this summary we will ignore issues of multithreading. Our explanation here is just for one instance of the algorithm, corresponding to the row or column dimension of one of the weight matrices; if there are weight matrices, there are separate copies of the variables we describe here.
Typical configuration variables are as follows: , (this will determine ), rank (or 80), ; and let’s define a variable that dictates the period with which we update the Fisher-matrix factors. Minibatch size is normally 128 (on CPU), or 512 (on GPU).
On , before running the steps below we have to initialize the parameters as described in Section B.3.2. Note: while in Section B.3.2 we describe how to set , and , the variables which we actually store are , , and ; to compute we need Equations (35), (43) and (44).
We have an input , and despite the notation, we do not require that be the same for all – sometimes the last minibatch we process has a smaller than normal size.
If or divides exactly, then we will be updating the factored Fisher matrix; otherwise we just apply it and don’t update. There are two slightly versions of the algorithm, depending whether we will be updating the Fisher matrix.
In either case, we first compute from and using (78), and then compute
From here the two cases begin to differ.
If we won’t be updating the Fisher matrix, then it’s simpler. The input is . We first compute . Then we compute
overwriting the input . Next, for each we compute the row-products