Deep learning success stories are predicated on large neural network models being trained using ever larger amounts of data. While the computational speed and memory available on individual computers and GPUs grows ever larger, there always will remain some problems and settings in which the amount of training data available will not fit entirely into the memory of one computer. What is more, and even for a fixed amount of data, as the number of parameters in a neural network or the complexity of the computation it performs increases, so too does the time it takes to train. Both large training data and complex networks inspire parallel training algorithms.
In this work we focus on parallel stochastic gradient descent (SGD). Like the substantial and growing body of work on this topic (e.g. non-exhaustively: Recht et al. (2011); Dean et al. (2012); McMahan & Streeter (2014); Zhang et al. (2015)) we too will focus on gradient computations computed in parallel on “mini-batches” drawn from the training data. However, unlike most of these methods which are asynchronous in nature, we focus instead on improving the performance of synchronous distributed SGD, very much like Chen et al. (2016), upon whose work we directly build.
The main problem in fully synchronous distributed SGD is the straggler effect. This real-world effect is caused by the small and constantly varying subset of worker nodes that, for whatever random reasons, always perform their mini-batch gradient computation slower than the rest of the concurrent workers, causing long idle times in all of the workers which have already finished. Chen et al. (2016) introduced a method of mitigating the straggler effect on wall-clock convergence rate by picking a fixed cut-off for the number of workers on which to wait before synchronously updating the parameter on a centralized parameter server. They found, as we demonstrate in this work as well, that the increased gradient computation throughput that comes from reducing idle time more than offsets the loss of a small fraction of mini-batch gradient contributions per gradient descent step.
Our work exploits this same key idea but substantially improves the way the likely number of stragglers is identified. In particular we instrument and generate training data once for a particular compute cluster then use it to train a lagged generative latent-variable time-series model that encodes the joint worker run-time behavior of all the workers in the cluster. For highly contentious clusters with poor job schedulers, such a model might reasonably be expected to learn to model latent-states that produce correlated, grouped increases in observed run-times due to resource contention. For well-engineered clusters such a model might learn that worker run-times are nearly perfectly independently and identically distributed.
Specifying such a flexible model by hand would be difficult. Also, as we will soon explain, we will need to perform real-time posterior predictive inference in said model at distributed synchronous SGD run-time to dynamically predict straggler cut-off. For both these reasons we use the variational autoencoder loss(Kingma & Welling, 2013) to simultaneously learn not only the model (Krishnan et al., 2017) parameters but also the parameters of an amortized inference neural network (Ritchie et al., 2016; Le et al., 2017) that allows for real-time approximate predictive inference of worker run-times.
The main contributions of this paper are:
The idea of using amortized inference in a deep latent-variable time-series model to predict computer cluster worker run-times, in particular for use in a distributed synchronous gradient descent algorithm.
The dynamic cut-off distributed synchronous gradient descent algorithm itself, including in particular the approximations made to enable real-time posterior predictive inference.
The empirical verification at scale of the decrease in time to convergence that our algorithm yields when training deep neural networks.
The rest of the paper is organized as follows. Section 2. grounds our investigation in its motivation of synchronous SGD speedup. Section 3. outlines two models of compute cluster run-times used to dynamically determine a straggler cut-off. Section 4. highlights our experimental results.
2 Background and Motivation
In stochastic gradient descent, we rely on unbiased estimates of the gradient in order to update the global parameter settings. Distributed mini-batch SGD differs from serial mini-batch SGD in that the mini-batch of sizeis distributed to worker computers that locally compute sub-mini-batch gradients before communicating the result back to a centralized parameter server that updates the parameter using a gradient update step:
is the loss function andis the learning rate. Distributed SGD, as shown, uses unbiased gradient estimates, leaving and
as tunable hyperparameters governing the convergence properties of the algorithm, similar to the single-threaded case. Too high a learning rate causes the algorithm to diverge, while low mini-batch size and lowboth can produce convergence to local minima (Hoffer et al., 2017).
2.1 Effect of Stragglers
In synchronous SGD, we can attribute low throughput in the sense of central parameter updates per unit time to the straggler effect that arises in real-world cluster computing scenarios with multiple workers computing in parallel. Consider Equation 1, in which is computed independently on an memory-isolated logical processor. Let be the time it takes for to be computed on the worker indexed by for . Distributed computers are not ideal, otherwise would be a constant, independent of , and all workers would finish at the same time and only be idle while the parameter server aggregates the gradients and sends back the new parameters. Instead
is actually random. Moreover, the joint distribution of all the’s is likely, again in real-world settings, to be non-trivially correlated owing to cluster architecture, etc. For instance most modern clusters consist of computers or graphics processing units each in turn having a small number of independent processors, so slow downs in one logical processing unit are likely to be exhibited by others sharing the same, for instance, bus or network address. What is more, in modern operating systems, time-correlated contention is quite common, particularly in clusters under queue management systems, when, for instance, other processes, operating system or user, are concurrently executed. All this yields worker compute times that may be non-trivially correlated in both time and in “space.”
Our aim is to significantly reduce the effect of stragglers on throughput and to do so by modeling cluster worker compute times in a way that intelligently and adaptively responds to the kinds of correlated run-time variations actually observed in the real world. What we find is that doing so improves overall performance of distributed mini-batch SGD.
Our approach works by maximizing the total throughput of parameter updates during a distributed mini-batch SGD run. The basic idea, shared with (Chen et al., 2016), is to predict a cutoff, , (Alg. 1, line 23) for each iteration of SGD which dictates the total number of workers on which to wait before taking a gradient step in the parameter space (Alg. 1, line 29). To be concrete about why we want to do this: if the slowest straggler takes 10 seconds to finish, but the second slowest takes 8, then there is already a 20% reduction in the wall-clock time simply by setting the cutoff to .
The central considerations are: what is the notion of throughput we should optimize? And how do we predict the cutoff that achieves it?
Simply optimizing overall run-time admits a trivial and unhelpful solution of setting the cutoff to be all workers. Each iteration and the overall algorithm would then take no time. Instead we seek to maximize the number of workers to finish in a given amount of time, i.e. throughput , which we define to be:
where indexes the ordered worker run-times . Note that, for now and throughout when clear, we will avoid indexing run-times by SGD loop iteration, although we specifically will make use of temporal correlation between worker run-times soon enough.
We define our objective to be maximizing the throughput of the system as defined above, i.e. , which, as it turns out, will yield improved overall learning as a consequence of calculating and incorporating the maximum number of gradients over time.
Setting the cutoff optimally and dynamically requires a model which is able to learn and predict the joint run-times of all cluster workers. With such a model, we aim to make highly informed and accurate predictions about the next set of run-times per worker and consequently make a real-time optimal choice of for the subsequent loop of sub-mini-batch gradient calculations. How we model computer cluster worker performance follows.
3.1 Modeling Computer Cluster Worker Performance
As before, let be the random times it takes for to be computed on the worker indexed by . Assume that these are distributed according to some distribution .
3.1.1 Order Statistics
Given a set of identically
-distributed random variableswe wish to know the joint distribution of the sorted random variables . Such quantities are known as “order statistics.” For instance under the assumption that the distribution of each order statistic is independent and . Each describes the distribution of the largest sorted run-time under independent draws from this underlying distribution.
Under the given independent and identically distributed (iid) normality assumption the distribution of the each order statistic has closed form:
is the cumulative distribution function (CDF) ofand
Note that each order statistic’s distribution, including the maximum, increases as the variance of the run-time distribution increases, while the average run-time does not.
Given workers, the expected average idle time for each worker if synchronizing on all completing can be derived to be:
The latter approximation holds because the order statistics of iid draws from a Gaussian are both independent and symmetric around the middle order statistics; workers wait on average the difference of the longest (highest) order statistic and the middle order statistic.
As a baseline in subsequent sections we will use a useful approximation of the expectations of order statistics under this iid normality assumption. This is known as the Elfving (1947) formula (Royston, 1982):
It is not known how to derive the analytic form of the joint order statistic distribution of non-Gaussian distributed correlated random variables. However a Monte Carlo approximation of the order statistics is straightforward: use a model to predict the joint distribution of the’s, then sample, sort, and record the values of all sorted samples, and then repeat. Towards that end we will first develop a model of correlated compute times from which we will then be able to construct Monte Carlo order statistic estimates for use in determining the optimal cutoff threshold.
3.1.2 Generative Model
Before introducing the design of the generative model we use to predict worker run-times, first consider why a generative model here is nearly absolutely necessary, certainly in comparison to a purely autoregressive model, for predicting run-times given a lagged window. In short we can only consider worker run-time prediction models that are extremely sample efficient to train. This is because one receives no benefit whatsoever if the predictive model should require collected training data for many thousands of distributed SGD runs before being able to use it. We also can only consider a kind of model that allows real-time prediction because it will be in the inner loop of the parameter server and used predict at run-time how many straggling workers to ignore. Deep neural net auto-regressors satisfy the latter but not the former. Generative models satisfy the former but historically not the latter; except now deep neural net guided amortized inference in generative models does. This forms the core of our technical approach.
We will model the time sequence of observed joint worker run-times
using a hidden Markov model whereis the time evolving unobserved latent state of the cluster. The dependency structure of our model factorizes as:
where, for reasons specific to amortizing inference, we will restrict our model to a fixed-lag window. The principal model use is the accurate prediction of the next set of worker run-times from those that have come before:
3.1.3 Model Learning and Amortized Inference
With the course-grained model dependency defined, it remains to specify the fine-grained parameterization of the generative model, to explain how to train the model, and to show how to perform real-time approximate inference in the model.
First we use the deep linear dynamical model introduced by (Krishnan et al., 2017). Namely, the transition and emission functions in our model are parametrized by neural networks:
whose specific architecture is:
is the identity matrix anddenotes an
-layer multilayer perceptron containing nonlinearities,, , etc.
Our model also utilizes the gated transition function for and :
The flexibility of such a model allows us to avoid making restrictive or inappropriate assumptions that might be quite far from the true generative model while imposing rough structural assumptions that seem appropriate like correlation over time and correlation between workers at a given time.
The remaining tasks are to, given a set of training data, learn and train an amortized inference network to perform realtime inference in said model. For this we utilized the variational autoencoder-style loss used for amortized inference in deep probabilistic programming with guide programs (Ritchie et al., 2016). The guide program structure we used is a structured left-right model:
where denotes scalar multiplication below:
We use stochastic gradient descent to simultaneously optimize the variational evidence lower bound (ELBO) with respect to both and :
Doing this yields an extremely useful by-product. Maximizing the ELBO also drives the KL divergence between and to be small. We will exploit this fact in our experiments to speed run-time prediction.
In particular we will directly approximate Equation 4 by
with being the last-time-step marginal of the th of samples from .
During training, and at test-time, we normalize the observations by dividing out the 2 times the mean of the first fixed-lag window. In doing so, we avoid retraining the model for neural networks and batch sizes that cause longer runtimes.
In all experiments we use a twenty-timestep lag, i.e.
4.1 Predicting Worker Runtimes
To test our model’s ability to accurately predict joint worker runtimes sufficiently well that the approximate order statistics derived by sorting the output of the proposal network match the true ordered runtimes of workers in subsequent timesteps, we perform the same experiment on two clusters of distinctly different architectures and sizes. In particular we record worker compute times for fully synchronous SGD iterations while training a deep neural network. Using these we train our generative model parameters, the proposal network, then use both to make predictions about next time-step worker processing times and compute the cut-off we would use at run-time and compare the predicted runtimes and cut-offs to the ground-truth observed and computed from the known actual next-time-step worker runtimes.
On one cluster comprised of four nodes of forty logical Intel Xeon processors, we trained a 3-Layer CNN network to do MNIST classification with a 60000:10000 training to validation data split. We use a single parameter server leaving an available 158 worker count, across which we trained using synchronous SGD and recorded each worker runtime of an iteration of SGD for 1 hour. The resulting mean and standard deviation of the worker runtimes were 1.057 and 0.393 seconds respectively.
Using these values in the Elfving formula (Eqn. 3), we find that the maximum expected runtime out of 158 independent workers would be 2.1063. This means that approximately on average, in fully synchronous SGD, workers are spending 1.049 seconds idle while the longest running thread finishes its computation. During this time, a second gradient could almost be calculated, which provides some insight into the large increase in efficiency of our approach.
In Figure 2 there is clear evidence that the strong assumptions of independence and identically distributed runtimes required to use the iid normality assumption do not hold. Figure 2 clearly shows what can happen on a highly contentious cluster, which produced different levels of correlated worker runtimes throughout the hour. In order to reduce the total wait times, a model of a compute cluster is required, and in particular one that does not make unnecessary and inaccurate assumptions about the distributions from which the runtimes are distributed.
The crux of our approach is to reduce the wait time of these processes. Thus we trained our model and inference network using Adam with gradient clipping on the data collected from the small cluster. In addition, we also trained a production scale inference network with data taken from a Cray XC40 supercomputer operating on 32 KNL nodes with 68 logical cores per node. The worker counts available to us on these systems are 158 and 2175, respectively.
Both trained models display high performance on validation sets, where we test by comparing the next available vector of run-times against the predicted run-times emitted by the preceding 20 timestep sequence (see Figure3). On our local cluster, we also discovered a set of slow workers that persisted for about 1000 iterations likely due to the contention of resources by another unrelated job that overlapped our training data collection phase. In the first mode, lasting from iterations 1 to 61, we observe a single slow compute node (the cluster in question contains 4 nodes with 40 cores each) as the main bottleneck. In the second mode, this slow machine equilibrates with the remaining 3, and our run-times are more uniform throughout the workers. We include this window in our test data and demonstrate that the model has learned both dynamical modes.
In Figure 2, we compare the throughputs achieved by our approach with the maximum and naive throughputs in our 158 worker model. Again, our validation set was chosen to include a window of interest where at iteration 61, the compute cluster sheds a set of 40 slow nodes and operates at uniform efficiency. During this transition, our inferred cutoff is only set suboptimally for a few iterations, before recovering near maximum performance after iteration 85.
4.2 Handling Censored Run-times
As described, we use the learned inference network to predict future cutoffs rather than the generative model. Because variational inference jointly learns the model parameters along with the inference network, we could theoretically use an inference algorithm such as SMC (Doucet et al., 2001) for more accurate estimates of the true posterior. However, our cutoff prediction must be done in an amortized setting, because we rely on it to be set for a gradient run prior to the updates returning from the workers. In a setting which requiring fast, repeated inference, using an amortized method is often the only approach, especially in large complex models.
However, when using amortized inference, there is a practical implication of dealing with partially observed and in fact censored data. Since at run-time we are only waiting for gradients up to the cutoff, and are in fact actually killing the straggling workers, we do not have the run-time information from the straggling workers that would have finished past the cutoff. This results in censored observations, and we know that censoring occurs right at
. Inference in the generative model could directly be made able to deal with censored data, however our inference network runs an RNN which was trained on fully observed run-time vectors and therefore requires fully observed input to function correctly. Because of this, we describe an effective approximate technique for imputing the missing worker runtime values.
Our practical solution is to sample a new uncensored data point for every worker whose gradients are dropped. Because we push estimates of the approximate posterior through the generative model, we have a predictive run-time distribution for the current iteration of SGD before receiving actual updates from any worker. When eventually the cutoff is reached, and the corresponding rate censor is observed, we are left with run-time distributions, which are left truncated at :
where we have left off the time index for clarity and is any one of the censored worker runtime observations.
When a censored value is required, we take its corresponding predicted run-time distribution and sample from the right tail truncated distribution to get an approximate value for that missing run-time. We find that this method works well to propagate the model forward, leading to still accurate predictions.
4.3 Wall Clock Speedup
We report results for the simple MNIST example run on the 160 node computer cluster. All distributed cutoff SGD experiments were run with sampling a mini-batch with replacement. For some other distributed SGD implementations, a subset of the data is pre-partitioned onto each worker to save networking cost. However, here we cannot do that because if some workers remain inherently slow then their gradients will always be dropped as a result of maximizing throughput. All experiments use a single parameter server, which did not present a bottleneck during testing. We implement the popular asynchronous SGD algorithm, Hogwild, in order to compare the convergence of the noise-adding, but perhaps faster wall-clock training rate, of an asynchronous method. Hogwild’s algorithm has the parameter server communicating with the workers at each update, while synchronous SGD allows for only small communication bandwidth to report a rate once finished. When the parameter server is able to set the cutoff, it broadcasts this list of participants to its workers as a bit array, and then workers who do not finish zero their gradients and the full array performs its update locally after sharing information in an all ring reduce. This method also lowers the communication requirements, an optimization that is unachievable in the most asynchronous implementations.
Figure 4 shows that our method achieves the fastest convergence to the lowest lost among comparison methods performing synchronous SGD. Hogwild outperforms our approach in wall-clock time, but its convergence is to a higher validation loss.
We have presented an improved, faster way to do synchronous distributed gradient descent. Our primary contributions include describing how a model of worker runtimes can be used to predict order statistics that allow for a near optimal choice of straggler cutoff that maximizes gradient computation throughput.
While the focus throughout has been on on vanilla SGD, it should be clear that our method and algorithm can be nearly trivially extended to most optimizers of choice so long as they are stochastic in their operation on the training set. Most methods for learning deep neural network models today fit this description, including for instance the Adam optimizer (Kingma & Ba, 2014).
We conclude with a note that our method implicitly assumes that every minibatch is of the same computational cost in expectation, which may not always be the case. Future work could be to extend the inference network further (Rezende & Mohamed, 2015) or to investigate variable length input in distributed training as in (Ergen & Kozat, 2017).
- Andrieu & Doucet (2002) Andrieu, Christophe and Doucet, Arnaud. Particle filtering for partially observed gaussian state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4):827–836, 2002.
- Balles et al. (2016) Balles, Lukas, Romero, Javier, and Hennig, Philipp. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.
- Chen et al. (2016) Chen, Jianmin, Pan, Xinghao, Monga, Rajat, Bengio, Samy, and Jozefowicz, Rafal. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
- Cho et al. (2017) Cho, Minsik, Finkler, Ulrich, Kumar, Sameer, Kung, David, Saxena, Vaibhav, and Sreedhar, Dheeraj. Powerai ddl. arXiv preprint arXiv:1708.02188, 2017.
- Codreanu et al. (2017) Codreanu, Valeriu, Podareanu, Damian, and Saletore, Vikram. Scale out for large minibatch sgd: Residual network training on imagenet-1k with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291, 2017.
- De Sa et al. (2015) De Sa, Christopher M, Zhang, Ce, Olukotun, Kunle, and Ré, Christopher. Taming the wild: A unified analysis of hogwild-style algorithms. In Advances in neural information processing systems, pp. 2674–2682, 2015.
- Dean et al. (2012) Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai, Devin, Matthieu, Mao, Mark, Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pp. 1223–1231, 2012.
- Dewar et al. (2012) Dewar, Michael, Wiggins, Chris, and Wood, Frank. Inference in hidden markov models with explicit state duration distributions. IEEE Signal Processing Letters, 19(4):235–238, 2012.
- Doucet et al. (2001) Doucet, Arnaud, De Freitas, Nando, and Gordon, Neil. An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Springer, 2001.
- Ergen & Kozat (2017) Ergen, Tolga and Kozat, Suleyman S. Online training of lstm networks in distributed systems for variable length data sequences. IEEE Transactions on Neural Networks and Learning Systems, 2017.
- Goyal et al. (2017) Goyal, Priya, Dollár, Piotr, Girshick, Ross, Noordhuis, Pieter, Wesolowski, Lukasz, Kyrola, Aapo, Tulloch, Andrew, Jia, Yangqing, and He, Kaiming. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Hoffer et al. (2017) Hoffer, Elad, Hubara, Itay, and Soudry, Daniel. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pp. 1729–1739, 2017.
- Hoffman et al. (2013) Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
- Johnson & Zhang (2013) Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315–323, 2013.
- Kingma & Ba (2014) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- Krishnan et al. (2017) Krishnan, Rahul G, Shalit, Uri, and Sontag, David. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101–2109, 2017.
Le et al. (2017)
Le, T.A., Baydin, A.G., and Wood, F.
Inference Compilation and Universal Probabilistic
20th International Conference on Artificial Intelligence and Statistics, 2017.
- McMahan & Streeter (2014) McMahan, Brendan and Streeter, Matthew. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems, pp. 2915–2923, 2014.
- Recht et al. (2011) Recht, Benjamin, Re, Christopher, Wright, Stephen, and Niu, Feng. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.
- Reddi et al. (2015) Reddi, Sashank J, Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, and Smola, Alexander J. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pp. 2647–2655, 2015.
- Rezende & Mohamed (2015) Rezende, Danilo Jimenez and Mohamed, Shakir. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
- Ritchie et al. (2016) Ritchie, Daniel, Horsfall, Paul, and Goodman, Noah D. Deep amortized inference for probabilistic programs. arXiv preprint arXiv:1610.05735, 2016.
- Royston (1982) Royston, JP. Algorithm as 177: Expected normal order statistics (exact and approximate). Journal of the royal statistical society. Series C (Applied statistics), 31(2):161–165, 1982.
- You et al. (2017a) You, Yang, Gitman, Igor, and Ginsburg, Boris. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.
- You et al. (2017b) You, Yang, Zhang, Zhao, Hsieh, C, Demmel, James, and Keutzer, Kurt. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.
- Zhang et al. (2015) Zhang, Sixin, Choromanska, Anna E, and LeCun, Yann. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693, 2015.