We study the stochastic convex optimization problem
where the constraint set is a compact convex set, and is a convex loss for each , where
is an unknown probability distribution from which we can draw i.i.d. samples. Problem (1.1
) is broadly important in both optimization and machine learning[18, 17, 16, 12, 6]. It should be distinguished from (and is harder than) the finite-sum optimization problems [15, 3], for which sharper results on the empirical loss are possible but not on the generalization error.
A classic approach to solve (1.1
) is via stochastic gradient descent (SGD) (also called stochastic approximation ). At each iteration SGD performs the update , where denotes orthogonal projection onto , scalar is a suitable stepsize, and is an unbiased stochastic gradient such that .
Although much more scalable than gradient descent, SGD is still a sequential method that cannot be immediately used for truly large-scale problems requiring distributed optimization. Indeed, distributed optimization  is a central focus of real-world machine learning, and has attracted significant recent research interest, a large part of which is dedicated to scaling up SGD [1, 13, 5, 9, 7].
Our work is motivated by the need to more precisely model and exploit the delay properties of real-world cloud providers; indeed, the behavior of machines and delays in such settings are typically quite different from what one may observe on small clusters owned by individuals or small groups. In particular, cloud resources are shared by many users who run variegated tasks on them. Consequently, such an environment will invariably be more diverse in terms of availability of key resources such as CPU, disk, or network bandwidth, as compared to an environment where resources are shared by a small number of individuals. Thus, being able to accommodate for variable delays is of great value to both providers and users of large-scale cloud services.
In light of this background, we investigate delay sensitive asynchronous SGD, especially, for being able to adapt to the actual delays experienced rather than using global upper-case ‘bounded delay’ arguments that can be too pessimistic. A potential practical approach is as follows: in the beginning the server updates parameters whenever its receives a gradient from any machine, with a weight inversely proportional to the actual delay observed. Towards the end, the server may take larger update steps whenever it gets a gradient from a machine that sends parameters infrequently, and small ones if it get parameters from a machine that updates very frequently, to reduce the bias caused by the initial aggressive steps.
The key contributions of this paper are underscored by our practical motivation. In particular, we design, analyze and investigate AdaDelay (Adaptive Delay
), an asynchronous SGD algorithm, that more closely follows the actual delays experienced during computation. Therefore, instead of penalizing parameter updates by using worst-case bounds on delays, AdaDelay uses step sizes that depend on the actual delays observed. While this allows the use of larger stepsizes, it requires a slightly more intricate analysis because (i) step sizes and are no longer guaranteed to be monotonically decreasing; and (ii) residuals that measure progress are not independent across time as they are coupled by the delay random variable.
We validate our theoretical framework by experimenting with large-scale machine learning datasets containing over a billion points and features. The experiments reveal that our assumptions of network delay are a reasonable approximation to the actual observed delays, and that in the regime of large delays (e.g., when there are stragglers), using delay sensitive steps is very helpful toward obtaining models that more quickly converge on test accuracy; this is revealed by experiments where using AdaDelay leads to significant improvements on the test error (AUC).
An useful summary on aspects of stochastic optimization in machine learning is ; more broadly, [12, 17] are excellent references. Our focus is on distributed stochastic optimization in the asynchronous setting. The classic work  is an important reference; more recent works closest to ours are [1, 16, 7, 11]. Of particular relevance to our paper is the recent work on delay adaptive gradient scaling in an AdaGrad like framework . The work  claims substantial improvements under specialized settings over , a work that exploits data sparsity in a distributed asynchronous setting. Our experiments confirm ’s claims that their best learning rate is insensitive to maximum delays. However, in our experience the method of  overly smooths the optimization path, which can have adverse effects on real-world data (see Section 4).
To our knowledge, all previous works on asynchronous SGD (and its AdaGrad variants) assume monotonically diminishing step-sizes. Our analysis, although simple, shows that rather than using worst case delay bounds, using exact delays to control step sizes can be remarkably beneficial in realistic settings: for instance, when there are stragglers that can slow down progress for all the machines in a worst-case delay model.
Algorithmically, the work  is the one most related to ours; the authors of  consider using delay information to adjust the step size. However, the most important difference is that they only use the worst possible delays which might be too conservative, while AdaDelay leverages the actual delays experienced.  investigates two variants of update schemes, both of which occur with delay. But they do not exploit the actual delays either. There are some other interesting works studying specific scenarios, for example, , which focuses on the sparse data. However, our framework is more general and thus capable of covering more applications.
2 Problem Setup and Algorithm
We build on the groundwork laid by [1, 11]; like them, we also consider optimizing (1.1) under a delayed gradient model. The computational framework that we use is the parameter-server , so that a central server111This server is virtual; its physical realization may involve several machines, e.g., . maintains the global parameter, and the worker nodes compute stochastic gradients using their share of the data. The workers communicate their gradients back to the central server (in practice using various crucial communication saving techniques implemented in the work of ), which updates the shared parameter and communicates it back.
To highlight our key ideas and avoid getting bogged down in unilluminating details, we consider only smooth stochastic optimization, i.e., in this paper. Straightforward, albeit laborious extensions are possible to nonsmooth problems, strongly convex costs, mirror descent versions, proximal splitting versions. Such details are relegated to a longer version of this paper.
Assumption 2.1 (Lipschitz gradients).
The function has a locally -Lipschitz gradients. That is,
Assumption 2.2 (Bounded variance).
There exists a constant such that
Assumption 2.3 (Compact domain).
Let . Then,
Finally, an additional assumption, also made in  is that of bounded gradients.
Assumption 2.4 (Bounded Gradient).
Let . Then,
These assumptions are typically reasonable for machine learning problems, for instance, logistic-regression losses and least-squares costs, as long as the data samples remain bounded, which is typically easy to satisfy. Exploring relaxed versions of these assumptions would also be interesting.
We denote a random delay at time-point by ; step sizes are denoted , and delayed gradients as . For a differentiable convex function , the corresponding Bregman divergence is . For simplicity, all norms are assumed to be Euclidean. We also interchangeably use and to refer to the same quantity.
2.1 Delay model
Assumption 2.5 (Delay).
We consider the following two practical delay models:
Uniform: Here . This model is a reasonable approximation to observed delays after an initial startup time of the network. We could make a more refined assumption that for iterations , the delays are uniform on , and the analysis easily extends to handle this case; we omit it for ease of presentation. Furthermore, the analysis also extends to delays having distributions with bounded support. Therefore, it indeed captures a wide spectrum of practical models.
Scaled: For each , there is a such that . Moreover, assume that
are constants that do not grow with (the subscript only indicates that each
is a random variable that may have a different distribution). This model allows delay processes that are richer than uniform as long as they have bounded first and second moments.
Remark: Our analysis seems general enough to cover many other delay distributions by combining our two delay models. For example, the Gaussian model (where
obeys a Gaussian distribution but its support must be truncated as) may be seen as a combination of the following: 1) When (a suitable constant), the Gaussian assumption indicates , which falls under our second delay model; 2) When , our proof technique with bounded support (same as uniform model) applies. Of course, we believe a more refined analysis for specific delays may help tighten constants.
Under the above delay model, we consider the following projected stochastic gradient iteration:
where the stepsize is sensitive to the actual delay observed. Iteration (2.1) generates a sequence ; the server also maintains the averaged iterate
We use stepsizes of the form , where the step offsets are chosen to be sensitive to the actual delay of the incoming gradients. We typically use
for some constant (to be chosen later). Actually, we can also consider time-varying multipliers in (3.1) (see Corollary 3.4), but initially for clarity of presentation we let be independent of . Thus, if there are no delays, then and iteration (2.1) reduces to the usual synchronous SGD. The constant
is used to tradeoff contributions in the error bound from the noise variance, the radius bound , and potentially bounds on gradient norms.
Our convergence analysis builds heavily on . But the key difference is that our step sizes depend on the actual delay experienced, rather than on a fixed worst-case bounds on the maximum possible network delay. These delay dependent step sizes necessitate a slightly more intricate analysis. The primary complexity arises from being no longer independent of the actual delay . This in turn raises another difficulty, namely that are no longer monotonically decreasing, as is typically assumed in most convergence analyses of SGD. We highlight our theoretical result below, and due to space limitation, all the auxiliary lemmas are moved to appendix.
The proof begins by analyzing the difference ; Lemma A.2 (provided in the supplement) bounds this difference, ultimately leading to an inequality of the form:
The random variables , , and are in turned given by
Once we bound these in expectation, we obtain the result claimed in the theorem. In particular, Lemma A.3 bounds (3.2) under Assumption 2.5(A), while Lemma A.4 provides a bound under the Assumption 2.5(B). Similarly, Lemmas A.5 and Lemma A.6 bounds (3.3), while Lemma A.7 bounds (3.4). Combining these bounds we obtain the theorem. ∎
Theorem 3.1 has several implications. Corollaries 3.2 and 3.3 indicate both our delay models share a similar convergence rate, while Corollary 3.4 shows such results still hold even we replace the constant with a bounded (away from zero, and from above) sequence (a setting of great practical importance). Finally, Corollary 3.5 mentions in passing a simple variant that considers for . It also highlights the known fact that for , the algorithm achieves the best theoretical convergence.
Let satisfy Assumption 2.5 (A). Then we have
Let satisfy Assumption 2.5 (B); let , , and for all . Then,
If we wish to use step size offsets where , we get a result of the form (we report only the asymptotically worse term, as this result is of limited importance).
Let with and . Then, there exists a constant such that
We now evaluate the efficiency of AdaDelay in a distributed environment using real datasets.
We collected two click-through rate datasets for evaluation, which are shown in Table 1. One is the Criteo dataset333http://labs.criteo.com/downloads/download-terabyte-click-logs/
, where the first 8 days are used for training while the following 2 days are used for validation. We applied one-hot encoding for category and string features. The other dataset, named CTR2, is collected from a large Internet company. We sampled 100 million examples from three weeks for training, and 20 millions examples from the next week for validation. We extracted 2 billion unique features using the on-production feature extraction module. These two datasets have comparable size, but different example-feature ratios. We adopt Logistic Regression as our classification model.
|training example||test example||unique feature||non-zero entry|
|Criteo||1.5 billion||400 million||360 million||58 billion|
|CTR2||110 million||20 million||1.9 billion||13 billion|
All experiments were carried on a cluster with 20 machines. Most machines are equipped with dual Intel Xeon 2.40GHz CPUs, 32 GB memory and 1 Gbit/s Ethernet.
We compare AdaDelay with two related methods AsyncAdaGrad  and AdaptiveRevision . Their main difference lies in the choice of the learning rate at time : . Denote by the -th element of , and similarly the delayed gradient on feature . AsyncAdaGrad adopts a scaled learning rate . AdaptiveRevision takes into account actual delays by considering . It uses a non-decreasing learning rate based on . Similar to AsyncAdaGrad and AdaptiveRevision, we use a scaled learning rate in AdaDelay to better model the nonuniform sparsity of the dataset (this step size choice falls within the purview of Corollary 3.4). In other words, we set , where averages the weighted delayed gradients on feature . We follow the common practice of fixing to 1 while choosing the best by a grid search over .
We implemented these three methods in the parameter server framework , which is a high-performance asynchronous communication library supporting various data consistency models. There are two groups of nodes in this framework: workers and servers. Worker nodes run independently from each other. At each time, a worker first reads a minibatch of data from a distributed filesystem, and then pulls the relevant recent working set of parameters, namely the weights of the features that appear in this minibatch, from the server nodes. It next computes the gradients and then pushes these gradients to the server nodes.
The server nodes maintain the weights. For each feature, both AsyncAdaGrad and AdaDelay store the weight and the accumulated gradient which is used to compute the scaled learning rate. While AdaptiveRevision needs two more entries for each feature.
To compute the actual delay for AdaDelay, we let the server nodes record the time when worker is pulling the weight for minibatch . Denote by the time when the server nodes are updating the weight by using the gradients of this minibatch. Then the actual delay of this minibatch can be obtained by .
AdaptiveRevision needs gradient components for each feature to calculate its learning rate. If we send over the network by following , we increase the total network communication by , which harms the system performance due to the limited network bandwidth. Instead, we store at the server node during while processing this minibatch. This incurs no extra network overhead, however, it increases the memory consumption of the server nodes.
The parameter server implements a node using an operation system process, which has its own communication and computation threads. In order to run thousands of workers on our limited hardware, we may combine server workers into a single process to reduce the system overhead.
We first visualize the actual delays observed at server nodes. As noted from Figure 1, delay is around at the early stage of the training, while the constant varies for different tasks. For example, it is close to when training the Criteo dataset with 1,600 workers, while it increases to for the CTR2 dataset with 400 workers. After the delay hitting the value , which is often half of the number of workers, it behaves as a Gaussian distribution with mean , which are shown in the bottom of Figure 1.
Next, we present the comparison results of these three algorithms by varying the number of workers. We use the AUC on the validation dataset as the criterion444We observed similar results on using LogLoss.
, often 1% difference is significant for click-through rate estimation. We set the minibatch size toand for Criteo and CTR2, respectively, to reduce the communication frequency for better system performance555Probably due to the scale and the sparsity of the datasets, we observed no significant improvement on the test AUC when decreasing the minibatch size even to 1.. We search in the range and report the best results for each algorithm in Figure 2.
As can be seen, AdaptiveRevision only outperforms AsyncAdaGrad on the Criteo dataset with a large number of workers. The reason why it differs from  is probably due to the datasets we used are 1000 times larger than the ones reported by , and we evaluated the algorithms in a distributed environment rather than a simulated setting where a large minibatch size is necessary for the former. However, as reported , we also observed that AdaptiveRevision’s best learning rate is insensitive to the number of workers.
On the other hand, AdaDelay improves AsyncAdaGrad when a large number of workers (greater than ) is used, which means the delay adaptive learning rate takes effect when the delay can be large.
To further investigate this phenomenon, we simulated an overloaded cluster where several stragglers may produce large delays; we do this by slowing down half of the workers by a random factor in when computing gradients. The results are shown in Figure 3. As can be seen, AdaDelay consistently outperforms AsyncAdaGrad, which shows that adaptive modeling of the actual delay is better than using a constant worst case delay when the variance of the delays is large.
Finally we report the system performance. We first present the speedup from 1 machine to 16 machines, where each machine runs 100 workers. We observed a near linear speedup of AdaDelay, which is shown in Figure 4. The main reason is due to the asynchronous updating which removes the dependencies between worker nodes. In addition, using multiple workers within a machine can fully utilize the computational resources by hiding the overhead of reading data and communicating the parameters.
In the parameter server framework, worker nodes only need to cache one or a few data minibatches. Most memory is used by the server nodes to store the model. We summarize the server memory usage for the three algorithms compared in Table 2.
As expected, AdaDelay and AsyncAdaGrad have similar memory consumption because the extra storage needed by AdaDelay to track and compute the incurred delays is tiny. However AdaptiveRevision doubles memory usage, because of the extra entries that it needs for each feature, and because of the cached delayed gradient .
|Criteo||24GB||24 GB||55 GB|
|CTR2||97 GB||97 GB||200 GB|
In real distributed computing environment, there are multiple factors contributing to delay, such as the CPU speed, I/O of disk, and network throughput. With the inevitable and sometimes unpredictable phenomenon of delay, we considered distributed convex optimization by developing and analyzing AdaDelay, an asynchronous SGD method that tolerates stale gradients.
A key component of our work that differs from existing approaches is the use of (server-side) updates sensitive to the actual delay observed in the network. This allows us to use larger stepsizes initially, which can lead to more rapid initial convergence, and stronger ability to adapt to the environment. We discussed details of two different realistic delay models: (i) uniform (more generally, bounded support) delays, and (ii) scaled delays with constant first and second moments but not-necessarily bounded support. Under both models, we obtain theoretically optimal convergence rates.
Adapting more closely to observed delays and incorporating server-side delay sensitive gradient aggregation that combines the benefits of the adaptive revision framework  with our delayed gradient methods is an important future direction. Extension of our analysis to handle constrained convex optimization problems (without requiring a projection oracle onto the constraint set) is also an important part of future work.
- Agarwal and Duchi  Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
- Bertsekas and Tsitsiklis  D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
- Bertsekas  Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1–38, 2011.
- Duchi et al.  John Duchi, Michael I Jordan, and Brendan McMahan. Estimation, optimization, and parallelism when data is sparse. In NIPS 26, pages 2832–2840, 2013.
- Duchi et al.  John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: convergence analysis and network scaling. Automatic Control, IEEE Transactions on, 57(3):592–606, 2012.
- Ghadimi and Lan  Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
- Langford et al.  J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Neural Information Processing Systems, 2009. URL http://arxiv.org/abs/0911.0491.
- Li et al. [2014a] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Operating Systems Design and Implementation (OSDI), 2014a.
- Li et al. [2014b] Mu Li, David G Andersen, Alex J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In NIPS 27, pages 19–27, 2014b.
- McMahan and Streeter  Brendan McMahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous distributed online learning. In NIPS 27, pages 2915–2923, 2014.
- Nedić et al.  A Nedić, Dimitri P Bertsekas, and Vivek S Borkar. Distributed asynchronous incremental subgradient methods. Studies in Computational Mathematics, 8:381–407, 2001.
- Nemirovski et al.  A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- Ram et al.  S Sundhar Ram, A Nedić, and Venugopal V Veeravalli. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010.
- Robbins and Monro  H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- Schmidt et al.  Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. CoRR, abs/1309.2388, 2013.
- Shamir and Srebro  Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, 2014.
- Shapiro et al.  Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory, volume 16. SIAM, 2014.
- Srebro and Tewari  N. Srebro and A. Tewari. Stochastic Optimization for Machine Learning. ICML 2010 Tutorial, 2010.
Appendix A Technical details of the convergence analysis
We collect below some basic tools and definitions from convex analysis.
Definition A.1 (Bregman divergence).
Let be differentiable strictly convex function. The Bregman divergence generated by is
Four-point identity: Bregman divergences satisfy the following four point identity:
A special case of (A.8) is the “three-point” identity
a.1 Bounding the change
We start the analysis by bounding the gap . The lemma below is just a combination of several results of . We present the details below in one place for easy reference. The impact of our delay sensitive step sizes shows up in subsequent lemmas, where we bound the individual terms that arise from Lemma A.2.
At any time-point , let the gradient error due to delays be
Then, we have the following (deterministic) bound:
Using convexity of we have
Now apply Lipschitz continuity of to the second term to obtain
To complete the proof, we bound the terms and separately below.
Choose ; then rewrite the above inequality and identity (A.9) with to get
Plugging in this bound for and collecting the terms we obtain
The next step is to take expectations over (A.11) and then further bound the resulting terms separately. Note that is independent of given (since is a function of gradients up to time ). Thus, the third term in (A.11) has zero expectation. It remains to consider expectations over the following three quantities:
Lemma A.3 bounds (A.15) under Assumption 2.5(A), while Lemma A.4 provides a bound under the Assumption 2.5(B). Similarly, Lemmas A.5 and A.6 bound (A.16), while Lemmas A.7 bounds (A.17). Combining these bounds we obtain the theorem.
a.2 Bounding , , and
Unlike the delay independent step sizes treated in , bounding requires some more work because depends on , which in turn breaks the monotonically decreasing nature of (we wish to avoid using a fixed worst case bound on the steps, to gain more precise insight into the impacts of being sensitive to delays), necessitating a more intricate analysis.
Let . Observe that although , it is not independent of . Thus, with
Since is not monotonically decreasing with , while upper-bounding we cannot simply discard the final term in (A.18).
When , uniformly takes on at most values
where . Given a delay , is just . Using , we have
Using nested expectations we then see that
where we dropped the terms with as they are non-positive.
Consider now the inner summation above. We have
Thus, we now consider
Summing over to , we finally obtain the upper bound
Let Assumption (2.5) (B) hold. Then