We consider a fundamental problem in machine learning, stochastic convex optimization:
Here, is a convex subset of and is a distribution over -smooth convex functions . We do not have direct access to , and the distribution is unknown, but we do have the ability to generate i.i.d. samples through some kind of stream or oracle. In practice, each function corresponds to a new datapoint in some learning problem. Algorithms for this problem are widely applicable: for example, in logistic regression the goal is to optimize when the
pairs are the (feature vector, label) pairs coming from a fixed data distribution. Given a budget oforacle calls (e.g. a dataset of size ), we wish to find a such that (called the suboptimality) is as small as possible as fast as possible using as little memory as possible, where .
The most popular algorithm for solving (1) is Stochastic Gradient Descent (SGD), which achieves statistically optimal suboptimality in time and constant memory. However, in modern large-scale machine learning problems the number of data points is often gigantic, and so even the linear time-complexity of SGD becomes onerous. We need a parallel algorithm that runs in only time using machines. We address this problem in this paper, evaluating solutions on three metrics: time complexity, space complexity, and communication complexity. Time complexity is the total time taken to process the data points. Space complexity is the amount of space required per machine. Note that in our streaming model, an algorithm that keeps only the most recently seen data point in memory is considered to run in constant memory. Communication complexity is measured in terms of the number of “rounds” of communication in which all the machines synchronize. In measuring these quantities we often suppress all constants other than those depending on and and all logarithmic factors.
In this paper we achieve the ideal parallelization complexity (up to a logarithmic factor) of time, space and rounds of communication, so long as . Further, in contrast to much prior work, our algorithm is a reduction that enables us to generically parallelize any serial online learning algorithm that obtains a sufficiently adaptive convergence guarantee (e.g. (duchi2011adaptive; orabona2014simultaneous; cutkosky2017online)) in a black-box way. This significantly simplifies our analysis by decoupling the learning rates or other internal variables of the serial algorithm from the parallelization procedure. This technique allows our algorithm to adapt to an unknown smoothness parameter in the problem, resulting in optimal convergence guarantees without requiring tuning of learning rates. This is an important aspect of the algorithm: even prior analyses that meet the same time, space and communication costs (FrostigGKS15; lei2016less; lei2017non) require the user to input the smoothness parameter to tune a learning rate. Incorrect values for this parameter can result in failure to converge, not just slower convergence. In contrast, our algorithm automatically adapts to the true value of with no tuning. Empirically, we find that the parallelized implementation of a serial algorithm matches the performance of the serial algorithm in terms of sample-complexity, while bestowing significant runtime savings.
2 Prior Work
One popular strategy for parallelized stochastic optimization is minibatch-SGD (DekelGSX12), in which one computes gradients at a fixed point in parallel and then averages these gradients to produce a single SGD step. When is not too large compared to the variance in , this procedure gives a linear speedup in theory and uses constant memory. Unfortunately, minibatch-SGD obtains a communication complexity that scales as (or for accelerated variants). In modern problems when is extremely large, this overhead is prohibitively large. We achieve a communication complexity that is logarithmic in , allowing our algorithm to be run as a near-constant number of map-reduce jobs even for very large . We summarize the state of the art for some prior algorithms algorithms in Table 1.
Many prior approaches to reducing communication complexity can be broadly categorized into those that rely on Newton’s method and those that rely on the variance-reduction techniques introduced in the SVRG algorithm (Johnson013). Algorithms that use Newton’s method typically make the assumption that is a distribution over quadratic losses (ShamirS013; ZhangX15a; reddi2016aide; WangWS17), and leverage the fact that the expected Hessian is constant to compute a Newton step in parallel. Although quadratic losses are an excellent starting point, it is not clear how to generalize these approaches to arbitrary non-quadratic smooth losses such as encountered in logistic regression.
Alternative strategies stemming from SVRG work by alternating between a “batch phase” in which one computes a very accurate gradient estimate using a large batch of examples and an “SGD phase” in which one runs SGD, using the batch gradient to reduce the variance in the updates(FrostigGKS15; lei2016less; shah2016trading; harikandeh2015stopwasting). Our approach also follows this overall strategy (see Section 3 for a more detailed discussion of this procedure). However, all prior algorithms in this category make use of carefully specified learning rates in the SGD phase, while our approach makes use of any adaptive serial optimization algorithm, even ones that do not resemble SGD at all, such as (cutkosky2017online; orabona2014simultaneous). This results in a streamlined analysis and a more general final algorithm. Not only do we recover prior results, we can leverage the adaptivity of our base algorithm to obtain better results on sparse losses and to avoid any dependencies on the smoothness parameter , resulting in a much more robust procedure.
The rest of this paper is organized as follows. In Section 3 we provide a high-level overview of our strategy. In Section 4 we introduce some basic facts about the analysis of adaptive algorithms using online learning, in Section 5 we sketch our intuition for combining SVRG and the online learning analysis, and in Section 6 we describe and analyze our algorithm. In Section 7 we show that the convergence rate is statistically optimal and show that a parallelized implementation achieves the stated complexities. Finally in Section 9 we give some experimental results.
|Method||Quadratic Loss||Space||Communication||Adapts to|
|Newton inspired (ShamirS013; ZhangX15a; reddi2016aide; WangWS17)||Needed||No|
|accel. minibatch-SGD (CotterSSS11)||Not Needed||No|
|prior SVRG-like (FrostigGKS15; lei2016less; shah2016trading; harikandeh2015stopwasting)||Not Needed||No|
|This work||Not Needed||Yes|
3 Overview of Approach
Our overall strategy for parallelizing a serial SGD algorithm is based upon the stochastic variance-reduced gradient (SVRG) algorithm (Johnson013). SVRG is a technique for improving the sample complexity of SGD given access to a stream of i.i.d. samples (as in our setting), as well as the ability to compute exact gradients in a potentially expensive operation. The basic intuition is to use an exact gradient at some “anchor point” as a kind of “hint” for what the exact gradient is at nearby points . Specifically, SVRG leverages the theorem that
is an unbiased estimate ofwith variance approximately bounded by (see (8) in (Johnson013)). Using this fact, the SVRG strategy is:
Choose an “anchor point” .
Compute an exact gradient (this is an expensive operation).
Perform SGD updates: for i.i.d. samples using the fixed anchor .
Choose a new anchor point by averaging the SGD iterates, set and repeat 2-4.
By reducing the suboptimality of the anchor point , the variance in the gradients also decreases, producing a virtuous cycle in which optimization progress reduces noise, which allows faster optimization progress. This approach has two drawbacks that we will address. First, it requires computing the exact gradient , which is impossible in our stochastic optimization setup. Second, prior analyses require specific settings for that incorporate and fail to converge with incorrect settings, requiring the user to manually tune to obtain the desired performance. To deal with the first issue, we can approximate by averaging gradients over a mini-batch, which allows us to approximate SVRG’s variance-reduced gradient estimate, similar to (FrostigGKS15; lei2016less). This requires us to keep track of the errors introduced by this approximation. To deal with the second issue, we incorporate analysis techniques from online learning which allow us replace the constant step-size SGD with any adaptive stochastic optimization algorithm in a black-box manner. This second step forms the core of our theoretical contribution, as it both simplifies analysis and allows us to adapt to .
The overall roadmap for our analysis has five steps:
We model the errors introduced by approximating the anchor-point gradient by a minibatch-average as a “bias”, so that we think of our algorithm as operating on slightly biased but low-variance gradient estimates.
Focusing first only the bias aspect, we analyze the performance of online learning algorithms with biased gradient estimates and show that so long as the bias is sufficiently small, the algorithm will still converge quickly (Section 4).
Next focusing on the variance-reduction aspect, we show that any online learning algorithm which enjoys a sufficiently adaptive convergence guarantee produces a similar “virtuous cycle” as observed with constant-step-size SGD in the analysis of SVRG, resulting in fast convergence (sketched in Section 5, proved in Appendices C and D).
Observe that the batch processing in step 3 can be done in parallel, that this step consumes the vast majority of the computation, and that it only needs to be repeated logarithmically many times (Section 7).
4 Biased Online Learning
A popular way to analyze stochastic gradient descent and related algorithms is through online learning (shalev2012online). In this framework, an algorithm repeatedly outputs vectors for in some convex space , and receives gradients such that for some convex objective function .111The online learning literature often allows for adversarially generated , but we consider only stochastically generated here. Typically one attempts to bound the linearized regret:
Where . We can apply online learning algorithms to stochastic optimization via online-to-batch conversion (cesa2004generalization), which tells us that
Thus, an algorithm that guarantees small regret immediately guarantees convergence in stochastic optimization. Online learning algorithms typically obtain some sort of (deterministic!) guarantee like
where is increasing in each . For example, when the convex space has diameter , AdaGrad (duchi2011adaptive) obtains .
As foreshadowed in Section 3, we will need to consider the case of biased gradients. That is,
for some unknown bias vector. Given these biased gradients, a natural question is: to what extent does controlling the regret affect our ability to control the suboptimality ? We answer this question with the following simple result:
Define and where and . Then
If the domain has diameter , then
Our main convergence results will require algorithms with regret bounds of the form or for various . This is an acceptable restriction because there are many examples of such algorithms, including AdaGrad (duchi2011adaptive), SOLO (orabona2016scale), PiSTOL (orabona2014simultaneous) and FreeRex (cutkosky2017online). Further, in Proposition 3 we show a simple trick to remove the dependence on , allowing our results to extend to unbounded domains.
5 Variance-Reduced Online Learning
In this section we sketch an argument that using variance reduction in conjunction with a online learning algorithm guaranteeing regret results in a very fast convergence of up to log factors. A similar result holds for regret guarantees like via a similar argument, which we leave to Appendix D. To do this we make use of a critical lemma of variance reduction which asserts that a variance-reduced gradient estimate of with anchor point has up to constants. This gives us the following informal result:
[Informal statement of Proposition 8] Given a point , let be an unbiased estimate of such that . Suppose are generated by an online learning algorithm with regret at most . Then
The proof is remarkably simple, and we sketch it in one line here. The full statement and proof can be found in Appendix D.
Now square both sides and use the quadratic formula to solve for . ∎
Notice that in Proposition 2, the online learning algorithm’s regret guarantee does not involve the smoothness parameter , and yet nevertheless shows up in equation (2). It is this property that will allow us to adapt to without requiring any user-supplied information.
Variance reduction allows us to generate estimates satisfying the hypothesis of Proposition 2, so that we can control our convergence rate by picking appropriate s. We want to change very few times because changing anchor points requires us to compute a high-accuracy estimate of . Thus we change only when is a power of 2 and set to be the average of the last iterates . By Jensen, this allows us to bound by , and so applying Proposition 2 we can conclude .
6 SVRG with Online Learning
With the machinery of the previous sections, we are now in a position to derive and analyze our main algorithm, presented in SVRG OL.
SVRG OL implements the procedure described in Section 3. For each of a series of rounds, we compute a batch gradient estimate for some “anchor point” . Then we run iterations of an online learning algorithm. To compute the gradient given to the online learning algorithm in response to an output point , SVRG OL approximates the variance-reduction trick of SVRG, setting for some new sample . After the iterations have elapsed, a new anchor point is chosen and the process repeats.
In this section we characterize SVRG OL’s performance when the base algorithm has a regret guarantee of . We can also perform essentially similar analysis for regret guarantees like , but we postpone this to Appendix E.
In order to analyze SVRG OL, we need to bound the error uniformly for all . This can be accomplished through an application of Hoeffding’s inequality:
Suppose that is a distribution over -Lipschitz functions. Then with probability at least
-Lipschitz functions. Then with probability at least, .
Suppose the online learning algorithm guarantees regret . Set for (where ). Suppose that for all . Then for ,
In particular, if is a distribution over -Lipschitz functions, then with probability at least we have for all . Further, if and has diameter , then this implies
We note that although this theorem requires a finite diameter for the second result, we present a simple technique to deal with unbounded domains and retain the same result in Appendix D
7 Statistical and Computational Complexity
In this section we describe how to choose the batch size and epoch sizes in order to obtain optimal statistical complexity and computational complexity. The total amount of data used by SVRG OL is . If we choose , this is . Set , with some so that and . Then our Theorem 1 guarantees suboptimality , which is . This matches the optimal up to logarithmic factors and constants.
The parallelization step is simple: we parallelize the computation of by having machines compute and sum gradients for new examples each, and then averaging these sums together on one machine. Notice that this can be done with constant memory footprint by streaming the examples in - the algorithm will not make any further use of these examples so it’s safe to forget them. Then we run the steps of the inner loop in serial, which again can be done in constant memory footprint. This results in a total runtime of - a linear speedup so long as . For algorithms with regret bounds matching the conditions of Theorem 1, we get optimal convergence rates by setting , in which case our total data usage is . This yields the following calculation:
Set . Suppose the base optimizer in SVRG OL guarantees regret , and the domain has finite diameter . Let and be the total number of data points observed. Suppose we compute the batch gradients in parallel on machines with . Then for we obtain
in time , and space , and communication rounds.
8.1 Linear Losses and Dense Batch Gradients
Many losses of practical interest take the form for some label and feature vector where is extremely large, but is -sparse. These losses have the property that is also -sparse. Since can often be very large, it is extremely desirable to perform all parameter updates in time rather than time. This is relatively easy to accomplish for most SGD algorithms, but our strategy involves correcting the variance in using a dense batch gradient and so we are in danger of losing the significant computational speedup that comes from taking advantage of sparsity. We address this problem through an importance-sampling scheme.
Suppose the th coordinate of is non-zero with probability . Given a vector , let be the vector whose th component is if , or is . Then is equal to the all-ones vector. Using this notation, we replace the variance-reduced gradient estimate with , where indicates component-wise multiplication. Since is the all-ones vector, and so the expected value of this estimate has not changed. However, it is clear that the sparsity of the estimate is now equal to the sparsity of . Performing this transformation introduces additional variance into the estimate, and could slow down our convergence by a constant factor. However, we find that even with this extra variance we still see impressive speedups (see Section 9).
8.2 Spark implementation
Implementing our algorithm in the Spark environment is fairly straightforward. SVRG OL switches between two phases: a batch gradient computation phase and a serial phase in whicn we run the online learning algorithm. The serial phase is carried out by the driver while the batch gradient is computed by executors. We initially divide the training data into approximately 100M chunks, and we use executors. Tree aggregation with depth of is used to gather the gradient from the executors, which is similar to the operation implemented by Vowpal Wabbing (VW) (agarwal2014reliable). We use asynchronous collects to move the instances used in the next serial SGD phase of SVRG OL to the driver while the batch gradient is being computed. We used feature hashing with 23 bits to limit memory consumption.
8.3 Batch sizes
Our theoretical analysis asks for exponentially increasing serial phase lengths and a batch size of of . In practice we use slightly different settings. We have a constant serial phase length for all , and an increasing batch size for some constant . We usually set . The constant is motivated by observing that the requirement for exponentially increasing comes from a desire to offset potential poor performance in the first serial phase (which gives the dependence on in Theorem 1). In practice we do not expect this to be an issue. The increasing batch size is motivated by the empirical observation that earlier serial phases (when we are farther from the optimum) typically do not require as accurate a batch gradient in order to make fast progress.
|Data||# Instance||Data size (Gb)||# Features||Avg # feat||% positives|
|KDD10||19.2M||0.74M||0.5||0.02||29 890 095||29.34||86.06%|
|KDD12||119.7M||29.9M||1.6||0.5||54 686 452||11.0||4.44%|
|ADS SMALL||1.216B||0.356B||155.0||40.3||2 970 211||92.96||8.55%|
|ADS LARGE||5.613B||1.097B||1049.1||486.1||12 133 899||95.72||9.42%|
|1.236B||0.994B||637.4||57.6||37 091 273||132.12||18.74%|
To verify our theoretical results, we carried out experiments on large-scale (order 100 million datapoints) public datasets, such as KDD10 and KDD12 222https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html and on proprietary data (order 1 billion datapoints), such as click-prediction for ads (ChengC10) and email click-prediction datasets (anona). The main statistics of the datasets are shown in Table 2. All of these are large datasets with sparse features, and heavily imbalanced in terms of class distribution. We solved these binary classification tasks with logistic regression. We tested two well-know scalable logistic regression implementation: Spark ML 2.2.0 and Vowpal Wabbit 7.10.0 (VW) 333 https://github.com/JohnLangford/vowpal_wabbit/releases/tag/7.10
. To optimize the logistic loss we used the L-BFGS algorithm implemented by both packages. We also tested minibatch SGD and non-adaptive SVRG implementations. However, we observe that the relationship between non-adaptive SVRG updates and the updates in our algorithm are analogous to the relationship between the updates in constant-step-size SGD and (for example) AdaGrad. Since our experiments involved sparse high-dimensional data, adaptive step sizes are very important and one should not expect these algorithms to be competitive (and indeed they were not).
First we compared SVRG OL to several non-parallelized baseline SGD optimizers on the different datasets. We plot the loss a function of the number of datapoints processed, as well as the total runtime (Figure 1). Measuring the number of datapoints processed gives us a sense of the statistical efficiency of the algorithm and gives a metric that is independent of implementation quality details. We see that, remarkably, SVRG OL’s actually performs well as a function of number of datapoints processed and so is a competitive serial algorithm before even any parallelization. Thus it is no surprise that we see significant speedups when the batch computation is parallelized.
To assess the trend of the speed-up with the size of the training data, we plotted the relative speed-up of SVRG OL versus FreeRex which is used as base optimizer in SVRG OL. Figure 2 shows the fraction of running time of non-parallel and parallel algorithms needed to achieve the same performance in terms of test loss. The x-axis scales with the running time of the parallel SVRG OL algorithm. The speed-up increases with training time, and thus the number of training instances processed. This result suggests that our method will indeed match with the theoretical guarantees in case of large enough datasets, although this trend is hard to verify rigorously in our test regime.
In our second experiment, we proceed to compare SVRG OL to Spark ML and VW in Table 4. These two LBFGS-based algorithms were superior in all metrics to minibatch SGD and non-adaptive SVRG algorithms and so we report only the comparison to Spark ML and VW (see Section F for full results). We measure the number of communication rounds, the total training error, the error on a held-out test set, the Area Under the Curve (AUC), and total runtime in minutes. Table 4 illustrates that SVRG OL compares well to both Spark ML and VW. Notably, SVRG OL uses dramatically fewer communication rounds. On the smaller KDD datasets, we also see much faster runtimes, possibly due to overhead costs associated with the other algorithms. It is important to note that our SVRG OL makes only one pass over the dataset, while the competition makes one pass per communication round, resulting in 100s of passes. Nevertheless, we obtain competitive final error due to SVRG OL’s high statistical efficiency.
We have presented SVRG OL, a generic stochastic optimization framework which combines adaptive online learning algorithms with variance reduction to obtain communication efficiency in parallel architectures. Our analysis significantly streamlines previous work by making black-box use of any adaptive online learning algorithm, thus disentangling the variance-reduction and serial phases of SVRG algorithms. We require only a logarithmic number of communication rounds, and we automatically adapt to an unknown smoothness parameter, yielding both fast performance and robustness to hyperparameter tuning. We developed a Spark implementation of SVRG OL and solved real large scale sparse learning problems with competitive performance to L-BFGS implemented by VW and Spark ML.
Appendix A Proof of Lemma 1
The assumption on implies that is bounded by and so is -subgaussian. Therefore we can apply the Hoeffding and union bounds to obtain tail bounds on :
rearranging, with probability at least , for all we have
as desired. ∎
Appendix B Proofs from Section 4
The proof follows from Cauchy-Schwarz, triangle inequality, and convexity of :
Now rearrange and apply Jensen’s inequality to recover the first line of the Proposition. The second statement follows from observing that . ∎
Proposition 1 shows that the suboptimality increases when both and becomes large. Although the online learning algorithm does not have the ability to control , it does have the ability to control , and so we can design a reduction to compensate for . The reduction is simple: instead of , provide the algorithm with , where is a bound such that for all , and by abuse of notation we take when . Proposition 3 below, tells us that, so long as we know the bound , we can obtain an increase in suboptimality that depends only on and not .
Suppose an online learning algorithm guarantees regret , where is an increasing function of each . Then if we run on gradients , we obtain:
Observe that where , so that by convexity we have
Now observe that to obtain:
Finally, use Jensen’s inequality to concude the Proposition ∎
Appendix C Smooth Losses
In the following sections we consider applying an online learning algorithm to gradients of the form where each is an i.i.d. smooth convex function with and is some fixed vector. In order to leverage the structure of this kind of , we’ll need two lemmas from the literature about smooth convex functions (which can be found, for example, in Johnson013):
If is an -smooth convex function and are fixed vectors, then
Set . Then is still convex and -smooth and so that for all . Therefore We have
from which the Lemma follows. ∎
We can use Lemma 2 to show the following useful fact:
Suppose is a distribution over -smooth convex functions and . Let . Then for all we have .
From Lemma 2 we have
and now the result follows since . ∎
With these in hand, we can prove
With for some points and , we have
Observe that , so that by Lemma 3
Using this we have
Appendix D Biased Online Learning with SVRG updates
In this section are finally prepared to analyze SVRG OL. In order to do so, we restate the algorithm as Algorithm 2. This algorithm is identical to SVRG OL, but we have introduced the additional notation so that we can write instead of . Factoring out the term and writing in this way makes clearer how we are able to apply the analysis of biased gradients to the analysis of online learning in the previous section. We analyze Algorithm 2 for online learning algorithms that obtain second-order regret guarantees, as well as ones that obtain first-order regret guarantees . Algorithm in these families include the well-known AdaGrad [duchi2011adaptive] algorithm and its unconstrained variant SOLO [orabona2016scale] (second order, ) as well as FreeRex [cutkosky2017online] and PiSTOL [orabona2014simultaneous] (first order, ). We will show that for sufficiently small , such algorithms result in such that .