Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms

09/22/2014 ∙ by Yu-Xiang Wang, et al. ∙ 0

We develop parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. Whenever possible, we perform computations asynchronously, which helps attain speedups on multicore machines as well as in distributed environments. Moreover, instead of worst-case bounded delays, our methods only depend (mildly) on expected delays, allowing them to be robust to stragglers and faulty worker threads. Our algorithms assume block-separable constraints, and subsume the recent Block-Coordinate Frank-Wolfe (BCFW) method lacoste2013block. Our analysis reveals problem-dependent quantities that govern the speedups of our methods over BCFW. We present experiments on structural SVM and Group Fused Lasso, obtaining significant speedups over competing state-of-the-art (and synchronous) methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The classical Frank-Wolfe (FW) algorithm [13] has witnessed a huge surge of interest recently [7, 20, 21, 2]. The FW algorithm iteratively solves the problem

(1)

where is a smooth function (typically convex) and is a closed convex set. The key factor that makes FW appealing is its use of a linear oracle that solves , instead of a projection (quadratic) oracle that solves , especially because the linear oracle can be much simpler and faster.

This appeal has motivated several new variants of basic FW, e.g., regularized FW [41, 6, 17], linearly convergent special cases [23, 16], stochastic/online versions [34, 18, 25], and a randomized block-coordinate FW [24].

But despite this progress, parallel and distributed FW variants are barely studied. In this work, we develop new parallel and distributed FW algorithms, in particular for block-separable instances of (1) that assume the form

(2)

where () is a compact convex set and are coordinate blocks of . This setting for FW was considered in [24], who introduced the Block-Coordinate Frank-Wolfe (Bcfw) method.

Such problems arise in many applications, notably, structural SVMs [24], routing [26], group fused lasso [1, 5]

, trace-norm based tensor completion 

[29], reduced rank nonparametric regression [12], and structured submodular minimization [22], among others.

One approach to solve (2) is via block-coordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem [32, 36, 3]. However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polytopes [15]), or even computationally intractable [8].

Frank-Wolfe (FW) methods excel in such scenarios as they rely only on linear oracles solving . For , this breaks into the independent problems

(3)

where denotes the gradient w.r.t. the coordinates . It is immediate that these subproblems can be solved in parallel (an idea dating back to at least [26]). But there is a practical impediment: updating all the coordinates at each iteration (serially or in parallel) is expensive hampering use of FW on big-data problems.

This drawback is partially ameliorated by Bcfw [24], a method that randomly selects a block at each iteration and performs FW updates with it. However, this procedure is strictly sequential: it does not take advantage of modern multicore architectures or of high-performance distributed clusters.

Contributions. In light of the above, we develop scalable FW methods, and make the following main contributions:

  • Parallel and distributed block-coordinate Frank-Wolfe algorithms, henceforth both referred as Ap-Bcfw, that allow asynchronous computation. Ap-Bcfw depends only (mildly) on the expected delay, therefore is robust to stragglers and faulty worker threads.

  • An analysis of the primal and primal-dual convergence of Ap-Bcfw and its variants for any minibatch size and potentially unbounded maximum delay. When the maximum delay is actually bounded, we show stronger results using results from load-balancing on max-load bounds.

  • Insightful deterministic conditions under which minibatching provably improves the convergence rate for a class of problems (sometimes by orders of magnitude).

  • Experiments that demonstrate on real data how our algorithm solves a structural SVM problem several times faster than the state-of-the-art.

In short, our results contribute to making FW more attractive for big-data applications. To lend further perspective, we compare our methods to some closely related works below. Space limits our summary; we refer the reader to Jaggi [21], Zhang et al. [40], Lacoste-Julien et al. [24], Freund & Grigas [14] for additional notes and references.

Bcfw and Structural SVM. Our algorithm Ap-Bcfw extends and generalizes Bcfw to parallel computation using mini-batches. Our convergence analysis follows the proof structure in Lacoste-Julien et al. [24], but with different stepsizes that must be carefully chosen. Our results contain Bcfw as a special case. A large portion of Lacoste-Julien et al. [24] focuses on more explicit (and stronger) guarantee for Bcfw on structural SVM. While we mainly focus on a more general class of problems, the particular subroutine needed by structural SVM requires special treatment; we discuss the details in Appendix C.

Parallelization of sequential algorithms. The idea of parallelizing sequential optimization algorithms is not new. It dates back to [38] for stochastic gradient methods; more recently Richtárik & Takáč [36], Liu et al. [30], Lee et al. [27] study parallelization of BCD. The conditions under which these parallel BCD methods succeed, e.g., expected separable overapproximation (ESO), and coordinate Lipschitz conditions, bear a close resemblance to our conditions in Section 2.2, but are not the same due to differences in how solutions are updated and what subproblems arise. In particular, our conditions are affine invariant. We provide detailed comparisons to parallel coordinate descents in Appendix D.4.

Asynchronous algorithms.

Asynchronous algorithms that allow delayed parameter updates have been proposed earlier for stochastic gradient descent 

[33] and parallel BCD [30]. We propose the first asynchronous algorithm for Frank-Wolfe. Our asynchronous scheme not only permits delayed minibatch updates, but also allows the updates for coordinate blocks within each minibatch to have different delays. Therefore, each update may not be a solution of (3) for any single . In addition, we obtained strictly better dependency on the delay parameter than predecessors (e.g., an exponential improvement over Liu et al. [30]) possibly due to a sharper analysis.

Other related work. While preparing our manuscript, we discovered the preprint [4] which also studies distributed Frank-Wolfe. We note that [4] focuses on Lasso type problems and communication costs, and hence, is not directly comparable to our results.

Notation.

We briefly summarize our notation now. The vector

denotes the parameter vector, possibly split into coordinate blocks. For block , is the projection matrix which projects down to ; thus . The adjoint operator maps , thus is with zeros in all dimensions except (note the subscript ). We denote the size of a minibatch by , and the number of parallel workers (threads) by . Unless otherwise stated,

denotes the iteration/epoch counter and

denotes a stepsize. Finally, (and other such constants) denotes some curvature measure associated with function and minibatch size . Such constants are important in our analysis, and will be described in greater detail in the main text.

2 Algorithm

In this section, we develop and analyze an asynchronous parallel block-coordinate Frank-Wolfe algorithm, hereafter Ap-Bcfw, to solve (2).

Our algorithm is designed to run fully asynchronously on either a shared-memory multicore architecture or on a distributed system. For the shared-memory model, the computational work is divided amongst worker threads, each of which has access to a pool of coordinates that it may work on, as well as to the shared parameters. This setup matches the system assumptions in Niu et al. [33], Richtárik & Takáč [36], Liu et al. [30], and most modern multicore machines permit such an arrangement. On a distributed system, the parameter server [28, 9] broadcast the most recent parameter vector periodically to each worker and workers keep sending updates to the parameter vector after solving the subroutines corresponding to a randomly chosen parameter. In either settings, we do not wait fo slower workers or synchronize the parameters at any point of the algorithm, therefore many updates sent from the workers could be calculated based on a delayed parameter.

  ————————Server node———————
  Input: An initial feasible , mini-batch size , number of workers .
  Broadcast to all workers.
  for  = 1,2,… ( is the iteration number.) do
     1. Read from buffer until it has updates for disjoint blocks (overwrite in case of collision111

We bound the probability of collisions in Appendix 

D.1.
). Denote the index set by .
     2. Set step size .
     3. Update .
     4. Broadcast (or just ) to workers.
     if converged then
        Broadcast STOP signal to workers and break.
     end if
  end for
  Output: .
  ———————–Worker nodes———————
  a. Set to be
  while no STOP signal received do
     if New update is received then
        b. Update .
     end if
     c. Randomly choose .
     d. Calculate partial gradient and solve (3).
     e. Send to the server.
  end while
Algorithm 1 Ap-Bcfw: Asynchronous Parallel Block-Coordinate Frank-Wolfe (Distributed)

The above scheme is made explicit by the pseudocode in Algorithm 1, following a server-worker terminology. The shared memory version of the pseudo-code is very similar, hence deferred to the Appendix. The three most important questions pertaining to Algorithm 1 are:

  • Does it converge?

  • If so, then how fast? And how much faster is it compared to Bcfw ()?

  • How do delayed updates affect the convergence?

We answer the first two questions in Section 2.1 and 2.2. Specifically, we show Ap-Bcfw converges at the familiar rate. Our analysis reveals that the speed-up of Ap-Bcfw over Bcfw through parallelization is problem dependent. Intuitively, we show that the extent that mini-batching () can speed up convergence depends on the average “coupling” of the objective function across different coordinate blocks. For example, we show that if has a block symmetric diagonally dominant Hessian, then Ap-Bcfw converges times faster. We address the third question in Section 2.3, where we establish convergence results that depend only mildly in the “expected” delay . The bound is proportional to when we allow the delay to grow unboundedly, and proportional to when the delay is bounded by a small .

2.1 Main convergence results

Before stating the results, we need to define a few quantities. The first key quantity—also key to the analysis of several other FW methods—is the notion of curvature. Since Ap-Bcfw updates a subset of coordinate blocks at a time, we define set curvature for an index set as

(4)

For index sets of size , we define the expected set curvature over a uniform choice of subsets as

(5)

These curvature definitions are closely related to the global curvature constant of [21] and the coordinate curvature and product curvature of [24]. Lemma 1 makes this relation more precise.

Lemma 1 (Curvature relations).

Suppose with cardinality and . Then,

  1. ;

  2. .

The way the average set curvature scales with is critical for bounding the amount of speedup we can expect over Bcfw; we provide a detailed analysis of this speedup in Section 2.2.

The next key object is an approximate linear minimizer. At iteration , as in Jaggi [21], Lacoste-Julien et al. [24], we also low the core computational subroutine that solves (3) to yield an approximate minimizer . The approximation is quantified by an additive constant that for a minibatch of size , the approximate solution obeys in expectation that

(6)

where the expectation is taken over both the random coins in selecting and any other source of uncertainty in this oracle call during the entire history up to step . (6) is strictly weaker than what is required in Jaggi [21], Lacoste-Julien et al. [24], as we only need the approximation to hold in expectation. With definitions (5) and (6) in hand, we are ready to state our first main convergence result.

Theorem 1 (Primal Convergence).

Suppose we employ a linear minimizer that satisfies (6) when solving subproblems (3). Then, for each , the iterations in Algorithm 1 and its line search variant (Steps 2b and 5 in Algorithm 1) obey

where the constant

At a first glance, the term in the numerator might seem bizzare, but as we will see in the next section, can be as small as . This is the scale of the constant one should keep in mind to compare the rate to other methods, e.g. coordinate descent. Also note that so far this convergence result does not explicitly work for delayed updates, which we will analyze in Section 2.3 separately via the approximation parameter .

For FW methods, one can also easily obtain a convergence guarantee in an appropriate primal-dual sense. To this end, we introduce our version of the surrogate duality gap [21]; we define this as

(7)

To see why (7) is actually a duality gap, note that since is convex, the linearization is always smaller than the function evaluated at any , so that

This duality gap is obtained for “free” in batch Frank-Wolfe, but not in Bcfw or Ap-Bcfw

. Here, we only have an unbiased estimator

. As gets large, is close to with high probability (McDiarmid’s Inequality), and can still be useful as a stopping criterion.

Theorem 2 (Primal-Dual Convergence).

Suppose we run Algorithm 1 and its line search variant up to iterations, let and , then there exists at least one such that the expected surrogate duality gap satisfies

where is as in Theorem 1 and is the weighted average .

Relation with FW and Bcfw:

The above convergence guarantees can be thought of as an interpolation between

Bcfw and batch FW. If we take , this gives exactly the convergence guarantee for Bcfw [24, Theorem 2] and if we take , we can drop from (with a small modification in the analysis) and it reduces to the classic batch guarantee as in [21].

Dependence on initialization: Unlike classic FW, the convergence rate for our method depends on the initialization. When and , the convergence is slower by a factor of . The same concern was also raised in [24] with . We can actually remove the from as long as we know that . By Lemma 1, the expected set curvature increases with , so the fast convergence region becomes larger when we increase . In addition, if we pick , the rate of convergence is not affected by initialization anymore.

Speedup: The careful reader may have noticed the term in the numerator. This is undesirable as can be large (for instance, in structural SVM is the total number of data points). The saving grace in Bcfw is that when , is as small as (see [24, Lemmas A1 and A2]), and it is easy to check that the dependence in is the same even for . What really matters is how much speedup one can achieve over Bcfw, and this speedup critically relies on how depends on . Analyzing this dependence will be our main focus in the next section.

2.2 Effect of parallelism / mini-batching

To understand when mini-batching is meaningful and to quantify its speedup, we take a more careful look at the expected set curvature in this section. In particular, we analyze and present a set of insightful conditions that govern its relationship with . The key idea is to roughly quantify how strongly different coordinate blocks interact with each other.

To begin, assume that there exists a positive semidefinite matrix such that for any

(8)

The matrix may be viewed as a generalization of the gradient’s Lipschitz constant (a scalar) to a matrix. For quadratic functions , we can take . For twice differentiable functions, we can choose

Since (we write instead of for brevity), we separate into blocks; so represents the block corresponding to and such that we can take the product . Now, we define a boundedness parameter for every , and an incoherence condition with parameter for every block coordinate pair such that

Then, using these quantities, we obtain the following bound on the expected set-curvature.

Theorem 3.

If problem (2) obeys -expected boundedness and -expected incoherence. Then,

(9)

It is clear that when the incoherence term is large, the expected set curvature is proportional to , and when is close to 0, then is proportional to . In other words, when the interaction between coordinates block is small, one would gain from parallelizing the block-coordinate Frank-Wolfe. This is analogous to the situation in parallel coordinate descent [36, 30] and we will compare the rate of convergence explicitly with them in the next section.

Remark 1.

Let us form a matrix with on the diagonal and on the off-diagonal. If is symmetric diagonally dominant (SDD), i.e., the sum of absolute off-diagonal entries in each row is no greater than the diagonal entry, then is proportional to .

The above result depends on the parameters and . We now derive specific instances of the above results for the structural SVM and Group Fused Lasso. For the structural SVM, a simple generalization of [24, Lemmas A.1, A.2] shows that in the worst case, using offers no gain at all. Fortunately, if we are willing to consider a more specific problem and consider the average case instead, using larger does make the algorithm converge faster (and this is the case according to our experiments).

Example 1 (Structural SVM for multi-label classification (with random data)).

We describe the application to structural SVMs in detail in Section C (please see this section for details on notation). Here, we describe the convergence rate for this application. According to [39], the compatibility function for multiclass classification will be where the only nonzero block that we fill with the feature vector is the th block. So looks like . This already ensures that provided lie on a unit sphere. Suppose we have classes and each class has a unique feature vector drawn randomly from a unit sphere in ; furthermore, for simplicity assume we always draw data points with distinct labels222This is an oversimplification but it offers a rough rule-of-thumb. In practice, should be in the same ballpark as our estimate here. for some constant . In addition, if , then with high probability

which yields a convergence rate , where
using notation from Lemmas A.1 and A.2 of [24].

This analysis suggests that a good rule-of-thumb is that we should choose to be at most the number of categories for the classification. If each class is a mixture of random draws from the unit sphere, then we can choose to be the underlying number of mixture components.

Example 2 (Group Fused Lasso).

The Group Fused Lasso aims to solve (typically for )

(10)

where , and column of is an observed noisy -dimensional feature vector at time . The matrix is the differencing matrix that takes the difference of feature vectors at adjacent time points (columns). The formulation aims to filter the trend that has some piecewise constant structures. The dual to (10) is

s.t.

where is conjugate to , i.e., . This block-constrained problem fits our structure (2). For this problem, we find that and , which yields the bound

Consequently, the rate of convergence becomes . In this case, batch FW will have a better rate of convergence than Bcfw 333Observe that does not have an term in the denominator to cancel out the numerator. This is because the objective function is not appropriately scaled with like it does in the structural SVM formulation..

2.3 Convergence with delayed updates

Due to the delays in communication, it happens all the time that some updates pushed back by workers are calculated based on delayed parameters that we broadcast earlier. Dropping these updates or enforcing synchronization will create a huge system overhead especially when the size of the minibatch is small. Ideally, we want to just accept the delayed updates as if they are correct, and broadcast new parameters to workers without locking the updates. The question is, does it actually work? In this section, we model the delay from every update to be iid from an unknown distribution. Under weak assumptions, we show that the effect of delayed updates can be treated as a form of approximate oracle evaluation as in (6) with some specific constant that depends on the expected delay and the maximum delay parameter (when exists), therefore establishing that the convergence results in the previous section remains valid for this variant. The results will also depend on the following diameter and gradient Lipschitz constant for a norm

Theorem 4 (Delayed Updates as Approximate Oracle).

For each norm of choice, let and

be defined above. Let the a random variable of delay be

and let be the expected delay from any worker, moreover, assume that the algorithm drops any updates with delay greater than at iteration . Then for the version of the algorithm without line-search, the delayed oracle will produce such that (6) holds with

(11)

Furthermore, if we assume that there is a such that for all , then (6) holds with where

(12)

The results above imply that Ap-Bcfw (without line-search) converges in both primal optimality and in duality gap according to Theorem 1 and 2.

Note that (11) depends on the expected delay rather than the maximum delay and as we allow the maximum delay to grow unboundedly. This allows the system to automatically deal with heavy-tailed delay distribution and sporadic stragglers. When we do have small bounded delay, we produce stronger bounds (12) with a multiplier that is either a constant (when for any ), proportional to (when ) or proportional to (when is large). The whole expression often has sublinear dependency in the expected delay . To be more precise, when is Euclidean norm, by Jensen’s inequality. Therefore in this case the bound is essentially proportional to . This is strictly better than Niu et al. [33] which has quadratic dependency in and Liu et al. [30] which has exponential dependency in . Our mild dependency for the cases suggests that the (12) remains proportional to even when we allow the maximum delay parameter to be as large as or larger without significantly affecting the convergence. Note that this allows some workers to be delayed for several data passes.

Observe that when , where the results reduces to a lock-free variant for Bcfw, becomes proportional to . This is always greater than (see e.g., [21, Appendix D]) but due to the flexibility of choosing the norm, this quantity corresponding to the most favorable norm is typically a small constant. For example, when is a quadratic function, we show that (see Appendix D.2). When , is often for an appropriately chosen norm. Therefore, (11) and (12) are roughly in the order of and respectively444For details, see our discussion in Appendix D.2.

Lastly, we remark that and are not independent. When we increase , we update the parameters less frequently and gets smaller. In a real distributed system, with constant throughput in terms of number of oracle solves per second from all workers. If the average delay is a fixed number in clock time specified by communication time. Then is roughly a constant regardless how is chosen.

3 Experiments

In this section, we experimentally demonstrate the performance gains of the three key features of our algorithm: minibatches of data, parallel workers, and asynchronous updates.

3.1 Minibatches of Data

We conduct simulations to study the effect of mini-batch size , where larger implies greater degrees of parallelism as each worker can solve one or more subproblems in a mini-batch. In our simulation for structural SVM we use sequence labeling task on a subset of the OCR dataset [37] . The subproblem can be solved using the Viterbi algorithm. The speedup on this dataset is shown in Figure 1(a). For this dataset, we use with weighted averaging and line-search throughout. We measure the speedup for a particular in terms of the number of epochs (Algorithm 1) required to converge relative to , which corresponds to Bcfw. Figure 1(a) shows that Ap-Bcfw achieves linear speedup for mini-batch size up to . Further speedup is sensitive to the convergence criteria, where more stringent thresholds lead to lower speed-ups. This is because large mini-batch sizes introduce errors, which reduces progress per update, and is consistent with existing work on the effect of parameter staleness on convergence [19, 10]. This suggests that it might be possible to use more workers initially for a large speedup and reduce parallelism as the algorithm approaches the optimum.

In our simulation for Group Fused Lasso, we generate a piecewise constant dataset of size (, in Eq. 2) with Gaussian noise. We use and a primal suboptimality threshold as our convergence criterion. At each iteration, we solve subproblems (i.e. the mini-batch size). Figure 1(b) shows the speed-up over (Bcfw). Similar to the structural SVM, the speedup is almost perfect for small () but tapers off for large to varying degrees depending on the convergence thresholds.

(a) Structural SVM on OCR dataset (n=6251)
(b) Group Fused Lasso (n=100)
Figure 1: Performance improvement with for (a) Structual SVM on the OCR dataset [37] and (b) Group Fused Lasso on a synthetic dataset. denotes primal optimum.

3.2 Shared Memory Parallel Workers

We implement Ap-Bcfw for the structural SVM in a multicore shared-memory system using the full OCR dataset . All shared-memory experiments were implemented in C++ and conducted on a 16-core machine with Intel(R) Xeon(R) CPU E5-2450 2.10GHz processors and 128G RAM. We first fix the number of workers at and vary the mini-batch size . Figure 2(a) shows the absolute convergence (i.e. the convergence per second). We note that Ap-Bcfw outperforms single-threaded Bcfw under all investigated , showing the efficacy of parallelization. Within Ap-Bcfw, convergence improves with increasing mini-batch sizes up to , but worsens when as the error from the large mini-batch size dominates additional computation. The optimal for a given number of workers () depends on both the dataset (how “coupled” are the coordinates) and also system implementations (how costly is the synchronization as the system scales).

Since speedup for a given depends on , we search for the optimal across multiples of to find the best speedup for each . Figure 2(b) shows faster convergence of Ap-Bcfw over Bcfw () when workers are available. It is important to note that the x-axis is wall-clock time rather than the number of epochs.

Figure 2(c) shows the speedup with varying . Ap-Bcfw achieves near-linear speed up for smaller . The speed-up curve tapers off for larger for two reasons: (1) Large incurs higher system overheads, and thus needs larger to utilize CPU efficiently; (2) Larger incurs errors as shown in Fig. 1(a). If the subproblems were more time-consuming to solve, the affect of system overhead would be reduced. We simulate harder subproblems by simply solving them Uniform times instead of just once. The speedup is nearly perfect as shown in Figure 2(d). Again, we observe that a more generous convergence threshold produces higher speedup, suggesting that resource scheduling could be useful (e.g., allocate more CPUs initially and fewer as algorithm converges).

(a)
(b)
(c)
(d)
Figure 2: From left: (a) Primal suboptimality vs wall-clock time using 8 workers () and various mini-batch sizes . (b) Primal suboptimality vs wall-clock time for varying with best chosen for each separately. (c) Speedup via parallelization with the best chosen among multiples of () for each . (d) The same with longer subproblems.

3.3 Performance gain with asynchronous updates

We compare Ap-Bcfw with a synchronous version of the algorithm (Sp-Bcfw) where the server assigns subproblems to each worker, then waits for and accumulates the solutions before proceeding to the next iteration. We simulate workers of varying slow-downs in our shared-memory setup by assigning a return probability to each worker . After solving each subproblem, worker reports the solution to the server with probability . Thus a worker with will drop 20% of the updates on average corresponding to slow-down.

We use workers for the experiments in this section. We first simulate the scenario with just one straggler with return probability while the other workers run at full speed . Figure 3(a) shows that the average time per effective datapass (over 20 passes and 5 runs) of Ap-Bcfw stays almost unchanged with slowdown factor of the straggler, whereas it increases linearly for Sp-Bcfw. This is because Ap-Bcfw relies on the average available worker processing power, while Sp-Bcfw is only as fast as the slowest worker.

Next, we simulate a heterogeneous environment where the workers have varying speeds. While varying a parameter , we set for . Figure 3(b) shows that Ap-Bcfw slows down by only a factor of 1.4 compared to the no-straggler case. Assuming that the server and worker each takes about half the (wall-clock) time on average per epoch, we would expect the run time to increase by 50% if average worker speed halves, which is the case if (i.e., ). Therefore a factor of 1.4 is reasonable. The performance of Sp-Bcfw is almost identical to that in the previous experiment as its speed is determined by the slowest worker. Thus our experiments show that Ap-Bcfw is robust to stragglers and system heterogeneity.

Figure 3: Average time per effective data pass in asynchronous and synchronous modes for two cases: one worker is slow with return probability (left); workers have return probabilities s uniformly in (right). Times normalized separately for Ap-Bcfw, Sp-Bcfw w.r.t. to the setup where workers run at full speed.

3.4 Convergence under unbounded heavy-tailed delay

In this section, we illustrate the mild effect of delay on convergence by randomly drawing an independent delay variable for each worker. For simplicity, we use (Bcfw) on the same group fused lasso problem as in Section 3.1. We sample

using either a Poisson distribution or a heavy-tailed Pareto distribution (round to the nearest integer). The Pareto distribution is chosen with shape parameter

and scale parameter such that and . During the experiment, at iteration , any updates that were based on a delay greater than are dropped (as our theory demanded). The results are shown in Figure 4. Observe that for both cases, the impact of the delay is rather mild. With expected delay up to , the algorithm only takes fewer than twice as many iterations to converge.

Figure 4: Illustrations of the convergence Bcfw with delayed updates. On the left, we have the delay sampled from a Poisson distribution. The figure on the right is for delay sampled from a Pareto distribution. We run each problem until the duality gap reaches .

4 Conclusion

In this paper, we propose an asynchronous parallel generalization of the block-coordinate Frank-Wolfe method [24] and provide intuitive conditions under which it has a provable speed-up over Bcfw. The asynchronous updates allow our method to be robust to stragglers and node failure as the speed of Ap-Bcfw depends on average worker speed instead of the slowest. We demonstrate the effectiveness of the algorithm in structural SVM and Group Fused Lasso with both controlled simulation and real-data experiments on a multi-core workstation. For the structural SVM, it leads to a speed-up over the state-of-the-art Bcfw by an order of magnitude using 16 parallel processors. As a projection-free Frank-Wolfe method, we expect our algorithm to be very competitive in large-scale constrained optimization problems, especially when projections are expensive. Future work includes analysis for the strongly convex case and ultimately releasing a carefully implemented software package for practitioners to deploy in Big Data applications.

References

  • Alaíz et al. [2013] Alaíz, Carlos M, Barbero, Álvaro, and Dorronsoro, José R. Group fused lasso. In

    Artificial Neural Networks and Machine Learning–ICANN 2013

    , pp. 66–73. Springer, 2013.
  • Bach [2013] Bach, Francis. Conditional gradients everywhere. 2013.
  • Beck & Tetruashvili [2013] Beck, Amir and Tetruashvili, Luba. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization, 23(4):2037–2060, 2013.
  • Bellet et al. [2014] Bellet, Aurélien, Liang, Yingyu, Garakani, Alireza Bagheri, Balcan, Maria-Florina, and Sha, Fei. Distributed frank-wolfe algorithm: A unified framework for communication-efficient sparse learning. CoRR, abs/1404.2644, 2014.
  • Bleakley & Vert [2011] Bleakley, Kevin and Vert, Jean-Philippe. The group fused lasso for multiple change-point detection. arXiv, 2011.
  • Bredies et al. [2009] Bredies, Kristian, Lorenz, Dirk A, and Maass, Peter. A generalized conditional gradient method and its connection to an iterative shrinkage method. Computational Optimization and Applications, 42(2):173–193, 2009.
  • Clarkson [2010] Clarkson, Kenneth L. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
  • Collins et al. [2008] Collins, Michael, Globerson, Amir, Koo, Terry, Carreras, Xavier, and Bartlett, Peter L. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. JMLR, 9:1775–1822, 2008.
  • Dai et al. [2013] Dai, Wei, Wei, Jinliang, Zheng, Xun, Kim, Jin Kyu, Lee, Seunghak, Yin, Junming, Ho, Qirong, and Xing, Eric P. Petuum: A framework for iterative-convergent distributed ml. arXiv:1312.7651, 2013.
  • Dai et al. [2014] Dai, Wei, Kumar, Abhimanu, Wei, Jinliang, Ho, Qirong, Gibson, Garth, and Xing, Eric P. High-performance distributed ml at scale through parameterserver consistency models. In AAAI, 2014.
  • Fercoq & Richtárik [2015] Fercoq, Olivier and Richtárik, Peter. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997–2023, 2015.
  • Foygel et al. [2012] Foygel, Rina, Horrell, Michael, Drton, Mathias, and Lafferty, John D. Nonparametric reduced rank regression. In NIPS’12, pp. 1628–1636, 2012.
  • Frank & Wolfe [1956] Frank, Marguerite and Wolfe, Philip. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
  • Freund & Grigas [2014] Freund, Robert M. and Grigas, Paul. New analysis and results for the frank–wolfe method. Mathematical Programming, 155(1):199–230, 2014. ISSN 1436-4646.
  • Fujishige & Isotani [2011] Fujishige, Satoru and Isotani, Shigueo. A submodular function minimization algorithm based on the minimum-norm base. Pacific Journal of Optimization, 7(1):3–17, 2011.
  • Garber & Hazan [2013] Garber, Dan and Hazan, Elad. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv:1301.4666, 2013.
  • Harchaoui et al. [2015] Harchaoui, Zaid, Juditsky, Anatoli, and Nemirovski, Arkadi. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112, 2015. ISSN 0025-5610.
  • Hazan & Kale [2012] Hazan, Elad and Kale, Satyen. Projection-free online learning. In ICML’12, 2012.
  • Ho et al. [2013] Ho, Qirong, Cipar, James, Cui, Henggang, Lee, Seunghak, Kim, Jin Kyu, Gibbons, Phillip B., Gibson, Garth A., Ganger, Greg, and Xing, Eric. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS’13. 2013.
  • Jaggi [2011] Jaggi, Martin. Sparse convex optimization methods for machine learning. PhD thesis, Diss., Eidgenössische Technische Hochschule ETH Zürich, Nr. 20013, 2011, 2011.
  • Jaggi [2013] Jaggi, Martin. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML’13, pp. 427–435, 2013.
  • Jegelka et al. [2013] Jegelka, Stefanie, Bach, Francis, and Sra, Suvrit. Reflection methods for user-friendly submodular optimization. In NIPS’13, pp. 1313–1321, 2013.
  • Lacoste-Julien & Jaggi [2015] Lacoste-Julien, Simon and Jaggi, Martin. On the global linear convergence of frank-wolfe optimization variants. In NIPS’15, pp. 496–504, 2015.
  • Lacoste-Julien et al. [2013] Lacoste-Julien, Simon, Jaggi, Martin, Schmidt, Mark, and Pletscher, Patrick. Block-coordinate frank-wolfe optimization for structural svms. In ICML’13, pp. 53–61, 2013.
  • Lafond et al. [2015] Lafond, Jean, Wai, Hoi-To, and Moulines, Eric. Convergence analysis of a stochastic projection-free algorithm. arXiv:1510.01171, 2015.
  • LeBlanc et al. [1975] LeBlanc, Larry J, Morlok, Edward K, and Pierskalla, William P. An efficient approach to solving the road network equilibrium traffic assignment problem. Transportation Research, 9(5):309–318, 1975.
  • Lee et al. [2014] Lee, Seunghak, Kim, Jin Kyu, Zheng, Xun, Ho, Qirong, Gibson, Garth A, and Xing, Eric P. On model parallelization and scheduling strategies for distributed machine learning. In NIPS’14, pp. 2834–2842, 2014.
  • Li et al. [2013] Li, Mu, Zhou, Li, Yang, Zichao, Li, Aaron, Xia, Fei, Andersen, David G, and Smola, Alexander. Parameter server for distributed machine learning. In NIPS Workshop: Big Learning, 2013.
  • Liu et al. [2013] Liu, Ji, Musialski, Przemyslaw, Wonka, Peter, and Ye, Jieping. Tensor completion for estimating missing values in visual data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):208–220, 2013.
  • Liu et al. [2014] Liu, Ji, Wright, Stephen J, Ré, Christopher, and Bittorf, Victor. An asynchronous parallel stochastic coordinate descent algorithm. JMLR, 2014.
  • Mitzenmacher [2001] Mitzenmacher, Michael. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on, 12(10):1094–1104, 2001.
  • Nesterov [2012] Nesterov, Yu. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • Niu et al. [2011] Niu, Feng, Recht, Benjamin, Ré, Christopher, and Wright, Stephen J. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv:1106.5730, 2011.
  • Ouyang & Gray [2010] Ouyang, Hua and Gray, Alexander G. Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs. In SDM, 2010.
  • Raab & Steger [1998] Raab, Martin and Steger, Angelika. Balls into bins - a simple and tight analysis. In Randomization and Approximation Techniques in Computer Science, pp. 159–170. Springer, 1998.
  • Richtárik & Takáč [2012] Richtárik, Peter and Takáč, Martin. Parallel coordinate descent methods for big data optimization. arXiv:1212.0873, 2012.
  • Taskar et al. [2004] Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin Markov networks. In NIPS’04, pp. 25–32. MIT Press, 2004.
  • Tsitsiklis et al. [1986] Tsitsiklis, John N, Bertsekas, Dimitri P, Athans, Michael, et al. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9), 1986.
  • Yu & Joachims [2009] Yu, Chun-Nam John and Joachims, Thorsten. Learning structural svms with latent variables. In ICML’09, pp. 1169–1176. ACM, 2009.
  • Zhang et al. [2012] Zhang, Xinhua, Yu, Yaoliang, and Schuurmans, Dale. Accelerated training for matrix-norm regularization: A boosting approach. In NIPS’12, pp. 2915–2923, 2012.
  • Zhang et al. [2013] Zhang, Xinhua, Yu, Yao-Liang, and Schuurmans, Dale. Polar operators for structured sparse estimation. In NIPS’13, pp. 82–90, 2013.

Appendix A Convergence analysis

We provide a self-contained convergence proof in this section. The skeleton of our convergence proof follow closely from Lacoste-Julien et al. [24] and Jaggi [21]. There are a few subtle modification and improvements that we need to add due to our weaker definition of approximate oracle call that is nearly correct only in expectation. The delayed convergence is new and interesting for the best of our knowledge, which uses a simple result in “load balancing” [31].

Note that for the cleanness of the presentation, we focus on the primal and primal-dual convergence of the version of the algorithms with pre-defined step sizes and additive approximate subroutine, it is simple to extend the same analysis for line-search variant and multiplicative approximation.

a.1 Primal Convergence

Lemma 2.

Denote the gap between current and the optimal to be . The iterative updates in Algorithm 1(with arbitrary fixed stepsize or by the coordinate-line search) obey

where the expectation is taken over the joint randomness all the way to iteration .

Proof.

Let for notational convenience. We prove the result for Algorithm  1 first. Apply the definition of and then apply the definition of the additive approximation in (6), to get

Subtract on both sides we get:

Now take the expectation over the entire history then apply (6) and definition of the surrogate duality gap (7), we obtain

(13)

The last inequality follows from the property of the surrogate duality gap due to the fact that . This completes the proof of the descent lemma. ∎

Now we are ready to state the proof for Theorem 1.

Proof of Theorem 1.

We follow the proof in Theorem C.1 in Lacoste-Julien et al. [24] to prove the statement for Algorithm 1. The difference is that we use a different and carefully chosen sequence of step size.

Take and denote as for short hands. The inequality in Lemma 2 simplifies to

Now we will prove for by induction. The base case is trivially true since . Assuming that the claim holds for , we apply the induction hypothesis and the above inequality is reduced to

This completes the induction and hence the proof for the primal convergence for Algorithm 1. ∎

a.2 Convergence of the surrogate duality gap

Proof of Theorem 2.

We mimic the proof in Lacoste-Julien et al. [24, Section C.3] for the analogous result closely, and we will use the same notation for and as in the proof for primal convergence, moreover denote First from (13) in the proof of Lemma 2, we have

Rearrange the terms, we get

(14)

The idea is that if we take an arbitrary convex combination of , the result will be within the convex hull, namely between the minimum and the maximum, hence proven the existence claim in the theorem. By choosing weight where normalization constant and taking the convex combination of both side of (14), we have

(15)

Note that , so we simply dropped a negative term in last line. Applying the step size , we get

Plug the above back into (15) and use the bound , we get

This completes the proof for . ∎

Proof of Convergence with Delayed Gradient

The idea is that we are going to treat the updates calculated from the delayed gradients as an additive error and then invoke our convergence results that allow the oracle to be approximate. We will first present a lemma that we will use for the proof of Theorem 4.

Lemma 3.

Let , be a norm, ,