DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization

10/13/2017 ∙ by Lin Xiao, et al. ∙ Microsoft The University of Iowa Carnegie Mellon University 0

Machine learning with big data often involves large optimization models. For distributed optimization over a cluster of machines, frequent communication and synchronization of all model parameters (optimization variables) can be very costly. A promising solution is to use parameter servers to store different subsets of the model parameters, and update them asynchronously at different machines using local datasets. In this paper, we focus on distributed optimization of large linear models with convex loss functions, and propose a family of randomized primal-dual block coordinate algorithms that are especially suitable for asynchronous distributed implementation with parameter servers. In particular, we work with the saddle-point formulation of such problems which allows simultaneous data and model partitioning, and exploit its structure by doubly stochastic coordinate optimization with variance reduction (DSCOVR). Compared with other first-order distributed algorithms, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. We discuss the implementation details of the DSCOVR algorithms, and present numerical experiments on an industrial distributed computing system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Algorithms and systems for distributed optimization are critical for solving large-scale machine learning problems, especially when the dataset cannot fit into the memory or storage of a single machine. In this paper, we consider distributed optimization problems of the form

(1)

where is the local data stored at the th machine, is a convex cost function associated with the linear mapping , and is a convex regularization function. In addition, we assume that is separable, i.e., for some integer , we can write

(2)

where , and for are non-overlapping subvectors of with (they form a partition of ). Many popular regularization functions in machine learning are separable, for example, or for some .

An important special case of (1) is distributed empirical risk minimization (ERM) of linear predictors. Let be  training examples, where each

is a feature vector and

is its label. The ERM problem is formulated as

(3)

where each is a loss function measuring the mismatch between the linear prediction and the label . Popular loss functions in machine learning include, e.g., for regression, the squared loss , and for classification, the logistic loss where . In the distributed optimization setting, the  examples are divided into  subsets, each stored on a different machine. For , let denote the subset of stored at machine  and let (they satisfy ). Then the ERM problem (3) can be written in the form of (1) by letting consist of with as its rows and defining as

(4)

where is a subvector of , consisting of with .

The nature of distributed algorithms and their convergence properties largely depend on the model of the communication network that connects the  computing machines. A popular setting in the literature is to model the communication network as a graph, and each node can only communicate (in one step) with their neighbors connected by an edge, either synchronously or asynchronously (e.g., Bertsekas and Tsitsiklis, 1989; Nedić and Ozdaglar, 2009)

. The convergence rates of distributed algorithms in this setting often depend on characteristics of the graph, such as its diameter and the eigenvalues of the graph Laplacian

(e.g. Xiao and Boyd, 2006; Duchi et al., 2012; Nedić et al., 2016; Scaman et al., 2017). This is often called the decentralized setting.

Another model for the communication network is centralized, where all the machines participate synchronous, collective communication, e.g., broadcasting a vector to all  machines, or computing the sum of  vectors, each from a different machine (AllReduce). These collective communication protocols hide the underlying implementation details, which often involve operations on graphs. They are adopted by many popular distributed computing standards and packages, such as MPI (MPI Forum, 2012), MapReduce (Dean and Ghemawat, 2008) and Aparche Spark (Zaharia et al., 2016), and are widely used in machine learning practice (e.g., Lin et al., 2014; Meng et al., 2016). In particular, collective communications are very useful for addressing data parallelism, i.e., by allowing different machines to work in parallel to improve the same model using their local dataset. A disadvantage of collective communications is their synchronization cost: faster machines or machines with less computing tasks have to become idle while waiting for other machines to finish their tasks in order to participate a collective communication.

One effective approach for reducing synchronization cost is to exploit model parallelism (here “model” refers to , including all optimization variables). The idea is to allow different machines work in parallel with different versions of the full model or different parts of a common model, with little or no synchronization. The model partitioning approach can be very effective for solving problems with large models (large dimension ). Dedicated parameter servers can be set up to store and maintain different subsets of the model parameters, such as the ’s in (2), and be responsible for coordinating their updates at different workers (Li et al., 2014; Xing et al., 2015). This requires flexible point-to-point communication.

In this paper, we develop a family of randomized algorithms that exploit simultaneous data and model parallelism. Correspondingly, we adopt a centralized communication model that support both synchronous collective communication and asynchronous point-to-point communication. In particular, it allows any pair of machines to send/receive a message in a single step, and multiple point-to-point communications may happen in parallel in an event-driven, asynchronous manner. Such a communication model is well supported by the MPI standard. To evaluate the performance of distributed algorithms in this setting, we consider the following three measures.

  • Computation complexity: total amount of computation, measured by the number of passes over all datasets for , which can happen in parallel on different machines.

  • Communication complexity: the total amount of communication required, measured by the equivalent number of vectors in sent or received across all machines.

  • Synchronous communication: measured by the total number of vectors in that requires synchronous collective communication involving all  machines. We single it out from the overall communication complexity as a (partial) measure of the synchronization cost.

In Section 2, we introduce the framework of our randomized algorithms, Doubly Stochastic Coordinate Optimization with Variance Reduction (DSCOVR), and summarize our theoretical results on the three measures achieved by DSCOVR. Compared with other first-order methods for distributed optimization, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. Then we present the details of several DSCOVR variants and their convergence analysis in Sections 3-6. We discuss the implementation of different DSCOVR algorithms in Section 7, and present results of our numerical experiments in Section 8.

2 The DSCOVR Framework and Main Results

Figure 1: Partition of primal variable , dual variable , and the data matrix .

First, we derive a saddle-point formulation of the convex optimization problem (1). Let be the convex conjugate of , i.e., , and define

(5)

where . Since both the ’s and are convex, is convex in  and concave in . We also define a pair of primal and dual functions:

(6)
(7)

where is exactly the objective function in (1)111 More technically, we need to assume that each is convex and lower semi-continuous so that (see, e.g., Rockafellar, 1970, Section 12). It automatically holds if is convex and differentiable, which we will assume later. and is the convex conjugate of . We assume that has a saddle point , that is,

In this case, we have and , and .

The DSCOVR framework is based on solving the convex-concave saddle-point problem

(8)

Since we assume that  has a separable structure as in (2), we rewrite the saddle-point problem as

(9)

where for are column partitions of . For convenience, we define the following notations. First, let be the overall data matrix, by stacking the ’s vertically. Conforming to the separation of , we also partition  into block columns for , where each (stacked vertically). For consistency, we also use to denote from now on. See Figure 1 for an illustration.

0:  initial points , and step sizes for and for .
1:  for  do
2:      pick and randomly with distributions and respectively.
3:      compute variance-reduced stochastic gradients and .
4:      update primal and dual block coordinates:
(10)
(11)
5:  end for
Algorithm 1 DSCOVR framework

We exploit the doubly separable structure in (9) by a doubly stochastic coordinate update algorithm outlined in Algorithm 1. Let and

be two probability distributions. During each iteration 

, we randomly pick an index with probability , and independently pick an index with probability . Then we compute two vectors and (details to be discussed later), and use them to update the block coordinates and while leaving other block coordinates unchanged. The update formulas in (10) and (11) use the proximal mappings of the (scaled) functions and respectively. We recall that the proximal mapping for any convex function is defined as

There are several different ways to compute the vectors and in Step 3 of Algorithm 1. They should be the partial gradients or stochastic gradients of the bilinear coupling term in  with respect to and  respectively. Let

which is the bilinear term in without the factor . We can use the following partial gradients in Step 3:

(12)

We note that the factor does not appear in the first equation because it multiplies both and in (9) and hence does not appear in updating . Another choice is to use

(13)

which are unbiased stochastic partial gradients, because

where and are expectations with respect to the random indices  and  respectively.

It can be shown that, Algorithm 1 converges to a saddle point of with either choice (12) or (13) in Step 3, and with suitable step sizes and . It is expected that using the stochastic gradients in (13) leads to a slower convergence rate than applying (12). However, using (13) has the advantage of much less computation during each iteration. Specifically, it employs only one block matrix-vector multiplication for both updates, instead of  and  block multiplications done in (12).

More importantly, the choice in (13) is suitable for parallel and distributed computing. To see this, let denote the pair of random indices drawn at iteration  (we omit the superscript  to simplify notation whenever there is no confusion from the context). Suppose for a sequence of consecutive iterations , there is no common index among , nor among , then these iterations can be done in parallel and they produce the same updates as being done sequentially. Suppose there are processors or machines, then each can carry out one iteration, which includes the updates in (13) as well as (10) and (11). These iterations are independent of each other, and in fact can be done in any order, because each only involve one primal block and one dual block , for both input and output (variables on the right and left sides of the assignments respectively). In contrast, the input for the updates in (12) depend on all primal and dual blocks at the previous iteration, thus cannot be done in parallel.

Figure 2: Simultaneous data and model parallelism. At any given time, each machine is busy updating one parameter block and its own dual variable. Whenever some machine is done, it is assigned to work on a random block that is not being updated.

In practice, suppose we have machines for solving problem (9), and each holds the data matrix in memory and maintains the dual block , for . We assume that the number of model partitions  is larger than , and the  model blocks are stored at one or more parameter servers. In the beginning, we can randomly pick model blocks (sampling without replacement) from , and assign each machine to update one of them. If machine  is assigned to update block , then both and are updated, using only the matrix ; moreover, it needs to communicate only the block  with the parameter server that are responsible to maintain it. Whenever one machine finishes its update, a scheduler can randomly pick another parameter block that is not currently updated by other machines, and assign it to the free machine. Therefore all machines can work in parallel, in an asynchronous, event-driven manner. Here an event is the completion of a block update at any machine, as illustrated in Figure 2. We will discuss the implementation details in Section 7.

The idea of using doubly stochastic updates for distributed optimization in not new. It has been studied by Yun et al. (2014) for solving the matrix completion problem, and by Matsushima et al. (2014) for solving the saddle-point formulation of the ERM problem. Despite their nice features for parallelization, these algorithms inherit the (or with strong convexity) sublinear convergence rate of the classical stochastic gradient method. They translate into high communication and computation cost for distributed optimization. In this paper, we propose new variants of doubly stochastic update algorithms by using variance-reduced stochastic gradients (Step 3 of Algorithm 1). More specifically, we borrow the variance-reduction techniques from SVRG (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014) to develop the DSCOVR algorithms, which enjoy fast linear rates of convergence. In the rest of this section, we summarize our theoretical results characterizing the three measures for DSCOVR: computation complexity, communication complexity, and synchronization cost. We compare them with distributed implementation of batch first-order algorithms.

2.1 Summary of Main Results

Throughout this paper, we use to denote the standard Euclidean norm for vectors. For matrices, denotes the operator (spectral) norm and denotes the Frobenius norm. We make the following assumption regarding the optimization problem (1).

Assumption 1

Each is convex and differentiable, and its gradient is -Lipschitz continuous, i.e.,

(14)

In addition, the regularization function  is -strongly convex, i.e.,

Under Assumption 1, each is -strongly convex (see, e.g., Hiriart-Urruty and Lemaréchal, 2001, Theorem 4.2.2), and  defined in (5) has a unique saddle point .

The condition (14) is often referred to as being -smooth. To simplify discussion, here we assume for . Under these assumptions, each composite function has a smoothness parameter (upper bound on the largest eigenvalue of its Hessian). Their average has a smooth parameter , which no larger than the average of the individual smooth parameters . We define a condition number for problem (1) as the ratio between this smooth parameter and the convexity parameter  of :

(15)

where . This condition number is a key factor to characterize the iteration complexity of batch first-order methods for solving problem (1), i.e., minimizing . Specifically, to find a  such that , the proximal gradient method requires iterations, and their accelerated variants require iterations (e.g., Nesterov, 2004; Beck and Teboulle, 2009; Nesterov, 2013). Primal-dual first order methods for solving the saddle-point problem (8) share the same complexity (Chambolle and Pock, 2011, 2015).

A fundamental baseline for evaluating any distributed optimization algorithms is the distributed implementation of batch first-order methods. Let’s consider solving problem (1) using the proximal gradient method. During every iteration , each machine receives a copy of from a master machine (through Broadcast), and computes the local gradient . Then a collective communication is invoked to compute the batch gradient at the master (Reduce). The master then takes a proximal gradient step, using and the proximal mapping of , to compute the next iterate  and broadcast it to every machine for the next iteration. We can also use the AllReduce operation in MPI to obtain at each machine without a master. In either case, the total number of passes over the data is twice the number of iterations (due to matrix-vector multiplications using both and ), and the number of vectors in sent/received across all machines is  times the number of iterations (see Table 1). Moreover, all communications are collective and synchronous.

Algorithms Computation complexity Communication complexity
(number of passes over data) (number of vectors in )
batch first-order methods
DSCOVR
accelerated batch first-order methods
accelerated DSCOVR
Table 1: Computation and communication complexities of batch first-order methods and DSCOVR (for both SVRG and SAGA variants). We omit the notation in all entries and an extra factor for accelerated DSCOVR algorithms.

Since DSCOVR is a family of randomized algorithms for solving the saddle-point problem (8), we would like to find such that holds in expectation and with high probability. We list the communication and computation complexities of DSCOVR in Table 1, comparing them with batch first-order methods. Similar guarantees also hold for reducing the duality gap , where  and  are defined in (6) and (7) respectively.

The key quantity characterizing the complexities of DSCOVR is the condition number , which can be defined in several different ways. If we pick the data block  and model block 

with uniform distribution, i.e.,

for and for , then

(16)

Comparing the definition of in (15), we have because

With and , we can also define

(17)

In this case, we also have because . Finally, if we pick the pair with non-uniform distribution and , then we can define

(18)

Again we have because . We may replace in Tables 1 and 2 by either or , depending on the probability distributions  and  and different proof techniques.

From Table 1, we observe similar type of speed-ups in computation complexity, as obtained by variance reduction techniques over the batch first-order algorithms for convex optimization (e.g., Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014; Xiao and Zhang, 2014; Lan and Zhou, 2015; Allen-Zhu, 2017), as well as for convex-concave saddle-point problems (Zhang and Xiao, 2017; Balamurugan and Bach, 2016). Basically, DSCOVR algorithms have potential improvement over batch first-order methods by a factor of  (for non-accelerated algorithms) or  (for accelerated algorithms), but with a worse condition number. In the worst case, the ratio between and may be of order  or larger, thus canceling the potential improvements.

Algorithms Synchronous Communication Asynchronous Communication
(number of vectors in ) (equiv. number of vectors in )
DSCOVR-SVRG
DSCOVR-SAGA
accelerated DSCOVR-SVRG
accelerated DSCOVR-SAGA
Table 2: Breakdown of communication complexities into synchronous and asynchronous communications for two different types of DSCOVR algorithms. We omit the notation and an extra factor for accelerated DSCOVR algorithms.

More interestingly, DSCOVR also has similar improvements in terms of communication complexity over batch first-order methods. In Table 2, we decompose the communication complexity of DSCOVR into synchronous and asynchronous communication. The decomposition turns out to be different depending on the variance reduction techniques employed: SVRG (Johnson and Zhang, 2013) versus SAGA (Defazio et al., 2014). We note that DSCOVR-SAGA essentially requires only asynchronous communication, because the synchronous communication of  vectors are only necessary for initialization with non-zero starting point.

The comparisons in Table 1 and 2 give us good understanding of the complexities of different algorithms. However, these complexities are not accurate measures of their performance in practice. For example, collective communication of vectors in can often be done in parallel over a spanning tree of the underlying communication network, thus only cost times (insted of times) compared with sending only one vector. Also, for point-to-point communication, sending one vector in altogether can be much faster than sending  smaller vectors of total length  separately. A fair comparison in term of wall-clock time on a real-world distributed computing system requires customized, efficient implementation of different algorithms. We will shed some light on timing comparisons with numerical experiments in Section 8.

2.2 Related Work

There is an extensive literature on distributed optimization. Many algorithms developed for machine learning adopt the centralized communication setting, due to the wide availability of supporting standards and platforms such as MPI, MapReduce and Spark (as discussed in the introduction). They include parallel implementations of the batch first-order and second-order methods (e.g., Lin et al., 2014; Chen et al., 2014; Lee et al., 2017), ADMM (Boyd et al., 2011), and distributed dual coordinate ascent (Yang, 2013; Jaggi et al., 2014; Ma et al., 2015).

For minimizing the average function , in the centralized setting and with only first-order oracles (i.e., gradients of ’s or their conjugates), it has been shown that distributed implementation of accelerated gradient methods achieves the optimal convergence rate and communication complexity (Arjevani and Shamir, 2015; Scaman et al., 2017). The problem (1

) we consider has the extra structure of composition with a linear transformation by the local data, which allows us to exploit simultaneous data and model parallelism using randomized algorithms and obtain improved communication and computation complexity.

Most work on asynchronous distributed algorithms exploit model parallelism in order to reduce the synchronization cost, especially in the setting with parameter servers (e.g., Li et al., 2014; Xing et al., 2015; Aytekin et al., 2016). Besides, delay caused by the asynchrony can be incorporated to the step size to gain practical improvement on convergence (e.g., Agarwal and Duchi, 2011; McMahan and Streeter, 2014; Sra et al., 2016), though the theoretical sublinear rates remain. There are also many recent work on asynchronous parallel stochastic gradient and coordinate-descent algorithms for convex optimization (e.g., Recht et al., 2011; Liu et al., 2014; Shi et al., 2015; Reddi et al., 2015; Richtárik and Takáč, 2016; Peng et al., 2016). When the workloads or computing power of different machines or processors are nonuniform, they may significantly increase iteration efficiency (number of iterations done in unit time), but often at the cost of requiring more iterations than their synchronous counterparts (due to delays and stale updates). So there is a subtle balance between iteration efficiency and iteration complexity (e.g., Hannah and Yin, 2017). Our discussions in Section 2.1 show that DSCOVR is capable of improving both aspects.

For solving bilinear saddle-point problems with a finite-sum structure, Zhang and Xiao (2017) proposed a randomized algorithm that works with dual coordinate update but full primal update. Yu et al. (2015) proposed a doubly stochastic algorithm that works with both primal and dual coordinate updates based on equation (12). Both of them achieved accelerated linear convergence rates, but neither can be readily applied to distributed computing. In addition, Balamurugan and Bach (2016) proposed stochastic variance-reduction methods (also based on SVRG and SAGA) for solving more general convex-concave saddle point problems. For the special case with bilinear coupling, they obtained similar computation complexity as DSCOVR. However, their methods require full model updates at each iteration (even though working with only one sub-block of data), thus are not suitable for distributed computing.

With additional assumptions and structure, such as similarity between the local cost functions at different machines or using second-order information, it is possible to obtain better communication complexity for distributed optimization; see, e.g., Shamir et al. (2014); Zhang and Xiao (2015); Reddi et al. (2016). However, these algorithms rely on much more computation at each machine for solving a local sub-problem at each iteration. With additional memory and preprocessing at each machine, Lee et al. (2015) showed that SVRG can be adapted for distributed optimization to obtain low communication complexity.

3 The DSCOVR-SVRG Algorithm

From this section to Section 6, we present several realizations of DSCOVR using different variance reduction techniques and acceleration schemes, and analyze their convergence properties. These algorithms are presented and analyzed as sequential randomized algorithms. We will discuss how to implement them for asynchronous distributed computing in Section 7.

0:  initial points , , number of stages and number of iterations per stage .
1:  for  do
2:       and
3:       and
4:      for  do
5:           pick and randomly with distributions and respectively.
6:           compute variance-reduced stochastic gradients:
(19)
(20)
7:           update primal and dual block coordinates:
8:      end for
9:       and .
10:  end for
10:   and .
Algorithm 2 DSCOVR-SVRG

Algorithm 2 is a DSCOVR algorithm that uses the technique of SVRG (Johnson and Zhang, 2013) for variance reduction. The iterations are divided into stages and each stage has a inner loop. Each stage is initialized by a pair of vectors and , which come from either initialization (if ) or the last iterate of the previous stage (if ). At the beginning of each stage, we compute the batch gradients

The vectors and share the same partitions as and , respectively. Inside each stage , the variance-reduced stochastic gradients are computed in (19) and (20). It is easy to check that they are unbiased. More specifically, taking expectation of with respect to the random index  gives

and taking expectation of with respect to the random index  gives

In order to measure the distance of any pair of primal and dual variables to the saddle point, we define a weighted squared Euclidean norm on . Specifically, for any pair where and with , we define

(21)

If for all , then . We have the following theorem concerning the convergence rate of Algorithm 2.

Suppose Assumption 1 holds, and let be the unique saddle point of . Let be a constant that satisfies

(22)

In Algorithm 2, if we choose the step sizes as

(23)
(24)

and the number of iterations during each stage satisfies , then for any ,

(25)

The proof of Theorem 3 is given in Appendix A. Here we discuss how to choose the parameter  to satisfy (22). For simplicity, we assume for all .

  • If we let and sample with the uniform distribution across both rows and columns, i.e., for and for , then we can set

    where as defined in (16).

  • An alternative condition for to satisfy is (shown in Section A.1 in the Appendix)

    (26)

    Again using uniform sampling, we can set

    where and as defined in (17).

  • Using the condition (26), if we choose the probabilities to be proportional to the squared Frobenius norms of the data partitions, i.e.,