1 Introduction
Algorithms and systems for distributed optimization are critical for solving largescale machine learning problems, especially when the dataset cannot fit into the memory or storage of a single machine. In this paper, we consider distributed optimization problems of the form
(1) 
where is the local data stored at the th machine, is a convex cost function associated with the linear mapping , and is a convex regularization function. In addition, we assume that is separable, i.e., for some integer , we can write
(2) 
where , and for are nonoverlapping subvectors of with (they form a partition of ). Many popular regularization functions in machine learning are separable, for example, or for some .
An important special case of (1) is distributed empirical risk minimization (ERM) of linear predictors. Let be training examples, where each
is a feature vector and
is its label. The ERM problem is formulated as(3) 
where each is a loss function measuring the mismatch between the linear prediction and the label . Popular loss functions in machine learning include, e.g., for regression, the squared loss , and for classification, the logistic loss where . In the distributed optimization setting, the examples are divided into subsets, each stored on a different machine. For , let denote the subset of stored at machine and let (they satisfy ). Then the ERM problem (3) can be written in the form of (1) by letting consist of with as its rows and defining as
(4) 
where is a subvector of , consisting of with .
The nature of distributed algorithms and their convergence properties largely depend on the model of the communication network that connects the computing machines. A popular setting in the literature is to model the communication network as a graph, and each node can only communicate (in one step) with their neighbors connected by an edge, either synchronously or asynchronously (e.g., Bertsekas and Tsitsiklis, 1989; Nedić and Ozdaglar, 2009)
. The convergence rates of distributed algorithms in this setting often depend on characteristics of the graph, such as its diameter and the eigenvalues of the graph Laplacian
(e.g. Xiao and Boyd, 2006; Duchi et al., 2012; Nedić et al., 2016; Scaman et al., 2017). This is often called the decentralized setting.Another model for the communication network is centralized, where all the machines participate synchronous, collective communication, e.g., broadcasting a vector to all machines, or computing the sum of vectors, each from a different machine (AllReduce). These collective communication protocols hide the underlying implementation details, which often involve operations on graphs. They are adopted by many popular distributed computing standards and packages, such as MPI (MPI Forum, 2012), MapReduce (Dean and Ghemawat, 2008) and Aparche Spark (Zaharia et al., 2016), and are widely used in machine learning practice (e.g., Lin et al., 2014; Meng et al., 2016). In particular, collective communications are very useful for addressing data parallelism, i.e., by allowing different machines to work in parallel to improve the same model using their local dataset. A disadvantage of collective communications is their synchronization cost: faster machines or machines with less computing tasks have to become idle while waiting for other machines to finish their tasks in order to participate a collective communication.
One effective approach for reducing synchronization cost is to exploit model parallelism (here “model” refers to , including all optimization variables). The idea is to allow different machines work in parallel with different versions of the full model or different parts of a common model, with little or no synchronization. The model partitioning approach can be very effective for solving problems with large models (large dimension ). Dedicated parameter servers can be set up to store and maintain different subsets of the model parameters, such as the ’s in (2), and be responsible for coordinating their updates at different workers (Li et al., 2014; Xing et al., 2015). This requires flexible pointtopoint communication.
In this paper, we develop a family of randomized algorithms that exploit simultaneous data and model parallelism. Correspondingly, we adopt a centralized communication model that support both synchronous collective communication and asynchronous pointtopoint communication. In particular, it allows any pair of machines to send/receive a message in a single step, and multiple pointtopoint communications may happen in parallel in an eventdriven, asynchronous manner. Such a communication model is well supported by the MPI standard. To evaluate the performance of distributed algorithms in this setting, we consider the following three measures.

Computation complexity: total amount of computation, measured by the number of passes over all datasets for , which can happen in parallel on different machines.

Communication complexity: the total amount of communication required, measured by the equivalent number of vectors in sent or received across all machines.

Synchronous communication: measured by the total number of vectors in that requires synchronous collective communication involving all machines. We single it out from the overall communication complexity as a (partial) measure of the synchronization cost.
In Section 2, we introduce the framework of our randomized algorithms, Doubly Stochastic Coordinate Optimization with Variance Reduction (DSCOVR), and summarize our theoretical results on the three measures achieved by DSCOVR. Compared with other firstorder methods for distributed optimization, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. Then we present the details of several DSCOVR variants and their convergence analysis in Sections 36. We discuss the implementation of different DSCOVR algorithms in Section 7, and present results of our numerical experiments in Section 8.
2 The DSCOVR Framework and Main Results
First, we derive a saddlepoint formulation of the convex optimization problem (1). Let be the convex conjugate of , i.e., , and define
(5) 
where . Since both the ’s and are convex, is convex in and concave in . We also define a pair of primal and dual functions:
(6)  
(7) 
where is exactly the objective function in (1)^{1}^{1}1 More technically, we need to assume that each is convex and lower semicontinuous so that (see, e.g., Rockafellar, 1970, Section 12). It automatically holds if is convex and differentiable, which we will assume later. and is the convex conjugate of . We assume that has a saddle point , that is,
In this case, we have and , and .
The DSCOVR framework is based on solving the convexconcave saddlepoint problem
(8) 
Since we assume that has a separable structure as in (2), we rewrite the saddlepoint problem as
(9) 
where for are column partitions of . For convenience, we define the following notations. First, let be the overall data matrix, by stacking the ’s vertically. Conforming to the separation of , we also partition into block columns for , where each (stacked vertically). For consistency, we also use to denote from now on. See Figure 1 for an illustration.
(10)  
(11) 
We exploit the doubly separable structure in (9) by a doubly stochastic coordinate update algorithm outlined in Algorithm 1. Let and
be two probability distributions. During each iteration
, we randomly pick an index with probability , and independently pick an index with probability . Then we compute two vectors and (details to be discussed later), and use them to update the block coordinates and while leaving other block coordinates unchanged. The update formulas in (10) and (11) use the proximal mappings of the (scaled) functions and respectively. We recall that the proximal mapping for any convex function is defined asThere are several different ways to compute the vectors and in Step 3 of Algorithm 1. They should be the partial gradients or stochastic gradients of the bilinear coupling term in with respect to and respectively. Let
which is the bilinear term in without the factor . We can use the following partial gradients in Step 3:
(12) 
We note that the factor does not appear in the first equation because it multiplies both and in (9) and hence does not appear in updating . Another choice is to use
(13) 
which are unbiased stochastic partial gradients, because
where and are expectations with respect to the random indices and respectively.
It can be shown that, Algorithm 1 converges to a saddle point of with either choice (12) or (13) in Step 3, and with suitable step sizes and . It is expected that using the stochastic gradients in (13) leads to a slower convergence rate than applying (12). However, using (13) has the advantage of much less computation during each iteration. Specifically, it employs only one block matrixvector multiplication for both updates, instead of and block multiplications done in (12).
More importantly, the choice in (13) is suitable for parallel and distributed computing. To see this, let denote the pair of random indices drawn at iteration (we omit the superscript to simplify notation whenever there is no confusion from the context). Suppose for a sequence of consecutive iterations , there is no common index among , nor among , then these iterations can be done in parallel and they produce the same updates as being done sequentially. Suppose there are processors or machines, then each can carry out one iteration, which includes the updates in (13) as well as (10) and (11). These iterations are independent of each other, and in fact can be done in any order, because each only involve one primal block and one dual block , for both input and output (variables on the right and left sides of the assignments respectively). In contrast, the input for the updates in (12) depend on all primal and dual blocks at the previous iteration, thus cannot be done in parallel.
In practice, suppose we have machines for solving problem (9), and each holds the data matrix in memory and maintains the dual block , for . We assume that the number of model partitions is larger than , and the model blocks are stored at one or more parameter servers. In the beginning, we can randomly pick model blocks (sampling without replacement) from , and assign each machine to update one of them. If machine is assigned to update block , then both and are updated, using only the matrix ; moreover, it needs to communicate only the block with the parameter server that are responsible to maintain it. Whenever one machine finishes its update, a scheduler can randomly pick another parameter block that is not currently updated by other machines, and assign it to the free machine. Therefore all machines can work in parallel, in an asynchronous, eventdriven manner. Here an event is the completion of a block update at any machine, as illustrated in Figure 2. We will discuss the implementation details in Section 7.
The idea of using doubly stochastic updates for distributed optimization in not new. It has been studied by Yun et al. (2014) for solving the matrix completion problem, and by Matsushima et al. (2014) for solving the saddlepoint formulation of the ERM problem. Despite their nice features for parallelization, these algorithms inherit the (or with strong convexity) sublinear convergence rate of the classical stochastic gradient method. They translate into high communication and computation cost for distributed optimization. In this paper, we propose new variants of doubly stochastic update algorithms by using variancereduced stochastic gradients (Step 3 of Algorithm 1). More specifically, we borrow the variancereduction techniques from SVRG (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014) to develop the DSCOVR algorithms, which enjoy fast linear rates of convergence. In the rest of this section, we summarize our theoretical results characterizing the three measures for DSCOVR: computation complexity, communication complexity, and synchronization cost. We compare them with distributed implementation of batch firstorder algorithms.
2.1 Summary of Main Results
Throughout this paper, we use to denote the standard Euclidean norm for vectors. For matrices, denotes the operator (spectral) norm and denotes the Frobenius norm. We make the following assumption regarding the optimization problem (1).
Assumption 1
Each is convex and differentiable, and its gradient is Lipschitz continuous, i.e.,
(14) 
In addition, the regularization function is strongly convex, i.e.,
Under Assumption 1, each is strongly convex (see, e.g., HiriartUrruty and Lemaréchal, 2001, Theorem 4.2.2), and defined in (5) has a unique saddle point .
The condition (14) is often referred to as being smooth. To simplify discussion, here we assume for . Under these assumptions, each composite function has a smoothness parameter (upper bound on the largest eigenvalue of its Hessian). Their average has a smooth parameter , which no larger than the average of the individual smooth parameters . We define a condition number for problem (1) as the ratio between this smooth parameter and the convexity parameter of :
(15) 
where . This condition number is a key factor to characterize the iteration complexity of batch firstorder methods for solving problem (1), i.e., minimizing . Specifically, to find a such that , the proximal gradient method requires iterations, and their accelerated variants require iterations (e.g., Nesterov, 2004; Beck and Teboulle, 2009; Nesterov, 2013). Primaldual first order methods for solving the saddlepoint problem (8) share the same complexity (Chambolle and Pock, 2011, 2015).
A fundamental baseline for evaluating any distributed optimization algorithms is the distributed implementation of batch firstorder methods. Let’s consider solving problem (1) using the proximal gradient method. During every iteration , each machine receives a copy of from a master machine (through Broadcast), and computes the local gradient . Then a collective communication is invoked to compute the batch gradient at the master (Reduce). The master then takes a proximal gradient step, using and the proximal mapping of , to compute the next iterate and broadcast it to every machine for the next iteration. We can also use the AllReduce operation in MPI to obtain at each machine without a master. In either case, the total number of passes over the data is twice the number of iterations (due to matrixvector multiplications using both and ), and the number of vectors in sent/received across all machines is times the number of iterations (see Table 1). Moreover, all communications are collective and synchronous.
Algorithms  Computation complexity  Communication complexity 

(number of passes over data)  (number of vectors in )  
batch firstorder methods  
DSCOVR  
accelerated batch firstorder methods  
accelerated DSCOVR 
Since DSCOVR is a family of randomized algorithms for solving the saddlepoint problem (8), we would like to find such that holds in expectation and with high probability. We list the communication and computation complexities of DSCOVR in Table 1, comparing them with batch firstorder methods. Similar guarantees also hold for reducing the duality gap , where and are defined in (6) and (7) respectively.
The key quantity characterizing the complexities of DSCOVR is the condition number , which can be defined in several different ways. If we pick the data block and model block
with uniform distribution, i.e.,
for and for , then(16) 
Comparing the definition of in (15), we have because
With and , we can also define
(17) 
In this case, we also have because . Finally, if we pick the pair with nonuniform distribution and , then we can define
(18) 
Again we have because . We may replace in Tables 1 and 2 by either or , depending on the probability distributions and and different proof techniques.
From Table 1, we observe similar type of speedups in computation complexity, as obtained by variance reduction techniques over the batch firstorder algorithms for convex optimization (e.g., Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014; Xiao and Zhang, 2014; Lan and Zhou, 2015; AllenZhu, 2017), as well as for convexconcave saddlepoint problems (Zhang and Xiao, 2017; Balamurugan and Bach, 2016). Basically, DSCOVR algorithms have potential improvement over batch firstorder methods by a factor of (for nonaccelerated algorithms) or (for accelerated algorithms), but with a worse condition number. In the worst case, the ratio between and may be of order or larger, thus canceling the potential improvements.
Algorithms  Synchronous Communication  Asynchronous Communication 

(number of vectors in )  (equiv. number of vectors in )  
DSCOVRSVRG  
DSCOVRSAGA  
accelerated DSCOVRSVRG  
accelerated DSCOVRSAGA 
More interestingly, DSCOVR also has similar improvements in terms of communication complexity over batch firstorder methods. In Table 2, we decompose the communication complexity of DSCOVR into synchronous and asynchronous communication. The decomposition turns out to be different depending on the variance reduction techniques employed: SVRG (Johnson and Zhang, 2013) versus SAGA (Defazio et al., 2014). We note that DSCOVRSAGA essentially requires only asynchronous communication, because the synchronous communication of vectors are only necessary for initialization with nonzero starting point.
The comparisons in Table 1 and 2 give us good understanding of the complexities of different algorithms. However, these complexities are not accurate measures of their performance in practice. For example, collective communication of vectors in can often be done in parallel over a spanning tree of the underlying communication network, thus only cost times (insted of times) compared with sending only one vector. Also, for pointtopoint communication, sending one vector in altogether can be much faster than sending smaller vectors of total length separately. A fair comparison in term of wallclock time on a realworld distributed computing system requires customized, efficient implementation of different algorithms. We will shed some light on timing comparisons with numerical experiments in Section 8.
2.2 Related Work
There is an extensive literature on distributed optimization. Many algorithms developed for machine learning adopt the centralized communication setting, due to the wide availability of supporting standards and platforms such as MPI, MapReduce and Spark (as discussed in the introduction). They include parallel implementations of the batch firstorder and secondorder methods (e.g., Lin et al., 2014; Chen et al., 2014; Lee et al., 2017), ADMM (Boyd et al., 2011), and distributed dual coordinate ascent (Yang, 2013; Jaggi et al., 2014; Ma et al., 2015).
For minimizing the average function , in the centralized setting and with only firstorder oracles (i.e., gradients of ’s or their conjugates), it has been shown that distributed implementation of accelerated gradient methods achieves the optimal convergence rate and communication complexity (Arjevani and Shamir, 2015; Scaman et al., 2017). The problem (1
) we consider has the extra structure of composition with a linear transformation by the local data, which allows us to exploit simultaneous data and model parallelism using randomized algorithms and obtain improved communication and computation complexity.
Most work on asynchronous distributed algorithms exploit model parallelism in order to reduce the synchronization cost, especially in the setting with parameter servers (e.g., Li et al., 2014; Xing et al., 2015; Aytekin et al., 2016). Besides, delay caused by the asynchrony can be incorporated to the step size to gain practical improvement on convergence (e.g., Agarwal and Duchi, 2011; McMahan and Streeter, 2014; Sra et al., 2016), though the theoretical sublinear rates remain. There are also many recent work on asynchronous parallel stochastic gradient and coordinatedescent algorithms for convex optimization (e.g., Recht et al., 2011; Liu et al., 2014; Shi et al., 2015; Reddi et al., 2015; Richtárik and Takáč, 2016; Peng et al., 2016). When the workloads or computing power of different machines or processors are nonuniform, they may significantly increase iteration efficiency (number of iterations done in unit time), but often at the cost of requiring more iterations than their synchronous counterparts (due to delays and stale updates). So there is a subtle balance between iteration efficiency and iteration complexity (e.g., Hannah and Yin, 2017). Our discussions in Section 2.1 show that DSCOVR is capable of improving both aspects.
For solving bilinear saddlepoint problems with a finitesum structure, Zhang and Xiao (2017) proposed a randomized algorithm that works with dual coordinate update but full primal update. Yu et al. (2015) proposed a doubly stochastic algorithm that works with both primal and dual coordinate updates based on equation (12). Both of them achieved accelerated linear convergence rates, but neither can be readily applied to distributed computing. In addition, Balamurugan and Bach (2016) proposed stochastic variancereduction methods (also based on SVRG and SAGA) for solving more general convexconcave saddle point problems. For the special case with bilinear coupling, they obtained similar computation complexity as DSCOVR. However, their methods require full model updates at each iteration (even though working with only one subblock of data), thus are not suitable for distributed computing.
With additional assumptions and structure, such as similarity between the local cost functions at different machines or using secondorder information, it is possible to obtain better communication complexity for distributed optimization; see, e.g., Shamir et al. (2014); Zhang and Xiao (2015); Reddi et al. (2016). However, these algorithms rely on much more computation at each machine for solving a local subproblem at each iteration. With additional memory and preprocessing at each machine, Lee et al. (2015) showed that SVRG can be adapted for distributed optimization to obtain low communication complexity.
3 The DSCOVRSVRG Algorithm
From this section to Section 6, we present several realizations of DSCOVR using different variance reduction techniques and acceleration schemes, and analyze their convergence properties. These algorithms are presented and analyzed as sequential randomized algorithms. We will discuss how to implement them for asynchronous distributed computing in Section 7.
(19)  
(20) 
Algorithm 2 is a DSCOVR algorithm that uses the technique of SVRG (Johnson and Zhang, 2013) for variance reduction. The iterations are divided into stages and each stage has a inner loop. Each stage is initialized by a pair of vectors and , which come from either initialization (if ) or the last iterate of the previous stage (if ). At the beginning of each stage, we compute the batch gradients
The vectors and share the same partitions as and , respectively. Inside each stage , the variancereduced stochastic gradients are computed in (19) and (20). It is easy to check that they are unbiased. More specifically, taking expectation of with respect to the random index gives
and taking expectation of with respect to the random index gives
In order to measure the distance of any pair of primal and dual variables to the saddle point, we define a weighted squared Euclidean norm on . Specifically, for any pair where and with , we define
(21) 
If for all , then . We have the following theorem concerning the convergence rate of Algorithm 2.
Suppose Assumption 1 holds, and let be the unique saddle point of . Let be a constant that satisfies
(22) 
In Algorithm 2, if we choose the step sizes as
(23)  
(24) 
and the number of iterations during each stage satisfies , then for any ,
(25) 
The proof of Theorem 3 is given in Appendix A. Here we discuss how to choose the parameter to satisfy (22). For simplicity, we assume for all .