Algorithms and systems for distributed optimization are critical for solving large-scale machine learning problems, especially when the dataset cannot fit into the memory or storage of a single machine. In this paper, we consider distributed optimization problems of the form
where is the local data stored at the th machine, is a convex cost function associated with the linear mapping , and is a convex regularization function. In addition, we assume that is separable, i.e., for some integer , we can write
where , and for are non-overlapping subvectors of with (they form a partition of ). Many popular regularization functions in machine learning are separable, for example, or for some .
An important special case of (1) is distributed empirical risk minimization (ERM) of linear predictors. Let be training examples, where each
is a feature vector andis its label. The ERM problem is formulated as
where each is a loss function measuring the mismatch between the linear prediction and the label . Popular loss functions in machine learning include, e.g., for regression, the squared loss , and for classification, the logistic loss where . In the distributed optimization setting, the examples are divided into subsets, each stored on a different machine. For , let denote the subset of stored at machine and let (they satisfy ). Then the ERM problem (3) can be written in the form of (1) by letting consist of with as its rows and defining as
where is a subvector of , consisting of with .
The nature of distributed algorithms and their convergence properties largely depend on the model of the communication network that connects the computing machines. A popular setting in the literature is to model the communication network as a graph, and each node can only communicate (in one step) with their neighbors connected by an edge, either synchronously or asynchronously (e.g., Bertsekas and Tsitsiklis, 1989; Nedić and Ozdaglar, 2009)
. The convergence rates of distributed algorithms in this setting often depend on characteristics of the graph, such as its diameter and the eigenvalues of the graph Laplacian(e.g. Xiao and Boyd, 2006; Duchi et al., 2012; Nedić et al., 2016; Scaman et al., 2017). This is often called the decentralized setting.
Another model for the communication network is centralized, where all the machines participate synchronous, collective communication, e.g., broadcasting a vector to all machines, or computing the sum of vectors, each from a different machine (AllReduce). These collective communication protocols hide the underlying implementation details, which often involve operations on graphs. They are adopted by many popular distributed computing standards and packages, such as MPI (MPI Forum, 2012), MapReduce (Dean and Ghemawat, 2008) and Aparche Spark (Zaharia et al., 2016), and are widely used in machine learning practice (e.g., Lin et al., 2014; Meng et al., 2016). In particular, collective communications are very useful for addressing data parallelism, i.e., by allowing different machines to work in parallel to improve the same model using their local dataset. A disadvantage of collective communications is their synchronization cost: faster machines or machines with less computing tasks have to become idle while waiting for other machines to finish their tasks in order to participate a collective communication.
One effective approach for reducing synchronization cost is to exploit model parallelism (here “model” refers to , including all optimization variables). The idea is to allow different machines work in parallel with different versions of the full model or different parts of a common model, with little or no synchronization. The model partitioning approach can be very effective for solving problems with large models (large dimension ). Dedicated parameter servers can be set up to store and maintain different subsets of the model parameters, such as the ’s in (2), and be responsible for coordinating their updates at different workers (Li et al., 2014; Xing et al., 2015). This requires flexible point-to-point communication.
In this paper, we develop a family of randomized algorithms that exploit simultaneous data and model parallelism. Correspondingly, we adopt a centralized communication model that support both synchronous collective communication and asynchronous point-to-point communication. In particular, it allows any pair of machines to send/receive a message in a single step, and multiple point-to-point communications may happen in parallel in an event-driven, asynchronous manner. Such a communication model is well supported by the MPI standard. To evaluate the performance of distributed algorithms in this setting, we consider the following three measures.
Computation complexity: total amount of computation, measured by the number of passes over all datasets for , which can happen in parallel on different machines.
Communication complexity: the total amount of communication required, measured by the equivalent number of vectors in sent or received across all machines.
Synchronous communication: measured by the total number of vectors in that requires synchronous collective communication involving all machines. We single it out from the overall communication complexity as a (partial) measure of the synchronization cost.
In Section 2, we introduce the framework of our randomized algorithms, Doubly Stochastic Coordinate Optimization with Variance Reduction (DSCOVR), and summarize our theoretical results on the three measures achieved by DSCOVR. Compared with other first-order methods for distributed optimization, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. Then we present the details of several DSCOVR variants and their convergence analysis in Sections 3-6. We discuss the implementation of different DSCOVR algorithms in Section 7, and present results of our numerical experiments in Section 8.
2 The DSCOVR Framework and Main Results
First, we derive a saddle-point formulation of the convex optimization problem (1). Let be the convex conjugate of , i.e., , and define
where . Since both the ’s and are convex, is convex in and concave in . We also define a pair of primal and dual functions:
where is exactly the objective function in (1)111 More technically, we need to assume that each is convex and lower semi-continuous so that (see, e.g., Rockafellar, 1970, Section 12). It automatically holds if is convex and differentiable, which we will assume later. and is the convex conjugate of . We assume that has a saddle point , that is,
In this case, we have and , and .
The DSCOVR framework is based on solving the convex-concave saddle-point problem
Since we assume that has a separable structure as in (2), we rewrite the saddle-point problem as
where for are column partitions of . For convenience, we define the following notations. First, let be the overall data matrix, by stacking the ’s vertically. Conforming to the separation of , we also partition into block columns for , where each (stacked vertically). For consistency, we also use to denote from now on. See Figure 1 for an illustration.
be two probability distributions. During each iteration, we randomly pick an index with probability , and independently pick an index with probability . Then we compute two vectors and (details to be discussed later), and use them to update the block coordinates and while leaving other block coordinates unchanged. The update formulas in (10) and (11) use the proximal mappings of the (scaled) functions and respectively. We recall that the proximal mapping for any convex function is defined as
There are several different ways to compute the vectors and in Step 3 of Algorithm 1. They should be the partial gradients or stochastic gradients of the bilinear coupling term in with respect to and respectively. Let
which is the bilinear term in without the factor . We can use the following partial gradients in Step 3:
We note that the factor does not appear in the first equation because it multiplies both and in (9) and hence does not appear in updating . Another choice is to use
which are unbiased stochastic partial gradients, because
where and are expectations with respect to the random indices and respectively.
It can be shown that, Algorithm 1 converges to a saddle point of with either choice (12) or (13) in Step 3, and with suitable step sizes and . It is expected that using the stochastic gradients in (13) leads to a slower convergence rate than applying (12). However, using (13) has the advantage of much less computation during each iteration. Specifically, it employs only one block matrix-vector multiplication for both updates, instead of and block multiplications done in (12).
More importantly, the choice in (13) is suitable for parallel and distributed computing. To see this, let denote the pair of random indices drawn at iteration (we omit the superscript to simplify notation whenever there is no confusion from the context). Suppose for a sequence of consecutive iterations , there is no common index among , nor among , then these iterations can be done in parallel and they produce the same updates as being done sequentially. Suppose there are processors or machines, then each can carry out one iteration, which includes the updates in (13) as well as (10) and (11). These iterations are independent of each other, and in fact can be done in any order, because each only involve one primal block and one dual block , for both input and output (variables on the right and left sides of the assignments respectively). In contrast, the input for the updates in (12) depend on all primal and dual blocks at the previous iteration, thus cannot be done in parallel.
In practice, suppose we have machines for solving problem (9), and each holds the data matrix in memory and maintains the dual block , for . We assume that the number of model partitions is larger than , and the model blocks are stored at one or more parameter servers. In the beginning, we can randomly pick model blocks (sampling without replacement) from , and assign each machine to update one of them. If machine is assigned to update block , then both and are updated, using only the matrix ; moreover, it needs to communicate only the block with the parameter server that are responsible to maintain it. Whenever one machine finishes its update, a scheduler can randomly pick another parameter block that is not currently updated by other machines, and assign it to the free machine. Therefore all machines can work in parallel, in an asynchronous, event-driven manner. Here an event is the completion of a block update at any machine, as illustrated in Figure 2. We will discuss the implementation details in Section 7.
The idea of using doubly stochastic updates for distributed optimization in not new. It has been studied by Yun et al. (2014) for solving the matrix completion problem, and by Matsushima et al. (2014) for solving the saddle-point formulation of the ERM problem. Despite their nice features for parallelization, these algorithms inherit the (or with strong convexity) sublinear convergence rate of the classical stochastic gradient method. They translate into high communication and computation cost for distributed optimization. In this paper, we propose new variants of doubly stochastic update algorithms by using variance-reduced stochastic gradients (Step 3 of Algorithm 1). More specifically, we borrow the variance-reduction techniques from SVRG (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014) to develop the DSCOVR algorithms, which enjoy fast linear rates of convergence. In the rest of this section, we summarize our theoretical results characterizing the three measures for DSCOVR: computation complexity, communication complexity, and synchronization cost. We compare them with distributed implementation of batch first-order algorithms.
2.1 Summary of Main Results
Throughout this paper, we use to denote the standard Euclidean norm for vectors. For matrices, denotes the operator (spectral) norm and denotes the Frobenius norm. We make the following assumption regarding the optimization problem (1).
Each is convex and differentiable, and its gradient is -Lipschitz continuous, i.e.,
In addition, the regularization function is -strongly convex, i.e.,
The condition (14) is often referred to as being -smooth. To simplify discussion, here we assume for . Under these assumptions, each composite function has a smoothness parameter (upper bound on the largest eigenvalue of its Hessian). Their average has a smooth parameter , which no larger than the average of the individual smooth parameters . We define a condition number for problem (1) as the ratio between this smooth parameter and the convexity parameter of :
where . This condition number is a key factor to characterize the iteration complexity of batch first-order methods for solving problem (1), i.e., minimizing . Specifically, to find a such that , the proximal gradient method requires iterations, and their accelerated variants require iterations (e.g., Nesterov, 2004; Beck and Teboulle, 2009; Nesterov, 2013). Primal-dual first order methods for solving the saddle-point problem (8) share the same complexity (Chambolle and Pock, 2011, 2015).
A fundamental baseline for evaluating any distributed optimization algorithms is the distributed implementation of batch first-order methods. Let’s consider solving problem (1) using the proximal gradient method. During every iteration , each machine receives a copy of from a master machine (through Broadcast), and computes the local gradient . Then a collective communication is invoked to compute the batch gradient at the master (Reduce). The master then takes a proximal gradient step, using and the proximal mapping of , to compute the next iterate and broadcast it to every machine for the next iteration. We can also use the AllReduce operation in MPI to obtain at each machine without a master. In either case, the total number of passes over the data is twice the number of iterations (due to matrix-vector multiplications using both and ), and the number of vectors in sent/received across all machines is times the number of iterations (see Table 1). Moreover, all communications are collective and synchronous.
|Algorithms||Computation complexity||Communication complexity|
|(number of passes over data)||(number of vectors in )|
|batch first-order methods|
|accelerated batch first-order methods|
Since DSCOVR is a family of randomized algorithms for solving the saddle-point problem (8), we would like to find such that holds in expectation and with high probability. We list the communication and computation complexities of DSCOVR in Table 1, comparing them with batch first-order methods. Similar guarantees also hold for reducing the duality gap , where and are defined in (6) and (7) respectively.
The key quantity characterizing the complexities of DSCOVR is the condition number , which can be defined in several different ways. If we pick the data block and model block
with uniform distribution, i.e.,for and for , then
Comparing the definition of in (15), we have because
With and , we can also define
In this case, we also have because . Finally, if we pick the pair with non-uniform distribution and , then we can define
From Table 1, we observe similar type of speed-ups in computation complexity, as obtained by variance reduction techniques over the batch first-order algorithms for convex optimization (e.g., Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014; Xiao and Zhang, 2014; Lan and Zhou, 2015; Allen-Zhu, 2017), as well as for convex-concave saddle-point problems (Zhang and Xiao, 2017; Balamurugan and Bach, 2016). Basically, DSCOVR algorithms have potential improvement over batch first-order methods by a factor of (for non-accelerated algorithms) or (for accelerated algorithms), but with a worse condition number. In the worst case, the ratio between and may be of order or larger, thus canceling the potential improvements.
|Algorithms||Synchronous Communication||Asynchronous Communication|
|(number of vectors in )||(equiv. number of vectors in )|
More interestingly, DSCOVR also has similar improvements in terms of communication complexity over batch first-order methods. In Table 2, we decompose the communication complexity of DSCOVR into synchronous and asynchronous communication. The decomposition turns out to be different depending on the variance reduction techniques employed: SVRG (Johnson and Zhang, 2013) versus SAGA (Defazio et al., 2014). We note that DSCOVR-SAGA essentially requires only asynchronous communication, because the synchronous communication of vectors are only necessary for initialization with non-zero starting point.
The comparisons in Table 1 and 2 give us good understanding of the complexities of different algorithms. However, these complexities are not accurate measures of their performance in practice. For example, collective communication of vectors in can often be done in parallel over a spanning tree of the underlying communication network, thus only cost times (insted of times) compared with sending only one vector. Also, for point-to-point communication, sending one vector in altogether can be much faster than sending smaller vectors of total length separately. A fair comparison in term of wall-clock time on a real-world distributed computing system requires customized, efficient implementation of different algorithms. We will shed some light on timing comparisons with numerical experiments in Section 8.
2.2 Related Work
There is an extensive literature on distributed optimization. Many algorithms developed for machine learning adopt the centralized communication setting, due to the wide availability of supporting standards and platforms such as MPI, MapReduce and Spark (as discussed in the introduction). They include parallel implementations of the batch first-order and second-order methods (e.g., Lin et al., 2014; Chen et al., 2014; Lee et al., 2017), ADMM (Boyd et al., 2011), and distributed dual coordinate ascent (Yang, 2013; Jaggi et al., 2014; Ma et al., 2015).
For minimizing the average function , in the centralized setting and with only first-order oracles (i.e., gradients of ’s or their conjugates), it has been shown that distributed implementation of accelerated gradient methods achieves the optimal convergence rate and communication complexity (Arjevani and Shamir, 2015; Scaman et al., 2017). The problem (1
) we consider has the extra structure of composition with a linear transformation by the local data, which allows us to exploit simultaneous data and model parallelism using randomized algorithms and obtain improved communication and computation complexity.
Most work on asynchronous distributed algorithms exploit model parallelism in order to reduce the synchronization cost, especially in the setting with parameter servers (e.g., Li et al., 2014; Xing et al., 2015; Aytekin et al., 2016). Besides, delay caused by the asynchrony can be incorporated to the step size to gain practical improvement on convergence (e.g., Agarwal and Duchi, 2011; McMahan and Streeter, 2014; Sra et al., 2016), though the theoretical sublinear rates remain. There are also many recent work on asynchronous parallel stochastic gradient and coordinate-descent algorithms for convex optimization (e.g., Recht et al., 2011; Liu et al., 2014; Shi et al., 2015; Reddi et al., 2015; Richtárik and Takáč, 2016; Peng et al., 2016). When the workloads or computing power of different machines or processors are nonuniform, they may significantly increase iteration efficiency (number of iterations done in unit time), but often at the cost of requiring more iterations than their synchronous counterparts (due to delays and stale updates). So there is a subtle balance between iteration efficiency and iteration complexity (e.g., Hannah and Yin, 2017). Our discussions in Section 2.1 show that DSCOVR is capable of improving both aspects.
For solving bilinear saddle-point problems with a finite-sum structure, Zhang and Xiao (2017) proposed a randomized algorithm that works with dual coordinate update but full primal update. Yu et al. (2015) proposed a doubly stochastic algorithm that works with both primal and dual coordinate updates based on equation (12). Both of them achieved accelerated linear convergence rates, but neither can be readily applied to distributed computing. In addition, Balamurugan and Bach (2016) proposed stochastic variance-reduction methods (also based on SVRG and SAGA) for solving more general convex-concave saddle point problems. For the special case with bilinear coupling, they obtained similar computation complexity as DSCOVR. However, their methods require full model updates at each iteration (even though working with only one sub-block of data), thus are not suitable for distributed computing.
With additional assumptions and structure, such as similarity between the local cost functions at different machines or using second-order information, it is possible to obtain better communication complexity for distributed optimization; see, e.g., Shamir et al. (2014); Zhang and Xiao (2015); Reddi et al. (2016). However, these algorithms rely on much more computation at each machine for solving a local sub-problem at each iteration. With additional memory and preprocessing at each machine, Lee et al. (2015) showed that SVRG can be adapted for distributed optimization to obtain low communication complexity.
3 The DSCOVR-SVRG Algorithm
From this section to Section 6, we present several realizations of DSCOVR using different variance reduction techniques and acceleration schemes, and analyze their convergence properties. These algorithms are presented and analyzed as sequential randomized algorithms. We will discuss how to implement them for asynchronous distributed computing in Section 7.
Algorithm 2 is a DSCOVR algorithm that uses the technique of SVRG (Johnson and Zhang, 2013) for variance reduction. The iterations are divided into stages and each stage has a inner loop. Each stage is initialized by a pair of vectors and , which come from either initialization (if ) or the last iterate of the previous stage (if ). At the beginning of each stage, we compute the batch gradients
The vectors and share the same partitions as and , respectively. Inside each stage , the variance-reduced stochastic gradients are computed in (19) and (20). It is easy to check that they are unbiased. More specifically, taking expectation of with respect to the random index gives
and taking expectation of with respect to the random index gives
In order to measure the distance of any pair of primal and dual variables to the saddle point, we define a weighted squared Euclidean norm on . Specifically, for any pair where and with , we define
If for all , then . We have the following theorem concerning the convergence rate of Algorithm 2.
Suppose Assumption 1 holds, and let be the unique saddle point of . Let be a constant that satisfies
In Algorithm 2, if we choose the step sizes as
and the number of iterations during each stage satisfies , then for any ,