The running time of algorithms on a distributed-memory, parallel machine depends on the number of arithmetic operations (computation) and the cost of data movement (communication). In distributed-memory settings, communication cost includes the “message size cost,” i.e., the amount, , of data exchanged between processors over a network (bandwidth cost) and the “synchronization cost,” i.e., the number, , of messages sent, where a message is used for interprocessor synchronization (latency cost). On modern parallel computer architectures, communicating data often takes much longer than performing a floating-point operation, and this gap is continuing to increase. Therefore, it is important to design algorithms that minimize communication in order to attain high performance, and recent results on communication-avoiding numerical linear algebra (CA-NLA)  illustrate the potential for faster algorithms by avoiding interprocessor communication.
Our goal in this work is to extend results from CA-NLA to convex optimization  to accelerate machine learning (ML) applications. In particular, we are interested in reducing communication cost for iterative convex optimization on distributed machines. Communication is a bottleneck for these methods since they often require synchronization at every iteration. Our goal in this paper is to avoid this communication for iterations, where is a tuning parameter, without altering the convergence rates or numerical stability of the existing methods. We will show how the communication-avoiding techniques developed in  can be used to derive faster convex optimization algorithms.
We are interested in solving the class of optimization problems which take the form
is a convex loss function with regularization function,, with data points and features, labels , and the solution to the minimization problem, . In particular, we would like to focus on sparse convex optimization problems: sparse proximal least-squares , with sparsity-inducing regularization functions
where are disjoint blocks of .
We also consider support vector machines (SVMs) 
where is the -th row of , is the corresponding binary label (), and is a regularization parameter. We present our results for proximal least-squares using Lasso-regularization, but they hold more generally for other regularization functions with well-defined proximal operators (Elastic-Nets, Group Lasso, etc.) . By proximal operator we mean that the non-linearity due to the regularization function, for example with Lasso, can be defined for a constant, , and element-wise on vector, , as:
where is the well-know soft-thresholding operator for Lasso [4, 8, 9]. Proximal operators, similar to (2), can be defined for other non-linear regularizers (i.e. Elastic-Nets, Group Lasso, etc.). Lasso creates solution sparsity during the optimization process by setting elements of the solution vector, , exactly to zero. SVM 
introduces sparsity (by its loss function definition) since we seek a small number of support vectors which separate data belonging to different classes. The sparsity-inducing nature of both optimization problems is important (and widely used) when dealing with high-dimensional data. In the case of high-dimensional and large-scale data, it becomes equally important to parallelize the computations in order to quickly and efficiently solve these Lasso and SVM problems.
There exist many algorithms to solve proximal least-squares [10, 11, 12, 13, 14] and SVM problems [15, 16, 17, 18, 19]. In this paper we focus on randomized variants of accelerated and non-accelerated Coordinate Descent (CD) and Block Coordinate Descent (BCD) since they have optimal convergence rates among the class of first-order methods [11, 10, 12, 13, 19, 16].
CD  is a popular ML technique for solving optimization problems which updates a single element (i.e., coordinate) of by minimizing a one-dimensional subproblem. BCD generalizes CD with a tunable block size parameter, , by updating coordinates of and minimizing a -dimensional subproblem. This process is repeated until a termination criterion is met. Solving the subproblem initially requires the computation of several dot-products with a subset of rows or columns of the data matrix, , its transpose, and residual vectors. Once they are computed the remaining computations are scalar and vector operations which update the solution and residual vectors (and scalar quantities) for the next iteration.
Figure 1 illustrates the computations in parallel BCD. We assume that is partitioned so that the dot-products can be computed in parallel by all processors (in this case row-partitioned). Vectors in are similarly partitioned (in this case residual, ), but vectors in are replicated (in this case solution, ). For the dot-products become GEMM (GEneralized Matrix-Multiplication) computations. After the dot-product/GEMM computations an MPI_Allreduce with summation combines each processor’s contributions. Since Allreduce redundantly stores results on all processors, computing the solution to the subproblem and updating all vectors can be performed without communication. Finally, this process is repeated until a termination criterion is met. The mathematical details (and complexity) of solving the subproblem and updating vectors might vary based on the optimization problem (proximal least-squares, SVM, etc.). However, the parallel BCD methods we consider in this paper can be summarized by Figure 1. Each iteration requires synchronization, therefore, avoiding these synchronization costs could lead to faster and more scalable BCD methods for proximal least-squares and SVM.
We present synchronization-avoiding (SA) derivations of existing accelerated and non-accelerated CD/BCD through mathematical re-formulation. We show that our methods avoid synchronization for iterations without altering the numerical stability or convergence behavior in exact arithmetic. We implement the SA-methods in C++ using MPI  and show that our methods perform - faster and are more scalable on up to k cores of a Cray XC30 supercomputer on LIBSVM  datasets.
The rest of the paper is organized as follows: Section II discusses related approaches and how ours differs. Section I derives SA-Lasso and shows experimental results in Section IV. Section V derives SA-SVM and shows experimental results in Section V. Finally, we summarize and discuss future work in Section VII.
Ii Related Work
Communication-Avoiding in Linear Algebra
This paper extends the -step iterative methods work in NLA and subsequent work on CA iterative linear algebra surveyed by Ballard et. al. . We extend these results to widely used and important problems in ML, namely proximal least-squares and SVM. Our work expands upon and generalizes CA-NLA results to the field of convex optimization.
Communication-Efficient Machine Learning
Recent work along these lines, like proxCoCoA+  propose frameworks that reduce the communication bottleneck. ProxCoCoA+ perform computation on only locally stored data and then combine contributions from all processors. The communication benefits of proxCoCoA+ are inherited from the iteration complexity of performing Newton-type steps on locally stored data.
DUAL-LOCO  introduces a framework which reduces communication between processors by communicating a small matrix of randomly projected features. While DUAL-LOCO requires just one communication round, it does not apply to proximal least-squares problems (i.e., Lasso, elastic-net, etc.) and introduces an (additive) approximation error.
CA-SVM  eliminates communication in SVM by performing an initial
-means clustering as a pre-processing step to partition the data and subsequently training SVM classifiers locally on each processor. Communication is reduced significantly, but at the cost of accuracy. Like proxCoCoA+, CA-SVM uses a local SVM solver which can be replaced with our SA-variant.
applies a similar approach to ours and derives a SA version of SVM using Stochastic Gradient Descent (SGD) as the optimization method. Our work extends this SA approach further to different methods (accelerated and non-accelerated CD/BCD) and extend it to other non-linear optimization problems (proximal least-squares).
Devarakonda et al.  derive SA-variants of primal and dual BCD methods for L2 regularized least-squares. Our work extends their results to non-linear problems and to sparse convex optimization.
Soori et al.  derive CA-variants of novel stochastic FISTA (SFISTA) and stochastic Newton (SPNM) methods for the proximal least squares problem. They illustrate that standard loop unrolling techniques can be applied to SFISTA and SPNM to obtain communication-avoiding variants. One the other hand, our work solves the proximal least-squares and SVM problems on accelerated and non-accelerated BCD methods. Furthermore, the synchronization-avoiding technique is adapted from CA-Krylov  and -step Krylov subspace methods [29, 30, 31, 32] work.
Our SA technique derives alternate forms of existing proximal least-squares and SVM methods by re-arranging the computations to obtain solution updates per communication round. This allows us to obtain an algorithm whose convergence behavior and sequence of solution updates are equivalent to the original algorithm.
Iii Synchronization-Avoiding LASSO
In this section, we derive a SA version of the accelerated block coordinate descent (accBCD) algorithm for the Lasso problem. The derivation of SA-accBCD (Synchronization-Avoiding accBCD) relies on unrolling the vector update recurrences times and re-arranging the updates and dependent computations to avoid synchronization.
Given a matrix with data points and features, a vector of labels , and regularization parameter , the Lasso problem aims to find the solution that solves the optimization problem:
This problem can be solved with many iterative algorithms. We consider the accelerated BCD (accBCD) algorithm described in [11, Algorithm 2] (see Algorithm 1 in this paper). Nesterov’s acceleration  is adapted to BCD through the introduction of additional vectors ( and ) and scalar () updates. The solution vector at each iteration is implicitly computed since , but need not be explicitly computed until termination. Aside from complications due to acceleration, the remaining computations follow the BCD steps illustrated in Figure 1.
We assume that is 1D-row partitioned so partial dot-products (lines 8 and 9 in Alg. 1) can be computed on each processor. Lines - in Alg. 1 randomly select column indices and extract them from . Lines - in Alg. 1 compute dot-products. Lines - Alg. 1 compute the solution to the subproblem and update scalars and vectors for the next iteration. We assume that iterations of the algorithm are performed (where depends on the termination criterion). Line 12 in Alg. 1 applies the soft-thresholding function defined in (2) required for Lasso. At each iteration we compute , the optimal Lipschitz constant, which is the largest eigenvalue of the, small, Gram matrix (line 10 in Alg 1). Since is replicated on all processors (after the MPI_Allreduce ), line 10 does not require communication. An approximate Lipschitz constant can be used but we compute the optimal constant for fast convergence. The recurrences in lines 10–18 of Alg. 1 can be unrolled to avoid synchronization. We begin the SA derivation by changing the loop index from to where is the outer loop index, is the recurrence unrolling parameter, and is the inner loop index. Let us assume that we are at iteration and have just computed the vectors , , , and . From this can be computed by111We ignore scalar updates since they can be redundantly stored and computed on all processors.
By induction we can show that , , and can be computed in terms of , , and
|Summary of theoretical costs|
|Algorithm||Ops cost (F)||Memory cost (M)||Latency cost (L)||Message Size cost (W)|
non-zeros that are uniformly distributed,is the density of (i.e. ), is the number of processors and is the recurrence unrolling parameter. is the non-zeros of the matrix with sampled columns from at each iteration. We assume that the and Gram matrix computed at each iteration are dense.
for . Notice that due to the recurrence unrolling we can defer the updates to , , , and for iterations. The summation in (3) computes the Gram-like matrices, , for . Synchronization can be avoided in these computations by computing the Gram matrix once before the inner loop333 is symmetric so computing just the upper/lower triangular part reduces flops and message size by . and redundantly storing it on all processors. Synchronization can be avoided in the summation in (4) by initializing the random number generator on all processors to the same seed. Finally, at the end of the inner loop iterations we can perform the vector updates
The resulting Synchronization-Avoiding accBCD (SA-accBCD) algorithm is shown in Algorithm 2. Since our SA technique relies on rearranging the computations, the convergence rates and behavior of the standard accelerated BCD algorithm (Alg. 1) is the same (in exact arithmetic).
SA-accBCD computes a larger Gram matrix every iterations, which results in a computation-communication tradeoff where SA-accBCD increases the flops and message size in order to reduce the latency by . If the latency cost is the dominant term then SA-accBCD can attain -fold speedup over accBCD. In general there exists a tradeoff between and the speedups attainable. Table I summarizes the operational, storage, and communication costs of the SA and non-SA methods.
Iv Experimental Results: LASSO
|Summary of datasets|
In this section we present experimental results for our SA variant and explore the numerical and performance tradeoffs. The recurrence unrolling we propose requires computation of Gram-like matrices whose condition numbers may adversely affect the numerical stability of SA-accBCD. We also re-order the sequence of updates of the solution and residual vectors, which could also lead to numerical instability. We begin in Section IV-A with experiments that illustrate that SA-accBCD is numerically stable.
Iv-a Convergence behavior
is the smallest singular value.for accCD and CD. for accBCD and BCD.
We explore the tradeoff between convergence behavior, block size, and (the recurrence unrolling parameter) for the SA-accBCD algorithm and compare it to the behavior of the standard accBCD algorithm. All numerical stability experiments were performed in MATLAB version R2016b on an Intel i7. The datasets used in our experiments were obtained from the LIBSVM repository  and are summarized in Table II. We measure the convergence behavior by plotting the objective function value: at each iteration. For all experiments, we set . We report the convergence vs. iterations to illustrate any differences in convergence behavior.
Figure 2 shows the convergence behavior of the datasets in Table II for several blocksizes and with for SA-accBCD. The results show that larger blocksizes converge faster than , but at the expense of more computation (and larger message sizes in the distributed-memory setting). Comparing SA-accBCD and accBCD we observe no numerical stability issues for as large as for all datasets tested (in theory we can avoid communication for iterations). Table III shows the final relative objective error of the SA-methods compared to the non-SA methods: . This suggests that the additional computation and message size costs are the performance limiting factors and not numerical instability.
|Relative objective error|
Iv-B Performance results
In this section, we present experimental results to show that the SA-methods in Section IV-A are faster than their non-SA variants. We consider the datasets in Table II which were chosen to illustrate performance and speedups on over/under-determined, sparse and dense datasets to illustrate that speedups are independent of those factors.
We implement the algorithms in C++ using the Message Passing Interface (MPI)  for high-performance, distributed-memory parallel processing. The local linear algebra computations are performed using the Intel MKL library for Sparse and Dense BLAS  routines. All methods were tested on a Cray XC30 supercomputer at NERSC which has 24 processors per node and 128GB of memory. The implementation divides the dataset row-wise, however, the SA-methods generalize to other data layout schemes. We choose row-wise since it results in the lowest per iteration communication cost of . All datasets are stored using Compressed Sparse Row format (3-array variant). Vectors in are replicated on all processors and vectors in are partitioned (similar to ).
Convergence vs. Running Time
Figure 3 shows the convergence vs. running time for the datasets in Table II. We present experiments on CD, accCD, BCD, accBCD and their SA variants. In all plots, we can observe that the accelerated methods converge faster than the non-accelerated methods. The BCD methods converge faster than the CD methods as expected. Since the SA-methods do not alter the convergence rates they are faster per iteration.
For the SA methods, we plot two values of , one value (in blue) where we observed the best speedups and a larger value (in red) where we observed less speedups. Note that this decrease in speedup for certain values of is expected since the SA-methods tradeoff additional message size and computation for a decrease in latency cost. We see SA-accCD speedups of and for news20, covtype, url, and epsilon, respectively. For SA-accBCD the speedups decrease to and , respectively.
Performance Scaling and Speedups
Figure 4 shows the performance strong scaling (problem size is fixed and number of processors is increased) and the breakdown of speedups. Figures 3(a)-3(d) show the strong scaling performance of the accCD vs. SA-accCD methods for different ranges of processors for the datasets tested. We can observe that SA-accCD is faster for all datasets and for all processor ranges. Notice that the gap between accCD and SA-accCD (depicted in -scale) increases with the number of processors.
Figures 3(e)-3(h) show the communication, computation, and total speedups attained by SA-accCD for several values of over accCD. We see large communication speedups which eventually decrease when message size costs dominate latency costs. SA-accCD also attains modest computation speedups over accCD. This is because selecting columns of and computing the entries of the Gram matrix (for SA-accCD) is more cache-efficient (uses a BLAS-3 routine) than computing individual dot-products (uses a BLAS-1 routine). Once becomes too large we see slowdowns.
V Synchronization-Avoiding SVM
We are given a matrix , labels where are binary labels for each observation (-th row of ). Support Vector Machines (SVM) solve the optimization problem:
wher is a loss function and is the penalty parameter. In this work, we consider the two loss functions:
We refer to the first as SVM-L1 and the second as SVM-L2 (consistent with ). Recent work  has shown that both variants of SVM can be solved efficiently using dual coordinate descent. Therefore, in this work we will consider the dual optimization problem:
where , where and . For SVM-L1, and and for SVM-L2 and . The dual problem can be solved efficiently by CD  and is shown in Algorithm 3. Note that, unlike Lasso, SVM requires 1D-column partitioning in order to compute dot-products in parallel. The recurrences defined in lines 7-11, 13, and 14 of Alg. 3 can be unrolled to avoid synchornization. We begin the SA derivation by changing the loop index from to where is the outer loop index, is the (tunable) recurrence unrolling parameter, and is the inner loop index. Let us assume that we are at iteration and have just computed the vectors and . From this the next update, , can be computed by
Finally, and can be computed by
By unrolling the vector update recurrences for and , we can compute , , and in terms of and . We will ignore the quantities and in the subsequent derivations for brevity. We introduce an auxiliary scalar, , for notational convenience.
By induction we can show that can be computed in terms of and such that
for . Due to the recurrence unrolling, we can defer updates to and for iterations. The summation in (14) adds a previous update if the coordinate chosen for update at iteration is the same as iteration . Synchronization can be avoided in this step by initializing the random number generator on all processors to the same seed. The summation in (15) computes the inner-product . Synchronization can be avoided at this step by computing the Gram-like matrix, , upfront at the beginning of the outer loop. Note that the diagonal elements of the resulting matrix are the ’s required in the inner loop. Finally, at the end of the inner loop iterations we can perform the vector updates: and .
) do not change (in exact arithmetic). However, in floating-point arithmetic this rearrangement could lead to numerical instability. However, we will empirically show in SectionIV-B that SA-SVM is numerically stable. While our SA-variants reduce the latency cost by , they increase the flops and bandwidth costs by (due to the Gram matrix, ). Therefore, the best choice of depends on the relative algorithmic flops, bandwidth, latency costs and their respective hardware parameters.
Vi Experimental Results: SVM
|Summary of datasets|
The recurrence unrolling results in the computation of an Gram matrix, whose condition number may adversely affect numerical stability. So, we begin by verifying the stability of SA-SVM through MATLAB experiments and then illustrate the speedups attainable through our approach.
We conduct numerical stability experiments in MATLAB (same platform as Sec. IV-B) and use binary classification datasets (summarized in Table IV) from the LIBSVM  repository. We measure the convergence behavior by plotting the duality gap, , where is the primal objective value555We can compute this without running the primal SVM algorithm (as in ). and is the dual objective value (as in ). Note that duality gap is a stronger criterion than the relative objective error used in Sec. IV-B. Due to strong convexity, primal and dual linear SVM have the same optimal function value [19, 15]. We set for all experiments (same as ) and show results for SVM-L1 and SVM-L2.
|Dataset||Processors||Algorithm||Running Time (speedup)|
|news20.binary||P = 576||SVM-L1||sec.|
|rcv1.binary||P = 240||SVM-L1||sec.|
|gisette||P = 3072||SVM-L1||sec.|
Figure 5 illustrates that the SA-variants are numerically stable and converge in the same way their non-SA variants. SVM-L2 converges faster than SVM-L1 since the loss function is smoothed. Table V shows speedups of SA-SVM-L1 over SVM-L1. SVM-L1 and SVM-L2 are initialized with different scalar quantities. All else remains the same, so we solve the (harder) SVM-L1 problem and report performance results. Note that we perform offline strong scaling experiments for each dataset and report the best processors and running time combinations. We observed speedups of for rcv1.binary, for news20.binary and for gisette. These speedups were attained despite load balancing issues for rcv1 and news20 when transforming datasets stored row-wise on disk to 1D-column partitioned matrices in DRAM (Lasso experiments do not have these issues). Eliminating this overhead in future work would further improve speedups and scalability (since load imbalance decreases the effective flops rate due to stragglers). Gisette is nearly dense hence load balance was not a problem. We were able to strong scale SA-SVM-L1 to cores and attain a speedup.
We showed in this paper that existing work in CA-NLA can be extended to important ML problems. We derived SA variants of CD and BCD methods and illustrated that our methods are faster and more scalable with changing convergence rates. The SA-variants attain speedups of - and can reduce communication by factors of - . While we did not explore other parallel environments, our methods would attain greater speedups on frameworks like Spark  due to the large latency costs .
-  G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, pp. 1–155, 2014.
-  S. Boyd and L. Vandenberghe, “Convex optimization,” 2004.
-  N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
-  R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. of the Royal Statistical Society. Series B