I Introduction
Mathematical optimization is one of the main pillars of machine learning, where parameter values are computed based on observed data. Applications in many big data and largescale scientific computing domains need the solution to convex optimization problems [1]
. The performance of these optimization problems is often dominated by communication which is closely tied to the formulation of the algorithms. A popular approach to estimate parameters in convex optimization problems is solving a regularized least square problem that can solve many illconditioned systems. L1regularized least square problems are often solved using the class of
proximal methods, called iterative shrinkagethresholding algorithms (ISTA) [2]. The advantage of ISTA is in its simplicity. However, ISTA has also been recognized as a slow method in terms of convergence rate [3]. It is well known that for largescale problems first order methods are often the only practical option. Fast iterative shrinkagethresholding algorithm (FISTA) extends ISTA while preserving the computational simplicity of ISTA and has a global rate of convergence that is significantly better, both theoretically and practically [3]. Another class of widely used proximal methods are proximal Newtontype methods (PNM) that use second order information in their proximal mapping which have a superlinear rate of convergence [4]. Stochastic formulations of FISTA and PNM, abbreviated as SFISTA and SPNM, are often more efficient for processing large data [1] and thus are used as the base algorithms in this work. These methods are iterative by nature, therefore need to communicate data in each iteration. Since communication is more expensive than arithmetic, the scalability and performance of these algorithms are bound by communication costs on distributed architectures.The performance of an algorithm on modern computing architectures depends on the costs of arithmetic and data communication between different levels of memory and processors over a network. The communication cost of an algorithm is computed by adding bandwidth costs, the time required to move words, and latency costs, the time to communicate messages between processors. On modern computer architectures the cost of communication is often orders of magnitude larger than floating point operation costs and this gap is increasing [5]. Classical formulations of optimization methods are inherently communicationbound. Therefore, to achieve highperformance on modern distributed platforms, optimization methods have to be reformulated to minimize communication and data movement rather than arithmetic complexity [6].
Recent work has developed algorithms to reduce communication in a number of optimization and machine learning methods. Communication efficient optimization libraries such as [7, 8, 9] attempt to reduce communication, though they may change the convergence behavior of the algorithm. For example, CoCoA [7] uses a local solver on each machine and shares information between the solvers with highly flexible communication schemes. HOGWILD! [8]
implements stochastic gradient descent (SGD) without locking which achieves a nearly optimal convergence rate (compared to its serial counterpart) only if the optimization problem is sparse. KAVG
[9] modifies SGD by communicating every k iterations. This method changes the convergence behavior of SGD by arguing that frequent synchronization does not always lead to faster convergence. You et. al. [10]present a partitioning algorithm to efficiently divide the training set among processors for support vector machines (SVMs). Their approach works well for SVMs but does not extend to optimization problems in general. PpackSVM
[11] uses a similar approach to ours to derive an SGDbased SVM algorithm which communicates every k iterations. Our work extends this technique to proximal leastsquares problems solved by FISTA and Newtontype methods.Standard optimization algorithms are iterative and compute an update direction per iteration by communicating data. Krylov iterative solvers, frequently used to solve linear systems of equations, follow a similar pattern. Efforts to reformulate standard formulations of Krylov solvers by scientific computing practitioners has lead the developed of kstep Krylov solvers [12, 13]. These algorithms compute k basis vectors atonce by unrolling k iterations of the standard algorithm.With careful partitioning of the original data matrix [14, 15], the matrix powers kernel in the reformulated Krylov methods can be optimized to reduce communication costs by O(k) compared to the classical algorithm. kstep methods are powerful as they are arithmetically the same as the standard methods and thus often preserve convergence properties. However, data partitioning costs can be very high in the communicationavoiding implementations of kstep Krylov methods and matrices from many optimization problems are not good candidates for such partitioning. Devarakonda et. al. extend kstep methods to the block coordinate decent (BCD) methods [16]. Their implementation reduces latency costs by a factor k while increasing the bandwidth and floating point operations (flops) costs.
In this work we introduce kstep formulations for the stochastic FISTA (SFISTA) and stochastic proximal Newtontype method (SPNM) algorithms. Unlike the kstep formulations of Krylov methods which avoid communication by partitioning data—such an approach does not work for large matrices in optimization problems—we use randomized sampling to reformulate the classical algorithms and to reduce communication. Randomization enables us to control the communication and computation cost in the reformulated algorithm. Our approach reduces the latency cost without significantly increasing bandwidth costs or flops.
We introduce communicationavoiding implementations of SFISTA (CASFISTA) and SPNM (CASPNM) to reduce communication by reducing latency costs. Figure 1 shows the execution time of SFISTA for the covtype dataset. SFISTA shows poor scaling properties when the number of processors increase and demonstrates no performance gains on 64 processors vs. one processor. CASFISTA can solve the regularized least square problem by iteratively updating the optimization variable every k iterations and reduces the latency cost by a factor of k. CASFISTA and CASPNM outperform the classical algorithms without changing the convergence behavior. We control randomization by sampling from the distributed data between processors. Random sampling enables us to create subsamples of data to improve locality and arithmetic costs in each iteration. Unrolling of iterations to reformulate the standard algorithms is also made possible through random sampling.
The following summarizes the contributions of the work:

We introduce CASFISTA and CASPNM that provably reduce the latency cost of two wellknown optimization methods used to solve L1regularized least square problems by a factor of without changing the convergence behavior of the original algorithms.

The communicationavoiding algorithms achieve upto 10X speedup on the target supercomputers.

We develop a randomized sampling strategy that creates samples of data for each processor to facilitate the derivation and implementations of CASFISTA and CASPNM.

Computation and communication costs of the classical algorithms as well as the proposed communicationavoiding algorithms are presented. Our analysis shows that CASPNM and CASFISTA reduce communication costs without changing the floating points operation or bandwidth costs.
Ii Background
This section introduces the L1regularized least squares problems and algorithmic approaches for their solution. Details of the communication model used throughout the paper is also provided.
Iia Regularized Least Squares Problem
Consider the following composite additive cost optimization problem of the form
(1) 
where is a continuous, convex, possibly nonsmooth function and is a smooth and convex function, with a Lipschitzcontinuous gradient. Let be the Lipschitz constant of the gradient of . This form can represent a large class of regression problems based on different choices of and . In particular, the L1regularized least squares problem, a.k.a. the least absolute shrinkage and selection operator (LASSO) problem, is a special case where
when is the input data matrix, where rows are the features and columns are the samples, holds the labels (or observations), is the optimization variable, and is the regularization (penalty) parameter. The parameter controls the sparsity of the solution, since increasing its value magnifies the effect of the second term , which itself is minimized at . In this case, the gradient of is given by:
(2) 
LASSO problems appear frequently in machine learning applications [17] including learning directed and undirected graphical models [18], online dictionary learning [19], elasticnet regularized problems [20]
, and feature selection in classification problems and data analysis
[21, 22]. Let(3) 
be the optimal solution to the LASSO problem. The LASSO objective contains two parts, a quadratic penalty and the L1regularization term . Therefore, LASSO has regression characteristics by minimizing a least squares error and has subset selection features by shrinking the optimization variable and setting parts of it to zero.
IiB Solving LASSO
Among the most efficient methods for solving convex optimization problems of the form (1) are first order methods that use a proximal step in order to handle the possibly nonsmooth part . In particular, for the LASSO problem, the class of iterative soft thresholding algorithms, recently improved to FISTA [3], have become very popular due to their excellent performance. Each iteration of FISTA, involves computing a gradient followed by a soft thresholding operator. More recently, inspired by first order methods, proximal Newton methods have been introduced that use second order information (hessian) in their proximal part. These methods are globally convergent and could achieve a superlinear rate of convergence in the neighborhood of the optimal solution [4]. The fact that FISTA and proximal Newton methods are key algorithms for solving composite convex optimization problems motivates us to introduce communicationavoiding variants of these methods and discuss their computation and communication complexity.
IiC Communication Model
The cost of an algorithm includes arithmetic and computation. Traditionally, algorithms have been analyzed with floating point operation costs. However, communication costs are a crucial part in analyzing algorithms in largescale simulations [23]. The cost of floating point operations and communication, including bandwidth and latency, can be combined to obtain the following model:
(4) 
where is the overall execution time approximated by a linear function of F, L, and W, which are total number of floating point operations, the number of messages sent, and the number of words moved respectively. Also, , , and are machinespecific parameters that show the cost of one floating point operation, the cost of sending a message, and the cost of moving a word. Among different communication models, the LogP [24] and LogGP [25] models are often used to develop communication models for parallel architectures. The communication model used in this paper is a simplified model known as which uses , , and and shows the effect of communication and computation on the algorithm cost.
Iii Classical Stochastic algorithms
Even though there are some advantages to using batch optimization methods, multiple intuitive, practical, and theoretical reasons exist for using stochastic optimization methods. Stochastic methods can use information more efficiently than batch methods by choosing data randomly from the data matrix. In particular, for largescale machine learning problems, training sets involve a good deal of (approximate) redundant data which makes batch methods inefficient inpractice. Theoretical analysis also favors stochastic methods for many big data scenarios [1]. This section explains the SFISTA and SPNM algorithms and analyzes their computation and communication costs. The associated costs are derived under the assumption that columns of are distributed in a way that each processor has roughly the same number of nonzeros. The vector is distributed among processors while vectors with dimension such as and are replicated on all processors. Finally, we assume that which means we are dealing with many observations in the application.
Iiia Stochastic Fast Iterative ShrinkageThresholding Algorithm (SFISTA)
Computing the gradient in (2) is very expensive since the data matrix is being multiplied by its transpose. The cost of this operation can be reduced by sampling from the original matrix, leading to a randomized variant of FISTA [26, 27]. If percent of columns of are sampled, the new gradient vector is obtained by:
(5) 
where and
is a random matrix containing one nonzero per column representing samples used for computing the gradient. Therefore the generalized gradient update is:
(6) 
where is the soft thresholding operator defined as:
(7) 
where represents the ith element of a vector. FISTA accelerates the rate of generalized gradient by modifying the update rule as follows:
(8) 
where is the auxiliary variable defined as:
(9) 
and .
Algorithm I shows the SFISTA algorithm for solving the LASSO problem.
IiiB Stochastic Proximal Newton Methods (SPNM)
FISTA uses the first order information of (gradient) to update the optimization variable. However, proximal Newton methods solve the same problem with second order information (Hessian), or an approximation of it, to compute the smooth segment of the composite function (1). Proximal Newton methods achieve a superlinear convergence rate in the vicinity of the optimal solution [4]. As discussed in section IIA, since LASSO starts with an initial condition at and because the optimal solution is typically very sparse, the sequence of will be close enough to the optimal solution for different values of k. Therefore, very often proximal Newton methods achieve a superlinear convergence rate and converge faster than FISTA. Proximal Newton methods solve (1) with quadratic approximation to the smooth function in each iteration [4] as follows:
(10) 
where is the approximation of the Hessian at iteration . Since a closed form solution of (10) does not exist, a first order algorithm is often used to solve (10) and to update the optimization variable. Algorithm II shows the stochastic proximal Newton method for solving LASSO. As demonstrated, the SPNM algorithm takes a block of data, approximates the Hessian based on sampled columns, and uses a first order solver to solve (10) in lines 710 with updates on the optimization variable. Thus, SPNM can be seen as a first order solver operating on blocks of data.
1:  Input: , , , K1 and . 

2:  set 
3:  for do 
4:  Generate where 
is chosen uniformly at random  
5:  
6:  
7:  
8:  output 
1:  Input: , , , K1 and . 

2:  set 
3:  for do 
4:  Generate where 
is chosen uniformly at random  
5:  
6:  
7:  for 
8:  
9:  
10:  
11:  output 
The following theorems analyze the computation and communication costs of SFISTA and SPNM.
Theorem 1. iterations of SFISTA on processors over the critical path has following costs:
F = flops, W = words moved, L = messages and M = words of memory.
Proof. SFISTA computes the gradient in line 5 which consists of three parts. The first part multiplies by its transpose in line 5 which requires operations and communicates words and messages. Computing the second part which involves multiplying sampled and , requires operations and needs words with messages. These two operations dominate other costs of the algorithm. Finally, the algorithm computes the gradient and updates the optimization variable redundantly on processors. Computing the gradient (line 5) requires operations without any communication between processors. The vector updates need operations. Therefore, the total cost of SFISTA for iterations is flops, words, and messages. Each processor needs enough memory to store three parts of , and th of . Therefore, it needs words of memory.
Theorem 2. T iterations of SPNM on processors over the critical path has the following costs:
F = flops, W = words moved, L = messages, and M = words of memory.
Proof. SPNM solves the minimization problem using the Hessian. This requires inner iterations in order to reach an optimal solution. An analysis similar to theorem 1 can be used to prove this theorem.
The communication bottleneck of SFISTA and SPNM: Despite the fact that stochastic implementations of FISTA and proximal Newton methods are more practical for largescale machine learning applications, they do not scale well on distributed architectures (e.g. Figure 1). In each iteration of both algorithms, the gradient vector has to be communicated at line 5 with an allreduce operation which leads to expensive communication.
Iv Avoiding Communication in SFISTA and SPNM
We reformulate the SFISTA and SPNM algorithms to take ksteps atonce without communicating data. The proposed kstep formulations reduce data communication and latency costs by a factor of k without increasing bandwidth costs and significantly improve the scalability of SFISTA and SPNM on distributed platforms. This section presents our formulations for kstep SFISTA and kstep SPNM, also referred to as communicationavoiding SFISTA (CASFISTA) and SPNM (CASPNM). We will also introduce a randomized sampling strategy that leverages randomization to generate data partitions for each processor and to make the kstep derivations possible. Communication, memory, bandwidth, and latency costs for the CA algorithms are also presented.
Iva The CASFISTA and CASPNM Algorithms
Communicationavoiding SFISTA: Algorithm III shows the CASFISTA algorithm. As discussed, the communication bottleneck is at line 5 in Algorithm I, thus for the reformulation we start by unrolling the loop in line 4 of the classical algorithm for ksteps and will sample from the data matrix to compute the gradient of the objective function. As demonstrated in Algorithm III
, a random matrix is produced based on uniform distribution and is used to select columns from of
and and to compute Gram matrices and in line 6. Operations in line 6 are done in k unrolled iterations because every iteration involves a different random matrix ; without randomized sampling we could not unroll these iterations.These local Gram matrices are concatenated into matrices and in line 7 which are then broadcasted to all processors. This communication only occurs every k iterations, thus, the CASFISTA algorithm reduces latency costs. Also, sending large amounts of data at every k iteration improves bandwidth utilization. Processors do not need to communicate for the updates in lines 913 for k iterations. Gram matrices and consist of blocks of size and vectors of size respectively. At every iteration a block of and one column of is chosen and is used to compute the gradient in line 10. Each block of and comes from subsampled data and contributes to computing the gradient. The auxiliary variable is updated in line 12 and the soft thresholding operator updates . To conclude, the CASFISTA algorithm only communicates data every k iterations in line 7 which reduces the number of messages communicated between processors by O(k).
1:  Input: , , , K1, s.t 

2:  set 
3:  for do 
4:  for do 
5:  Generate where 
is chosen uniformly at random  
6:  
7:  set and 
and send them to all processors.  
8:  for do 
9:  are blocks of G 
10:  
11:  
solve the optimization using FISTA:  
12:  
13:  
14:  output 
Communicationavoiding SPNM: Similar to the CASFISTA formulation, CASPNM in Algorithm IV is formulated by unrolling iterations in line 4 of the classical SPNM algorithm. Lines 1 through 10 in Algorithm IV follow the same analysis as CASFISTA. CASPNM solves the inner subproblem inexactly in lines 1316. It uses a first order method without any communication to get an optimal solution for the inner problem. The same blocks of and are used in the subproblem until a solution is achieved after iterations. The value of from the previous iteration is used as a warm start initialization in line 13 to improve the convergence rate of the inner iterations and the overall algorithm. The CASPNM algorithm only communicates the Gram matrices at line 7 and thus the total number of messages communicated is reduced by a factor of k.
In conclusion, derivations of CASFISTA and CASPNM start from a randomized variant of the classical algorithm, enabling us to unroll the iterations while maintaining the exact arithmetic of the classical algorithms.
1:  Input: , , , K1, s.t 

2:  set 
3:  for do 
4:  for do 
5:  Generate where 
is chosen uniformly at random  
6:  
7:  set and 
and send them to all processors  
8:  for do 
9:  are block of G 
10:  
11:  
12:  *solve this minimization problem using a first order method 
13:  
14:  for 
15:  
16:  
17:  
18:  output 
IvB Using Randomized Sampling to Avoid Communication
We leverage randomized sampling in the stochastic variants of FISTA and proximal Newton methods to derive the communicationavoiding formulations. With randomized sampling, CASFISTA and CASPNM generate independent samples at each iteration. These randomly selected samples contribute to computing gradient and Hessian matrices and as a result allow us to unroll k iterations of the classical algorithms to avoid communication. We create blocks of the Gram matrices and by randomly selecting k different subset of the columns by each processor. Performing one allreduce operation every k iterations on these Gram matrices is far less expensive than doing an allreduce operation at every iteration, which enables us to avoid communication.
Algorithm V shows a pseudo implementation of CASFISTA (CASPNM follows a similar pattern). While the data matrix is distributed among processors in line 3, randomize sampling is used in line 6 to reduce the cost of the matrixmatrix and matrixvector operations in line 7, which allows us to control onnode computation and communication costs. We then stack these results in the Gram matrices in line 8 and do an allreduce operation every k iterations at line 9. Blocks of data from the Gram matrices are selected in line 11 based on the data dimensions and recurrence updates happen at line 12.
1. INPUT: k, T, b, , and training dataset X, y 
2. INITIALIZE: 
3. Distribute X columnwise on all processors so each 
processor roughly has the same amount of nonzeros 
4. for i=0,…, do 
5. for j=1,…,k do on each processor 
6. Randomly select b percent of columns of X and rows of y 
7. compute and 
8. stack the results in Gram matrices 
9. Allreduce the Gram matrices 
10. for j=1,…,k do on each processor 
11. compute gradient using blocks of Gram matrices 
12. update , 
13. RETURN 
IvC Cost of CommunicationAvoiding Algorithms
We discuss the computation, storage, and communication costs of CASFISTA and CASPNM in following theorems and show that these algorithms reduce latency costs while preserving both bandwidth and flops costs.
Theorem 3. iterations of CASFISTA on processors over the critical path has the following costs:
F = flops, W = words moved, L = messages and M = words of memory.
Proof. CASFISTA computes which requires operations, communicates words, and requires messages. Computing requires operations and communicates words with messages. Then the algorithm computes the gradient and solves the minimization problem redundantly on all processors using a soft thresholding operator which requires operations for computing the gradient and for the soft thresholding operator without any communication. The vector updates on can be done without any communication. Since the critical path occurs every k iteration then the algorithm costs flops, words and messages. Each processor needs to store , and th of . Therefore, it needs words of memory.
Theorem 4. iterations CASPNM on P processors over the critical path has following costs:
F = flops, W = words moved, L = messages and M = words.
Proof. CASPNM computes and and sends them to all processors. Each processors solves the minimization problem redundantly for iterations. At each inner iteration, the ISTA algorithm solves the minimization problem in iterations. In order to get an optimal solution, the algorithm needs to run for iterations; CASPNM requires operations for this. A analysis similar to the proof to theorem 1 proves this theorem.
Table I shows the summery of costs for the CA algorithms compared to the classical algorithms.
Algorithm  Latency cost (L)  Ops cost (F)  Memory cost (M)  Bandwidth cost (W) 

SFISTA  
CASFISTA  
SPNM  
CASPNM 
V Experimental Results
This section presents experimental setup and performance results. The system setup and datasets as well as algorithm parameter settings are discussed first. We then show the effect of parameters k and b on the algorithms’ convergence rate. Speedup and strong scaling properties of the communicationavoiding algorithms are demonstrated in the last section and are compared to the classical algorithms.
Va Methodology
Table II shows the datasets used for our experiments. The datasets are from dense and sparse machine learning applications from [28] and vary in size and sparsity. The algorithms are implemented in C/C++ using Intel MKL for (sparse/dense) BLAS routines and MPI for parallel processing. While use a dense data format for our theoretical analysis, we use the compressed sparse row format (CSR) to store the data for sparse datasets. Our experiments are conducted on the XSEDE Comet systems [29].
Selecting : should be chosen based on the prediction accuracy of the dataset and can affect convergence rates. We tune for so that our experiments have reasonable running time. The final tuned value for , for all experiments, is 0.1 for abalone and 0.01 for susy and covtype.
Stopping criteria: The stopping condition in CASFISTA and CASPNM could be set based on two criteria: (i) The algorithms can execute for a prespecified number of iterations, shown with T in Algorithms III and IV. We use this stopping criteria for our experiments on strong scaling (section VC2) since the number of operations should remain the same across all processors; (ii) The second stopping criteria is based on the distance to the optimal solution. We define a normalized distance to the optimal solution, called relative solution error obtained via , where is the optimal solution found using Templates for FirstOrder Conic Solvers (TFOCS) [30]. TFOCS is a first order method where the tolerance for its stopping criteria is . Putting a bound on the stopping criteria enforces the algorithms to run until a solution close enough to optimal is achieved. The CASFISTA and CASPNM algorithms have to be changed in line 3 to include a whileloop with the condition to exit when the relative solution error becomes smaller than a tolerance tol. The tol parameter for each dataset is chosen to provide a reasonable execution time. This stopping condition is used for the speedup experiments in section VC1.
Dataset  Row numbers  Column numbers  Percentage of nnz  Size (nnz) 

abalone  4177  8  100%  258.7KB 
susy  5M  18  25.39%  2.47GB 
covtype  581,012  54  22.12%  71.2MB 
VB Convergence Analysis
This section shows the effect of the sampling rate b on convergence. Computation costs of each processor can be significantly reduced with a smaller , however, very small sample sizes can influence stability and convergence. We also demonstrate that changing in the kstep formulations of SFISTA and SPNM does not affect their convergence and stability behavior compared to the classical algorithms since the kstep formulations are arithmetically the same as the original algorithms. The susy dataset also follows a similar analysis; we did not include its data due to space limitations.
VB1 Effect of b on Convergence
Figure 2 shows the convergence behavior of CASFISTA and CASPNM for different sampling rates. The figure shows that for the same parameters, CASPNM converges faster than CASFISTA for all datasets. Smaller sampling rates reduce the floating point operation costs on each processor and as shown in Figure 2 the convergence rate does change for large values of b. For example, we see that the residual solution error of CASFISTA follows the same pattern for both values of and for covtype. However, if very few columns of the dataset is sampled then the gradient and Hessian may not represent a correct update direction. This gets worse as we get closer to the optimal solution, as shown in Figure 2 for on both datasets. For these experiments we set k to 32; following shows the effect of on convergence.
VB2 Effect of k on Convergence
Figure 3 shows the convergence behavior of CASPNM and CASFISTA for two values of and shows the convergence rate of the classical SPNM and SFISTA algorithms. Since the CA derivations are arithmetically the same as the classical formulations, their convergence is also the same as SPNM and SFISTA. The experiments also demonstrate that changing does not affect the stability and residual solution error. We tested the convergence rate and stability behavior of the algorithms for up to and a similar trend was observed. For these experiments, is set to its best value.
VC Performance Experiments
Speedup and scaling results are presented in this section. Speedup results are obtained using the second stopping criteria and the tolerance parameter tol is set to 0.1 for all the experiments. For the scaling experiments, we use the first stopping criteria with 100 iterations to execute the same number of operations across the experiments. The largest dataset susy is executed on upto 1024 processors and the two smaller datasets abalone and covtype are executed on upto 64 and 512 processors respectively to report reasonable scalability based on size.
VC1 Speedup
Figure 4 shows the speedup for all the datasets for different combinations of nodes (P) and k for CASFISTA. All the speedups are normalized over SFISTA. At small scale, for example for and k=32 for abalone in Figure 4a a speedup of 1.79X is achieved, while for the same dataset we can get a speedup of 9.63X on 64 nodes. Increasing significantly improves the performance of abalone since CASFISTA for a larger increases the ratio of floating point operations to communication for this relatively small data set on more processors.
As shown in the figure, the performance of all datasets almost always improves when increasing the number of nodes or k. For a fixed number of processors increasing reduces latency costs and communication without changing bandwidth or arithmetic costs. Figure 5 shows the speedup results for the CASPNM algorithm and follows a similar trend to CASFISTA. Figure 6 shows the speedup of CASFISTA and CASPNM algorithms on largest number of nodes for each dataset (64 for abalone, 512 for covtype, and 1024 for susy). Increasing k reduces the number of communicated massages and latency costs. Therefore, the speedups improve as k increases.
VC2 Strong Scaling
Figure 7 (a) and Figure 7 (b) show strong scaling results for CASFISTA and CASPNM. The running times are the average over five runs for 100 iterations of the algorithm. The figures show the scaling behavior of the kstep algorithms for k=32 and compare it to SFISTA and SPNM. As shown in both figures, the classical algorithms do not scale well and their execution times increase when latency costs dominate. For example, for the abalone dataset, SFISTA’s running time increases on more than 8 processors, while CASFISTA continues to scale. The communicationavoiding methods are closer to ideal scaling. We intentionally added the data point for the covtype dataset to Figure 7 to show that the performance of the step algorithms is bounded by bandwidth. This is because increasing k and the number of processors, increases the number of words to be moved in every k iterations and performance is bounded by the bandwidth of the machine. If the processors had no bandwidth limitations then the algorithms would scale for all increasing k.
Vi Conclusion
Existing formulations of stochastic FISTA and proximal newton methods suffer from high communication costs and do not scale well on distributed platforms when operating on large datasets. We reformulate classical algorithms of SFISTA and SPNM to reduce inherent communication in their formulations. Our communicationavoiding algorithms leverage randomized sampling to overlap iterations and reduce communication costs by a factor of k. The proposed kstep algorithms do not change the classical algorithms’ convergence behavior and preserve the overall bandwidth and arithmetic costs. Our experiments show that CASFISTA and CASPNM provide up to 10X speedup for the tested datasets on distributed platforms.
References
 [1] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for largescale machine learning,” arXiv preprint arXiv:1606.04838, 2016.
 [2] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on pure and applied mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.
 [3] A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
 [4] J. D. Lee, Y. Sun, and M. Saunders, “Proximal newtontype methods for convex optimization,” in Advances in Neural Information Processing Systems, 2012, pp. 827–835.
 [5] J. Dongarra, J. Hittinger, J. Bell, L. Chacon, R. Falgout, M. Heroux, P. Hovland, E. Ng, C. Webster, and S. Wild, “Applied mathematics research for exascale computing,” Lawrence Livermore National Laboratory (LLNL), Livermore, CA, Tech. Rep., 2014.
 [6] G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, pp. 1–155, 2014.
 [7] V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, and M. Jaggi, “Cocoa: A general framework for communicationefficient distributed optimization,” arXiv preprint arXiv:1611.02189, 2016.
 [8] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lockfree approach to parallelizing stochastic gradient descent,” in Advances in neural information processing systems, 2011, pp. 693–701.
 [9] F. Zhou and G. Cong, “On the convergence properties of a step averaging stochastic gradient descent algorithm for nonconvex optimization,” arXiv preprint arXiv:1708.01012, 2017.
 [10] Y. You, J. Demmel, K. Czechowski, L. Song, and R. Vuduc, “Casvm: Communicationavoiding support vector machines on distributed systems,” in IEEE International Parallel and Distributed Processing Symposium, 2015, pp. 847–859.
 [11] Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and Z. Chen, “PpackSVM: Parallel primal gradient descent kernel svm,” in Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ser. ICDM ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 677–686.
 [12] A. T. Chronopoulos and C. W. Gear, “On the efficient implementation of preconditioned sstep conjugate gradient methods on multiprocessors with memory hierarchy,” Parallel computing, vol. 11, no. 1, pp. 37–53, 1989.
 [13] A. Chronopoulos and C. Swanson, “Parallel iterative sstep methods for unsymmetric linear systems,” Parallel Computing, vol. 22, no. 5, pp. 623 – 641, 1996.
 [14] E. C. Carson, Communicationavoiding Krylov subspace methods in theory and practice. University of California, Berkeley, 2015.
 [15] E. Carson, N. Knight, and J. Demmel, “Avoiding communication in nonsymmetric lanczosbased krylov subspace methods,” SIAM Journal on Scientific Computing, vol. 35, no. 5, pp. S42–S61, 2013.
 [16] A. Devarakonda, K. Fountoulakis, J. Demmel, and M. W. Mahoney, “Avoiding communication in primal and dual block coordinate descent methods,” arXiv preprint arXiv:1612.04003, 2016.
 [17] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 [18] M. Schmidt, A. NiculescuMizil, K. Murphy et al., “Learning graphical model structure using l1regularization paths,” in AAAI, vol. 7, 2007, pp. 1278–1283.
 [19] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 689–696.
 [20] V. Smith, S. Forte, M. I. Jordan, and M. Jaggi, “L1regularized distributed optimization: A communicationefficient primaldual framework,” arXiv preprint arXiv:1512.04011, 2015.
 [21] S. Rendle, D. Fetterly, E. J. Shekita, and B.y. Su, “Robust largescale machine learning in the cloud,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1125–1134.

[22]
K. Koh, S.J. Kim, and S. Boyd, “An interiorpoint method for largescale l1regularized logistic regression,”
Journal of Machine learning research, vol. 8, no. Jul, pp. 1519–1555, 2007.  [23] S. H. Fuller and L. I. Millett, “Computing performance: Game over or next level?” Computer, vol. 44, no. 1, pp. 31–38, 2011.
 [24] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Von Eicken, “Logp: Towards a realistic model of parallel computation,” in ACM Sigplan Notices, vol. 28, no. 7. ACM, 1993, pp. 1–12.
 [25] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, “Loggp: incorporating long messages into the logp model—one step closer towards a realistic model for parallel computation,” in Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures. ACM, 1995, pp. 95–105.

[26]
L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,”
SIAM Journal on Optimization, vol. 24, no. 4, pp. 2057–2075, 2014.  [27] A. Nitanda, “Stochastic proximal gradient descent with acceleration techniques,” in Advances in Neural Information Processing Systems, 2014, pp. 1574–1582.
 [28] C.C. Chang and C.J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.
 [29] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson et al., “Xsede: accelerating scientific discovery,” Computing in Science & Engineering, vol. 16, no. 5, pp. 62–74, 2014.
 [30] S. R. Becker, E. J. Candès, and M. C. Grant, “Templates for convex cone problems with applications to sparse signal recovery,” Mathematical programming computation, vol. 3, no. 3, pp. 165–218, 2011.