1 Introduction
We develop randomized coordinate descent methods to solve the following composite convex problem:
(1) 
where , , and are proper, closed and convex functions, is a given matrix. The optimization template (1
) covers many important applications including support vector machines, sparse model selection, logistic regression, etc. It is also convenient to formulate generic constrained convex problems by choosing an appropriate
.Within convex optimization, coordinate descent methods have recently become increasingly popular in the literature Nesterov2012 ; richtarik2014iteration ; richtarik2016parallel ; fercoq2015accelerated ; shalev2013stochastic ; necoara2016parallel
. These methods are particularly wellsuited to solve hugescale problems arising from machine learning applications where matrixvector operations are prohibitive
Nesterov2012 .To our knowledge, there is no coordinate descent method for the general threecomposite form (1) within our structure assumptions studied here that has rigorous convergence guarantees. Our paper specifically fills this gap. For such a theoretical development, coordinate descent algorithms require specific assumptions on the convex optimization problems Nesterov2012 ; fercoq2015accelerated ; necoara2016parallel . As a result, to rigorously handle the threecomposite case, we assume that () is smooth, () is nonsmooth but decomposable (each component has an “efficiently computable” proximal operator), and () is nonsmooth.
Our approach:
In a nutshell, we generalize fercoq2015accelerated ; qu2016coordinate to the three composite case (1). For this purpose, we combine several classical and contemporary ideas: We exploit the smoothing technique in Nesterov2005c , the efficient implementation technique in lee2013efficient ; fercoq2015accelerated , the homotopy strategy in tran2015smooth , and the nonuniform coordinate selection rule in qu2016coordinate
in our algorithm, to achieve the best known complexity estimate for the template.
Surprisingly, the combination of these ideas is achieved in a very natural and elementary primaldual gapbased framework. However, the extension is indeed not trivial since it requires to deal with the composition of a nonsmooth function and a linear operator .
While our work has connections to the methods developed in qu2016coordinate ; fercoq2013smooth ; fercoq2015coordinate , it is rather distinct. First, we consider a more general problem (1) than the one in fercoq2015accelerated ; qu2016coordinate ; fercoq2013smooth . Second, our method relies on Nesterov’s accelerated scheme rather than a primaldual method as in fercoq2015coordinate . Moreover, we obtain the first rigorous convergence rate guarantees as opposed to fercoq2015coordinate . In addition, we allow using any sampling distribution for choosing the coordinates.
Our contributions: We propose a new smooth primaldual randomized coordinate descent method for solving (1) where is smooth, is nonsmooth, separable and has a blockwise proximal operator, and is a general nonsmooth function. Under such a structure, we show that our algorithm achieves the best known convergence rate, where is the iteration count and to our knowledge, this is the first time that this convergence rate is proven for a coordinate descent algorithm.
We instantiate our algorithm to solve special cases of (1) including the case and constrained problems. We analyze the convergence rate guarantees of these variants individually and discuss the choices of sampling distributions.
Exploiting the strategy in lee2013efficient ; fercoq2015accelerated , our algorithm can be implemented in parallel by breaking up the full vector updates. We also provide a restart strategy to enhance practical performance.
Paper organization:
We review some preliminary results in Section 2. The main contribution of this paper is in Section 3 with the main algorithm and its convergence guarantee. We also present special cases of the proposed algorithm. Section 4 provides numerical evidence to illustrate the performance of our algorithms in comparison to existing methods. The proofs are deferred to the supplementary document.
2 Preliminaries
Notation:
Let be the set of positive integer indices. Let us decompose the variable vector into blocks denoted by as such that each block has the size with
. We also decompose the identity matrix
of into block as , where has unit vectors. In this case, any vector can be written as , and each block becomes for . We define the partial gradients as for . For a convex function , we use to denote its domain, to denote its Fenchel conjugate, and to denote its proximal operator. For a convex set , denotes its indicator function. We also need the following weighted norms:(2) 
Here, is a symmetric positive definite matrix, and for and . In addition, we use to denote .
Formal assumptions on the template:
We require the following assumptions to tackle (1):
Assumption 1.
The functions , and are all proper, closed and convex. Moreover, they satisfy

The partial derivative of is Lipschitz continuous with the Lipschitz constant , i.e., for all .

The function is separable, which has the following form .
Now, we briefly describe the main techniques used in this paper.
Acceleration:
Acceleration techniques in convex optimization date back to the seminal work of Nesterov in nesterov1983method , and is one of standard techniques in convex optimization. We exploit such a scheme to achieve the best known rate for the nonsmooth template (1).
Nonuniform distribution:
We assume that is a random index on
associated with a probability distribution
such that(3) 
When for all
, we obtain the uniform distribution. Let
be i.i.d. realizations of the random index after iteration. We define as the field generated by these realizations.Smoothing techniques:
We can write the convex function using its Fenchel conjugate . Since in (1) is convex but possibly nonsmooth, we smooth as
(4) 
where is given and is the smoothness parameter. Moreover, the quadratic function is defined based on a given norm in . Let us denote by , the unique solution of this concave maximization problem in (4), i.e.:
(5) 
where is the proximal operator of . If we assume that is Lipschitz continuous, or equivalently that is bounded, then it holds that
(6) 
Let us define a new smoothed function . Then, is differentiable, and its block partial gradient
(7) 
is also Lipschitz continuous with the Lipschitz constant , where is given in Assumption 1, and is the th block of .
Homotopy:
In smoothingbased methods, the choice of the smoothness parameter is critical. This choice may require the knowledge of the desired accuracy, number of maximum iterations or the diameters of the primal and/or dual domains as in Nesterov2005c . In order to make this choice flexible and our method applicable to the constrained problems, we employ a homotopy strategy developed in tran2015smooth for deterministic algorithms, to gradually update the smoothness parameter while making sure that it converges to .
3 Smooth primaldual randomized coordinate descent
In this section, we develop a smoothing primaldual method to solve (1). Or approach is to combine the four key techniques mentioned above: smoothing, acceleration, homotopy, and randomized coordinate descent. Similar to qu2016coordinate we allow to use arbitrary nonuniform distribution, which may allow to design a good distribution that captures the underlying structure of specific problems.
3.1 The algorithm
Algorithm 1 below smooths, accelerates, and randomizes the coordinate descent method.
From the update and , it directly follows that . Therefore, it is possible to implement the algorithm without forming .
3.2 Efficient implementation
While the basic variant in Algorithm 1 requires full vector updates at each iteration, we exploit the idea in lee2013efficient ; fercoq2015accelerated and show that we can partially update these vectors in a more efficient manner.
We present the following result which shows the equivalence between Algorithm 1 and Algorithm 2, the proof of which can be found in the supplementary document.
Proposition 3.1.
Let , and . Then, , and , for all , where , , and are defined in Algorithm 1.
According to Algorithm 2, we never need to form or update fulldimensional vectors. Only times that we need are when computing the gradient and the dual variable . We present two special cases which are common in machine learning, in which we can compute these steps efficiently.
Remark 3.2.
Under the following assumptions, we can characterize the periteration complexity explicitly. Let , and

has the form , where is the standard unit vector.

h is separable as in or .
Assuming that we store and maintain the residuals , , , , then we have the periteration cost as arithmetic operations. If is partially separable as in richtarik2016parallel , then the complexity of each iteration will remain moderate.
3.3 Case 1: Convergence analysis of SMARTCD for Lipschitz continuous
We provide the following main theorem, which characterizes the convergence rate of Algorithm 1.
Theorem 3.3.
In the special case when we use uniform distribution, , the convergence rate reduces to
where . This estimate shows that the convergence rate of Algorithm 1 is
which is the best known so far to the best of our knowledge.
3.4 Case 2: Convergence analysis of SMARTCD for nonsmooth constrained optimization
In this section, we instantiate Algorithm 1 to solve constrained convex optimization problem with possibly nonsmooth terms in the objective. Clearly, if we choose in (1) as the indicator function of the set for a given vector , then we obtain a constrained problem:
(9) 
where and are defined as in (1), , and .
We can specify Algorithm 1 to solve this constrained problem by modifying the following two steps:
Now, we analyze the convergence of this algorithm by providing the following theorem.
Theorem 3.4.
Let be the sequence generated by Algorithm 1 for solving (9) using the updates (10) and (11) and let be an arbitrary optimal solution of the dual problem of (9). In addition, let and be given parameters. Then, we have the following estimates:
(12) 
where . We note that the following lower bound always holds .
3.5 Other special cases
We consider the following special cases of Algorithm 1:
The case :
In this case, we obtain an algorithm similar to the one studied in qu2016coordinate except that we have nonuniform sampling instead of importance sampling. If the distribution is uniform, then we obtain the method in fercoq2015accelerated .
The case :
In this case, we have , which can handle the linearly constrained problems with smooth objective function. In this case, we can choose , and the coordinate proximal gradient step, Step 8 in Algorithm 1, is simplified as
(13) 
In addition, we replace Step 9 with
(14) 
We then obtain the following results:
Corollary 3.5.
Assume that Assumption 1 holds. Let , and Step 8 and 9 of Algorithm 1 be updated by (13) and (14), respectively. If, in addition, is Lipschitz continuous, then we have
(15) 
where is defined by (6).
If, instead of Lipschitz continuous , we have to solve the constrained problem (9) with , then we have
(16) 
where .
3.6 Restarting SMARTCD
It is known that restarting an accelerated method significantly enhances its practical performance when the underlying problem admits a (restricted) strong convexity condition. As a result, we describe below how to restart (i.e., the momentum term) in Efficient SMARTCD. If the restart is injected in the th iteration, then we restart the algorithm with the following steps:
The first three steps of the restart procedure is for restarting the primal variable which is classical o2015adaptive . Restarting is also suggested in tran2015smooth . The cost of this procedure is essentially equal to the cost of one iteration as described in Remark 3.2
, therefore even restarting once every epoch will not cause a significant difference in terms of periteration cost.
4 Numerical evidence
We illustrate the performance of Efficient SMARTCD in brain imaging and support vector machines applications. We also include one representative example of a degenerate linear program to illustrate why the convergence rate guarantees of our algorithm matter. We compare SMARTCD with VuCondatCD
fercoq2015coordinate , which is a coordinate descent variant of VuCondat’s algorithm vu2013splitting , FISTA beck2009fast , ASGARD tran2015smooth , ChambollePock’s primaldual algorithm chambolle2011first , LBFGS byrd1995limited and SDCA shalev2013stochastic .4.1 A degenerate linear program: Why do convergence rate guarantees matter?
We consider the following degenerate linear program studied in tran2015smooth :
(17) 
Here, the constraint is repeated times. This problem satisfies the linear constraint qualification condition, which guarantees the primaldual optimality. If we define
where
we can fit this problem and its dual form into our template (1).
For this experiment, we select the dimensions and . We implement our algorithm and compare it with VuCondatCD. We also combine our method with the restarting strategy proposed above. We use the same mapping to fit the problem into the template of VuCondatCD.
Figure 1 illustrates the convergence behavior of VuCondatCD and SMARTCD. We compare primal suboptimality and feasibility in the plots. The explicit solution of the problem is used to generate the plot with primal suboptimality. We observe that degeneracy of the problem prevents VuCondatCD from making any progress towards the solution, where SMARTCD preserves rate as predicted by theory. We emphasize that the authors in fercoq2015coordinate proved almost sure convergence for VuCondatCD but they did not provide a convergence rate guarantee for this method. Since the problem is certainly nonstrongly convex, restarting does not significantly improve performance of SMARTCD.
4.2 Total Variation and regularized least squares regression with functional MRI data
In this experiment, we consider a computational neuroscience application where prediction is done based on a sequence of functional MRI images. Since the images are high dimensional and the number of samples that can be taken is limited, TV regularization is used to get stable and predictive estimation results dohmatob2014benchmarking . The convex optimization problem we solve is of the form:
(18) 
This problem fits to our template with
where is the 3D finite difference operator to define a total variation norm and .
We use an fMRI dataset where the primal variable is 3D image of the brain that contains voxels. Feature matrix has rows, each representing the brain activity for the corresponding example dohmatob2014benchmarking . We compare our algorithm with VuCondat’s algorithm, FISTA, ASGARD, ChambollePock’s primaldual algorithm, LBFGS and VuCondatCD.
Figure 2 illustrates the convergence behaviour of the algorithms for different values of the regularization parameters. Periteration cost of SMARTCD and VuCondatCD is similar, therefore the behavior of these two algorithms are quite similar in this experiment. Since VuCondat’s, ChambollePock’s, FISTA and ASGARD methods work with full dimensional variables, they have slow convergence in time. LBFGS has a close performance to coordinate descent methods.
The simulation in Figure 2 is performed using benchmarking tool of dohmatob2014benchmarking . The algorithms are tuned for the best parameters in practice.
4.3 Linear support vector machines problem with bias
In this section, we consider an application of our algorithm to support vector machines (SVM) problem for binary classification. Given a training set with examples such that and class labels such that , we define the soft margin primal support vector machines problem with bias as
(19) 
As it is a common practice, we solve its dual formulation, which is a constrained problem:
(20) 
where represents a diagonal matrix that has the class labels in its diagonal and is formed by the example vectors. If we define
then, we can fit this problem into our template in (9).
We apply the specific version of SMARTCD for constrained setting from Section 3.4 and compare with VuCondatCD and SDCA. Even though SDCA is a stateoftheart method for SVMs, we are not able to handle the bias term using SDCA. Hence, it only applies to (20) when constraint is removed. This causes SDCA not to converge to the optimal solution when there is bias term in the problem (19). The following table summarizes the properties of the classification datasets we used.
Data Set  Training Size  Number of Features  Convergence Plot 

rcv1.binary chang2011libsvm ; lewis2004rcv1  20,242  47,236  Figure 3, plot 1 
a8a chang2011libsvm ; Lichman:2013  22,696  123  Figure 3, plot 2 
gisette chang2011libsvm ; guyon2005result  6,000  5,000  Figure 3, plot 3 
Figure 3 illustrates the performance of the algorithms for solving the dual formulation of SVM in (20). We compute the duality gap for each algorithm and present the results with epochs in the horizontal axis since periteration complexity of the algorithms is similar. As expected, SDCA gets stuck at a low accuracy since it ignores one of the constraints in the problem. We demonstrate this fact in the first experiment and then limit the comparison to SMARTCD and VuCondatCD. Equipped with restart strategy, SMARTCD shows the fastest convergence behavior due to the restricted strong convexity of (20).
5 Conclusions
Coordinate descent methods have been increasingly deployed to tackle huge scale machine learning problems in recent years. The most notable works include Nesterov2012 ; richtarik2014iteration ; richtarik2016parallel ; fercoq2015accelerated ; shalev2013stochastic ; necoara2016parallel . Our method relates to several works in the literature including fercoq2015accelerated ; fercoq2013smooth ; Nesterov2012 ; nesterov2017efficiency ; qu2016coordinate ; tran2015smooth . The algorithms developed in fercoq2015accelerated ; richtarik2014iteration ; richtarik2016parallel only considered a special case of (1) with , and cannot be trivially extended to apply to general setting (1). Here, our algorithm can be viewed as an adaptive variant of the method developed in fercoq2015accelerated extended to the sum of three functions. The idea of homotopy strategies relate to tran2015smooth for firstorder primaldual methods. This paper further extends such an idea to randomized coordinate descent methods for solving (1). We note that a naive application of the method developed in fercoq2015accelerated to the smoothed problem with a carefully chosen fixed smoothness parameter would result in the complexity , whereas using our homotopy strategy on the smoothness parameter, we reduced this complexity to .
With additional strong convexity assumption on problem template (1), it is possible to obtain rate by using deterministic firstorder primaldual algorithms chambolle2011first ; tran2015smooth . It remains as future work to incorporate strong convexity to coordinate descent methods for solving nonsmooth optimization problems with a faster convergence rate.
Acknowledgments
The work of O. Fercoq was supported by a public grant as part of the Investissement d’avenir project, reference ANR11LABX0056LMH, LabEx LMH. The work of Q. TranDinh was partly supported by NSF grant, DMS1619884, USA. The work of A. Alacaoglu and V. Cevher was supported by European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594  timedata).
References
 (1) Y. Nesterov, “Efficiency of coordinate descent methods on hugescale optimization problems,” SIAM Journal on Optimization, vol. 22, no. 2, pp. 341–362, 2012.
 (2) P. Richtárik and M. Takáč, “Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function,” Mathematical Programming, vol. 144, no. 12, pp. 1–38, 2014.
 (3) P. Richtárik and M. Takáč, “Parallel coordinate descent methods for big data optimization,” Mathematical Programming, vol. 156, no. 12, pp. 433–484, 2016.
 (4) O. Fercoq and P. Richtárik, “Accelerated, parallel, and proximal coordinate descent,” SIAM Journal on Optimization, vol. 25, no. 4, pp. 1997–2023, 2015.
 (5) S. ShalevShwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss minimization,” Journal of Machine Learning Research, vol. 14, pp. 567–599, 2013.
 (6) I. Necoara and D. Clipici, “Parallel random coordinate descent method for composite minimization: Convergence analysis and error bounds,” SIAM J. on Optimization, vol. 26, no. 1, pp. 197–226, 2016.
 (7) Z. Qu and P. Richtárik, “Coordinate descent with arbitrary sampling i: Algorithms and complexity,” Optimization Methods and Software, vol. 31, no. 5, pp. 829–857, 2016.
 (8) Y. Nesterov, “Smooth minimization of nonsmooth functions,” Math. Prog., vol. 103, no. 1, pp. 127–152, 2005.
 (9) Q. TranDinh, O. Fercoq, and V. Cevher, “A smooth primaldual optimization framework for nonsmooth composite convex minimization,” arXiv preprint arXiv:1507.06243, 2015.
 (10) O. Fercoq and P. Richtárik, “Smooth minimization of nonsmooth functions with parallel coordinate descent methods,” arXiv preprint arXiv:1309.5885, 2013.
 (11) O. Fercoq and P. Bianchi, “A coordinate descent primaldual algorithm with large step size and possibly non separable functions,” arXiv preprint arXiv:1508.04625, 2015.
 (12) Y. Nesterov and S.U. Stich, “Efficiency of the accelerated coordinate descent method on structured optimization problems,” SIAM J. on Optimization, vol. 27, no. 1, pp. 110–123, 2017.
 (13) Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence ,” Doklady AN SSSR, vol. 269, translated as Soviet Math. Dokl., pp. 543–547, 1983.
 (14) Y. T. Lee and A. Sidford, “Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems,” in Foundations of Computer Science (FOCS), 2013 IEEE Annual Symp. on, pp. 147–156, IEEE, 2013.
 (15) B. O’Donoghue and E. Candes, “Adaptive restart for accelerated gradient schemes,” Foundations of computational mathematics, vol. 15, no. 3, pp. 715–732, 2015.
 (16) B. C. Vũ, “A splitting algorithm for dual monotone inclusions involving cocoercive operators,” Advances in Computational Mathematics, vol. 38, no. 3, pp. 667–681, 2013.
 (17) A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
 (18) A. Chambolle and T. Pock, “A firstorder primaldual algorithm for convex problems with applications to imaging,” Journal of mathematical imaging and vision, vol. 40, no. 1, pp. 120–145, 2011.
 (19) R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, 1995.
 (20) E. D. Dohmatob, A. Gramfort, B. Thirion, and G. Varoquaux, “Benchmarking solvers for tv leastsquares and logistic regression in brain imaging,” in Pattern Recognition in Neuroimaging, 2014 International Workshop on, pp. 1–4, IEEE, 2014.
 (21) C.C. Chang and C.J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.
 (22) D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” Journal of Machine Learning Research, vol. 5, no. Apr, pp. 361–397, 2004.
 (23) M. Lichman, “UCI machine learning repository,” 2013.

(24)
I. Guyon, S. Gunn, A. BenHur, and G. Dror, “Result analysis of the nips 2003 feature selection challenge,” in
Advances in neural information processing systems, pp. 545–552, 2005.  (25) P. Tseng, “On accelerated proximal gradient methods for convexconcave optimization,” Submitted to SIAM J. Optim, 2008.
Appendix A Key lemmas
The following properties are key to design the algorithm, whose proofs are very similar to the proof of (tran2015smooth, , Lemma 10) by using a different norm, and we omit the proof here. The proof of the last property directly follows by using the explicit form of in the special case when .
Lemma A.1.
For any , the function defined by (4) satisfies the following properties:

is convex and smooth. Its gradient is Lipschitz continuous with the Lipschitz constant .

Comments
There are no comments yet.