1 Introduction
In recent years, there has been increased interest in coordinate descent (CD) methods due to their simplicity, low cost per iteration, and efficiency (Wright, 2015). Algorithms based on coordinate descent are the state of the art for many optimization problems (Nesterov, 2012; ShalevShwartz and Zhang, 2013b; Lin et al., 2014; ShalevShwartz and Zhang, 2013a, 2016; Richtarik and Takac, 2016; Fercoq and Richtárik, 2015; Nesterov and Stich, 2017)
. Most of the CD methods draw their coordinates from a fixed distribution—for instance from the uniform distribution as in uniform coordinate descent (UCD). However, it is clear that significant improvements can be achieved by choosing more
important coordinates more frequently (Nesterov, 2012; Nesterov and Stich, 2017; Stich et al., 2017a, b; Perekrestenko et al., 2017). In particular, we could greedily choose the ‘best’ coordinate at each iteration i.e. the steepest coordinate descent (SCD).SCD for composite problems.
Consider the smooth quadtratic function . There are three natural notions of the ‘best’ coordinate.^{2}^{2}2Following standard notation (cf. (Nutini et al., 2015)) we call them the GaussSouthwell (GS) rules. One could choose (i) GSs: the steepest coordinate direction based on (sub)gradients, (ii) GSr: the coordinate which allows us to take the largest step, and (iii) GSq: the coordinate that allows us to minimize the function value the most. For our example (and in general for smooth functions), the three rules are equivalent. When we add an additional nonsmooth function to , such as , however, the three notions are no more equivalent. The performance of greedy coordinate descent in this composite setting is not well understood, and is the focus of this work.
Iteration complexity of SCD.
If the objective decomposes into identical separable problems, then clearly SCD is identical to UCD. In all but such extreme cases, Nutini et al. (2015) give a refined analysis of SCD for smooth functions and show that it outperforms UCD. This lead to a renewed interest in greedy methods (e.g. (Karimi et al., 2016; You et al., 2016; Dünner et al., 2017; Song et al., 2017; Nutini et al., 2017; Stich et al., 2017a; Locatello et al., 2018; Lu et al., 2018)). However, for the composite case the analysis in (Nutini et al., 2015) of SCD methods for any of the three rules mentioned earlier falls back to that of UCD. Thus they fail to demonstrate the advantage of greedy methods for the composite case. In fact it is claimed that the rate of the GSs greedy rule may even be worse than that of UCD. In this work we provide a refined analysis of SCD for a certain class of composite problems, and show that all three strategies (GSs, GSr, and GSq) converge on composite problems at a rate similar to SCD in the smooth case. Thus for these problems too, greedy coordinate algorithms are provably faster than UCD other than in extreme cases.
Efficiency of SCD.
A naïve implementation of SCD would require computing the full gradient at a cost roughly times more than just computing one coordinate of the gradient as required by UCD. This seems to negate any potential gain of SCD over UCD. The working principle behind approximate SCD methods is to tradeoff exactness of the greedy direction against the time spent to decide the steepest direction (e.g. (Stich et al., 2017a)). For smooth problems, Dhillon et al. (2011) show that approximate nearest neighbor search algorithms can be used to provide in sublinear time an approximate steepest descent direction. We build upon these ideas and extend the framework to nonsmooth composite problems, thereby capturing a significantly larger class of input problems. In particular we show how to efficiently map the GSs rule to an instance of maximum inner product search ().
Contributions.
We analyze and advocate the use of the GSs greedy rule to compute the update direction for composite problems. Our main contributions are:

[leftmargin=15pt,itemsep=1pt]

We show that on a class of composite problems, greedy coordinate methods achieve convergence rates which are very similar to those obtained for smooth functions, thereby extending the applicability of SCD. This class of problems covers several important applications such as SVMs (in its dual formulation), Lasso regression,
regularized logistic regression among others. With this we establish that greedy methods significantly outperform UCD also on composite problems, except in extreme cases (cf. Remark
4). 
We show that both the GSs as well as the GSr rules achieve convergence rates which are (other than in extreme cases) faster than UCD. This sidesteps the negative results by Nutini et al. (2015) for these methods through a more finegrained analysis. We also study the effect of approximate greedy directions on the convergence.

Algorithmically, we show how to precisely map the GSs direction computation as a special instance of a maximum inner product search problem (MIPS). Many standard nearest neighbor algorithms such as e.g. Locality Sensitive Hashing (LSH) can therefore be used to efficiently run SCD on composite optimization problems.

We perform extensive numerical experiments to study the advantages and limitations of our steepest descent combined with a current stateoftheart MIPS implementation (Boytsov and Naidan, 2013).
Related Literature.
Coordinate descent, being one of the earliest known optimization methods, has a rich history (e.g. (Bickley, 1941; Warga, 1963; Bertsekas and Tsitsiklis, 1989, 1991)). A significant renewal in interest followed the work of Nesterov (2012), who provided a simple analysis of the convergence of UCD. In practice, many solvers (e.g. (Ndiaye et al., 2015; Massias et al., 2018)
) combine UCD with active set heuristics where attention is restricted to a subset of
active coordinates. These methods are orthogonal to, and hence can be combined with, the greedy rules studied here. Greedy coordinate methods can also be viewed as an ‘extreme’ version of adaptive importance sampling (Stich et al., 2017a; Perekrestenko et al., 2017). However unlike greedy methods, even in the smooth case, there are no easily characterized function classes for which the adaptive sampling schemes or the active set methods are provably faster than UCD. The work closest to ours, other than the already discussed Nutini et al. (2015), would be that of Dhillon et al. (2011). The latter show a sublinear convergence rate for GSr on composite problems. They also propose a practical variant for regularized problems which essentially ignores the regularizer and is hence not guaranteed to converge.2 Setup
We consider composite optimization problems of the structure
(1) 
where is the number of coordinates, is convex and smooth, and the , are convex and possibly nonsmooth. In this exposition, we further restrict the function to either enforce a box constraint or an regularizer. This comprises many important problem classes, for instance dual SVM or Lasso regression, see Appendix A.3.
We further assume that the smooth component is coordinatewise smooth: for any , and ,
(2) 
Sometimes we will assume that is in addition also strongly convex with respect to the norm, , that is,
(3) 
for any and in the domain of . In general it holds . See Nutini et al. (2015) for a detailed comparison of the two constants.
3 SCD for NonSmooth Problems
Here we briefly recall the definitions of the GSs, GSr and GSq coordinate selection rules and introduce the approximate GSs rule that we will consider in detail.
(4)  
(5)  
(6) 
for an iterate ,
for standard unit vector
. Here and are defined as the minimum value and minimizer respectively ofWe relax the requirement for an exact steepest selection, and define an approximate GSs rule.
Definition 1 (approximate Gss).
For given , the coordinate is considered to be a approximate steepest direction for if
3.1 SCD for regularized problems
We now discuss the GSs rule for the concrete example of problems, and collect some observations that we will use later to define the mapping to the instance. A similar discussion is included for box constrained problems in Appendix B.
Consider regularized problems of the form
(7) 
The GSs steepest rule (4) and update rules can be simplified for such functions. Let denote the sign function, and define as the shrinkage operator
Further, for any , let us define as
(8) 
Lemma 1.
For any , the GSs rule is equivalent to
(9) 
Our analysis of GSs rule requires bounding the number of ‘bad’ steps (to be detailed in Section 4). For this, we will slightly modify the update of the coordinate descent method. Note that we still always follow the GSs direction, but will sometimes not perform the standard proximal coordinate update along this direction. To update the th coordinate, we either rely on the standard proximal step on the coordinate,
(10) 
or we perform linesearch
(11) 
Finally, the th coordinate is updated as
(12) 
Our modification or ‘postprocessing’ step (12) ensures that the coordinate can never ‘cross’ the origin. This small change will later on help us bound the precise number of steps needed in our convergence rates (Sec. 4). The details are summarized in Algorithm 1.
4 Convergence Rates
In this section, we present our main convergence results. We illustrate the novelty of our results in the important regularized case: For strongly convex functions , we provide the first linear rates of convergence independent of for greedy coordinate methods over regularized problems, matching the rates in the smooth case. In particular, for GSs this was conjectured to be impossible (Nutini et al., 2015, Section H.5, H.6) (see Remark 4). We also show the sublinear convergence of the three rules in the nonstrongly convex setting. Similar rates also hold for boxconstrained problems.
4.1 Linear convergence for strongly convex
Theorem 1.
Remark 2.
Remark 3.
All our linear convergence rates can be extended to objective functions which only satisfy the weaker condition of proximalPL strong convexity (Karimi et al., 2016).
Remark 4.
The standard analysis (e.g. in Nesterov (2012)) of UCD gives a convergence rate of
Here is the strong convexity constant with respect to the norm, which satisfies . The left boundary marks the worstcase for SCD, resulting in convergence slower than UCD. It is shown in Nutini et al. (2015) that this occurs only in extreme examples (e.g. when consists of identical separable functions). For all other situations when , our result shows that SCD is faster.
Remark 5.
Our analysis in terms of works for all three selection rules GSs, GSr, or GSq rules. In (Nutini et al., 2015, Section H5, H6) it was conjectured (but not proven) that this linear convergence rate holds for GSq, but that it cannot hold for GSs or GSr. Example functions were constructed where it was shown that the single step progress of GSs or GSr is much smaller than . However these example steps were all bad steps, as we will define in the following proof sketch, whose number we show can be bounded.
We state an analogous linear rate for the boxconstrained case too, but refer to Appendix B for the detailed algorithm and proof.
Theorem 2.
While the proof shares ideas with the case, there are significant differences, e.g. the division of the steps into three categories: i) good steps which give a progress, ii) bad steps which may not give much progress but are bounded in number, and a third iii) cross steps which give a progress.
Remark 6.
For the box case, the greedy methods converge faster than UCD if , as before, and if . Typically, is much smaller than 1 and so the second condition is almost always satisfied. Hence we can expect greedy to be much faster in the box case, just as in the unconstrained smooth case. It remains unclear if the term truly affects the rate of convergence. For example, in the separated quadratic case considered in (Nutini et al., 2015, Sec. 4.1), and so we can ignore the term in the rate (see Remark 16 in the Appendix).
Proof sketch.
While the full proofs are given in the appendix, we here give a proof sketch of the convergence of Algorithm 1 for regularized problems in the strongly convex case, as in Theorem 1.
The key idea of our technique is to partition the iterates into two sets: good and bad steps depending on whether they make (provably) sufficient progress. Then we show that the modification to the update we made in (12) ensures that we do not have too many bad steps. Since Algorithm 1 is a descent method, we can focus only on the good steps and describe its convergence. The “contradiction” to the convergence of GSs provided in (Nutini et al., 2015, Section H.5, H.6) are in fact instances of bad steps.
The definitions of good and bad steps are explained in Fig. 1 (and formally in Def. 11). The core technical lemma below shows that in a good step, the update along the GSs direction has an alternative characterization. For the sake of simplicity, let us assume that and that we use the exact GSs coordinate.
Lemma 2.
Suppose that iteration of Algorithm 1 updates coordinate and that it was a good step. Then
Proof sketch. We will only examine the case when here for the sake of simplicity. Combining this with the assumption that iteration was a good step gives that both , , and . Further if , the GSs rule simplifies to .
Since is coordinatewise smooth (2),
But the GSs rule exactly maximizes the last quantity. Thus we can continue:
Recall that and so . Further for any , and so . This means that
Plugging this into our previous equation gives us the lemma. See Lemma 8 for the full proof. ∎
If (i.e. is smooth), Lemma 2 reduces to the ‘refined analysis’ of Nutini et al. (2015). We can now state the rate obtained in the strongly convex case.
Proof sketch for Theorem 1. Notice that if , the step is necessarily good by definition (see Fig. 1). Since we start at the origin , the first time each coordinate is picked is a good step. Further, if some step is bad, this implies that ‘crosses’ the origin. In this case our modified update rule (12) sets the coordinate to 0. The next time coordinate is picked, the step is sure to be good. Thus in steps, we have at least good steps.
As per Lemma 2, every good step corresponds to optimizing the upper bound with the squared regularizer. We can finish the proof:
Inequality follows from Karimireddy et al. (2018, Lemma 9), and from strong convexity of . Rearranging the terms above gives us the required linear rate of convergence. ∎
4.2 Sublinear convergence for general convex
A sublinear convergence rate independent of for SCD can be obtained when is not strongly convex.
Theorem 3.
Suppose that is coordinatewise smooth and convex, for being an regularizer or a boxconstraint. Also let be the set of minima of with a minimum value . After steps of Algorithm 1 or Algorithm 2 respectively, where the coordinate is chosen using the GSs, GSr, or GSq rule,
where is the diameter of the level set. For the set of minima ,
While a similar convergence rate was known for the GSr rule (Dhillon et al., 2011), we here establish it for all three rules—even for the approximate GSs.
5 Maximum Inner Product Search
We now shift the focus from the theoretical rates to the actual implementation. A very important observation—as pointed out by Dhillon et al. (2011)—is that finding the steepest descent direction is closely related to a geometric problem. As an example consider the function for a data matrix . The gradient takes the form for and thus finding steepest coordinate direction is equal to finding the datapoint with the largest (in absolute value) inner product with the query vector , which a priori requires the evaluation of all scalar products. However, when we have to perform multiple similar queries (such as over the iterations of SCD), it is possible to preprocess the dataset to speed up the query time. Note that we do not require the columns to be normalized.
For the more general set of problems we consider here, we need the following slightly stronger primitive.
Definition 7 ().
Given a set of , dimensional points , the Subset Maximum Inner Product Search or problem is to preprocess the set such that for any query vector and any subset of the points , the best point , i.e.
can be computed with scalar product evaluations.
Stateoftheart algorithms relax the exactness assumption and compute an approximate solution in time equivalent to a sublinear number of scalar product evaluations, i.e. (e.g. (Charikar, 2002; Lv et al., 2007; Shrivastava and Li, 2014; Neyshabur and Srebro, 2015; Andoni et al., 2015)). We consciously refrain from stating more precise running times, as these will depend on the actual choice of the algorithm and the parameters chosen by the user. Our approach in this paper is transparent to the actual choice of algorithm, we only show how SCD steps can be exactly cast as such instances. By employing an arbitrary solver one thus gets a sublinear time approximate SCD update. An important caveat is that in subsequent queries, we will adaptively change the subset based on the solution to the previous query. Hence the known theoretical guarantees shown for LSH do not directly apply, though the practical performance does not seem to be affected by this (see Appendix Fig. 13, 17). Practical details of efficiently solving are provided in Section 7.
6 Mapping Gss to
We now move to our next contribution and show how the GSs rule can be efficiently implemented. We aim to cast the problem of computing the GSs update as an instance of (Maximum Inner Product Search), for which very fast query algorithms exist. In contrast, the GSr and GSq rules do not allow such a mapping. In this section, we will only consider objective functions of the following special structure:
(13) 
The usual problems such as Lasso, dual SVM, or logistic regression, etc. have such a structure (see Appendix A.3).
Difficulty of the Greedy Rules.
This section will serve to strongly motivate our choice of using the GSs rule over the GSr or GSq. Let us pause to examine the three greedy selection rules and compare their relative difficulty. As a warmup, consider again the smooth function for a data matrix as introduced above in Section 5. We have observed that the steepest coordinate direction is equal to
(14) 
The formulation on the right is an instance of over the vectors . Now consider a nonsmooth problem of the form . For simplicity, let us assume and . In this case, the subgradient is and the GSs rule is
(15) 
The rule (15) is clearly not much harder than (14), and can be cast as a problem with minor modifications (see details in Sec. 6.1).
Let denote the proximal coordinate update along the th coordinate. In our case, . The GSr rule can now be ‘simplified’ as:
(16) 
It does not seem easy to cast (16) as a instance. It is even less likely that the GSq rule which reads
can be mapped as to . This highlights the simplicity and usefulness of the GSs rule.
6.1 Mapping Regularized Problems
Here we focus on problems of the form (13) where . Again, we have where .
For simplicity, let . Then the GSs rule in (9) is
(17) 
We want to map the problem of the above form to a instance. Define for some , vectors
(18) 
and form a query vector as
(19) 
A simple computation shows that the problem in (17) is equivalent to
Thus by searching over a subset of vectors in , we can compute the GSs direction. Dealing with the case where goes through similar arguments, and the details are outlined in Appendix E. Here we only state the resulting mapping.
The constant in (18) and (19) is chosen to ensure that the entry is of the same order of magnitude on average as the rest of the coordinates of . The need for only arises out of the performance concerns about the underlying algorithm to solve the instance. For example, has no effect if we use exact search.
Formally, define the set . Then at any iteration with current iterate , we also define as , where
(20)  
Lemma 3.
The sets and differ in at most four points since and differ only in a single coordinate. This makes it computationally very efficient to incrementally maintain and for regularized problems.
6.2 Mapping BoxConstrained Problems
7 Experimental Results
Our experiments focus on the standard tasks of Lasso regression, as well as SVM training (on the dual objective). We refer the reader to Appendix A.3 for definitions. Lasso regression is performed on the rcv1 dataset while SVM is performed on w1a and the ijcnn1
datasets. All columns of the dataset (features for Lasso, datapoints for SVM) are normalized to unit length, allowing us to use the standard cosinesimilarity algorithms
nmslib (Boytsov and Naidan, 2013) to efficiently solve the instances. Note however that our framework is applicable without any normalization, if using a general solver instead.We use the hnsw algorithm of the nmslib library with the default hyperparameter value and other parameters as in Table 1, selected by gridsearch.^{3}^{3}3A short overview of how to set these hyperparameters can be found at https://github.com/nmslib/nmslib/blob/master/python_bindings/parameters.md. More details such as the meaning of these parameters can be found in the nmslib manual (Naidan and Boytsov, 2015, pp. 61). We exclude the time required for preprocessing of the datasets since it is amortized over the multiple experiments run on the same dataset (say for hyperparameter tuning etc.). All our experiments are run on an Intel Xeon CPU E52680 v3 (2.50GHz, 30 MB cache) with 48 cores and 256GB RAM.
Dataset  efS  post  

rcv1,  47,236  15,564  19%  100  2 
rcv1,  47,236  15,564  3%  400  2 
w1a  2,477  300  100  0  
ijcnn1  49,990  22  50  0 
First we compare the practical algorithm (dhillon) of Dhillon et al. (2011), which disregards the regularization part in choosing the next coordinate, and Algorithm 1 with GSs rule (steepest) for Lasso regression. Note that dhillon is not guaranteed to converge. To compare the selection rules without biasing on the choice of the library, we perform exact search to answer the queries. As seen from Fig. 2, steepest significantly outperforms dhillon. In fact dhillon stagnates (though it does not diverge), once the error becomes small and the regularization term starts playing a significant role. Increasing the regularization further worsens its performance. This is understandable since the rule used by dhillon ignores the regularizer.
Next we compare our steepest strategy (Algorithms 1 and 2 using the GSs rule), and the corresponding nearestneighborbased approximate versions (steepestnn) against uniform, which picks coordinates uniformly at random. In all these experiments, for Lasso and at for SVM. Fig. 3 shows the clearly superior performance in terms of iterations of the steepest strategy as well as steepestnn over uniform for both the Lasso as well as SVM problems. However, towards the end of the optimization i.e. in high accuracy regimes, steepestnn fails to find directions substantially better than uniform. This is because towards the end, all components of the gradient become small, meaning that the query vector is nearly orthogonal to all points—a setting in which the employed nearest neighbor library nmslib performs poorly (Boytsov and Naidan, 2013).
Fig. 4 compares the walltime performance of the steepest, steepestnn and uniform strategies. This includes all the overhead of finding the descent direction. In all instances, the steepestnn algorithm is competitive with uniform at the start, compensating for the increased time per iteration by increased progress per iteration. However towards the end steepestnn gets comparable progress per iteration at a significantly larger cost, making its performance worse. With increasing sparsity of the solution (see Table 1 for sparsity levels), exact steepest rule starts to outperform uniform and steepestnn.
Walltime experiments (Fig. 4) show that steepestnn always shows a significant performance gain in the important early phase of optimization, but in the later phase loses out to uniform due to the query cost and poor performance of nmslib. In practice, the recommended implementation is to use steepestnn algorithm in the early optimization regime, and switch to uniform once the iteration cost outweighs the gain. In the Appendix (Fig. 13) we further investigate the poor quality of the solution provided by nmslib.
8 Concluding Remarks
In this work we have proposed a approximate GSs selection rule for coordinate descent, and showed its convergence for several important classes of problems for the first time, furthering our understanding of steepest descent on nonsmooth problems. We have also described a new primitive, the Subset Maximum Inner Product Search (), and casted the GSs selection rule as an instance of . This enabled the use of fast sublinear algorithms designed for this problem to efficiently compute a approximate GSs direction.
We obtained strong empirical evidence for the superiority of the GSs rule over randomly picking coordinates on several real world datasets, validating our theory. Further, we showed that for Lasso regression, our algorithm consistently outperforms the practical algorithm presented in (Dhillon et al., 2011). Finally, we perform extensive numerical experiments showcasing the strengths and weaknesses of a current stateoftheart libraries for computing a approximate GSs direction. As grows, the cost per iteration for nmslib remains comparable to that of UCD, while the progress made per iteration increases. This means that as problem sizes grow, GSs implemented via becomes an increasingly attractive approach. Further, we also show that when the norm of the gradient becomes small, current stateoftheart methods struggle to find directions substantially better than uniform. Alleviating this, and leveraging some of the very active development of recent alternatives to LSH as subroutines for our method is a promising direction for future work. In a different direction, since the GSs rule, as opposed to GSq or the GSr, uses only local subgradient information, it might be amenable to gradient approximation schemes typically used in zerothorder algorithms (e.g. (Wibisono et al., 2012)).
Acknowledgements.
We thank Ludwig Schmidt for numerous discussions on using FALCONN
, and for his useful advice on setting its hyperparameters. We also thank Vyacheslav Alipov for insights about
nmslib, Hadi Daneshmand for algorithmic insights, and Mikkel Thorup for discussions on using hashing schemes in practice. The feedback from many anonymous reviewers has also helped significantly improve the presentation.References
 Andoni et al. (2015) Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., and Schmidt, L. (2015). Practical and Optimal LSH for Angular Distance. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1225–1233. Curran Associates, Inc.
 Bertsekas and Tsitsiklis (1989) Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ.
 Bertsekas and Tsitsiklis (1991) Bertsekas, D. P. and Tsitsiklis, J. N. (1991). Some aspects of parallel and distributed iterative algorithms—a survey. Automatica, 27(1):3–21.
 Bickley (1941) Bickley, W. (1941). Relaxation methods in engineering science: A treatise on approximate computation.
 Boytsov and Naidan (2013) Boytsov, L. and Naidan, B. (2013). Engineering efficient and effective NonMetric Space Library. In Brisaboa, N., Pedreira, O., and Zezula, P., editors, Similarity Search and Applications, volume 8199 of Lecture Notes in Computer Science, pages 280–293. Springer Berlin Heidelberg.

Charikar (2002)
Charikar, M. S. (2002).
Similarity Estimation Techniques from Rounding Algorithms.
InProceedings of the ThiryFourth Annual ACM Symposium on Theory of Computing
, STOC ’02, pages 380–388, New York, NY, USA. ACM.  Dhillon et al. (2011) Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2011). Nearest Neighbor based Greedy Coordinate Descent. In ShaweTaylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 24, pages 2160–2168. Curran Associates, Inc.
 Dünner et al. (2017) Dünner, C., Parnell, T., and Jaggi, M. (2017). Efficient use of limitedmemory accelerators for linear learning on heterogeneous systems. In Advances in Neural Information Processing Systems, pages 4258–4267.
 Fercoq and Richtárik (2015) Fercoq, O. and Richtárik, P. (2015). Accelerated, Parallel, and Proximal Coordinate Descent. SIAM Journal on Optimization, 25(4):1997–2023.

Karimi et al. (2016)
Karimi, H., Nutini, J., and Schmidt, M. (2016).
Linear convergence of gradient and proximalgradient methods under
the polyakłojasiewicz condition.
In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pages 795–811. Springer. 
Karimireddy et al. (2018)
Karimireddy, S. P. R., Stich, S., and Jaggi, M. (2018).
Adaptive balancing of gradient and update computation times using
global geometry and approximate subproblems.
In
International Conference on Artificial Intelligence and Statistics
, pages 1204–1213.  Lin et al. (2014) Lin, Q., Lu, Z., and Xiao, L. (2014). An Accelerated Proximal Coordinate Gradient Method. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3059–3067. Curran Associates, Inc.
 Locatello et al. (2018) Locatello, F., Raj, A., Karimireddy, S. P., Rätsch, G., Schölkopf, B., Stich, S., and Jaggi, M. (2018). On matching pursuit and coordinate descent. In International Conference on Machine Learning, pages 3204–3213.
 Lu et al. (2018) Lu, H., Freund, R. M., and Mirrokni, V. (2018). Accelerating greedy coordinate descent methods. arXiv preprint arXiv:1806.02476.
 Lv et al. (2007) Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. (2007). Multiprobe lsh: efficient indexing for highdimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases, pages 950–961. VLDB Endowment.
 Massias et al. (2018) Massias, M., Salmon, J., and Gramfort, A. (2018). Celer: a fast solver for the lasso with dual extrapolation. In International Conference on Machine Learning, pages 3321–3330.
 Naidan and Boytsov (2015) Naidan, B. and Boytsov, L. (2015). Nonmetric space library manual. arXiv preprint arXiv:1508.05470.
 Ndiaye et al. (2015) Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. (2015). Gap safe screening rules for sparse multitask and multiclass models. In Advances in Neural Information Processing Systems, pages 811–819.
 Nesterov (2012) Nesterov, Y. (2012). Efficiency of Coordinate Descent Methods on HugeScale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362.
 Nesterov and Stich (2017) Nesterov, Y. and Stich, S. (2017). Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems. SIAM Journal on Optimization, 27(1):110–123.
 Neyshabur and Srebro (2015) Neyshabur, B. and Srebro, N. (2015). On Symmetric and Asymmetric LSHs for Inner Product Search. In ICML 2015  Proceedings of the 32th International Conference on Machine Learning, pages 1926–1934.
 Nutini et al. (2017) Nutini, J., Laradji, I., and Schmidt, M. (2017). Let’s make block coordinate descent go fast: Faster greedy rules, messagepassing, activeset complexity, and superlinear convergence. arXiv preprint arXiv:1712.08859.
 Nutini et al. (2015) Nutini, J., Schmidt, M., Laradji, I. H., Friedlander, M., and Koepke, H. (2015). Coordinate Descent Converges Faster with the GaussSouthwell Rule Than Random Selection. arXiv:1506.00552 [cs, math, stat].
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
 Perekrestenko et al. (2017) Perekrestenko, D., Cevher, V., and Jaggi, M. (2017). Faster Coordinate Descent via Adaptive Importance Sampling. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 869–877, Fort Lauderdale, FL, USA. PMLR.
 Richtarik and Takac (2016) Richtarik, P. and Takac, M. (2016). Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(12):433–484.
 ShalevShwartz and Zhang (2013a) ShalevShwartz, S. and Zhang, T. (2013a). Accelerated Minibatch Stochastic Dual Coordinate Ascent. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 378–385, USA. Curran Associates Inc.
 ShalevShwartz and Zhang (2013b) ShalevShwartz, S. and Zhang, T. (2013b). Stochastic Dual Coordinate Ascent Methods for Regularized Loss. J. Mach. Learn. Res., 14(1):567–599.
 ShalevShwartz and Zhang (2016) ShalevShwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(12):105–145.
 Shrivastava and Li (2014) Shrivastava, A. and Li, P. (2014). Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In NIPS 2014  Advances in Neural Information Processing Systems 27, pages 2321–2329.
 Song et al. (2017) Song, C., Cui, S., Jiang, Y., and Xia, S.T. (2017). Accelerated stochastic greedy coordinate descent by soft thresholding projection onto simplex. In Advances in Neural Information Processing Systems, pages 4838–4847.
 Stich et al. (2017a) Stich, S. U., Raj, A., and Jaggi, M. (2017a). Approximate Steepest Coordinate Descent. ICML  Proceedings of the 34th International Conference on Machine Learning.
 Stich et al. (2017b) Stich, S. U., Raj, A., and Jaggi, M. (2017b). Safe Adaptive Importance Sampling. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4384–4394. Curran Associates, Inc.
 Warga (1963) Warga, J. (1963). Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3):588–593.
 Wibisono et al. (2012) Wibisono, A., Wainwright, M. J., Jordan, M. I., and Duchi, J. C. (2012). Finite sample convergence rates of zeroorder stochastic optimization methods. In Advances in Neural Information Processing Systems, pages 1439–1447.
 Wright (2015) Wright, S. J. (2015). Coordinate descent algorithms. Mathematical Programming, 151(1):3–34.
 You et al. (2016) You, Y., Lian, X., Liu, J., Yu, H.F., Dhillon, I. S., Demmel, J., and Hsieh, C.J. (2016). Asynchronous parallel greedy coordinate descent. In Advances in Neural Information Processing Systems, pages 4682–4690.
References
 Andoni et al. (2015) Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., and Schmidt, L. (2015). Practical and Optimal LSH for Angular Distance. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1225–1233. Curran Associates, Inc.
 Bertsekas and Tsitsiklis (1989) Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ.
 Bertsekas and Tsitsiklis (1991) Bertsekas, D. P. and Tsitsiklis, J. N. (1991). Some aspects of parallel and distributed iterative algorithms—a survey. Automatica, 27(1):3–21.
 Bickley (1941) Bickley, W. (1941). Relaxation methods in engineering science: A treatise on approximate computation.
 Boytsov and Naidan (2013) Boytsov, L. and Naidan, B. (2013). Engineering efficient and effective NonMetric Space Library. In Brisaboa, N., Pedreira, O., and Zezula, P., editors, Similarity Search and Applications, volume 8199 of Lecture Notes in Computer Science, pages 280–293. Springer Berlin Heidelberg.

Charikar (2002)
Charikar, M. S. (2002).
Similarity Estimation Techniques from Rounding Algorithms.
InProceedings of the ThiryFourth Annual ACM Symposium on Theory of Computing
, STOC ’02, pages 380–388, New York, NY, USA. ACM.  Dhillon et al. (2011) Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2011). Nearest Neighbor based Greedy Coordinate Descent. In ShaweTaylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 24, pages 2160–2168. Curran Associates, Inc.
 Dünner et al. (2017) Dünner, C., Parnell, T., and Jaggi, M. (2017). Efficient use of limitedmemory accelerators for linear learning on heterogeneous systems. In Advances in Neural Information Processing Systems, pages 4258–4267.
 Fercoq and Richtárik (2015) Fercoq, O. and Richtárik, P. (2015). Accelerated, Parallel, and Proximal Coordinate Descent. SIAM Journal on Optimization, 25(4):1997–2023.

Karimi et al. (2016)
Karimi, H., Nutini, J., and Schmidt, M. (2016).
Linear convergence of gradient and proximalgradient methods under
the polyakłojasiewicz condition.
In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pages 795–811. Springer. 
Karimireddy et al. (2018)
Karimireddy, S. P. R., Stich, S., and Jaggi, M. (2018).
Adaptive balancing of gradient and update computation times using
global geometry and approximate subproblems.
In
International Conference on Artificial Intelligence and Statistics
, pages 1204–1213.  Lin et al. (2014) Lin, Q., Lu, Z., and Xiao, L. (2014). An Accelerated Proximal Coordinate Gradient Method. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3059–3067. Curran Associates, Inc.
 Locatello et al. (2018) Locatello, F., Raj, A., Karimireddy, S. P., Rätsch, G., Schölkopf, B., Stich, S., and Jaggi, M. (2018). On matching pursuit and coordinate descent. In International Conference on Machine Learning, pages 3204–3213.
 Lu et al. (2018) Lu, H., Freund, R. M., and Mirrokni, V. (2018). Accelerating greedy coordinate descent methods. arXiv preprint arXiv:1806.02476.
 Lv et al. (2007) Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. (2007). Multiprobe lsh: efficient indexing for highdimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases, pages 950–961. VLDB Endowment.
 Massias et al. (2018) Massias, M., Salmon, J., and Gramfort, A. (2018). Celer: a fast solver for the lasso with dual extrapolation. In International Conference on Machine Learning, pages 3321–3330.
 Naidan and Boytsov (2015) Naidan, B. and Boytsov, L. (2015). Nonmetric space library manual. arXiv preprint arXiv:1508.05470.
 Ndiaye et al. (2015) Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. (2015). Gap safe screening rules for sparse multitask and multiclass models. In Advances in Neural Information Processing Systems, pages 811–819.
 Nesterov (2012) Nesterov, Y. (2012). Efficiency of Coordinate Descent Methods on HugeScale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362.
 Nesterov and Stich (2017) Nesterov, Y. and Stich, S. (2017). Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems. SIAM Journal on Optimization, 27(1):110–123.
 Neyshabur and Srebro (2015) Neyshabur, B. and Srebro, N. (2015). On Symmetric and Asymmetric LSHs for Inner Product Search. In ICML 2015  Proceedings of the 32th International Conference on Machine Learning, pages 1926–1934.
 Nutini et al. (2017) Nutini, J., Laradji, I., and Schmidt, M. (2017). Let’s make block coordinate descent go fast: Faster greedy rules, messagepassing, activeset complexity, and superlinear convergence. arXiv preprint arXiv:1712.08859.
 Nutini et al. (2015) Nutini, J., Schmidt, M., Laradji, I. H., Friedlander, M., and Koepke, H. (2015). Coordinate Descent Converges Faster with the GaussSouthwell Rule Than Random Selection. arXiv:1506.00552 [cs, math, stat].
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
 Perekrestenko et al. (2017) Perekrestenko, D., Cevher, V., and Jaggi, M. (2017). Faster Coordinate Descent via Adaptive Importance Sampling. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 869–877, Fort Lauderdale, FL, USA. PMLR.
 Richtarik and Takac (2016) Richtarik, P. and Takac, M. (2016). Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(12):433–484.
 ShalevShwartz and Zhang (2013a) ShalevShwartz, S. and Zhang, T. (2013a). Accelerated Minibatch Stochastic Dual Coordinate Ascent. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 378–385, USA. Curran Associates Inc.
 ShalevShwartz and Zhang (2013b) ShalevShwartz, S. and Zhang, T. (2013b). Stochastic Dual Coordinate Ascent Methods for Regularized Loss. J. Mach. Learn. Res., 14(1):567–599.
 ShalevShwartz and Zhang (2016) ShalevShwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(12):105–145.
 Shrivastava and Li (2014) Shrivastava, A. and Li, P. (2014). Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In NIPS 2014  Advances in Neural Information Processing Systems 27, pages 2321–2329.
 Song et al. (2017) Song, C., Cui, S., Jiang, Y., and Xia, S.T. (2017). Accelerated stochastic greedy coordinate descent by soft thresholding projection onto simplex. In Advances in Neural Information Processing Systems, pages 4838–4847.
 Stich et al. (2017a) Stich, S. U., Raj, A., and Jaggi, M. (2017a). Approximate Steepest Coordinate Descent. ICML  Proceedings of the 34th International Conference on Machine Learning.
 Stich et al. (2017b) Stich, S. U., Raj, A., and Jaggi, M. (2017b). Safe Adaptive Importance Sampling. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4384–4394. Curran Associates, Inc.
 Warga (1963) Warga, J. (1963). Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3):588–593.
 Wibisono et al. (2012) Wibisono, A., Wainwright, M. J., Jordan, M. I., and Duchi, J. C. (2012). Finite sample convergence rates of zeroorder stochastic optimization methods. In Advances in Neural Information Processing Systems, pages 1439–1447.
 Wright (2015) Wright, S. J. (2015). Coordinate descent algorithms. Mathematical Programming, 151(1):3–34.
 You et al. (2016) You, Y., Lian, X., Liu, J., Yu, H.F., Dhillon, I. S., Demmel, J., and Hsieh, C.J. (2016). Asynchronous parallel greedy coordinate descent. In Advances in Neural Information Processing Systems, pages 4682–4690.
Appendix A Setup and Notation
In this section we go over some of the definitions and notations which we had skipped over previously. We will also describe the class of functions we tackle and applications which fit into this framework.
a.1 Function Classes
Definition 8 (coordinatewise smoothness).
A function is coordinatewise smooth if
for any , , , and is a coordinate basis vector.
We also define strong convexity of the function .
Definition 9 (strong convexity).
A function is strongly convex with respect to some norm if
for any and in the domain of (note that does not necessarily need to be defined on the entire space ).
We will frequently denote by the strong convexity constant corresponding to the usual Euclidean norm, and by the strong convexity constant corresponding the norm. In general it holds : . See (Nutini et al., 2015) for a detailed comparision of the two constants.
Theorem 4.
Suppose that the function is twicedifferentiable and
for any in the domain of i.e. the maximum diagonal element of the Hessian is bounded by . Then is coordinatewise smooth. Additionally, for any
Proof.
By Taylor’s expansion and the intermediate value theorem, we have that for any , there exists a such that for ,
(21) 
Now if for some and coordinate , the equation (21) becomes
The first claim now follows since was defined such that . For the second claim, consider the following optimization problem over ,
(22) 
We claim that the maximum is achieved for for some . For now assume this is true. Then we would have that
Using this result in equation (21) would prove our second claim of the theorem. Thus we need to study the optimum of (22). Since is a convex function, is also a convex function. We can now appeal to Lemma 4 for the convex set . The corners of the set exactly correspond to the unit directional vectors . With this we finish the proof of our theorem. ∎
Remark 10.
This result states that if we define smoothness with respect to the norm, the resulting smoothness constant is same as the coordinatewise smoothness constant. This is surprising since for a general convex function , using the update rule
does not necessarily yield a coordinate update. We believe this observation (though not crucial to the current work) was not known before.
Let us prove an elementary lemma about maximizing convex functions over convex sets.
Lemma 4 (Maximum of a constrained convex function).
For any convex function , the maximum over a compact convex set is achieved at a ‘corner’. Here a ‘corner’ is defined to be a point such that there do not exist two points and , such that for some , .
Proof.
Suppose that the maximum is achieved at a point which is not a ‘corner’. Then let be two points such that for , we have . Since the function is convex,
(23) 
∎
We also assume that the proximal term is such that is either or enforces a boxconstraint.
a.2 Proximal Coordinate Descent
As argued in the introduction, coordinate descent is the method of choice for large scale problems of the form (1). We denote the iterates by , and a single coordinate of this vector by a subscript , for . CD methods only change one coordinate of the iterates in each iteration. That is, when coordinate is updated at iteration , we have for , and
(24) 
Combining the smoothness condition, and the definition of the proximal update (24), we get the progress made is
(25) 
a.3 Applications
There is a number of relevant problems in machine learning which are the form
where the nonsmooth term either enforces a box constraint, or is an regularizer. This class covers several important problems such as SVMs, Lasso regression, logistic regression and elastic net regularized problems. We use SVMs and Lasso regression as running examples for our methods.
Svm.
The loss function for training SVMs with
as the regularization parameter can be written as(26) 
where for and the training data. We can define the corresponding dual problem for (26) as
(27) 
where for the data matrix of the columns (ShalevShwartz and Zhang, 2013b). We can map this to (1) with , with , and the box indicator function, i.e.
It is straightforward to see that the function is coordinatewise smooth for .
We map the dual variable back to the primal variable as and the duality gap defined as
Comments
There are no comments yet.