Efficient Greedy Coordinate Descent for Composite Problems

10/16/2018 ∙ by Sai Praneeth Karimireddy, et al. ∙ EPFL 0

Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, and requiring upto n times fewer iterations. In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, which includes L1-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy. Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there has been increased interest in coordinate descent (CD) methods due to their simplicity, low cost per iteration, and efficiency (Wright, 2015). Algorithms based on coordinate descent are the state of the art for many optimization problems (Nesterov, 2012; Shalev-Shwartz and Zhang, 2013b; Lin et al., 2014; Shalev-Shwartz and Zhang, 2013a, 2016; Richtarik and Takac, 2016; Fercoq and Richtárik, 2015; Nesterov and Stich, 2017)

. Most of the CD methods draw their coordinates from a fixed distribution—for instance from the uniform distribution as in uniform coordinate descent (UCD). However, it is clear that significant improvements can be achieved by choosing more

important coordinates more frequently (Nesterov, 2012; Nesterov and Stich, 2017; Stich et al., 2017a, b; Perekrestenko et al., 2017). In particular, we could greedily choose the ‘best’ coordinate at each iteration i.e. the steepest coordinate descent (SCD).

SCD for composite problems.

Consider the smooth quadtratic function . There are three natural notions of the ‘best’ coordinate.222Following standard notation (cf. (Nutini et al., 2015)) we call them the Gauss-Southwell (GS) rules. One could choose (i) GS-s: the steepest coordinate direction based on (sub)-gradients, (ii) GS-r: the coordinate which allows us to take the largest step, and (iii) GS-q: the coordinate that allows us to minimize the function value the most. For our example (and in general for smooth functions), the three rules are equivalent. When we add an additional non-smooth function to , such as , however, the three notions are no more equivalent. The performance of greedy coordinate descent in this composite setting is not well understood, and is the focus of this work.

Iteration complexity of SCD.

If the objective decomposes into identical separable problems, then clearly SCD is identical to UCD. In all but such extreme cases, Nutini et al. (2015) give a refined analysis of SCD for smooth functions and show that it outperforms UCD. This lead to a renewed interest in greedy methods (e.g. (Karimi et al., 2016; You et al., 2016; Dünner et al., 2017; Song et al., 2017; Nutini et al., 2017; Stich et al., 2017a; Locatello et al., 2018; Lu et al., 2018)). However, for the composite case the analysis in (Nutini et al., 2015) of SCD methods for any of the three rules mentioned earlier falls back to that of UCD. Thus they fail to demonstrate the advantage of greedy methods for the composite case. In fact it is claimed that the rate of the GS-s greedy rule may even be worse than that of UCD. In this work we provide a refined analysis of SCD for a certain class of composite problems, and show that all three strategies (GS-s, GS-r, and GS-q) converge on composite problems at a rate similar to SCD in the smooth case. Thus for these problems too, greedy coordinate algorithms are provably faster than UCD other than in extreme cases.

Efficiency of SCD.

A naïve implementation of SCD would require computing the full gradient at a cost roughly  times more than just computing one coordinate of the gradient as required by UCD. This seems to negate any potential gain of SCD over UCD. The working principle behind approximate SCD methods is to trade-off exactness of the greedy direction against the time spent to decide the steepest direction (e.g. (Stich et al., 2017a)). For smooth problems, Dhillon et al. (2011) show that approximate nearest neighbor search algorithms can be used to provide in sublinear time an approximate steepest descent direction. We build upon these ideas and extend the framework to non-smooth composite problems, thereby capturing a significantly larger class of input problems. In particular we show how to efficiently map the GS-s rule to an instance of maximum inner product search ().

Contributions.

We analyze and advocate the use of the GS-s greedy rule to compute the update direction for composite problems. Our main contributions are:

  • [leftmargin=15pt,itemsep=1pt]

  • We show that on a class of composite problems, greedy coordinate methods achieve convergence rates which are very similar to those obtained for smooth functions, thereby extending the applicability of SCD. This class of problems covers several important applications such as SVMs (in its dual formulation), Lasso regression,

    -regularized logistic regression among others. With this we establish that greedy methods significantly outperform UCD also on composite problems, except in extreme cases (cf. Remark 

    4).

  • We show that both the GS-s as well as the GS-r rules achieve convergence rates which are (other than in extreme cases) faster than UCD. This sidesteps the negative results by Nutini et al. (2015) for these methods through a more fine-grained analysis. We also study the effect of approximate greedy directions on the convergence.

  • Algorithmically, we show how to precisely map the GS-s direction computation as a special instance of a maximum inner product search problem (MIPS). Many standard nearest neighbor algorithms such as e.g. Locality Sensitive Hashing (LSH) can therefore be used to efficiently run SCD on composite optimization problems.

  • We perform extensive numerical experiments to study the advantages and limitations of our steepest descent combined with a current state-of-the-art MIPS implementation (Boytsov and Naidan, 2013).

Related Literature.

Coordinate descent, being one of the earliest known optimization methods, has a rich history (e.g. (Bickley, 1941; Warga, 1963; Bertsekas and Tsitsiklis, 1989, 1991)). A significant renewal in interest followed the work of Nesterov (2012), who provided a simple analysis of the convergence of UCD. In practice, many solvers (e.g. (Ndiaye et al., 2015; Massias et al., 2018)

) combine UCD with active set heuristics where attention is restricted to a subset of

active coordinates. These methods are orthogonal to, and hence can be combined with, the greedy rules studied here. Greedy coordinate methods can also be viewed as an ‘extreme’ version of adaptive importance sampling (Stich et al., 2017a; Perekrestenko et al., 2017). However unlike greedy methods, even in the smooth case, there are no easily characterized function classes for which the adaptive sampling schemes or the active set methods are provably faster than UCD. The work closest to ours, other than the already discussed Nutini et al. (2015), would be that of Dhillon et al. (2011). The latter show a sublinear convergence rate for GS-r on composite problems. They also propose a practical variant for -regularized problems which essentially ignores the regularizer and is hence not guaranteed to converge.

2 Setup

We consider composite optimization problems of the structure

(1)

where is the number of coordinates, is convex and smooth, and the , are convex and possibly non-smooth. In this exposition, we further restrict the function to either enforce a box constraint or an regularizer. This comprises many important problem classes, for instance dual SVM or Lasso regression, see Appendix A.3.

We further assume that the smooth component is coordinate-wise smooth: for any , and ,

(2)

Sometimes we will assume that is in addition also strongly convex with respect to the norm, , that is,

(3)

for any and in the domain of . In general it holds . See Nutini et al. (2015) for a detailed comparison of the two constants.

3 SCD for Non-Smooth Problems

Here we briefly recall the definitions of the GS-s, GS-r and GS-q coordinate selection rules and introduce the approximate GS-s rule that we will consider in detail.

(4)
(5)
(6)

for an iterate ,

for standard unit vector

. Here and are defined as the minimum value and minimizer respectively of

We relax the requirement for an exact steepest selection, and define an approximate GS-s rule.

Definition 1 (-approximate Gs-s).

For given , the coordinate is considered to be a -approximate steepest direction for if

3.1 SCD for -regularized problems

We now discuss the GS-s rule for the concrete example of problems, and collect some observations that we will use later to define the mapping to the instance. A similar discussion is included for box constrained problems in Appendix B.

Consider -regularized problems of the form

(7)

The GS-s steepest rule (4) and update rules can be simplified for such functions. Let denote the sign function, and define as the shrinkage operator

Further, for any , let us define as

(8)
Lemma 1.

For any , the GS-s rule is equivalent to

(9)

Our analysis of GS-s rule requires bounding the number of ‘bad’ steps (to be detailed in Section 4). For this, we will slightly modify the update of the coordinate descent method. Note that we still always follow the GS-s direction, but will sometimes not perform the standard proximal coordinate update along this direction. To update the -th coordinate, we either rely on the standard proximal step on the coordinate,

(10)

or we perform line-search

(11)

Finally, the -th coordinate is updated as

(12)

Our modification or ‘post-processing’ step (12) ensures that the coordinate can never ‘cross’ the origin. This small change will later on help us bound the precise number of steps needed in our convergence rates (Sec. 4). The details are summarized in Algorithm 1.

1:  Initialize: .
2:  for  do
3:     Select coordinate as in GS-s, GS-r, or GS-q.
4:     Find via gradient (10) or line-search (11).
5:     Compute as in (12).
6:  end for
Algorithm 1 Steepest Coordinate Descent

4 Convergence Rates

In this section, we present our main convergence results. We illustrate the novelty of our results in the important -regularized case: For strongly convex functions , we provide the first linear rates of convergence independent of for greedy coordinate methods over -regularized problems, matching the rates in the smooth case. In particular, for GS-s this was conjectured to be impossible (Nutini et al., 2015, Section H.5, H.6) (see Remark 4). We also show the sublinear convergence of the three rules in the non-strongly convex setting. Similar rates also hold for box-constrained problems.

4.1 Linear convergence for strongly convex

Theorem 1.

Consider an -regularized optimization problem (7), with being coordinate-wise smooth, and strongly convex with respect to the norm. After steps of Algorithm 1 where the coordinate  is chosen using either the GS-s , GS-r, or GS-q rule,

Remark 2.

The linear convergence rate of Theorem 1 also holds for the -approximate GS-s rule as in Definition 1. In this case the will be multiplied by .

Remark 3.

All our linear convergence rates can be extended to objective functions which only satisfy the weaker condition of proximal-PL strong convexity (Karimi et al., 2016).

Remark 4.

The standard analysis (e.g. in Nesterov (2012)) of UCD gives a convergence rate of

Here is the strong convexity constant with respect to the norm, which satisfies . The left boundary marks the worst-case for SCD, resulting in convergence slower than UCD. It is shown in Nutini et al. (2015) that this occurs only in extreme examples (e.g. when consists of identical separable functions). For all other situations when , our result shows that SCD is faster.

Remark 5.

Our analysis in terms of works for all three selection rules GS-s, GS-r, or GS-q rules. In (Nutini et al., 2015, Section H5, H6) it was conjectured (but not proven) that this linear convergence rate holds for GS-q, but that it cannot hold for GS-s or GS-r. Example functions were constructed where it was shown that the single step progress of GS-s or GS-r is much smaller than . However these example steps were all bad steps, as we will define in the following proof sketch, whose number we show can be bounded.

We state an analogous linear rate for the box-constrained case too, but refer to Appendix B for the detailed algorithm and proof.

Theorem 2.

Suppose that is coordinate-wise smooth, and strongly convex with respect to the norm, for problem (1) with encoding a box-constraint. After steps of Algorithm 2 (the box analogon of Algorithm 1) where the coordinate  is chosen using the GS-s , GS-r, or GS-q rule, then

While the proof shares ideas with the -case, there are significant differences, e.g. the division of the steps into three categories: i) good steps which give a progress, ii) bad steps which may not give much progress but are bounded in number, and a third iii) cross steps which give a progress.

Remark 6.

For the box case, the greedy methods converge faster than UCD if , as before, and if . Typically, is much smaller than 1 and so the second condition is almost always satisfied. Hence we can expect greedy to be much faster in the box case, just as in the unconstrained smooth case. It remains unclear if the term truly affects the rate of convergence. For example, in the separated quadratic case considered in (Nutini et al., 2015, Sec. 4.1), and so we can ignore the term in the rate (see Remark 16 in the Appendix).

Proof sketch.

While the full proofs are given in the appendix, we here give a proof sketch of the convergence of Algorithm 1 for -regularized problems in the strongly convex case, as in Theorem 1.

The key idea of our technique is to partition the iterates into two sets: good and bad steps depending on whether they make (provably) sufficient progress. Then we show that the modification to the update we made in (12) ensures that we do not have too many bad steps. Since Algorithm 1 is a descent method, we can focus only on the good steps and describe its convergence. The “contradiction” to the convergence of GS-s provided in (Nutini et al., 2015, Section H.5, H.6) are in fact instances of bad steps.

Figure 1: The arrows represent proximal coordinate updates from different starting points . Updates which ‘cross’ () or ‘end at’ () the origin are bad, whereas the rest (, ) are good.

The definitions of good and bad steps are explained in Fig. 1 (and formally in Def. 11). The core technical lemma below shows that in a good step, the update along the GS-s direction has an alternative characterization. For the sake of simplicity, let us assume that and that we use the exact GS-s coordinate.

Lemma 2.

Suppose that iteration of Algorithm 1 updates coordinate and that it was a good step. Then

Proof sketch. We will only examine the case when here for the sake of simplicity. Combining this with the assumption that iteration was a good step gives that both , , and . Further if , the GS-s rule simplifies to .

Since is coordinate-wise smooth (2),

But the GS-s rule exactly maximizes the last quantity. Thus we can continue:

Recall that and so . Further for any , and so . This means that

Plugging this into our previous equation gives us the lemma. See Lemma 8 for the full proof. ∎

If (i.e. is smooth), Lemma 2 reduces to the ‘refined analysis’ of Nutini et al. (2015). We can now state the rate obtained in the strongly convex case.

Proof sketch for Theorem 1. Notice that if , the step is necessarily good by definition (see Fig. 1). Since we start at the origin , the first time each coordinate is picked is a good step. Further, if some step is bad, this implies that ‘crosses’ the origin. In this case our modified update rule (12) sets the coordinate to 0. The next time coordinate is picked, the step is sure to be good. Thus in steps, we have at least good steps.

As per Lemma 2, every good step corresponds to optimizing the upper bound with the -squared regularizer. We can finish the proof:

Inequality follows from Karimireddy et al. (2018, Lemma 9), and from strong convexity of . Rearranging the terms above gives us the required linear rate of convergence. ∎

4.2 Sublinear convergence for general convex 

A sublinear convergence rate independent of for SCD can be obtained when is not strongly convex.

Theorem 3.

Suppose that is coordinate-wise smooth and convex, for being an -regularizer or a box-constraint. Also let be the set of minima of with a minimum value . After steps of Algorithm 1 or Algorithm 2 respectively, where the coordinate  is chosen using the GS-s, GS-r, or GS-q rule,

where is the -diameter of the level set. For the set of minima ,

While a similar convergence rate was known for the GS-r rule (Dhillon et al., 2011), we here establish it for all three rules—even for the approximate GS-s.

5 Maximum Inner Product Search

We now shift the focus from the theoretical rates to the actual implementation. A very important observation—as pointed out by Dhillon et al. (2011)—is that finding the steepest descent direction is closely related to a geometric problem. As an example consider the function for a data matrix . The gradient takes the form for and thus finding steepest coordinate direction is equal to finding the datapoint with the largest (in absolute value) inner product with the query vector , which a priori requires the evaluation of all scalar products. However, when we have to perform multiple similar queries (such as over the iterations of SCD), it is possible to pre-process the dataset to speed up the query time. Note that we do not require the columns to be normalized.

For the more general set of problems we consider here, we need the following slightly stronger primitive.

Definition 7 ().

Given a set of , -dimensional points , the Subset Maximum Inner Product Search or problem is to pre-process the set such that for any query vector and any subset of the points , the best point , i.e.

can be computed with scalar product evaluations.

State-of-the-art algorithms relax the exactness assumption and compute an approximate solution in time equivalent to a sublinear number of scalar product evaluations, i.e. (e.g. (Charikar, 2002; Lv et al., 2007; Shrivastava and Li, 2014; Neyshabur and Srebro, 2015; Andoni et al., 2015)). We consciously refrain from stating more precise running times, as these will depend on the actual choice of the algorithm and the parameters chosen by the user. Our approach in this paper is transparent to the actual choice of algorithm, we only show how SCD steps can be exactly cast as such instances. By employing an arbitrary solver one thus gets a sublinear time approximate SCD update. An important caveat is that in subsequent queries, we will adaptively change the subset based on the solution to the previous query. Hence the known theoretical guarantees shown for LSH do not directly apply, though the practical performance does not seem to be affected by this (see Appendix Fig. 13, 17). Practical details of efficiently solving are provided in Section 7.

6 Mapping Gs-s to

We now move to our next contribution and show how the GS-s rule can be efficiently implemented. We aim to cast the problem of computing the GS-s update as an instance of (Maximum Inner Product Search), for which very fast query algorithms exist. In contrast, the GS-r and GS-q rules do not allow such a mapping. In this section, we will only consider objective functions of the following special structure:

(13)

The usual problems such as Lasso, dual SVM, or logistic regression, etc. have such a structure (see Appendix A.3).

Difficulty of the Greedy Rules.

This section will serve to strongly motivate our choice of using the GS-s rule over the GS-r or GS-q. Let us pause to examine the three greedy selection rules and compare their relative difficulty. As a warm-up, consider again the smooth function for a data matrix as introduced above in Section 5. We have observed that the steepest coordinate direction is equal to

(14)

The formulation on the right is an instance of over the vectors . Now consider a non-smooth problem of the form . For simplicity, let us assume and . In this case, the subgradient is and the GS-s rule is

(15)

The rule (15) is clearly not much harder than (14), and can be cast as a problem with minor modifications (see details in Sec. 6.1).

Let denote the proximal coordinate update along the -th coordinate. In our case, . The GS-r rule can now be ‘simplified’ as:

(16)

It does not seem easy to cast (16) as a instance. It is even less likely that the GS-q rule which reads

can be mapped as to . This highlights the simplicity and usefulness of the GS-s rule.

6.1 Mapping -Regularized Problems

Here we focus on problems of the form (13) where . Again, we have where .

For simplicity, let . Then the GS-s rule in (9) is

(17)

We want to map the problem of the above form to a instance. Define for some , vectors

(18)

and form a query vector as

(19)

A simple computation shows that the problem in (17) is equivalent to

Thus by searching over a subset of vectors in , we can compute the GS-s direction. Dealing with the case where goes through similar arguments, and the details are outlined in Appendix E. Here we only state the resulting mapping.

The constant in (18) and (19) is chosen to ensure that the entry is of the same order of magnitude on average as the rest of the coordinates of . The need for only arises out of the performance concerns about the underlying algorithm to solve the instance. For example, has no effect if we use exact search.

Formally, define the set . Then at any iteration with current iterate , we also define as , where

(20)
Lemma 3.

At any iteration , for and as defined in (20), the query vector as in (19), and as in (9) then the following are equivalent for is of the form :

The sets and differ in at most four points since and differ only in a single coordinate. This makes it computationally very efficient to incrementally maintain and for -regularized problems.

6.2 Mapping Box-Constrained Problems

Using similar ideas, we demonstrate how to efficiently map problems of the form (13) where enforces box constraints, such as for the dual SVM. The detailed approach is provided in Appendix B.1.

7 Experimental Results

Our experiments focus on the standard tasks of Lasso regression, as well as SVM training (on the dual objective). We refer the reader to Appendix A.3 for definitions. Lasso regression is performed on the rcv1 dataset while SVM is performed on w1a and the ijcnn1

datasets. All columns of the dataset (features for Lasso, datapoints for SVM) are normalized to unit length, allowing us to use the standard cosine-similarity algorithms

nmslib (Boytsov and Naidan, 2013) to efficiently solve the instances. Note however that our framework is applicable without any normalization, if using a general solver instead.

We use the hnsw algorithm of the nmslib library with the default hyper-parameter value and other parameters as in Table 1, selected by grid-search.333A short overview of how to set these hyper-parameters can be found at https://github.com/nmslib/nmslib/blob/master/python_bindings/parameters.md. More details such as the meaning of these parameters can be found in the nmslib manual (Naidan and Boytsov, 2015, pp. 61). We exclude the time required for pre-processing of the datasets since it is amortized over the multiple experiments run on the same dataset (say for hyper-parameter tuning etc.). All our experiments are run on an Intel Xeon CPU E5-2680 v3 (2.50GHz, 30 MB cache) with 48 cores and 256GB RAM.

Dataset efS post
rcv1, 47,236 15,564 19% 100 2
rcv1, 47,236 15,564 3% 400 2
w1a 2,477 300 100 0
ijcnn1 49,990 22 50 0
Table 1: Datasets and hyper-parameters: Lasso is run on rcv1, and SVM on w1a and ijcnn1. (, ) is dataset size, the constant from (18), (19) is set to , nmslib hyper-parameter is set as a default, .

First we compare the practical algorithm (dhillon) of Dhillon et al. (2011), which disregards the regularization part in choosing the next coordinate, and Algorithm 1 with GS-s rule (steepest) for Lasso regression. Note that dhillon is not guaranteed to converge. To compare the selection rules without biasing on the choice of the library, we perform exact search to answer the queries. As seen from Fig. 2, steepest significantly outperforms dhillon. In fact dhillon stagnates (though it does not diverge), once the error becomes small and the regularization term starts playing a significant role. Increasing the regularization further worsens its performance. This is understandable since the rule used by dhillon ignores the regularizer.

Figure 2: Evaluating dhillon: steepest which is based on the GS-s rule outperforms dhillon which quickly stagnates. Increasing the regularization, it stagnates in even fewer iterations.

Next we compare our steepest strategy (Algorithms 1 and 2 using the GS-s rule), and the corresponding nearest-neighbor-based approximate versions (steepest-nn) against uniform, which picks coordinates uniformly at random. In all these experiments, for Lasso and at for SVM. Fig. 3 shows the clearly superior performance in terms of iterations of the steepest strategy as well as steepest-nn over uniform for both the Lasso as well as SVM problems. However, towards the end of the optimization i.e. in high accuracy regimes, steepest-nn fails to find directions substantially better than uniform. This is because towards the end, all components of the gradient become small, meaning that the query vector is nearly orthogonal to all points—a setting in which the employed nearest neighbor library nmslib performs poorly (Boytsov and Naidan, 2013).

Figure 3: steepest as well steepest-nn significantly outperform uniform in number of iterations.

Fig. 4 compares the wall-time performance of the steepest, steepest-nn and uniform strategies. This includes all the overhead of finding the descent direction. In all instances, the steepest-nn algorithm is competitive with uniform at the start, compensating for the increased time per iteration by increased progress per iteration. However towards the end steepest-nn gets comparable progress per iteration at a significantly larger cost, making its performance worse. With increasing sparsity of the solution (see Table 1 for sparsity levels), exact steepest rule starts to outperform uniform and steepest-nn.

Wall-time experiments (Fig. 4) show that steepest-nn always shows a significant performance gain in the important early phase of optimization, but in the later phase loses out to uniform due to the query cost and poor performance of nmslib. In practice, the recommended implementation is to use steepest-nn algorithm in the early optimization regime, and switch to uniform once the iteration cost outweighs the gain. In the Appendix (Fig. 13) we further investigate the poor quality of the solution provided by nmslib.

Figure 4: steepest-nn is very competitive and sometimes outperforms uniform even in terms of wall time especially towards the beginning. However eventually the performance of uniform is better than steepest-nn. This is because as the norm of the gradient becomes small, the used nmslib algorithm performs poorly.

Repeating our experiments with other datasets, or using FALCONN (Andoni et al., 2015), another popular library for , yielded comparable results, provided in Appendix G.

8 Concluding Remarks

In this work we have proposed a -approximate GS-s selection rule for coordinate descent, and showed its convergence for several important classes of problems for the first time, furthering our understanding of steepest descent on non-smooth problems. We have also described a new primitive, the Subset Maximum Inner Product Search (), and casted the GS-s selection rule as an instance of . This enabled the use of fast sublinear algorithms designed for this problem to efficiently compute a -approximate GS-s direction.

We obtained strong empirical evidence for the superiority of the GS-s rule over randomly picking coordinates on several real world datasets, validating our theory. Further, we showed that for Lasso regression, our algorithm consistently outperforms the practical algorithm presented in (Dhillon et al., 2011). Finally, we perform extensive numerical experiments showcasing the strengths and weaknesses of a current state-of-the-art libraries for computing a -approximate GS-s direction. As grows, the cost per iteration for nmslib remains comparable to that of UCD, while the progress made per iteration increases. This means that as problem sizes grow, GS-s implemented via becomes an increasingly attractive approach. Further, we also show that when the norm of the gradient becomes small, current state-of-the-art methods struggle to find directions substantially better than uniform. Alleviating this, and leveraging some of the very active development of recent alternatives to LSH as subroutines for our method is a promising direction for future work. In a different direction, since the GS-s rule, as opposed to GS-q or the GS-r, uses only local subgradient information, it might be amenable to gradient approximation schemes typically used in zeroth-order algorithms (e.g. (Wibisono et al., 2012)).

Acknowledgements.

We thank Ludwig Schmidt for numerous discussions on using FALCONN

, and for his useful advice on setting its hyperparameters. We also thank Vyacheslav Alipov for insights about

nmslib, Hadi Daneshmand for algorithmic insights, and Mikkel Thorup for discussions on using hashing schemes in practice. The feedback from many anonymous reviewers has also helped significantly improve the presentation.

References

  • Andoni et al. (2015) Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., and Schmidt, L. (2015). Practical and Optimal LSH for Angular Distance. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1225–1233. Curran Associates, Inc.
  • Bertsekas and Tsitsiklis (1989) Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ.
  • Bertsekas and Tsitsiklis (1991) Bertsekas, D. P. and Tsitsiklis, J. N. (1991). Some aspects of parallel and distributed iterative algorithms—a survey. Automatica, 27(1):3–21.
  • Bickley (1941) Bickley, W. (1941). Relaxation methods in engineering science: A treatise on approximate computation.
  • Boytsov and Naidan (2013) Boytsov, L. and Naidan, B. (2013). Engineering efficient and effective Non-Metric Space Library. In Brisaboa, N., Pedreira, O., and Zezula, P., editors, Similarity Search and Applications, volume 8199 of Lecture Notes in Computer Science, pages 280–293. Springer Berlin Heidelberg.
  • Charikar (2002) Charikar, M. S. (2002).

    Similarity Estimation Techniques from Rounding Algorithms.

    In

    Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing

    , STOC ’02, pages 380–388, New York, NY, USA. ACM.
  • Dhillon et al. (2011) Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2011). Nearest Neighbor based Greedy Coordinate Descent. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 24, pages 2160–2168. Curran Associates, Inc.
  • Dünner et al. (2017) Dünner, C., Parnell, T., and Jaggi, M. (2017). Efficient use of limited-memory accelerators for linear learning on heterogeneous systems. In Advances in Neural Information Processing Systems, pages 4258–4267.
  • Fercoq and Richtárik (2015) Fercoq, O. and Richtárik, P. (2015). Accelerated, Parallel, and Proximal Coordinate Descent. SIAM Journal on Optimization, 25(4):1997–2023.
  • Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    , pages 795–811. Springer.
  • Karimireddy et al. (2018) Karimireddy, S. P. R., Stich, S., and Jaggi, M. (2018). Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1204–1213.
  • Lin et al. (2014) Lin, Q., Lu, Z., and Xiao, L. (2014). An Accelerated Proximal Coordinate Gradient Method. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3059–3067. Curran Associates, Inc.
  • Locatello et al. (2018) Locatello, F., Raj, A., Karimireddy, S. P., Rätsch, G., Schölkopf, B., Stich, S., and Jaggi, M. (2018). On matching pursuit and coordinate descent. In International Conference on Machine Learning, pages 3204–3213.
  • Lu et al. (2018) Lu, H., Freund, R. M., and Mirrokni, V. (2018). Accelerating greedy coordinate descent methods. arXiv preprint arXiv:1806.02476.
  • Lv et al. (2007) Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. (2007). Multi-probe lsh: efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases, pages 950–961. VLDB Endowment.
  • Massias et al. (2018) Massias, M., Salmon, J., and Gramfort, A. (2018). Celer: a fast solver for the lasso with dual extrapolation. In International Conference on Machine Learning, pages 3321–3330.
  • Naidan and Boytsov (2015) Naidan, B. and Boytsov, L. (2015). Non-metric space library manual. arXiv preprint arXiv:1508.05470.
  • Ndiaye et al. (2015) Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. (2015). Gap safe screening rules for sparse multi-task and multi-class models. In Advances in Neural Information Processing Systems, pages 811–819.
  • Nesterov (2012) Nesterov, Y. (2012). Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362.
  • Nesterov and Stich (2017) Nesterov, Y. and Stich, S. (2017). Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems. SIAM Journal on Optimization, 27(1):110–123.
  • Neyshabur and Srebro (2015) Neyshabur, B. and Srebro, N. (2015). On Symmetric and Asymmetric LSHs for Inner Product Search. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning, pages 1926–1934.
  • Nutini et al. (2017) Nutini, J., Laradji, I., and Schmidt, M. (2017). Let’s make block coordinate descent go fast: Faster greedy rules, message-passing, active-set complexity, and superlinear convergence. arXiv preprint arXiv:1712.08859.
  • Nutini et al. (2015) Nutini, J., Schmidt, M., Laradji, I. H., Friedlander, M., and Koepke, H. (2015). Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection. arXiv:1506.00552 [cs, math, stat].
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
  • Perekrestenko et al. (2017) Perekrestenko, D., Cevher, V., and Jaggi, M. (2017). Faster Coordinate Descent via Adaptive Importance Sampling. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 869–877, Fort Lauderdale, FL, USA. PMLR.
  • Richtarik and Takac (2016) Richtarik, P. and Takac, M. (2016). Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433–484.
  • Shalev-Shwartz and Zhang (2013a) Shalev-Shwartz, S. and Zhang, T. (2013a). Accelerated Mini-batch Stochastic Dual Coordinate Ascent. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 378–385, USA. Curran Associates Inc.
  • Shalev-Shwartz and Zhang (2013b) Shalev-Shwartz, S. and Zhang, T. (2013b). Stochastic Dual Coordinate Ascent Methods for Regularized Loss. J. Mach. Learn. Res., 14(1):567–599.
  • Shalev-Shwartz and Zhang (2016) Shalev-Shwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145.
  • Shrivastava and Li (2014) Shrivastava, A. and Li, P. (2014). Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In NIPS 2014 - Advances in Neural Information Processing Systems 27, pages 2321–2329.
  • Song et al. (2017) Song, C., Cui, S., Jiang, Y., and Xia, S.-T. (2017). Accelerated stochastic greedy coordinate descent by soft thresholding projection onto simplex. In Advances in Neural Information Processing Systems, pages 4838–4847.
  • Stich et al. (2017a) Stich, S. U., Raj, A., and Jaggi, M. (2017a). Approximate Steepest Coordinate Descent. ICML - Proceedings of the 34th International Conference on Machine Learning.
  • Stich et al. (2017b) Stich, S. U., Raj, A., and Jaggi, M. (2017b). Safe Adaptive Importance Sampling. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4384–4394. Curran Associates, Inc.
  • Warga (1963) Warga, J. (1963). Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3):588–593.
  • Wibisono et al. (2012) Wibisono, A., Wainwright, M. J., Jordan, M. I., and Duchi, J. C. (2012). Finite sample convergence rates of zero-order stochastic optimization methods. In Advances in Neural Information Processing Systems, pages 1439–1447.
  • Wright (2015) Wright, S. J. (2015). Coordinate descent algorithms. Mathematical Programming, 151(1):3–34.
  • You et al. (2016) You, Y., Lian, X., Liu, J., Yu, H.-F., Dhillon, I. S., Demmel, J., and Hsieh, C.-J. (2016). Asynchronous parallel greedy coordinate descent. In Advances in Neural Information Processing Systems, pages 4682–4690.

References

  • Andoni et al. (2015) Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., and Schmidt, L. (2015). Practical and Optimal LSH for Angular Distance. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1225–1233. Curran Associates, Inc.
  • Bertsekas and Tsitsiklis (1989) Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ.
  • Bertsekas and Tsitsiklis (1991) Bertsekas, D. P. and Tsitsiklis, J. N. (1991). Some aspects of parallel and distributed iterative algorithms—a survey. Automatica, 27(1):3–21.
  • Bickley (1941) Bickley, W. (1941). Relaxation methods in engineering science: A treatise on approximate computation.
  • Boytsov and Naidan (2013) Boytsov, L. and Naidan, B. (2013). Engineering efficient and effective Non-Metric Space Library. In Brisaboa, N., Pedreira, O., and Zezula, P., editors, Similarity Search and Applications, volume 8199 of Lecture Notes in Computer Science, pages 280–293. Springer Berlin Heidelberg.
  • Charikar (2002) Charikar, M. S. (2002).

    Similarity Estimation Techniques from Rounding Algorithms.

    In

    Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing

    , STOC ’02, pages 380–388, New York, NY, USA. ACM.
  • Dhillon et al. (2011) Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2011). Nearest Neighbor based Greedy Coordinate Descent. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 24, pages 2160–2168. Curran Associates, Inc.
  • Dünner et al. (2017) Dünner, C., Parnell, T., and Jaggi, M. (2017). Efficient use of limited-memory accelerators for linear learning on heterogeneous systems. In Advances in Neural Information Processing Systems, pages 4258–4267.
  • Fercoq and Richtárik (2015) Fercoq, O. and Richtárik, P. (2015). Accelerated, Parallel, and Proximal Coordinate Descent. SIAM Journal on Optimization, 25(4):1997–2023.
  • Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    , pages 795–811. Springer.
  • Karimireddy et al. (2018) Karimireddy, S. P. R., Stich, S., and Jaggi, M. (2018). Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1204–1213.
  • Lin et al. (2014) Lin, Q., Lu, Z., and Xiao, L. (2014). An Accelerated Proximal Coordinate Gradient Method. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3059–3067. Curran Associates, Inc.
  • Locatello et al. (2018) Locatello, F., Raj, A., Karimireddy, S. P., Rätsch, G., Schölkopf, B., Stich, S., and Jaggi, M. (2018). On matching pursuit and coordinate descent. In International Conference on Machine Learning, pages 3204–3213.
  • Lu et al. (2018) Lu, H., Freund, R. M., and Mirrokni, V. (2018). Accelerating greedy coordinate descent methods. arXiv preprint arXiv:1806.02476.
  • Lv et al. (2007) Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. (2007). Multi-probe lsh: efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases, pages 950–961. VLDB Endowment.
  • Massias et al. (2018) Massias, M., Salmon, J., and Gramfort, A. (2018). Celer: a fast solver for the lasso with dual extrapolation. In International Conference on Machine Learning, pages 3321–3330.
  • Naidan and Boytsov (2015) Naidan, B. and Boytsov, L. (2015). Non-metric space library manual. arXiv preprint arXiv:1508.05470.
  • Ndiaye et al. (2015) Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. (2015). Gap safe screening rules for sparse multi-task and multi-class models. In Advances in Neural Information Processing Systems, pages 811–819.
  • Nesterov (2012) Nesterov, Y. (2012). Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362.
  • Nesterov and Stich (2017) Nesterov, Y. and Stich, S. (2017). Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems. SIAM Journal on Optimization, 27(1):110–123.
  • Neyshabur and Srebro (2015) Neyshabur, B. and Srebro, N. (2015). On Symmetric and Asymmetric LSHs for Inner Product Search. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning, pages 1926–1934.
  • Nutini et al. (2017) Nutini, J., Laradji, I., and Schmidt, M. (2017). Let’s make block coordinate descent go fast: Faster greedy rules, message-passing, active-set complexity, and superlinear convergence. arXiv preprint arXiv:1712.08859.
  • Nutini et al. (2015) Nutini, J., Schmidt, M., Laradji, I. H., Friedlander, M., and Koepke, H. (2015). Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection. arXiv:1506.00552 [cs, math, stat].
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
  • Perekrestenko et al. (2017) Perekrestenko, D., Cevher, V., and Jaggi, M. (2017). Faster Coordinate Descent via Adaptive Importance Sampling. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 869–877, Fort Lauderdale, FL, USA. PMLR.
  • Richtarik and Takac (2016) Richtarik, P. and Takac, M. (2016). Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433–484.
  • Shalev-Shwartz and Zhang (2013a) Shalev-Shwartz, S. and Zhang, T. (2013a). Accelerated Mini-batch Stochastic Dual Coordinate Ascent. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 378–385, USA. Curran Associates Inc.
  • Shalev-Shwartz and Zhang (2013b) Shalev-Shwartz, S. and Zhang, T. (2013b). Stochastic Dual Coordinate Ascent Methods for Regularized Loss. J. Mach. Learn. Res., 14(1):567–599.
  • Shalev-Shwartz and Zhang (2016) Shalev-Shwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145.
  • Shrivastava and Li (2014) Shrivastava, A. and Li, P. (2014). Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In NIPS 2014 - Advances in Neural Information Processing Systems 27, pages 2321–2329.
  • Song et al. (2017) Song, C., Cui, S., Jiang, Y., and Xia, S.-T. (2017). Accelerated stochastic greedy coordinate descent by soft thresholding projection onto simplex. In Advances in Neural Information Processing Systems, pages 4838–4847.
  • Stich et al. (2017a) Stich, S. U., Raj, A., and Jaggi, M. (2017a). Approximate Steepest Coordinate Descent. ICML - Proceedings of the 34th International Conference on Machine Learning.
  • Stich et al. (2017b) Stich, S. U., Raj, A., and Jaggi, M. (2017b). Safe Adaptive Importance Sampling. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4384–4394. Curran Associates, Inc.
  • Warga (1963) Warga, J. (1963). Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3):588–593.
  • Wibisono et al. (2012) Wibisono, A., Wainwright, M. J., Jordan, M. I., and Duchi, J. C. (2012). Finite sample convergence rates of zero-order stochastic optimization methods. In Advances in Neural Information Processing Systems, pages 1439–1447.
  • Wright (2015) Wright, S. J. (2015). Coordinate descent algorithms. Mathematical Programming, 151(1):3–34.
  • You et al. (2016) You, Y., Lian, X., Liu, J., Yu, H.-F., Dhillon, I. S., Demmel, J., and Hsieh, C.-J. (2016). Asynchronous parallel greedy coordinate descent. In Advances in Neural Information Processing Systems, pages 4682–4690.

Appendix A Setup and Notation

In this section we go over some of the definitions and notations which we had skipped over previously. We will also describe the class of functions we tackle and applications which fit into this framework.

a.1 Function Classes

Definition 8 (coordinate-wise -smoothness).

A function is coordinate-wise -smooth if

for any , , , and is a coordinate basis vector.

We also define strong convexity of the function .

Definition 9 (-strong convexity).

A function is -strongly convex with respect to some norm if

for any and in the domain of (note that does not necessarily need to be defined on the entire space ).

We will frequently denote by the strong convexity constant corresponding to the usual Euclidean norm, and by the strong convexity constant corresponding the norm. In general it holds : . See (Nutini et al., 2015) for a detailed comparision of the two constants.

Theorem 4.

Suppose that the function is twice-differentiable and

for any in the domain of i.e. the maximum diagonal element of the Hessian is bounded by . Then is coordinate-wise -smooth. Additionally, for any

Proof.

By Taylor’s expansion and the intermediate value theorem, we have that for any , there exists a such that for ,

(21)

Now if for some and coordinate , the equation (21) becomes

The first claim now follows since was defined such that . For the second claim, consider the following optimization problem over ,

(22)

We claim that the maximum is achieved for for some . For now assume this is true. Then we would have that

Using this result in equation (21) would prove our second claim of the theorem. Thus we need to study the optimum of (22). Since is a convex function, is also a convex function. We can now appeal to Lemma 4 for the convex set . The corners of the set exactly correspond to the unit directional vectors . With this we finish the proof of our theorem. ∎

Remark 10.

This result states that if we define smoothness with respect to the norm, the resulting smoothness constant is same as the coordinate-wise smoothness constant. This is surprising since for a general convex function , using the update rule

does not necessarily yield a coordinate update. We believe this observation (though not crucial to the current work) was not known before.

Let us prove an elementary lemma about maximizing convex functions over convex sets.

Lemma 4 (Maximum of a constrained convex function).

For any convex function , the maximum over a compact convex set is achieved at a ‘corner’. Here a ‘corner’ is defined to be a point such that there do not exist two points and , such that for some , .

Proof.

Suppose that the maximum is achieved at a point which is not a ‘corner’. Then let be two points such that for , we have . Since the function is convex,

(23)

We also assume that the proximal term is such that is either or enforces a box-constraint.

a.2 Proximal Coordinate Descent

As argued in the introduction, coordinate descent is the method of choice for large scale problems of the form (1). We denote the iterates by , and a single coordinate of this vector by a subscript , for . CD methods only change one coordinate of the iterates in each iteration. That is, when coordinate is updated at iteration , we have for , and

(24)

Combining the smoothness condition, and the definition of the proximal update (24), we get the progress made is

(25)

a.3 Applications

There is a number of relevant problems in machine learning which are the form

where the non-smooth term either enforces a box constraint, or is an -regularizer. This class covers several important problems such as SVMs, Lasso regression, logistic regression and elastic net regularized problems. We use SVMs and Lasso regression as running examples for our methods.

Svm.

The loss function for training SVMs with

as the regularization parameter can be written as

(26)

where for and the training data. We can define the corresponding dual problem for (26) as

(27)

where for the data matrix of the columns (Shalev-Shwartz and Zhang, 2013b). We can map this to (1) with , with , and the box indicator function, i.e.

It is straight-forward to see that the function is coordinate-wise -smooth for .

We map the dual variable back to the primal variable as and the duality gap defined as