Feature Clustering for Accelerating Parallel Coordinate Descent

12/17/2012 ∙ by Chad Scherrer, et al. ∙ University of Michigan PNNL 0

Large-scale L1-regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for L1-regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental re- sults using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale L1-regularization problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the -regularized loss minimization problem

(1)

where is the design matrix,

is a weight vector to be estimated, and the loss function

is such that is a convex differentiable function for each . This formulation includes -regularized least squares (Lasso) (when ) and

-regularized logistic regression (when

). In recent years, coordinate descent (CD) algorithms have been shown to be efficient for this class of problems [Friedman et al., 2007; Wu and Lange, 2008; Shalev-Shwartz and Tewari, 2011; Bradley et al., 2011].

Motivated by the need to solve large scale regularized problems, researchers have begun to explore parallel algorithms. For instance, Bradley et al. [2011] developed the Shotgun algorithm. More recently, Scherrer et al. [2012] have developed “GenCD”, a generic framework for expressing parallel coordinate descent algorithms. Special cases of GenCD include Greedy CD [Li and Osher, 2009; Dhillon et al., 2011], the Shotgun algorithm of [Bradley et al., 2011], and Thread-Greedy CD [Scherrer et al., 2012].

In fact, the connection between these three special cases of GenCD is much deeper, and more fundamental, than is obvious under the GenCD abstraction. As our first contribution, we describe a general randomized block-greedy that includes all three as special cases. The block-greedy algorithm has two parameter: , the total number of feature blocks and , the size of the random subset of the blocks that is chosen at every time step. For each of these blocks, we greedily choose, in parallel, a single feature weight to be updated.

Second, we present a non-asymptotic convergence rate analysis for the randomized block-greedy coordinate descent algorithms for general values of (as the number of blocks cannot exceed the number of features) and . This result therefore applies to stochastic CD, greedy CD, Shotgun, and thread-greedy. Indeed, we build on the analysis and insights in all of these previous works. Our general convergence result, and in particular its instantiation to thread-greedy CD, is novel.

Third, based on the convergence rate analysis for block-greedy, we optimize a certain “block spectral radius” associated with the design matrix. This parameter is a direct generalization of a similar spectral parameter that appears in the analysis of Shotgun. We show that the block spectral radius can be upper bounded by the maximum inner product (or correlation if features are mean zero) between features in distinct blocks. This motivates the use of correlation-based feature clustering to accelerate the convergence of the thread-greedy algorithm.

Finally, we conduct an experimental study using a simple clustering heuristic. We observe dramatic acceleration due to clustering for smaller values of the regularization parameter, and show characteristics that must be paid particularly close attention for heavily regularized problems, and that can be improved upon in future work.

2 Block-Greedy Coordinate Descent

Scherrer et al. [2012] describe “GenCD”, a generic framework for parallel coordinate descent algorithms, in which a parallel coordinate descent algorithm can be determined by specifying a select step and an accept step. At each iteration, features chosen by select are evaluated, and a proposed increment is generated for each corresponding feature weight. Using this, the accept step then determines which proposals are to be updated.

Figure 1: The design space of block-greedy coordinate descent.

In these terms, we consider the block-greedy algorithm that takes as part of the input a partition of the features into blocks. Given this, each iteration selects features corresponding to a set of randomly selected blocks, and accepts a single feature from each block, based on an estimate of the resulting reduction in the objective function.

The pseudocode for the randomized block-greedy coordinate descent is given by Algorithm 1. The algorithm can be applied to any function of the form where is smooth and convex, and is convex and separable across coordinates. Our objective function (1) satisfies these conditions. The greedy step chooses a feature within a block for which the guaranteed descent in the objective function (if that feature alone were updated) is maximized. This descent is quantified by , which is defined precisely in the next section. To arrive at an heuristic understanding, it is best to think of as being proportional to the absolute value of the th entry in the gradient of the smooth part . In fact, if is zero (no regularization) then this heuristic is exact.

  Parameters: (no. of blocks) and (degree of parallelism)
  while not converged do
     Select a random subset of size from the available blocks
     Set to be the features in the selected blocks
     Propose increment // COMMENTparallel
     Accept
     Update weight for all // COMMENTparallel
Algorithm 1 Block-Greedy Coordinate Descent

The two parameters, and , of the block-greedy CD algorithm have the ranges and . Setting these to specific values gives many existing algorithms. For instance when , each feature is a block on its own. Then, setting amounts to randomly choosing a single coordinate and updating it which gives us the stochastic CD algorithm of Shalev-Shwartz and Tewari [2011]. Shotgun [Bradley et al., 2011] is obtained when is still but . Another extreme is the case when all the features constitute a single block. That is, . Then block-greedy CD is a deterministic algorithm and becomes the greedy CD algorithm of Li and Osher [2009]; Dhillon et al. [2011]. Finally, we can choose non-trivial values of that lie strictly between and . When this is the case, and we choose to update all blocks in parallel each time (), we arrive at the thread-greedy algorithm of Scherrer et al. [2012]. Figure 1 shows a schematic representation of the parameterization of these special cases.

3 Convergence Analysis

Of course, there is no reason to expect block-greedy CD to converge for all values of and . In this section, we give a sufficient condition for convergence and derive a convergence rate assuming this condition.

Bradley et al.

express the convergence criteria for Shotgun algorithm in terms of the spectral radius (maximal eigenvalue)

. For block-greedy, the corresponding quantity is a bit more complicated. We define

where is the set of all submatrices that we can obtain from by selecting exactly one index from each of the blocks. The intuition is that if features from different blocks are almost orthogonal then the matrices in will be close to identity and will therefore have small . Highly correlated features within a block do not increase .

As we said above, we will assume that we are minimizing a “smooth plus separable” convex function where the convex differentiable function satisfies a second order upper bound

In our case, this inequality will hold as soon as for any (where differentiation is w.r.t. ). The function is separable across coordinates: . The function is clearly separable.

The quantity appearing in Algorithm 1 serves to quantify the guaranteed descent (based on second order upper bound) if feature alone is updated and is obtained as a solution of the one-dimensional minimization problem

Note that if there is no regularization, then is simply (if we denote by for brevity). In the general case, by first order optimality conditions for the above one-dimensional convex optimization problem, we have where is a subgradient of at . That is, . This implies that for any .

Theorem 1.

Let be chosen so that

is less than . Suppose the randomized block-greedy coordinate algorithm is run on a smooth plus separable convex function to produce the iterates . Then the expected accuracy after steps is bounded as

Here the constant only depends on (Lipschitz and smoothness constants of) the function , is an upper bound on the norms , and is any minimizer of .

Proof.

We first calculate the expected change in objective function following the Shotgun analysis. We will use to denote (similarly for , etc.)

Define the matrix (that depends on the current iterate ) with entries . Then, using , we continue

Above (with some abuse of notation), , and are length vectors with components , and respectively. By definition of , we have . So, we continue

where we used . Simplifying we get

where

should be less than .

Now note that where the “infinity-2” norm of a -vector is, by definition, as follows: take the norm within a block and take the of the resulting values. Note that in the second step above, we moved from a -length to a length .

This gives us

For the rest of the proof, assume . In that case . Thus, convexity and the fact that the dual norm of the “infinity-2” norm is the “1-2” norm, give

Putting the last two inequalities together gives (for any upper bound on )

Defining the accuracy , we translate the above into the recurrence

and by Jensen’s we have and therefore

which solves to (up to a universal constant factor)

Even when , we can still relate to but the argument is a little more involved. We refer the reader to the supplementary for more details. ∎

In particular, consider the case where all blocks are updated in parallel as in the thread-greedy coordinate descent algorithm of Scherrer et al. [2012]. Then and there is no randomness in the algorithm, yielding the following corollary.

Corollary 2.

Suppose the block-greedy coordinate algorithm with (thready-greedy) is run on a smooth plus separable convex function to produce the iterates . If , then

4 Feature Clustering

The convergence analysis of section 3 shows that we need to minimize the block spectral radius. Directly finding a clustering that minimizes is a computationally daunting task. Even with equal-sized blocks, the number of possible partitions is . In the absence of an efficient search strategy for this enormous space, we find it convenient to work instead in terms of the inner product of features from distinct blocks. The following proposition makes the connection between these approaches precise.

Proposition 3.

Let be positive semidefinite, with , and for . Then the spectral radius of has the upper bound

Proof.

Let

be the eigenvector corresponding to the largest eigenvalue of

, scaled so that . Then

Proposition 3 tells us that we can partition the features into clusters using a heuristic approach that strives to minimize the maximum absolute inner product between the features (columns of the design matrix) and where and are features in different blocks.

4.1 Clustering Heuristic

Given features and blocks, we wish to distribute the features evenly among the blocks, attempting to minimize the absolute inner product between features across blocks. Moreover, we require an approach that is efficient, since any time spent clustering could instead have been used for iterations of the main algorithm. We describe a simple heuristic that builds uniform-sized clusters of features.

To construct a given block, we select a feature as a “seed”, and assign the nearest features to the seed (in terms of absolute inner product) to be in the same block. Because inner products with very sparse features result in a large number of zeros, we choose at each step the most dense unassigned feature as the seed. Algorithm 2 provides a detailed description. This heuristic requires computation of inner products. In practice it is very fast—less than three seconds for even the large KDDA dataset.

  
  for  to  do
     
     for  do // COMMENTparallel
        
      yielding the largest values of
     
  
  return  
Algorithm 2 A heuristic for clustering features into blocks, based on correlation

5 Experimental Setup

Platform All our experiments are conducted on a -core system comprising of sockets and banks of memory. Each socket is an AMD Opteron processor codenamed Magny-Cours, which is a multichip processor with two -core chips on a single die. Each -core processor is equipped with a three-level memory hierarchy as follows: KB of L1 cache for data and KB of L2 cache that are private to each core, and MB of L3 cache that is shared among the cores. Each -core processor is linked to a GB memory bank with independent memory controllers leading to a total system memory of GB () that can be globally addressed from each core. The four sockets are interconnected using HyperTransport-3 technology111Further details on AMD Opteron can be found at http://www.amd.com/us/products/embedded/processors/opteron/Pages/opteron-6100-series.aspx..

Datasets A variety of datasets were chosen222from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ for experimentation; these are summarized in Table 1. We consider four datasets: News20 contains about UseNet postings from newsgroups. The data was gathered by Ken Lang at Carnegie Mellon University circa 1995. Reuters is the RCV1-v2/LYRL2004 Reuters text data described by Lewis et al. [2004]. In this term-document matrix, each example is a training document, and each feature is a term. Nonzero values of the matrix correspond to term frequencies that are transformed using a standard tf-idf normalization. RealSim consists of about

UseNet articles from four discussion groups: simulated auto racing, simulated aviation, real auto racing, and real aviation. The data was gathered by Andrew McCallum while at Just Research circa 1997. We consider classifying real vs simulated data, irrespective of auto/aviation.

KDDA represents data from the KDD Cup 2010 challenge on educational data mining. The data represents a processed version of the training set of the first problem, algebra_2008_2009, provided by the winner from the National Taiwan University. These four inputs cover a broad spectrum of sizes and structural properties.

Name # Features # Samples # Nonzeros Source
News20 Keerthi and DeCoste [2005]
Reuters Lewis et al. [2004]
RealSim RealSim
KDDA Lo et al. [2011]
Table 1: Summary of input characteristics.

Implementation For the current work, our empirical results focus on thread-greedy coordinate descent with 32 blocks. At each iteration, a given thread must step through the nonzeros of each of its features to compute the proposed increment (the of Section 3) and the estimated benefit of choosing that feature. Once this is complete, the thread (without waiting) enters the line search phase, where it remains until all threads are being updated by less than the specified tolerance. Finally, all updates are performed concurrently. We use OpenMP’s atomic directive to maintain consistency.

Testing framework

We compare the effect of clustering to randomization (i.e. features are randomly assigned to blocks), for a variety of values for the regularization parameter . To test the effect of clustering for very sparse weights, we first let be the largest power of ten that leads to any nonzero weight estimates. This is followed by the next three consecutive powers of ten. For each run, we measure the regularized expected loss and the number of nonzeros at one-second intervals. Times required for clustering and randomization are negligible, and we do not report them here.

6 Results

Figure 2 shows the regularized expected loss (top) and number of nonzeros (bottom), for several values of the regularization parameter . Black and red curves indicate randomly-permuted features and clustered features, respectively. The starting value of was for all data except KDDA, which required in order to yield any nonzero weights.

In the upper plots, within a color, the order of the 4 curves, top to bottom, corresponds to successively decreasing values of . Note that a larger value of results in a sparser solution with greater regularized expected loss and a smaller number of nonzeros. Thus, for each subfigure of Figure 2, the order of the curves in the lower plot is reversed from that of the upper plot.

(a) News20,
(b) Reuters,
(c) RealSim,
(d) KDDA,
Figure 2: Convergence results. For each dataset, we show the regularized expected loss (top) and number of nonzeros (bottom), using powers of ten as regularization parameters. Results for randomized features are shown in black, and those for clustered features are shown in red. Note that the allowed running time for KDDA was ten times that of other datasets.

Overall, results across datasets are very consistent. For large values of , the simple clustering heuristic results in slower convergence, while for smaller values of we see considerable benefit. Due to space limitations, we choose a single dataset for which to explore results in greater detail.

Of the datasets we tested, Reuters might reasonably lead to the greatest concern. Like the other datasets, clustered features lead to slow convergence for large and fast convergence for small . However, Reuters is particularly interesting because for , clustered features seem to provide an initial benefit that does not last; after about 250 seconds it is overtaken by the run with randomized features.

Randomized Clustered Randomized Clustered Randomized Clustered
Active blocks 32 6 32 32 32 32
Iterations per second 153 12.9 152 12.3 136 12.3
NNZ @ 1K sec 184 215 797 8592 1248 19473
Objective @ 1K sec 0.472 0.591 0.264 0.321 0.206 0.136
NNZ @ 10K iter 74 203 82 8812 110 19919
Objective @ 10K iter 0.570 0.593 0.515 0.328 0.472 0.141
Table 2: The effect of feature clustering, for Reuters.

Table 2 gives a more detailed summary of the results for Reuters, for the three largest values of . The first row of this table gives the number of active blocks, by which we mean the number of blocks containing any nonzeros. For an inactive block, the corresponding thread repeatedly confirms that all weights remain zero without contributing to convergence.

In the most regularized case , clustered data results in only six active blocks, while for other cases every block is active. Thus in this case features corresponding to nonzero weights are colocated within these few blocks, severely limiting the advantage of parallel updates.

In the second row, we see that for randomized features, the algorithm is able to get through over ten times as many iterations per second. To see why, note that the amount of work for a given thread is a linear function of the number of nonzeros over all of the features in its block. Thus, the block with the greatest number of nonzeros serves as a bottleneck.

The middle two rows of Figure 2 summarize the state of each run after 1000 seconds. Note that for this test, randomized features result in faster convergence for the two largest values of .

For comparison, the final two rows of Figure 2 provide a similar summary based instead on the number of iterations. In these terms, clustering is advantageous for all but the largest value of .

(a) Block density
(b) Regularized expected loss
(c) Number of nonzeros
Figure 3: A closer look at performance characteristics for Reuters.

Figure 3 shows the source of this problem. First, Figure 2(a) shows the number of nonzeros in all features for each of the 32 blocks. Clearly the simple heuristic results in poor load-balancing. For comparison, Figures 2(b) and 2(c) show convergence rates as a function of the number of iterations.

7 Conclusion

We have presented convergence results for a family of randomized coordinate descent algorithms that we call block-greedy coordinate descent. This family includes Greedy CD, Thread-Greedy CD, Shotgun, and Stochastic CD. We have shown that convergence depends on , the maximal spectral radius over submatrices of resulting from the choice of one feature for each block.

Even though a simple clustering heuristic helps for smaller values of the regularization parameter, our results also show the importance of considering issues of load-balancing and the distribution of weights for heavily-regularized problems.

A clear next goal in this work is the development of a clustering heuristic that is relatively well load-balanced and distributes weights for heavily-regularized problems evenly across blocks, while maintaining good computational efficiency.

Acknowledgments

The authors are grateful for the helpful suggestions of Ken Jarman, Joseph Manzano, and our anonymous reviewers.

Funding for this work was provided by the Center for Adaptive Super Computing Software - MultiThreaded Architectures (CASS-MT) at the U.S. Department of Energy’s Pacific Northwest National Laboratory. PNNL is operated by Battelle Memorial Institute under Contract DE-ACO6-76RL01830.

References

  • Friedman et al. [2007] J Friedman, T Hastie, H Höfling, and R Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 1(2):302–332, 2007.
  • Wu and Lange [2008] T Wu and K Lange. Coordinate descent algorithms for lasso penalized regression. Annals of Applied Statistics, 2:224–244, 2008.
  • Shalev-Shwartz and Tewari [2011] S Shalev-Shwartz and A Tewari. Stochastic methods for -regularized loss minimization.

    Journal of Machine Learning Research

    , 12:1865–1892, 2011.
  • Bradley et al. [2011] J K Bradley, A Kyrola, D Bickson, and C Guestrin. Parallel Coordinate Descent for L1-Regularized Loss Minimization. In Proceedings of the 28th International Conference on Machine Learning, pages 321–328, 2011.
  • Scherrer et al. [2012] C Scherrer, A Tewari, M Halappanavar, and D Haglin. Scaling up Parallel Coordinate Descent Algorithms. In International Conference on Machine Learning, 2012.
  • Li and Osher [2009] Y Li and S Osher. Coordinate Descent Optimization for Minimization with Application to Compressed Sensing ; a Greedy Algorithm Solving the Unconstrained Problem. Inverse Problems and Imaging, 3:487–503, 2009.
  • Dhillon et al. [2011] I S Dhillon, P Ravikumar, and A Tewari. Nearest neighbor based greedy coordinate descent. In Advances in Neural Information Processing Systems 24, pages 2160–2168, 2011.
  • Lewis et al. [2004] D Lewis, Y Yang, T Rose, and F Li. RCV1 : A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361–397, 2004.
  • Keerthi and DeCoste [2005] S S Keerthi and D DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.
  • [10] RealSim. Document classification data gathered by Andrew McCallum., circa 1997. URL:http://people.cs.umass.edu/~mccallum/data.html.
  • Lo et al. [2011] Hung-Yi Lo, Kai-Wei Chang, Shang-Tse Chen, Tsung-Hsien Chiang, Chun-Sung Ferng, Cho-Jui Hsieh, Yi-Kuang Ko, Tsung-Ting Kuo, Hung-Che Lai, Ken-Yi Lin, Chia-Hsuan Wang, Hsiang-Fu Yu, Chih-Jen Lin, Hsuan-Tien Lin, and Shou de Lin. Feature engineering and classifier ensemble for KDD Cup 2010, 2011. To appear in JMLR Workshop and Conference Proceedings.

Appendix A Complete convergence analysis in the regularized case

Basic setup: We are minimizing a function of the form where is a convex differentiable function that satisfies a second order upper bound

and is convex (and possibly non-differentiable) and separable across coordinates:

In our case is the design matrix. If columns of

are zero mean and unit variance normalized then entries in

measure the correlation between features. Also, .

Divide the features into blocks of features each. The algorithm we analyze is block-greedy, a direct generalization of Shotgun ( in the Shotgun case). In the regularized case, the block-greedy algorithm is

For randomly chosen blocks in parallel do

  • Within a block , find such that is maximum and update

Endfor

Here serves to quantify the guaranteed descent (based on second order upper bound) if feature is updates and solves the one-dimensional problem

Note that if there is no regularization, then and this is the case we analyzed in the main body of the paper. In the general case, by first order optimality conditions for the above one-dimensional convex optimization problem, we have

where is a subgradient of at . That is, . This implies that

for any .

We first calculate the expected change in objective function following the Shotgun analysis. We will use to denote (similarly for , )

Define the matrix (depends on the current iteration) with entries . Then, using , we continue

Above (with some abuse of notation), , and are length vectors with components , and respectively.

Our generalization of Shotgun’s parameter is

where is the set of all submatrices obtainable from by selecting exactly one index from each of the blocks.

So, we continue

where we used .

Simplifying we get

where

should be less than .

Now note that

where the “infinity-2” norm of a -vector is, by definition, as follows: take the norm within a block and take the of the resulting values. Note that in the second step above, we moved from a -length to a length .

This gives us

(2)

From the results in Dhillon et al. [2011] we know that where the constant depends on the function (e.g. its smoothness and Lipschitz constants) and the maximum value can take over the course of the algorithm. Because , plugging this into (2), we get

Defining the accuracy , we translate the above into the recurrence

and by Jensen’s we have and therefore

which solves to (upto a universal constant factor)