Parallel coordinate descent for the Adaboost problem

10/07/2013 ∙ by Olivier Fercoq, et al. ∙ 0

We design a randomised parallel version of Adaboost based on previous studies on parallel coordinate descent. The algorithm uses the fact that the logarithm of the exponential loss is a function with coordinate-wise Lipschitz continuous gradient, in order to define the step lengths. We provide the proof of convergence for this randomised Adaboost algorithm and a theoretical parallelisation speedup factor. We finally provide numerical examples on learning problems of various sizes that show that the algorithm is competitive with concurrent approaches, especially for large scale problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Adaboost algorithm, introduced by Freund and Shapire [1], is a widely used classification algorithm. Its goal is to combine many weak hypotheses with high error rate to generate a single strong hypothesis with very low error. The algorithm is equivalent to the minimisation of the exponential loss by the greedy coordinate descent method [2]

. At each iteration, it selects the classifier with the largest error and one updates its weight in order to decrease at most this error. The weights of the other classifiers are left unchanged.

Adaboost has found a large number of applications and to name a few, we may cite face detection and recognition 

[3] cancer detection by interpretation of radiographies [4], gene-gene interaction detection [5]

The original algorithm is intrinsically sequential, but several parallel versions have been developped.

Collins, Shapire and Singer [6] give a version of the algorithm where all the coordinates are updated in parallel. They prove the convergence of this fully parallel coordinate descent under the assumption that the 1-norm of each row of the feature matrix has a norm smaller than 1. One may relax this assumption by requiring that every element of the matrix has absolute value smaller than 1 and by dividing the step length by the maximum number of nonzero elements in a row that we will denote .

In the context of support vector machines, Mukherjee et al. 

[7] interpreted the fully parallel coordinate descent method as a gradient method and designed an accelerated algorithm using accelerated gradient. The same approach is possible for Adaboost and we give numerical experiments in Section V.

Another approach for parallelisation is proposed in [8]: the author keep the Adaboost algorithm unchanged but parallelise the inner operations.

Finally, [9] proposed to solve a different problem, that will give a similiar result to Adaboost but will be solved by a parallel algorithm. However, they need to initialise the algorithm with iterations of the sequential Adaboost and they only give empirical evidence that the number of iterations required is small.

Palit and Reddy [10] first partition the coordinates so that each processor get a subset of the data. Each processor solves the Adaboost problem on its part of the data and the results are then merged. The algorithm involves very few communication between processors but the authors only provided a proof of convergence in the case of two processors.

In this paper, we propose a new parallel version of Adaboost based on recent work on parallel coordinate descent. In [11], Richtárik and Takáč introduced a general parallel coordinate descent method for the minimisation of composite functions of the form , where is convex, partially separable of degree and has a coordinate-wise Lipschitz gradient, and is a convex, nonsmooth and separable function, e.g. the norm. They provided convergence results for this algorithm together with a theoretical parallelisation speedup factor. They obtained the best speedups for randomised coordinate descent methods, which means that the coordinates are chosen according to a random sampling of . They showed [12] that this algorithm is very well suited to the resolution of Support Vector Machine problems.

The exponential loss does not fit in this framework because it does not have a Lipschitz gradient. However, Fercoq and Richtárik [13] showed that the parallel coordinate descent method can also be applied in the context of nonsmooth functions with max-structure to so called Nesterov separable functions.

We show in Theorem 1 that the logarithm of the exponential loss is Nesterov separable and this allows us to define the parallel coordinate descent method for the Adaboost problem (Algorithm 2). Then, we prove the convergence of the algorithm (Theorem 2) and give its iteration complexity, basing on the iteration complexity of the classical Adaboost [14, 15]. Finally, we provide numerical examples on learning problems of various sizes.

Ii The Adaboost problem

Let be a matrix of features and be a vector of labels. We denote by the matrix such that

In this paper, we may accept . We will write the coordinates as indices and the sequences as superscripts. Hence is the coordinate of the element of the vector valued sequence .

The Adaboost problem is the minimisation of the exponential loss [16]:

(1)

Let be the following empirical risk function

We denote the optimal value of the Adaboost problem (1) by

It will be convenient to consider the following equivalent objective function with Lipschitz gradient

and its associated Adaboost problem

(2)

As the logarithm is monotone, problems (1) and (2) are equivalent. Moreover both are convex optimisation problems. This version of the Adaboost problem has a nice dual problem involving the entropy function [17] and the Lipschitz continuity of the gradient of will be useful to define the Parallel Coordinate Descent Method.

Iii Parallel coordinate descent

Iii-a General case

In this section, we present the Parallel Coordinate Descent Method introduced by Richtárik and Takáč in [11].

For , we denote by the norm such that .

At each iteration of the parallel coordinate descent method, one needs to select which coordinates will be updated. One may choose the coordinates to update in a given deterministic way but it is convenient to randomise this choice of variables. Several samplings, i.e., laws for randomly choosing subsets of variables of , are considered in [11].

We will focus in this paper on the -nice sampling . It corresponds to the case where we have processors updating coordinates in parallel and each subset of with

coordinates has the same probability to be selected:

A good approximation of the -nice sampling for is the -independent sampling where each processor selects the coordinate it will update following a uniform law, independently of the others.

The choice of the sampling has consequences on the complexity estimates. More precisely, the parallel coordinate descent method relies on the concept of Expected Separable Overapproximation (ESO) to compute the updates and the ESO depends on the sampling. We denote here by

the vector of such that if and otherwise.

Definition 1 ([11]).

Let , and be a sampling. We say that admits a -Expected Separable Overapproximation with respect to if for all ,

(3)

We denote ESO for simplicity.

As the overapproximation is separable, one can find a minimiser with respect to by independent optimisation problems that will return for . In fact, we do not even need to compute the coordinates of that are not needed afterwards, i.e. we only compute for .

  Compute and such that .
  for  do
     Randomly generate following sampling .
     Compute , where minimises the overapproximation (3).
     
     if  then
        
     end if
  end for
Algorithm 1 Parallel Coordinate Descent Method [11]

The convergence properties of the Parallel Coordinate Descent Method (Algorithm 1) have been studied for quite general classes of functions, namely partially separable functions [11] and Nesterov separable functions [13]. The addition of a separable regulariser like the -norm or box constraints was also considered. However, in all cases, the analysis assumes that there exists a minimiser, which is not true in general for the Adaboost problem (1).

Iii-B Adaboost problem

In the following, we specialise the Parallel Coordinate Descent Method to the Adaboost problem (2). We begin by giving an ESO for the logarithm of the objective function.

Theorem 1.

Let be the maximum number of element in a row of matrix , that is

Let us denote

( if ) and

The function has a coordinate-wise Lipschitz gradient with constants such that

and if one chooses a -nice sampling , then

Proof:

By [18], Section 4.4, we know that can be written as

where is the simplex of and is 1-strongly convex on for the 1-norm.

This shows that is Nesterov separable of degree in the sense of [13], and so the Lipschitz constants are given by Theorem 2 in [13]. Moreover, by Theorem 6 in [13], if one chooses a -nice sampling , ESO. ∎

We can now state the Parallel Coordinate Descent Method for the Adaboost problem (2), since the minimiser of the ESO for and a -nice sampling at is given by such that for all , .

  Compute and as in Theorem 1.
  for  do
     Randomly generate following sampling .
     for  do in parallel
        
        
     end for
     if  then
        
     end if
  end for
Algorithm 2 Parallel Adaboost

Iii-C Computational issues

Computation of is easy and can be done with one single read of the data.

For , we shall first compute for . Note that

There are divisions of integers and the multiplication of these terms. Paired as in the last expression, none of the terms to multiply is bigger than and with a reshuffling of the terms before the multiplication, one can easily get a numerically stable way of computing . Then, we just need to perform simple sums and comparisons with 1.

The gradient of is given by

where for all ,

To compute the gradient, one stores residuals and updates them at each iteration, as well as the function values . If we start with , there is no big number in . The value of the function can be updated in parallel by a reduction clause and used both for the computation of the gradient and the test in Algorithm 2.

Iv Convergence of parallel coordinate descent for the Adaboost problem

The proof of convergence follows the lines of [15] with additional technicalities due to the randomisation of the samplings and the introduction of the logarithm. For conciseness of this paper, we give the proof and the precise definition of the parameters in the appendix.

Theorem 2 gives a bound on the number of iterations needed for the Parallel Coordinate Descent Method (Algorithm 2) to return with high probability, an -solution to the Adaboost problem (1).

Theorem 2.

Suppose . Partition the rows of into and , and suppose the axes of are ordered so that . Set to be the tightest axis-aligned rectangle such that , and . Then is compact, , has modulus of strong convexity over and .

Using these terms, choose an initial point , an accuracy , , a confidence level and iteration counter

Then the iterate of the Parallel Coordinate Descent Method (Algorithm 2) applied to with a -nice sampling is an -solution to the original Adaboost problem (1) with probability at least :

The iteration complexity is in , like for the classical Adaboost algorithm [14, 15]. The theorem also gives the theoretical parallelisation speedup factor of the method, which is equal to , where is given in Theorem 1 and is always smaller than . This means that, when one neglects communication costs, the algorithm is faster with processors than with one processor. The value of can be significantly smaller than when using a -nice sampling. For instance for the experiment on the URL reputation dataset, when .

We now give the convergence results in the case of weak learnability and attainability.

Proposition 1.

If , choosing

grants

Proposition 2.

If , choosing and

grants

More iterations are needed than with the greedy update but here we do not need to find the coordinate of the gradient with the biggest absolute value, which saves computational effort. However, in both cases, the parameters and are not easily computable. Hence, the biggest interest of these convergence results is to show that the convergence holds in (or ) for any initial point.

V Numerical experiments

In this section, we compare the Parallel Coordinate Descent Method with three other algorithms available for the resolution of the Adaboost problem. We will not consider algorithms that solve a different problem like the ones presented in [9, 10].

We run our experiments on two freely available datasets: w8a [19] and URL reputation [20]. The w8a dataset is of medium scale: it has examples, features. The feature matrix is sparse but some rows have many nonzero elements so that . The URL reputation dataset has a large size: examples and features. The maximum number of nonzero elements in a row is . We used 16 processors on a computer with Intel Xeon processors at 2.6 GHz and 128 GB RAM.

We give in Figures 1 and 2 the value of the objective function at each iteration for:

  • an asynchronous version of Parallel Coordinate Descent with -independent sampling () (Algorithm 2) based on the code of [11] which is freely available; the -independent sampling is a good approximation of the -nice sampling for ,

  • the fully parallel coordinate descent method [6],

  • the accelerated version of the fully parallel coordinate descent method [7],

  • the classical Adaboost algorithm (greedy coordinate descent); we performed the search for the largest absolute value of the gradient in parallel.

Fig. 1: Comparison of algorithms for the resolution of the Adaboost problem on the w8a dataset with 16 processors. Dotted line (green): Fully parallel coordinate descent. Solid line (cyan): Greedy coordinate descent. Solid with crosses (red): Parallel Coordinate Descent with -independent sampling (, ). Dash-dotted line (blue): Accelerated gradient.
Fig. 2: Comparison of algorithms for the resolution of the Adaboost problem on the URL reputation dataset with 16 processors (same colours as in Figure 1).

In both cases, the Parallel Coordinate Descent with -independent sampling is faster than the fully parallel coordinate descent because it benefits from larger steps (the ratio is ). It is also faster than the greedy coordinate descent. The reason is that one needs to compute the whole gradient at each iteration but only one directional derivative is actually used.

On the medium scale dataset w8a (Figure 1), the accelerated gradient is the fastest algorithm: it reaches high accuracy in the 20 seconds allocated. However, for the large scale dataset URL reputation (Figure 2), Parallel Coordinate Descent is the fastest algorithm. It benefits from larger steps () but also, unlike the other algorithms, the computational complexity of one iteration only moderately increases when the size of the problem increases.

We can see on Figure 3 that increasing the number of processors indeed accelerates and that the parallelisation speedup factor is nearly linear: the time needed to reach -1.8 decreases when more processors are used.

Fig. 3: Performance of the smoothed parallel coordinate descent method on the Adaboost problem for the URL reputation dataset. Blue solid line with crosses: ; green dashed line: ; red dash-dotted line: ; cyan dashed line with stars: ; purple solid line: .

Vi Conclusion

We showed in this paper that the randomised parallel coordinate descent, developed in the general framework of the minimisation of composite, partially separable functions, is well suited to solve the Adaboost problem. We showed that the iteration complexity of the algorithm is of the order and gave a computable formula for the theoretical parallelisation speedup factor. The numerical experiments demonstrate the efficiency of parallel coordinate descent with independent sampling, especially for large scale problems. Indeed, each directional derivative computed is actually used to update the optimisation variable but in the same time the step length are rather large. The step lengths are controlled by the inverse of the parameter and they decrease slower than the increase of the number of processors. Hence, parallel coordinate descent combines qualities of the greedy coordinate descent and of the fully parallel coordinate descent, so that for large scale problems it outperforms any previously available algorithm.

References

  • [1] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
  • [2] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Machine learning, vol. 37, no. 3, pp. 297–336, 1999.
  • [3] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in

    IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition. CVPR 2001

    , vol. 1.   IEEE, 2001, pp. 1–511.
  • [4] D. Zinovev, J. Furst, and D. Raicu, “Building an ensemble of probabilistic classifiers for lung nodule interpretation,” in 10th International Conference on Machine Learning and Applications and Workshops (ICMLA), vol. 2, 2011, pp. 155–161.
  • [5]

    A. Assareh, L. G. Volkert, and J. Li, “Interaction trees: Optimizing ensembles of decision trees for gene-gene interaction detections,” in

    11th International Conference on Machine Learning and Applications (ICMLA), vol. 1, 2012, pp. 616–621.
  • [6]

    M. Collins, R. E. Schapire, and Y. Singer, “Logistic regression, adaboost and bregman distances,”

    Machine Learning, vol. 48, no. 1-3, pp. 253–285, 2002.
  • [7] I. Mukherjee, K. Canini, R. Frongillo, and Y. Singer, “Parallel boosting with momentum,” 2013.
  • [8] K. Zeng, Y. Tang, and F. Liu, “Parallization of adaboost algorithm through hybrid mpi/openmp and transactional memory,” in 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).   IEEE, 2011, pp. 94–100.
  • [9] S. Merler, B. Caprile, and C. Furlanello, “Parallelizing adaboost by weights dynamics,” Computational Statistics & Data Analysis, vol. 51, no. 5, pp. 2487–2498, 2007.
  • [10] I. Palit and C. K. Reddy, “Scalable and parallel boosting with mapreduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904–1916, 2012.
  • [11] P. Richtárik and M. Takáč, “Parallel coordinate descent methods for big data optimization problems,” arXiv:1212.0873, November 2012.
  • [12] M. Takác, A. Bijral, P. Richtárik, and N. Srebro, “Mini-batch primal and dual methods for SVMs,” in 30th International Conference on Machine Learning, 2013.
  • [13] O. Fercoq and P. Richtárik, “Smooth minimization of nonsmooth functions by parallel coordinate descent,” 2013, arXiv:1309.5885.
  • [14] I. Mukherjee, C. Rudin, and R. E. Schapire, “The rate of convergence of adaboost,” arXiv preprint arXiv:1106.6024, 2011.
  • [15] M. Telgarsky, “A primal-dual convergence analysis of boosting,” The Journal of Machine Learning Research, vol. 13, pp. 561–606, 2012.
  • [16] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors),” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000.
  • [17] C. Shen and H. Li, “On the dual formulation of boosting algorithms,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 12, pp. 2216–2231, 2010.
  • [18] Y. Nesterov, “Smooth minimization of nonsmooth functions,” Mathematical Programming, vol. 103, pp. 127–152, 2005.
  • [19] J. C. Platt, “12 fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods - Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds.   MIT Press, 1999, http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.
  • [20] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying suspicious urls: an application of large-scale online learning,” in Proceedings of the 26th Annual International Conference on Machine Learning.   ACM, 2009, pp. 681–688, http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.
  • [21] P. Richtárik and M. Takáč, “Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function,” Mathematical Programming (doi: 10.1007/s10107-012-0614-z), preprint: April/July 2011.

Appendix: Proof of the iteration complexity

Let be the distance from point to set in the -norm:

We will denote by an arbitrary element of .

will denote the expectation conditional to knowing the previous choices of coordinates .

Definition 2 ([15]).

denotes the hard core of : the collection of examples which receive positive weight under some dual feasible point, a distribution upon which no weak learner is correlated with the labels. Symbolically,

We shall partition by rows into two matrices and , where has rows corresponding to , and .

Proposition 3.

Let us denote the iterate of Parallel Adaboost (Algorithm 2). For any compact set , let

If , then

where is defined in Theorem 1.

Proof:

Let be such that for all . Then we have where .

The stopping criterion grants that for all , :

But by Theorem 1, the th iterate of Parallel Adaboost, , satisfies

which implies by definition of and that

Proposition 4.

Let and a compact set such that be given. Then is strongly convex over and taking to be the modulus of strong convexity, for any ,

Proof:

The optimisation problem

attains its minimum by compacity of the feasible set and continuity of the objective function.

where

so by Cauchy-Schwarz, and there is equality if and only if for all . Hence, the objective is zero if and only if for all , which would imply for all and . We conclude that the optimal value, which is the modulus of strong convexity , is positive.

For the second part of the proposition, we remark that is a linear space, so and we can replace in the proof of Lemma 6.8 in [15] by to get

We continue by noting that, as , if , then :

The last inequality uses the fact that the difference of norms is smaller than the norm of the difference. The result follows by definition of . ∎

Proof:

We follow the lines of Telgarsky’s [15] for the proof. The main differences are the 2-norm instead of Inf-norm in the definition of problem-dependent quantities and the stochastic sampling instead of deterministic sampling.

Theorem 5.9 in [15] is still valid in our context, so that and the form of gives . Thus,

For the left term,

where we used the fact that (Theorem 5.9 in [15]). Hence

For the right term, as in [15], using the fact that the objective values never increase with Parallel Adaboost,

Let . As the level sets of and are equal, one can reuse the argument on 0-coecivity in [15] for . Hence there exists an axis-aligned rectangle containing and such that . Moreover, by Proposition 4,

We can now merge both estimates as in [15], using Lemma G.2 of [15]:

Let . Combining this with Proposition 3, we get

Theorem 1 in [21] with , implies that, given and , if

then

But for ,

We take and we get the result. ∎

We also have all the tools to state the convergence results in the case of weak learnability (Proposition 1) and attainability (Proposition 2): the proofs are easy adaptations of [15].