I Introduction
The Adaboost algorithm, introduced by Freund and Shapire [1], is a widely used classification algorithm. Its goal is to combine many weak hypotheses with high error rate to generate a single strong hypothesis with very low error. The algorithm is equivalent to the minimisation of the exponential loss by the greedy coordinate descent method [2]
. At each iteration, it selects the classifier with the largest error and one updates its weight in order to decrease at most this error. The weights of the other classifiers are left unchanged.
Adaboost has found a large number of applications and to name a few, we may cite face detection and recognition
[3] cancer detection by interpretation of radiographies [4], genegene interaction detection [5]…The original algorithm is intrinsically sequential, but several parallel versions have been developped.
Collins, Shapire and Singer [6] give a version of the algorithm where all the coordinates are updated in parallel. They prove the convergence of this fully parallel coordinate descent under the assumption that the 1norm of each row of the feature matrix has a norm smaller than 1. One may relax this assumption by requiring that every element of the matrix has absolute value smaller than 1 and by dividing the step length by the maximum number of nonzero elements in a row that we will denote .
In the context of support vector machines, Mukherjee et al.
[7] interpreted the fully parallel coordinate descent method as a gradient method and designed an accelerated algorithm using accelerated gradient. The same approach is possible for Adaboost and we give numerical experiments in Section V.Another approach for parallelisation is proposed in [8]: the author keep the Adaboost algorithm unchanged but parallelise the inner operations.
Finally, [9] proposed to solve a different problem, that will give a similiar result to Adaboost but will be solved by a parallel algorithm. However, they need to initialise the algorithm with iterations of the sequential Adaboost and they only give empirical evidence that the number of iterations required is small.
Palit and Reddy [10] first partition the coordinates so that each processor get a subset of the data. Each processor solves the Adaboost problem on its part of the data and the results are then merged. The algorithm involves very few communication between processors but the authors only provided a proof of convergence in the case of two processors.
In this paper, we propose a new parallel version of Adaboost based on recent work on parallel coordinate descent. In [11], Richtárik and Takáč introduced a general parallel coordinate descent method for the minimisation of composite functions of the form , where is convex, partially separable of degree and has a coordinatewise Lipschitz gradient, and is a convex, nonsmooth and separable function, e.g. the norm. They provided convergence results for this algorithm together with a theoretical parallelisation speedup factor. They obtained the best speedups for randomised coordinate descent methods, which means that the coordinates are chosen according to a random sampling of . They showed [12] that this algorithm is very well suited to the resolution of Support Vector Machine problems.
The exponential loss does not fit in this framework because it does not have a Lipschitz gradient. However, Fercoq and Richtárik [13] showed that the parallel coordinate descent method can also be applied in the context of nonsmooth functions with maxstructure to so called Nesterov separable functions.
We show in Theorem 1 that the logarithm of the exponential loss is Nesterov separable and this allows us to define the parallel coordinate descent method for the Adaboost problem (Algorithm 2). Then, we prove the convergence of the algorithm (Theorem 2) and give its iteration complexity, basing on the iteration complexity of the classical Adaboost [14, 15]. Finally, we provide numerical examples on learning problems of various sizes.
Ii The Adaboost problem
Let be a matrix of features and be a vector of labels. We denote by the matrix such that
In this paper, we may accept . We will write the coordinates as indices and the sequences as superscripts. Hence is the coordinate of the element of the vector valued sequence .
The Adaboost problem is the minimisation of the exponential loss [16]:
(1) 
Let be the following empirical risk function
We denote the optimal value of the Adaboost problem (1) by
It will be convenient to consider the following equivalent objective function with Lipschitz gradient
and its associated Adaboost problem
(2) 
As the logarithm is monotone, problems (1) and (2) are equivalent. Moreover both are convex optimisation problems. This version of the Adaboost problem has a nice dual problem involving the entropy function [17] and the Lipschitz continuity of the gradient of will be useful to define the Parallel Coordinate Descent Method.
Iii Parallel coordinate descent
Iiia General case
In this section, we present the Parallel Coordinate Descent Method introduced by Richtárik and Takáč in [11].
For , we denote by the norm such that .
At each iteration of the parallel coordinate descent method, one needs to select which coordinates will be updated. One may choose the coordinates to update in a given deterministic way but it is convenient to randomise this choice of variables. Several samplings, i.e., laws for randomly choosing subsets of variables of , are considered in [11].
We will focus in this paper on the nice sampling . It corresponds to the case where we have processors updating coordinates in parallel and each subset of with
coordinates has the same probability to be selected:
A good approximation of the nice sampling for is the independent sampling where each processor selects the coordinate it will update following a uniform law, independently of the others.
The choice of the sampling has consequences on the complexity estimates. More precisely, the parallel coordinate descent method relies on the concept of Expected Separable Overapproximation (ESO) to compute the updates and the ESO depends on the sampling. We denote here by
the vector of such that if and otherwise.Definition 1 ([11]).
Let , and be a sampling. We say that admits a Expected Separable Overapproximation with respect to if for all ,
(3) 
We denote ESO for simplicity.
As the overapproximation is separable, one can find a minimiser with respect to by independent optimisation problems that will return for . In fact, we do not even need to compute the coordinates of that are not needed afterwards, i.e. we only compute for .
The convergence properties of the Parallel Coordinate Descent Method (Algorithm 1) have been studied for quite general classes of functions, namely partially separable functions [11] and Nesterov separable functions [13]. The addition of a separable regulariser like the norm or box constraints was also considered. However, in all cases, the analysis assumes that there exists a minimiser, which is not true in general for the Adaboost problem (1).
IiiB Adaboost problem
In the following, we specialise the Parallel Coordinate Descent Method to the Adaboost problem (2). We begin by giving an ESO for the logarithm of the objective function.
Theorem 1.
Let be the maximum number of element in a row of matrix , that is
Let us denote
( if ) and
The function has a coordinatewise Lipschitz gradient with constants such that
and if one chooses a nice sampling , then
Proof:
By [18], Section 4.4, we know that can be written as
where is the simplex of and is 1strongly convex on for the 1norm.
We can now state the Parallel Coordinate Descent Method for the Adaboost problem (2), since the minimiser of the ESO for and a nice sampling at is given by such that for all , .
IiiC Computational issues
Computation of is easy and can be done with one single read of the data.
For , we shall first compute for . Note that
There are divisions of integers and the multiplication of these terms. Paired as in the last expression, none of the terms to multiply is bigger than and with a reshuffling of the terms before the multiplication, one can easily get a numerically stable way of computing . Then, we just need to perform simple sums and comparisons with 1.
The gradient of is given by
where for all ,
To compute the gradient, one stores residuals and updates them at each iteration, as well as the function values . If we start with , there is no big number in . The value of the function can be updated in parallel by a reduction clause and used both for the computation of the gradient and the test in Algorithm 2.
Iv Convergence of parallel coordinate descent for the Adaboost problem
The proof of convergence follows the lines of [15] with additional technicalities due to the randomisation of the samplings and the introduction of the logarithm. For conciseness of this paper, we give the proof and the precise definition of the parameters in the appendix.
Theorem 2 gives a bound on the number of iterations needed for the Parallel Coordinate Descent Method (Algorithm 2) to return with high probability, an solution to the Adaboost problem (1).
Theorem 2.
Suppose . Partition the rows of into and , and suppose the axes of are ordered so that . Set to be the tightest axisaligned rectangle such that , and . Then is compact, , has modulus of strong convexity over and .
The iteration complexity is in , like for the classical Adaboost algorithm [14, 15]. The theorem also gives the theoretical parallelisation speedup factor of the method, which is equal to , where is given in Theorem 1 and is always smaller than . This means that, when one neglects communication costs, the algorithm is faster with processors than with one processor. The value of can be significantly smaller than when using a nice sampling. For instance for the experiment on the URL reputation dataset, when .
We now give the convergence results in the case of weak learnability and attainability.
Proposition 1.
If , choosing
grants
Proposition 2.
If , choosing and
grants
More iterations are needed than with the greedy update but here we do not need to find the coordinate of the gradient with the biggest absolute value, which saves computational effort. However, in both cases, the parameters and are not easily computable. Hence, the biggest interest of these convergence results is to show that the convergence holds in (or ) for any initial point.
V Numerical experiments
In this section, we compare the Parallel Coordinate Descent Method with three other algorithms available for the resolution of the Adaboost problem. We will not consider algorithms that solve a different problem like the ones presented in [9, 10].
We run our experiments on two freely available datasets: w8a [19] and URL reputation [20]. The w8a dataset is of medium scale: it has examples, features. The feature matrix is sparse but some rows have many nonzero elements so that . The URL reputation dataset has a large size: examples and features. The maximum number of nonzero elements in a row is . We used 16 processors on a computer with Intel Xeon processors at 2.6 GHz and 128 GB RAM.
In both cases, the Parallel Coordinate Descent with independent sampling is faster than the fully parallel coordinate descent because it benefits from larger steps (the ratio is ). It is also faster than the greedy coordinate descent. The reason is that one needs to compute the whole gradient at each iteration but only one directional derivative is actually used.
On the medium scale dataset w8a (Figure 1), the accelerated gradient is the fastest algorithm: it reaches high accuracy in the 20 seconds allocated. However, for the large scale dataset URL reputation (Figure 2), Parallel Coordinate Descent is the fastest algorithm. It benefits from larger steps () but also, unlike the other algorithms, the computational complexity of one iteration only moderately increases when the size of the problem increases.
We can see on Figure 3 that increasing the number of processors indeed accelerates and that the parallelisation speedup factor is nearly linear: the time needed to reach 1.8 decreases when more processors are used.
Vi Conclusion
We showed in this paper that the randomised parallel coordinate descent, developed in the general framework of the minimisation of composite, partially separable functions, is well suited to solve the Adaboost problem. We showed that the iteration complexity of the algorithm is of the order and gave a computable formula for the theoretical parallelisation speedup factor. The numerical experiments demonstrate the efficiency of parallel coordinate descent with independent sampling, especially for large scale problems. Indeed, each directional derivative computed is actually used to update the optimisation variable but in the same time the step length are rather large. The step lengths are controlled by the inverse of the parameter and they decrease slower than the increase of the number of processors. Hence, parallel coordinate descent combines qualities of the greedy coordinate descent and of the fully parallel coordinate descent, so that for large scale problems it outperforms any previously available algorithm.
References
 [1] Y. Freund and R. E. Schapire, “A decisiontheoretic generalization of online learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
 [2] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidencerated predictions,” Machine learning, vol. 37, no. 3, pp. 297–336, 1999.

[3]
P. Viola and M. Jones, “Rapid object detection using a boosted cascade of
simple features,” in
IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition. CVPR 2001
, vol. 1. IEEE, 2001, pp. 1–511.  [4] D. Zinovev, J. Furst, and D. Raicu, “Building an ensemble of probabilistic classifiers for lung nodule interpretation,” in 10th International Conference on Machine Learning and Applications and Workshops (ICMLA), vol. 2, 2011, pp. 155–161.

[5]
A. Assareh, L. G. Volkert, and J. Li, “Interaction trees: Optimizing ensembles of decision trees for genegene interaction detections,” in
11th International Conference on Machine Learning and Applications (ICMLA), vol. 1, 2012, pp. 616–621. 
[6]
M. Collins, R. E. Schapire, and Y. Singer, “Logistic regression, adaboost and bregman distances,”
Machine Learning, vol. 48, no. 13, pp. 253–285, 2002.  [7] I. Mukherjee, K. Canini, R. Frongillo, and Y. Singer, “Parallel boosting with momentum,” 2013.
 [8] K. Zeng, Y. Tang, and F. Liu, “Parallization of adaboost algorithm through hybrid mpi/openmp and transactional memory,” in 19th Euromicro International Conference on Parallel, Distributed and NetworkBased Processing (PDP). IEEE, 2011, pp. 94–100.
 [9] S. Merler, B. Caprile, and C. Furlanello, “Parallelizing adaboost by weights dynamics,” Computational Statistics & Data Analysis, vol. 51, no. 5, pp. 2487–2498, 2007.
 [10] I. Palit and C. K. Reddy, “Scalable and parallel boosting with mapreduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904–1916, 2012.
 [11] P. Richtárik and M. Takáč, “Parallel coordinate descent methods for big data optimization problems,” arXiv:1212.0873, November 2012.
 [12] M. Takác, A. Bijral, P. Richtárik, and N. Srebro, “Minibatch primal and dual methods for SVMs,” in 30th International Conference on Machine Learning, 2013.
 [13] O. Fercoq and P. Richtárik, “Smooth minimization of nonsmooth functions by parallel coordinate descent,” 2013, arXiv:1309.5885.
 [14] I. Mukherjee, C. Rudin, and R. E. Schapire, “The rate of convergence of adaboost,” arXiv preprint arXiv:1106.6024, 2011.
 [15] M. Telgarsky, “A primaldual convergence analysis of boosting,” The Journal of Machine Learning Research, vol. 13, pp. 561–606, 2012.
 [16] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors),” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000.
 [17] C. Shen and H. Li, “On the dual formulation of boosting algorithms,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 12, pp. 2216–2231, 2010.
 [18] Y. Nesterov, “Smooth minimization of nonsmooth functions,” Mathematical Programming, vol. 103, pp. 127–152, 2005.
 [19] J. C. Platt, “12 fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods  Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. MIT Press, 1999, http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.
 [20] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying suspicious urls: an application of largescale online learning,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 681–688, http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.
 [21] P. Richtárik and M. Takáč, “Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function,” Mathematical Programming (doi: 10.1007/s101070120614z), preprint: April/July 2011.
Appendix: Proof of the iteration complexity
Let be the distance from point to set in the norm:
We will denote by an arbitrary element of .
will denote the expectation conditional to knowing the previous choices of coordinates .
Definition 2 ([15]).
denotes the hard core of : the collection of examples which receive positive weight under some dual feasible point, a distribution upon which no weak learner is correlated with the labels. Symbolically,
We shall partition by rows into two matrices and , where has rows corresponding to , and .
Proposition 3.
Proof:
Let be such that for all . Then we have where .
The stopping criterion grants that for all , :
But by Theorem 1, the th iterate of Parallel Adaboost, , satisfies
which implies by definition of and that
∎
Proposition 4.
Let and a compact set such that be given. Then is strongly convex over and taking to be the modulus of strong convexity, for any ,
Proof:
The optimisation problem
attains its minimum by compacity of the feasible set and continuity of the objective function.
where
so by CauchySchwarz, and there is equality if and only if for all . Hence, the objective is zero if and only if for all , which would imply for all and . We conclude that the optimal value, which is the modulus of strong convexity , is positive.
For the second part of the proposition, we remark that is a linear space, so and we can replace in the proof of Lemma 6.8 in [15] by to get
We continue by noting that, as , if , then :
The last inequality uses the fact that the difference of norms is smaller than the norm of the difference. The result follows by definition of . ∎
Proof:
We follow the lines of Telgarsky’s [15] for the proof. The main differences are the 2norm instead of Infnorm in the definition of problemdependent quantities and the stochastic sampling instead of deterministic sampling.
Theorem 5.9 in [15] is still valid in our context, so that and the form of gives . Thus,
For the right term, as in [15], using the fact that the objective values never increase with Parallel Adaboost,
Let . As the level sets of and are equal, one can reuse the argument on 0coecivity in [15] for . Hence there exists an axisaligned rectangle containing and such that . Moreover, by Proposition 4,
Let . Combining this with Proposition 3, we get
Theorem 1 in [21] with , implies that, given and , if
then
But for ,
We take and we get the result. ∎
Comments
There are no comments yet.