1 Introduction
Within supervised learning, the goal of multilabel classification is to train models to annotate objects with a subset of labels taken from a set of candidate labels. Typical applications include the determination of topics addressed in a text document, the identification of object categories present within an image, or the prediction of biological properties of a gene. In many applications, the number of candidate labels may be very large, ranging from hundreds to hundreds of thousands
[2] and often even exceeding the sample size [12]. The very large scale nature of the output space in such problems poses both statistical and computational challenges that need to be specifically addressed.A simple approach to solve multilabel classification problems, called binary relevance, is to train independently a binary classifier for each label. Several more complex schemes have however been proposed to take into account the dependencies between the labels (see, e.g.
[28, 19, 9, 33, 10, 39]). In the context of treebased methods, one way is to train multioutput trees [4, 18, 23], ie. trees that can predict multiple outputs at once. With respect to singleoutput trees [7], the score measure used in multioutput trees to choose splits is taken as the sum of the individual scores corresponding to the different labels (e.g., variance reduction) and each leaf is labeled with a vector of values, coding each for the probability of presence of one label. With respect to binary relevance, the multioutput tree approach has the advantage of building a single model for all labels. It can thus potentially take into account label dependencies and reduce memory requirements for the storage of the models. An extensive experimental comparison
[25] shows that this approach compares favorably with other approaches, including non treebased methods, both in terms of accuracy and computing times. In addition, multioutput trees inherit all intrinsic advantages of treebased methods, such as robustness to irrelevant features, interpretability through feature importance scores, or fast computations of predictions, that make them very attractive to address multilabel problems. The computational complexity of learning multioutput trees is however similar to that of the binary relevance method. Both approaches are indeed , where is the number of input features, the number of candidate output labels, and the sample size; this is a limiting factor when dealing with large sets of candidate labels.One generic approach to reduce computational complexity is to apply some compression technique prior to the training stage to reduce the number of outputs to a number much smaller than the total number of labels. A model can then be trained to make predictions in the compressed output space and a prediction in the original label space can be obtained by decoding the compressed prediction. As multilabel vectors are typically very sparse, one can expect a drastic dimensionality reduction by using appropriate compression techniques. This idea has been explored for example in [19] using compressed sensing, and in [10] using bloom filters, in both cases using regularized linear models as base learners. This approach obviously reduces computing times for training the model. At the prediction stage however, the predicted compressed output needs to be decoded, which adds computational cost and can also introduce further decoding errors.
In this paper, we explore the use of random output space projections for largescale multilabel classification in the context of treebased ensemble methods. We first explore the idea proposed for linear models in [19] with random forests: a (single) random projection of the multilabel vector to an dimensional random subspace is computed and then a multioutput random forest is grown based on score computations using the projected outputs. We exploit however the fact that the approximation provided by a tree ensemble is a weighted average of output vectors from the training sample to avoid the decoding stage: at training time all leaf labels are directly computed in the original multilabel space. We show theoretically and empirically that when is large enough, ensembles grown on such random output spaces are equivalent to ensembles grown on the original output space. When is large enough compared to , this idea hence may reduce computing times at the learning stage without affecting accuracy and computational complexity of predictions.
Next, we propose to exploit the randomization inherent to the projection of the output space as a way to obtain randomized trees in the context of ensemble methods: each tree in the ensemble is thus grown from a different randomly projected subspace of dimension . As previously, labels at leaf nodes are directly computed in the original output space to avoid the decoding step. We show, theoretically, that this idea can lead to better accuracy than the first idea and, empirically, that best results are obtained on many problems with very low values of , which leads to significant computing time reductions at the learning stage. In addition, we study the interaction between input randomization (à la Random Forests) and output randomization (through random projections), showing that there is an interest, both in terms of predictive performance and in terms of computing times, to optimally combine these two ways of randomization. All in all, the proposed approach constitutes a very attractive way to address largescale multilabel problems with treebased ensemble methods.
The rest of the paper is structured as follows: Section 2 reviews properties of multioutput tree ensembles and of random projections; Section 3 presents the proposed algorithms and their theoretical properties; Section 4 provides the empirical validations, whereas Section 5 discusses our work and provides further research directions.
2 Background
We denote by an input space, and by an output space; without loss of generality, we suppose that (where denotes the number of input features), and that (where is the dimension of the output space). We denote by the joint (unknown) sampling density over .
Given a learning sample of observations in the form of inputoutput pairs, a supervised learning task is defined as searching for a function in a hypothesis space
that minimizes the expectation of some loss function
over the joint distribution of input / output pairs:
NOTATIONS: Superscript indices () denote (input, output) vectors of an observation . Subscript indices (e.g. ) denote components of vectors.
2.1 Multioutput tree ensembles
A classification or a regression tree [7] is built using all the inputoutput pairs as follows: for each node at which the subsample size is greater or equal to a prepruning parameter , the best split is chosen among the input features combined with the selection of an optimal cut point. The best sample split of the local subsample minimizes the average reduction of impurity
(1) 
Finally, leaf statistics are obtained by aggregating the outputs of the samples reaching that leaf.
In this paper, for multioutput trees, we use the sum of the variances of the dimensions of the output vector as an impurity measure. It can be computed by (see Appendix A, in the supplementary material^{1}^{1}1 static.ajoly.org/files/ecml2014supplementary.pdf)
(2)  
(3) 
Furthermore, we compute the vectors of output statistics by componentwise averaging. Notice that, when the outputs are vectors of binary classlabels (i.e.
), as in multilabel classification, the variance reduces to the socalled Giniindex, and the leaf statistics then estimate a vector of conditional probabilities
, from which a prediction can be made by thresholding.Treebased ensemble methods build an ensemble of randomized trees. Unseen samples are then predicted by aggregating the predictions of all trees. Random Forests [6] build each tree on a bootstrap copy of the learning sample [6] and by optimising the split at each node over a locally generated random subset of size among the input features. Extra Trees [17] use the complete learning sample and optimize the split over a random subset of size of the features combined with a random selection of cut points. Setting the parameter to the number of input features allows to filter out irrelevant features; larger yields simpler trees possibly at the price of higher bias, and the higher the smaller the variance of the resulting predictor.
2.2 Random projections
In this paper we apply the idea of random projections to samples of vectors of the output space . With this in mind, we recall the JohnsonLindenstrauss lemma (reduced to linear maps), while using our notations.
Lemma 1
JohnsonLindenstrauss lemma [20] Given and an integer , let be a positive integer such that . For any sample of points in there exists a matrix such that for all
(4) 
Moreover, when is sufficiently large, several random matrices satisfy (4) with high probability. In particular, we can consider Gaussian matrices which elements are drawn i.i.d. in , as well as (sparse) Rademacher matrices which elements are drawn in with probability , where controls the sparsity of [1, 24].
Notice that if some satisfies (4) for the whole learning sample, it obviously satisfies (4) for any subsample that could reach a node during regression tree growing. On the other hand, since we are not concerned in this paper with the ‘reconstruction’ problem, we do not need to make any sparsity assumption ‘à la compressed sensing’.
3 Methods
We first present how we propose to exploit random projections to reduce the computational burden of learning single multioutput trees in very highdimensional output spaces. Then we present and compare two ways to exploit this idea with ensembles of trees. Subsection 3.3 analyses these two ways from the bias/variance point of view.
3.1 Multioutput regression trees in randomly projected output spaces
The multioutput single tree algorithm described in section 2 requires the computation of the sum of variances in (2) at each tree node and for each candidate split. When is very highdimensional, this computation constitutes the main computational bottleneck of the algorithm. We thus propose to approximate variance computations by using random projections of the output space. The multioutput regression tree algorithm is modified as follows (denoting by the learning sample ):

First, a projection matrix of dimension is randomly generated.

A new dataset is constructed by projecting each learning sample output using the projection matrix .

A tree (structure) is grown using the projected learning sample .

Predictions at each leaf of are computed using the corresponding outputs in the original output space.
The resulting tree is exploited in the standard way to make predictions: an input vector is propagated through the tree until it reaches a leaf from which a prediction in the original output space is directly retrieved.
If satisfies (4), the following theorem shows that variance computed in the projected subspace is an approximation of the variance computed over the original space.
Theorem 3.1
Given , a sample of points , and a projection matrix such that for all condition (4) holds, we have also:
(5) 
Proof
See Appendix B, supplementary material.
As a consequence, any split score approximated from the randomly projected output space will be close to the unprojected scores in any subsample of the complete learning sample. Thus, if condition (4) is satisfied for a sufficiently small then the tree grown from the projected data will be identical to the tree grown from the original data^{2}^{2}2Strictly speaking, this is only the case when the optimum scores of test splits as computed over the original output space are isolated, i.e. when there is only one single best split, no tie..
For a given size of the projection subspace, the complexity is reduced from to for the computation of one split score and thus from to for the construction of one full (balanced) tree, where one can expect to be much smaller than and at worst of . The whole procedure requires to generate the projection matrix and to project the training data. These two steps are respectively and but they can often be significantly accelerated by exploiting the sparsity of the projection matrix and/or of the original output data, and they are called only once before growing the tree.
All in all, this means that when is sufficiently large, the random projection approach may allow us to significantly reduce tree building complexity from to , without impact on predictive accuracy (see section 4, for empirical results).
3.2 Exploitation in the context of tree ensembles
The idea developed in the previous section can be directly exploited in the context of ensembles of randomized multioutput regression trees. Instead of building a single tree from the projected learning sample , one can grow a randomized ensemble of them. This “shared subspace” algorithm is described in pseudocode in Algorithm 1.
Another idea is to exploit the random projections used so as to introduce a novel kind of diversity among the different trees of an ensemble. Instead of building all the trees of the ensemble from a same shared outputspace projection, one could instead grow each tree in the ensemble from a different outputspace projection. Algorithm 2 implements this idea in pseudocode. The randomization introduced by the output space projection can of course be combined with any existing randomization scheme to grow ensembles of trees. In this paper, we will consider the combination of random projections with the randomizations already introduced in Random Forests and Extra Trees. The interplay between these different randomizations will be discussed theoretically in the next subsection by a bias/variance analysis and empirically in Section 4. Note that while when looking at single trees or shared ensembles, the size of the projected subspace should not be too small so that condition (4) is satisfied, the optimal value of when projections are randomized at each tree is likely to be smaller, as suggested by the bias/variance analysis in the next subsection.
From the computational point of view, the main difference between these two ways of transposing randomoutput projections to ensembles of trees is that in the case of Algorithm 2, the generation of the projection matrix and the computation of projected outputs is carried out times, while it is done only once for the case of Algorithm 1. These aspects will be empirically evaluated in Section 4.
3.3 Bias/variance analysis
In this subsection, we adapt the bias/variance analysis carried out in [17] to take into account random output projections. The details of the derivations are reported in Appendix C (supplementary material).
Let us denote by a single multioutput tree obtained from a projection matrix (below we use
to denote the corresponding random variable), where
is the value of a random variable capturing the random perturbation scheme used to build this tree (e.g., bootstrapping and/or random input space selection). The square error of this model at some point is defined by:and its average can decomposed in its residual error, (squared) bias, and variance terms denoted:
where the variance term can be further decomposed as the sum of the following three terms:
that measure errors due to the randomness of, respectively, the learning sample, the tree algorithm, and the output space projection (Appendix C, supplementary material).
Approximations computed respectively by algorithms 1 and 2 take the following forms:
where and are vectors of i.i.d. values of the random variables and respectively.
We are interested in comparing the average errors of these two algorithms, where the average is taken over all random parameters (including the learning sample). We show (Appendix C) that these can be decomposed as follows:
From this result, it is hence clear that Algorithm 2 can not be worse, on the average, than Algorithm 1. If the additional computational burden needed to generate a different random projection for each tree is not problematic, then Algorithm 2 should always be preferred to Algorithm 1.
For a fixed level of tree randomization (), whether the additional randomization brought by random projections could be beneficial in terms of predictive performance remains an open question that will be addressed empirically in the next section. Nevertheless, with respect to an ensemble grown from the original output space, one can expect that the outputprojections will always increase the bias term, since they disturb the algorithm in its objective of reducing the errors on the learning sample. For small values of , the average error will therefore decrease (with a sufficiently large number of trees) only if the increase in bias is compensated by a decrease of variance.
The value of , the dimension of the projected subspace, that will lead to the best tradeoff between bias and variance will hence depend both on the level of tree randomization and on the learning problem. The more (resp. less) tree randomization, the higher (resp. the lower) could be the optimal value of , since both randomizations affect bias and variance in the same direction.
4 Experiments
4.1 Accuracy assessment protocol
We assess the accuracy of the predictors for multilabel classification on a test sample (TS) by the “Label Ranking Average Precision (LRAP)” [25], expressed by
(6) 
where is the probability (or the score) associated to the label by the learnt model applied to , is a dimensional row vector of ones, and
Test samples without any relevant labels (i.e. with ) were discarded prior to computing the average precision. The best possible average precision is thus 1. Notice that we use indifferently the notation to express the cardinality of a set or the norm of a vector.
). The curves represent average values (and standard deviations) obtained from 10 applications of the randomised algorithms over a same single
split.Datasets  Random  Random Forests on Gaussian subspace  

Name  Forests  
emotions  391  202  72  6  
scene  1211  1196  2407  6  
yeast  1500  917  103  14  
tmc2017  21519  7077  49060  22  
genbase  463  199  1186  27  
reuters  2500  5000  19769  34  
medical  333  645  1449  45  
enron  1123  579  1001  53  
mediamill  30993  12914  120  101  
YeastGO  2310  1155  5930  132  
bibtex  4880  2515  1836  159  
CAL500  376  126  68  174  
WIPO  1352  358  74435  188  
EURLex (subj.)  19348  10cv  5000  201  
bookmarks  65892  21964  2150  208  
diatoms  2065  1054  371  359  
corel5k  4500  500  499  374  
EURLex (dir.)  19348  10cv  5000  412  
SCOPGO  6507  3336  2003  465  
delicious  12920  3185  500  983  
druginteraction  1396  466  660  1554  
proteininteraction  1165  389  876  1862  
ExpressionGO  2485  551  1288  2717  
EURLex (desc.)  19348  10cv  5000  3993 
4.2 Effect of the size of the Gaussian output space
To illustrate the behaviour of our algorithms, we first focus on the “Delicious” dataset [32], which has a large number of labels (), of input features (), and of training () and testing () samples.
The left part of figure 1 shows, when Gaussian outputspace projections are combined with the standard CART algorithm building a single tree, how the precision converges (cf Theorem 3.1) when increases towards . We observe that in this case, convergence is reached around at the expense of a slight decrease of accuracy, so that a compression factor of about 5 is possible with respect to the original output dimension .
The right part of figure 1 shows, on the same dataset, how the method behaves when combined with Random Forests. Let us first notice that the Random Forests grown on the original output space (green line) are significantly more accurate than the single trees, their accuracy being almost twice as high. We also observe that Algorithm 2 (orange curve) converges much more rapidly than Algorithm 1 (blue curve) and slightly outperforms the Random Forest grown on the original output space. It needs only about components to converge, while Algorithm 1 needs about of them. These results are in accordance with the analysis of Section 3.3, showing that Algorithm 2 can’t be inferior to Algorithm 1. In the rest of this paper we will therefore focus on Algorithm 2.
4.3 Systematic analysis over 24 datasets
To assess our methods, we have collected 24 different multilabel classification datasets from the literature (see Section D of the supplementary material, for more information and bibliographic references to these datasets) covering a broad spectrum of application domains and ranges of the output dimension (, see Table 1). For 21 of the datasets, we made experiments where the dataset is split randomly into a learning set of size , and a test set of size , and are repeated 10 times (to get average precisions and standard deviations), and for 3 of them we used a tenfold crossvalidation scheme (see Table 1).
Table 1 shows our results on the 24 multilabel datasets, by comparing Random Forests learnt on the original output space with those learnt by Algorithm 2 combined with Gaussian subspaces of size ^{3}^{3}3 is rounded to the nearest integer value; in Table 1 the values of vary between 2 for and 8 for .. In these experiments, the three parameters of Random Forests are set respectively to , (default values, see [17]) and (reasonable computing budget). Each model is learnt ten times on a different shuffled train/testing split, except for the 3 EURlex datasets where we kept the original 10 folds of crossvalidation.
We observe that for all datasets (except maybe SCOPGO), taking leads to a similar average precision to the standard Random Forests, i.e. no difference superior to one standard deviation of the error. On 11 datasets, we see that already yields a similar average precision (values not underlined in column ). For the 13 remaining datasets, increasing to significantly decreases the gap with the Random Forest baseline and 3 more datasets reach this baseline. We also observe that on several datasets such as “Druginteraction” and “SCOPGO”, better performance on the Gaussian subspace is attained with high output randomization () than with . We thus conclude that the optimal level of output randomization (i.e. the optimal value of the ratio ) which maximizes accuracy performances, is dataset dependent.
While our method is intended for tasks with very high dimensional output spaces, we however notice that even with relatively small numbers of labels, its accuracy remains comparable to the baseline, with suitable .
To complete the analysis, Appendix F considers the same experiments with a different baselearner (Extra Trees of [17]), showing very similar trends.
4.4 Input vs output space randomization
We study in this section the interaction of the additional randomization of the output space with that concerning the input space already built in the Random Forest method.
To this end, we consider the “Druginteraction” dataset ( input features and output labels [38]), and we study the effect of parameter controlling the input space randomization of the Random Forest method with the randomization of the output space by Gaussian projections controlled by the parameter . To this end, Figure 2 shows the evolution of the accuracy for growing values of (i.e. decreasing strength of the input space randomization), for three different quite low values of (in this case ). We observe that Random Forests learned on a very lowdimensional Gaussian subspace (red, blue and pink curves) yield essentially better performances than Random Forests on the original output space, and also that their behaviour with respect to the parameter is quite different. On this dataset, the outputspace randomisation makes the method completely immune to the ‘overfitting’ phenomenon observed for high values of with the baseline method (green curve).
We refer the reader to a similar study on the “Delicious” dataset given in the Appendix E (supplementary material), which shows that the interaction between and may be different from one dataset to another. It is thus advisable to jointly optimize the value of and , so as to maximise the tradeoff between accuracy and computing times in a problem and algorithm specific way.
4.5 Alternative output dimension reduction techniques
In this section, we study Algorithm 2 when it is combined with alternative outputspace dimensionality reduction techniques. We focus again on the “Delicious” dataset, but similar trends could be observed on other datasets.
Figure 3(a) first compares Gaussian random projections with two other dense projections: Rademacher matrices with (cf. Section 2.2) and compression matrices obtained by subsampling (without replacement) Hadamard matrices [8]. We observe that Rademacher and subsampleHadamard subspaces behave very similarly to Gaussian random projections.
In a second step, we compare Gaussian random projections with two (very) sparse projections: first, sparse Rademacher subspaces obtained by setting the sparsity parameter to and , selecting respectively about 33% and 2% of the original outputs to compute each component, and second, subsampled identity subspaces, similar to [34], where each of the selected components corresponds to a randomly chosen original label and also preserve sparsity. Sparse projections are very interesting from a computational point of view as they require much less operations to compute the projections but the number of components required for condition (4) to be satisfied is typically higher than for dense projections [24, 8]. Figure 3(b) compares these three projection methods with standard Random Forests on the “delicious” dataset. All three projection methods converge to plain Random Forests as the number of components increases but their behaviour at low values are very different. Rademacher projections converge faster with than with and interestingly, the sparsest variant () has its optimum at and improves in this case over the Random Forests baseline. Random output subspaces converge slower but they lead to a notable improvement of the score over baseline Random Forests. This suggests that although their theoretical guarantees are less good, sparse projections actually provide on this problem a better bias/variance tradeoff than dense ones when used in the context of Algorithm 2.
Another popular dimension reduction technique is the principal component analysis (PCA). In Figure
3(c), we repeat the same experiment to compare PCA with Gaussian random projections. Concerning PCA, the curve is generated in decreasing order of eigenvalues, according to their contribution to the explanation of the outputspace variance. We observe that this way of doing is far less effective than the random projection techniques studied previously.
4.6 Learning stage computing times
Our implementation of the learning algorithms is based on the scikitlearn Python package version 0.14dev [27]. To fix ideas about computing times, we report these obtained on a Mac Pro 4.1 with a dual QuadCore Intel Xeon processor at 2.26 GHz, on the “Delicious” dataset. Matrix operation, such as random projections, are performed with the BLAS and the LAPACK from the Mac OS X Accelerate framework. Reported times are obtained by summing the user and sys time of the time UNIX utility.
The reported timings correspond to the following operation: (i) load the dataset in memory, (ii) execute the algorithm. All methods use the same code to build trees. In these conditions, learning a random forest on the original output space (, , ) takes 3348 s; learning the same model on a Gaussian output space of size requires 311 s, while and take respectively 236 s and 1088 s. Generating a Gaussian subspace of size and projecting the output data of the training samples is done in less than 0.25 s, while and takes around 0.07 s and 1 s respectively. The time needed to compute the projections is thus negligible with respect to the time needed for the tree construction.
We see that a speedup of an order of magnitude could be obtained, while at the same time preserving accuracy with respect to the baseline Random Forests method. Equivalently, for a fixed computing time budget, randomly projecting the output space allows to build more trees and thus to improve predictive performances with respect to standard Random Forests.
5 Conclusions
This paper explores the use of random output space projections combined with treebased ensemble methods to address largescale multilabel classification problems. We study two algorithmic variants that either build a treebased ensemble model on a single shared random subspace or build each tree in the ensemble on a newly drawn random subspace. The second approach is shown theoretically and empirically to always outperform the first in terms of accuracy. Experiments on 24 datasets show that on most problems, using gaussian projections allows to reduce very drastically the size of the output space, and therefore computing times, without affecting accuracy. Remarkably, we also show that by adjusting jointly the level of input and output randomizations and choosing appropriately the projection method, one could also improve predictive performance over the standard Random Forests, while still improving very significantly computing times. As future work, it would be very interesting to propose efficient techniques to automatically adjust these parameters, so as to reach the best tradeoff between accuracy and computing times on a given problem.
To best of our knowledge, our work is the first to study random output projections in the context of multioutput treebased ensemble methods. The possibility with these methods to relabel tree leaves with predictions in the original output space makes this combination very attractive. Indeed, unlike similar works with linear models [19, 10], our approach only relies on JohnsonLindenstrauss lemma, and not on any output sparsity assumption, and also does not require to use any output reconstruction method. Besides multilabel classification, we would like to test our method on other, not necessarily sparse, multioutput prediction problems.
Acknowledgements.
Arnaud Joly is research fellow of the FNRS, Belgium. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Office.
References
 [1] Achlioptas, D.: Databasefriendly random projections: Johnsonlindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)
 [2] Agrawal, R., Gupta, A., Prabhu, Y., Varma, M.: Multilabel learning with millions of labels: recommending advertiser bid phrases for web pages. In: Proceedings of the 22nd international conference on World Wide Web. pp. 13–24. International World Wide Web Conferences Steering Committee (2013)
 [3] Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multilabel prediction of gene function. Bioinformatics 22(7), 830–836 (2006)
 [4] Blockeel, H., De Raedt, L., Ramon, J.: Topdown induction of clustering trees. In: Proceedings of ICML 1998. pp. 55–63 (1998)

[5]
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multilabel scene classification. Pattern recognition 37(9), 1757–1771 (2004)

[6]
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
 [7] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)
 [8] Candes, E.J., Plan, Y.: A probabilistic and ripless theory of compressed sensing. Information Theory, IEEE Transactions on 57(11), 7235–7254 (2011)
 [9] Cheng, W., Hüllermeier, E., Dembczynski, K.J.: Bayes optimal multilabel classification via probabilistic classifier chains. In: Proceedings of the 27th international conference on machine learning (ICML10). pp. 279–286 (2010)
 [10] Cisse, M.M., Usunier, N., Artières, T., Gallinari, P.: Robust bloom filters for large multilabel classification tasks. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 1851–1859 (2013)
 [11] Clare, A.: Machine learning and data mining for yeast functional genomics. Ph.D. thesis, University of Wales Aberystwyth, Aberystwyth, Wales, UK (2003)

[12]
Dekel, O., Shamir, O.: Multiclassmultilabel classification with more classes than examples. In: International Conference on Artificial Intelligence and Statistics. pp. 137–144 (2010)
 [13] Dimitrovski, I., Kocev, D., Loskovska, S., Džeroski, S.: Hierarchical classification of diatom images using ensembles of predictive clustering trees. Ecological Informatics 7(1), 19–29 (2012)
 [14] Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.: Protein classification with multiple algorithms. In: Advances in Informatics, pp. 448–456. Springer (2005)

[15]
Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: Computer Vision—ECCV 2002, pp. 97–112. Springer (2006)
 [16] Elisseeff, A., Weston, J.: A kernel method for multilabelled classification. In: Advances in neural information processing systems. pp. 681–687 (2001)
 [17] Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63(1), 3–42 (2006)
 [18] Geurts, P., Wehenkel, L., d’Alché Buc, F.: Kernelizing the output of treebased methods. In: Proceedings of the 23rd international conference on Machine learning. pp. 345–352. Acm (2006)
 [19] Hsu, D., Kakade, S., Langford, J., Zhang, T.: Multilabel prediction via compressed sensing. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22, pp. 772–780 (2009)
 [20] Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics 26(189206), 1 (1984)
 [21] Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD (2008)
 [22] Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Machine learning: ECML 2004, pp. 217–226. Springer (2004)
 [23] Kocev, D., Vens, C., Struyf, J., Dzeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recognition 46(3), 817–833 (2013)
 [24] Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 287–296. ACM (2006)
 [25] Madjarov, G., Kocev, D., Gjorgjevikj, D., Dzeroski, S.: An extensive experimental comparison of methods for multilabel learning. Pattern Recognition 45(9), 3084–3104 (2012)
 [26] Mencía, E.L., Fürnkranz, J.: Efficient multilabel classification algorithms for largescale problems in the legal domain. In: Semantic Processing of Legal Texts, pp. 192–215. Springer (2010)
 [27] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikitlearn: Machine learning in python. The Journal of Machine Learning Research 12, 2825–2830 (2011)
 [28] Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multilabel classification. In: Buntine, W., Grobelnik, M., Mladenić, D., ShaweTaylor, J. (eds.) Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol. 5782, pp. 254–269. Springer Berlin Heidelberg (2009)
 [29] Rousu, J., Saunders, C., Szedmak, S., ShaweTaylor, J.: Learning hierarchical multicategory text classification models. In: Proceedings of the 22nd international conference on Machine learning. pp. 744–751. ACM (2005)
 [30] Snoek, C.G.M., Worring, M., Van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th annual ACM international conference on Multimedia. pp. 421–430. ACM (2006)
 [31] Srivastava, A.N., ZaneUlman, B.: Discovering recurring anomalies in text reports regarding complex space systems. In: Aerospace Conference, 2005 IEEE. pp. 3853–3862. IEEE (2005)
 [32] Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08). pp. 30–44 (2008)
 [33] Tsoumakas, G., Katakis, I., Vlahavas, I.P.: Random klabelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011)
 [34] Tsoumakas, G., Vlahavas, I.: Random klabelsets: An ensemble method for multilabel classification. In: Machine Learning: ECML 2007, pp. 406–417. Springer (2007)
 [35] Tsoumakas, K.T.G., Kalliris, G., Vlahavas, I.: Multilabel classification of music into emotions. In: ISMIR 2008: Proceedings of the 9th International Conference of Music Information Retrieval. p. 325. Lulu. com (2008)
 [36] Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. Audio, Speech, and Language Processing, IEEE Transactions on 16(2), 467–476 (2008)

[37]
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multilabel classification. Machine Learning 73(2), 185–214 (2008)
 [38] Yamanishi, Y., Pauwels, E., Saigo, H., Stoven, V.: Extracting sets of chemical substructures and protein domains governing drugtarget interactions. Journal of chemical information and modeling 51(5), 1183–1194 (2011)
 [39] Zhou, T., Tao, D.: Multilabel subspace ensemble. In: International Conference on Artificial Intelligence and Statistics. pp. 1444–1452 (2012)
Appendix 0.A Proof of equation (3)
The sum of the variances of observations drawn from a random vector can be interpreted as a sum of squared euclidean distances between the pairs of observations
(a.1) 
Proof
(a.2)  
(a.3)  
(a.4)  
(a.5)  
(a.6)  
(a.7)  
(a.8)  
(a.9)  
(a.10) 
Appendix 0.B Proof of Theorem 1
Theorem 1 Given , a sample of points and a projection matrix such that for all the JohnsonLindenstrauss lemma holds, we have also:
(a.11) 
Appendix 0.C Bias/variance analysis
In this subsection, we adapt the bias/variance analysis of randomised supervised learning algorithms carried out in [17], to assess the effect of random output projections in the context of the two algorithms studied in our paper.
0.c.1 Single random trees.
Let us denote by a single multioutput (random) tree obtained from a projection matrix (below we use to denote the corresponding random variable), where is the value of a random variable capturing the random perturbation scheme used to build this tree (e.g., bootstrapping and/or random input space selection). Denoting by the square error of this model at some point defined by:
The average of this square error can decomposed as follows:
where
and
The three terms of this decomposition are respectively the residual error, the bias, and the variance of this estimator (at ).
The variance term can be further decomposed as follows using the law of total variance:
(a.13)  
The first term is the variance due to the learning sample randomization and the second term is the average variance (over ) due to both the random forest randomization and the random output projection. By using the law of total variance a second time, the second term of (a.13) can be further decomposed as follows:
(a.14) 
The first term of this decomposition is the variance due to the random choice of a projection and the second term is the average variance due to the random forest randomization. Note that all these terms are non negative. In what follows, we will denote these three terms respectively , , and . We thus have:
with
0.c.2 Ensembles of random trees.
When the random projection is fixed for all trees in the ensemble (Algorithm 1), the algorithm computes an approximation, denoted , that takes the following form:
where a vector of i.i.d. values of the random variable . When a different random projection is chosen for each tree (Algorithm 2), the algorithm computes an approximation, denoted by , of the following form:
where is also a vector of i.i.d. random projection matrices.
We would like to compare the average errors of these two algorithms with the average errors of the original single tree method, where the average is taken for all algorithms over their random parameters (that include the learning sample).
Given that all trees are grown independently of each other, one can show that the average models corresponding to each algorithm are equal:
They thus all have the exact same bias (and residual error) and differ only in their variance.
Using the same argument, the first term of the variance decomposition in (a.13), ie. , is the same for all three algorithms since:
Their variance thus only differ in the second term in (a.13).
Again, because of the conditional independence of the ensemble terms given the and projection matrix , Algorithm 1, which keeps the output projection fixed for all trees, is such that
and
It thus divides the second term of (a.14) by the number of ensemble terms. Algorithm 2, on the other hand, is such that:
and thus divides the second term of (a.13) by .
Putting all these results together one gets that:
Given that all terms are positive, this result clearly shows that Algorithm 2 can not be worse than Algorithm 1.
Appendix 0.D Description of the datasets
Experiments are performed on several multilabel datasets: the yeast [16] dataset in the biology domain; the corel5k [15] and the scene [5] datasets in the image domain; the emotions [35] and the CAL500 [36] datasets in the music domain; the bibtex [21], the bookmarks [21], the delicious [32], the enron [22], the EURLex (subject matters, directory codes and eurovoc descriptors) [26] the genbase [14], the medical^{4}^{4}4
The medical dataset comes from the computational medicine center’s 2007 medical natural language processing challenge
http://computationalmedicine.org/challenge/previous., the tmc2007 [31] datasets in the text domain and the mediamill [30] dataset in the video domain.Several hierarchical classification tasks are also studied to increase the diversity in the number of label and treated as multilabel classification task. Each node of the hierarchy is treated as one label. Nodes of the hierarchy which never occured in the training or testing set were removed. The reuters [29], WIPO [29] datasets are from the text domain. The Diatoms [13] dataset is from the image domain. SCOPGO [11], YeastGO [3] and ExpressionGO [37] are from the biological domain. Missing values in the ExpressionGO dataset were inferred using the median for continuous features and the most frequent value for categorical features using the entire dataset. The inference of a drugprotein interaction network [38] is also considered either using the drugs to infer the interactions with the protein (druginteraction), either using the proteins to infer the interactions with the drugs (proteininteraction).
Those datasets were selected to have a wide range of number of outputs . Their basic characteristics are summarized at Table 2. For more information on a particular dataset, please see the relevant paper.
Datasets  

emotions  391  202  72  6 
scene  1211  1196  2407  6 
yeast  1500  917  103  14 
tmc2007  21519  7077  49060  22 
genbase  463  199  1186  27 
reuters  2500  5000  19769  34 
medical  333  645  1449  45 
enron  1123  579  1001  53 
mediamill  30993  12914  120  101 
YeastGO  2310  1155  5930  132 
bibtex  4880  2515  1836  159 
CAL500  376  126  68  174 
WIPO  1352  358  74435  188 
EURLex (subject matters)  19348  10cv  5000  201 
bookmarks  65892  21964  2150  208 
diatoms  2065  1054  371  359 
corel5k  4500  500  499  374 
EURLex (directory codes)  19348  10cv  5000  412 
SCOPGO  6507  3336  2003  465 
delicious  12920  3185  500  983 
druginteraction  1396  466  660  1554 
proteininteraction  1165  389  876  1862 
ExpressionGO  2485  551  1288  2717 
EURLex (eurovoc descriptors)  19348  10cv  5000  3993 
Appendix 0.E Experiments with Extra trees
In this section, we carry out experiments combining Gaussian random projections (with ) with the Extra Trees method of [17] (see Section 2.1 of the paper for a very brief description of this method). Results on 23 datasets^{5}^{5}5Note that results on the “EURlex” dataset were not available at the time of submitting the paper. They will be added in the final version of this appendix. are compiled in Table 3.
Like for Random Forests, we observe that for all 23 datasets taking leads to a similar average precision to the standard Random Forests, ie. no difference superior to one standard deviation of the error. This is already the case with for 12 datasets and with for 4 more datasets. Interestingly, on 3 datasets with and 3 datasets with , the increased randomization brought by the projections actually improves average precision with respect to standard Random Forests (bold values in Table 3).
Datasets  Extra trees  Extra trees on Gaussian subspace  

emotions  
scene  
yeast 
Comments
There are no comments yet.