Multi-Label Learning with Deep Forest

11/15/2019 ∙ by Liang Yang, et al. ∙ Nanjing University 22

In multi-label learning, each instance is associated with multiple labels and the crucial task is how to leverage label correlations in building models. Deep neural network methods usually jointly embed the feature and label information into a latent space to exploit label correlations. However, the success of these methods highly depends on the precise choice of model depth. Deep forest is a recent deep learning framework based on tree model ensembles, which does not rely on backpropagation. We consider the advantages of deep forest models are very appropriate for solving multi-label problems. Therefore we design the Multi-Label Deep Forest (MLDF) method with two mechanisms: measure-aware feature reuse and measure-aware layer growth. The measure-aware feature reuse mechanism reuses the good representation in the previous layer guided by confidence. The measure-aware layer growth mechanism ensures MLDF gradually increase the model complexity by performance measure. MLDF handles two challenging problems at the same time: one is restricting the model complexity to ease the overfitting issue; another is optimizing the performance measure on user's demand since there are many different measures in the multi-label evaluation. Experiments show that our proposal not only beats the compared methods over six measures on benchmark datasets but also enjoys label correlation discovery and other desired properties in multi-label learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In multi-label learning, each example is associated with multiple labels simultaneously and the task of multi-label learning is to predict a set of relevant labels for the unseen instance. Multi-label learning has been widely applied in diverse problems like text categorization [Zhou and El-Gohary, 2015]

, scene classification

[Zhao et al., 2015], functional genomics [Zhang and Zhou, 2006], video categorization [Ray et al., 2018], chemicals classification [Cheng et al., 2017], etc. It is drawn increased research attention that multi-label learning tasks are omnipresent in real-world problems [Tsoumakas et al., 2010].

By transforming the multi-label learning problem to independent binary classification problems for each label, Binary Relevance [Tsoumakas and Katakis, 2007]

is a straightforward method which has been widely used in practice. Though it aims to make full use of high-performance traditional single-label classifiers, it will lead to high computational cost when label space is large. Besides, such a method neglects the fact that information on one label may be helpful for learning other related labels, which would limit the prediction performance. Investigating correlations among labels has been demonstrated to be the key to improve the performance of multi-label learning. As a result, more and more multi-label learning methods aim to explore and exploit the label correlations are proposed

[Tsoumakas et al., 2010]. There emerges considerable attention in exploring and exploiting label correlations in multi-label learning methods [Tsoumakas et al., 2010; Wang et al., 2016; Zhang et al., 2018a].

Different from traditional multi-label methods, deep neural network models usually try to learn a new feature space and employ a multi-label classifier on the top. Among the first to utilize network architectures, BP-MLL [Zhang and Zhou, 2006] not only treats each output node as a binary classification task but also exploits label correlations relied on the architecture itself. Later, a comparably simple Neural Network approach builds upon BP-MLL was proposed by replacing the pairwise ranking loss with entropy loss and using deep neural networks techniques [Nam et al., 2014] and it achieves a good result in the large-scale text classification. However, deep neural models usually require a huge amount of training data, and thus they are generally not suitable for small-scale datasets.

By realizing that the essence of deep learning lies in layer-by-layer processing, in-model feature transformation, and sufficient model complexity, Zhou and Feng proposed deep forest [Zhou and Feng, 2018]

. Deep forest is an ensemble deep model built on decision trees and does not use backpropagation in the training process. A deep forest ensembled with a cascade structure is able to do representation learning similarly like deep neural models. Deep forest is much easier to train since it has fewer hyper-parameters. It has achieved excellent performance on a broad range of tasks, such as large-scale financial fraud detection

[Zhang et al., 2018b], image, and text reconstruction [Feng and Zhou, 2018]. Though deep forest has been found useful in traditional classification tasks [Zhou and Feng, 2018], the potential of applying it into multi-label learning has not been noticed before our work.

The success of deep forest mainly comes from the layer-by-layer feature transformation in an ensemble way [Zhou, 2012]. While on the other hand, the key point in multi-label learning is how to use label correlations. Inspired by these two facts, we propose the Multi-Label Deep Forest (MLDF) method. Briefly speaking, MLDF uses different multi-label tree methods as the building blocks in deep forest, and label correlations can be exploited via layer-by-layer representation learning. Because evaluation in multi-label learning is more complicated than traditional classification tasks, a number of performance measures have been proposed [Schapire and Singer, 2000]. It is noticed that ifferent user has different demands and an algorithm usually performs differently on different measures [Wu and Zhou, 2017]. In order to achieve better performance on the specific measure, we propose two mechanisms: measure-aware feature reuse and measure-aware layer growth. The measure-aware feature reuse mechanism, inspired by confidence screening [Pang et al., 2018], reuses good presentation in the previous layer. The measure-aware layer growth mechanism aims to control the model complexity by various performance measures. The main contributions of this paper are summarized as follows:

  • We first introduce deep forest to multi-label learning. Because of the proposed cascade structure and two measure-aware mechanisms, our MLDF method can handle two challenging problems in multi-label learning at the same time: optimizing different performance measures on user demand, and reducing overfitting, when utilizing label correlations by a large number of layers, which often observed in deep neural multi-label models.

  • Our extensive experiments show that MLDF achieves the best performance on 9 benchmark datasets over 6 multi-label measures. Besides, the two mechanisms are confirmed necessary in MLDF. Furthermore, investigative experiments demonstrate our proposal enjoys high flexibility in applying various base tree models and resistance to overfitting.

The remainder of the paper is organized as follows. Section 2 introduces some preliminaries. Section 3 formally describes our MLDF method and two measure-aware mechanisms. Section 4 reports the experimental results on benchmarks and investigative experiments. Finally, we conclude the paper in Section 5.

2 Preliminaries

In this section, we introduce various performance measures, followed by the tree-based multi-label methods, which are the base of Multi-Label Deep Forest.

2.1 Multi-label performance measures

The multi-label classification task is to derive a function from the training set . Suppose the multi-label learning model first produces a real-valued function , which can be viewed as the confidence of relevance of the labels. The multi-label classifier can be induced from by thresholding.

There are lots of performance measures for multi-label learning. Six widely-used evaluation measures are employed in this paper [Wu and Zhou, 2017]. Table 1 shows the formulation of these measures, being the true labels, the -th row of the label matrix, ‘’(‘’) the relevant (irrelevant) note. Hamming loss and macro-AUC are label-based measures, while one-error, coverage, ranking loss and average precision are instance-based measures [Zhang and Zhou, 2014]. The means the confidence score of -th instance on -th label, means the predict result of -th instance on -th label, and means the instance ’s rank on -th label. For example, , then we have with threshold 0.5. Furthermore, we have and . Since , then we have .

Measure Formulation
hamming loss
ranking loss
average precision
Table 1: Definitions of six multi-label performance measures, ‘’ means the lower the better, ‘’ means the higher the better.

2.2 Tree-based multi-label methods

Tree-based multi-label methods, such as ML-C4.5 [Clare and King, 2001] and PCT [Blockeel et al., 1998]

, are adapted from multi-class tree methods. ML-C4.5, adapted from C4.5, allows multiple labels in the leaves of the tree, whose formula of entropy calculation is modified by summing the entropy for each label. Predictive Clustering Tree (PCT) recursively partitions all samples into smaller clusters while moving down the tree. In the testing process, the leaf node of a multi-label tree returns a vector of probabilities that a sample belongs to each class, by counting the percentage of different classes of training examples at the leaf node where concerned instance falls. Finally, the information kept in each leaf node is the probability that the instance owns each label.

The ability of a single tree is limited, but ensemble of the trees will greatly improve performance. Random Forest of Predictive Clustering Trees (RF-PCT)

[Kocev et al., 2013] and Random Forest of ML-C4.5 (RFML-C4.5) [Madjarov et al., 2012]

are ensembles that use PCT and ML-C4.5 as base classifiers respectively. The same as random forest, these forests use bagging and choose different feature sets to obtain the diversity among the base classifiers. Given a test instance, the forest will produce an estimate of label distribution by averaging results across all trees.

3 The Proposed Method

In this section, we propose a deep forest method for multi-label learning. Firstly, we introduce the framework of Multi-Label Deep Forest (MLDF). Then, we discuss two proposed mechanisms: measure-aware feature reuse and measure-aware layer growth.

Figure 1: The framework of Multi-Label Deep Forest (MLDF). Each layer ensembles two different forests (the black above and the blue below).

3.1 The framework

Figure 1 illustrates the framework of MLDF. Different multi-label forests (the black forests above and the blue forests below) are ensembled in each layer of MLDF. From , we can obtain the representation . The part of measure-aware feature reuse will receive the representation and update it by reusing the representation learned in the under the guidance of the performance of different measures. Then the new representation (the blue one) will be concatenated together with the raw input features (the red one) and goes into the next layer.

In MLDF, each layer is an ensemble of forests. To enhance the performance of the ensemble, we consider different growing methods of trees to encourage diversity, which is crucial in the success of ensemble methods [Zhou, 2012]. In traditional multi-class problems, extremely-random trees [Geurts et al., 2006], which takes one split point of each feature randomly, are used in gcForest [Zhou and Feng, 2017]. For multi-label learning problems, we can also adapt this kind of method by changing the approach of splitting nodes when generating trees. In MLDF, we take RF-PCT [Kocev et al., 2013] as the forest block, and two different methods generating nodes in trees are used to forests: one considers all possible split point of each feature, which is RF-PCT, the other considers one split point randomly [Kocev and Ceci, 2015], we name this as ERF-PCT. Of course, other multi-label tree methods can also be embedded into each layer such as RFML-C4.5, which will be discussed in Section 4.

As aforementioned in Section 2.2, given an instance, the forests will produce an estimation of label distributions, which can be viewed as the confidence of the instance belonging to each label. The representation learned in each layer will adopt measure-aware feature reuse and be input to the next layer with raw input features. The real-valued representation with rich labeling information will be input to the next layer to facilitate MLDF to take better advantage of label correlations [Belanger et al., 2017].

The predicting process can be summarized as follows. As shown in Figure 1, assume the forests have been fitted well. Firstly, we pre-process instances to standard matrix . Secondly, the instances matrix passes the first layer, and we can get the representation . By adopting measure-aware feature reuse, we can get . Then we concatenate with the raw input features and put them into the next layer. After multiple layers, we get the final perdition.

Measure Confidence
hamming loss
ranking loss
average precision
Table 2: Confidence computing method on six multi-label measures. or is sorted in descending order.
0:  measure , forests’ output , previous .
0:  new representation .
1:  Initialize matrix .
2:  if Measure is label-based then
3:     for  to  do
4:        compute confidence on
5:        Update to when .
6:     end for
7:  end if
8:  if Measure is instance-based then
9:     for  to  do
10:        compute confidence on
11:        Update to when .
12:     end for
13:  end if
Algorithm 1 Measure-aware feature reuse
0:  measure , forests’ output , ground-truth , previous performance on .
0:  threshold .
1:  Initialize confidence set .
2:  if Measure is label-based then
3:     for  to  do
4:        compute confidence on .
5:        compute measure on .
6:         when is worse than .
7:     end for
8:  end if
9:  if Measure is instance-based then
10:     for  to  do
11:        compute confidence on .
12:        compute measure on .
13:         when is worse than .
14:     end for
15:  end if
16:   = Compute threshold on .
Algorithm 2 Determine threshold

3.2 Measure-aware feature reuse

The split criterion of PCT is not directly related to the performance measure, and the representation generated by each layer is same though the measure is different. Therefore we propose the measure-aware feature reuse mechanism to enhance the representation under the guidance of different measures. The key idea of measure-aware feature reuse is to partially reuse the good representation in the previous layer on the current layer if the confidence on the current layer is lower than a tehreshold determined in training, which can make the measure performance better. Therefore, the challenge lies in defining the confidence of specific multi-label measures on demand. Inspire by the confidence computing method on the multi-class classification problem [Pang et al., 2018], we design a confidence computing method on different multi-label measures.

Hamming loss cares the correctness of single bit, one-error cares the element closest to 1, and others cares the ranking permutation on each row or column. Therefore, the crucial step is designing reasonable methods to compute the confidence for various measures. Table 2 summarizes the computing method by considering the real meaning of each measure. Matrix is the average of , and the element represents . For convenience, we arrange the element of each row (column) of in descending order when the measure is instance-based (label-based). Specifically, for hamming loss, we compute the max confidence that the bit is positive or negative. For example, the prediction vector , thus the confidence is . For ranking loss, we compute the probability that the ranking loss is zero, which means the positive labels are ahead of negative labels. For example, if the prediction vector , there will be 5 possible permutations of ground-truth leading to zero ranking loss: . Therefore, we can get the confidence by summing the probabilities in these five cases. The confidence of macro-AUC and average precision are defined in a similarity way.

Algorithm 1 summarizes the process of the measure-aware feature reuse. Due to the diversity between label-based measures and instance-based measures [Wu and Zhou, 2017], we need to deal with them separately. Specifically, the label-based measures compute the confidence on each column of and the instance-based measures compute it on each row. After the confidence computing, we fix the previous representation when the confidence is below the threshold, and update with the fixed ones.

The whole process of measure-aware feature reuse does not rely on true labels. We can judge the goodness of representation by a threshold determined in training process. As Algorithm 2 shows, we save the confidence into set if the evaluated performance measure goes worse at . Then, the threshold is determined based on . Simply, we can take an average on . Because the meaning of confidence is consistent with the measure, the threshold can be effectively utilized in measure-aware feature reuse.

3.3 Measure-aware layer growth

0:  maximal depth , measure , training data .
0:  model set , threshold set and final layer index .
1:  Initialize parameters:  performance in each layer [1:T], best performance on train set , the initial threshold , the best performance layer index , the model set .
2:  for  to  do
3:     Train forests in and get classifier .
4:     Predict .
5:      Determine Threshold (Algorithm 2) when .
6:      measure-aware feature reuse (Algorithm 1).
7:     Compute performance on measure with .
8:     if  is better than  then
9:        Update best performance .
10:        Update the layer index of best performance .
11:     else if  is not updated in recent 3 layers  then
12:        break
13:     end if
14:     Add to model set: .
15:  end for
16:  Keep in model set and drop others .
Algorithm 3 Measure-aware layer growth

Though the measure-aware feature reuse can effectively enhance the representation guided by various measures, the mechanism can not influence the layer growth and reduce the risk of overfitting that may occur in training process. In order to reduce overfitting and control the model complexity, we propose the measure-aware layer growth mechanism.

If we use the same data to fit forests and do predicting directly, the risk of overfitting will be increased [Platt et al., 1999]. MLDF uses the -fold cross-validation to alleviate this issue. For each fold, we train the forests based on the training examples in other folds and make the prediction on the current fold. The layer’s representation is generated by concatenating the predictions from each forest.

MLDF is built layer by layer. Algorithm 3 summarizes the procedure of measure-aware layer growth used in training MLDF. The inputs are maximal depth of layers , evaluation measure , and the training data . If the model has grown layers, the training process will stop. In general, we can choose one RF-PCT and one ERF-PCT in each layer and randomly select number of features as candidates in each forest. All parameters of training forest in MLDF, such as the number of forests and the depth of trees, are pre-determined before training. As we hope each layer to learn different representations, we can set the maximum depth of tree in forests growing with the layer , so does the number of trees, which can be set in advance. In the initialization step, the performance vector , which records the performance value on training data in each layer, should be initialized according to different measures. During each layer, we first fit the forests (line 3) and get the representation (line 4). Then we should determine the threshold (line 5) and generate new representation by measure-aware feature reuse (line 6). Finally we add the to model set (line 14).

The layer growth is measure-aware. After fitting one layer, we need to compute the evaluation measure. When the measure is not getting better in recent three layers (line 11), MLDF is forced to stop growing. At the same time, the layer index of best performance in training dataset should be recorded, which is useful in prediction. According to Occam’s razor rule, we prefer a simpler model when the performance is similar. While there is no clear improvement in performance, the final model set should be kept, and layers after will be dropped.

Different measures represent different user demands. We can set as the required measure according to different situations. Therefore, we can obtain the corresponding model for the specified measure. On one hand, the model complexity can be controlled by . On the other hand, compared to other methods, like PCT, that cannot explicitly take the performance measure into consideration in the training process, MLDF is more flexible and has the ability to get better performance.

4 Experiments

In this section, we conduct experiments with MLDF on different multi-label classification benchmark datasets. Our goal is to validate MLDF can achieve the best performance on different measures and the two measure-aware mechanisms are necessray. Moreover, we show the advantages of MLDF through more detailed experiments from various aspects.

4.1 Dataset and configuration

We choose 9 multi-label classification benchmark datasets from different application domains and with different scales. Table 3 presents the basic statistics of these datasets. All datasets are from a repository of multi-label datasets111 The datasets vary in size: from 502 up to 43970 examples, from 68 up to 5000 features, and from 5 up to 201 labels. They are roughly organized in ascending order of the number of examples , with eight of them being regular-scale, i.e., and eight of them being large-scale, i.e., . For all experiments conducted below, 50% examples are randomly sampled without replacement to form the training set, and the rest 50% examples are used to create the test set.

Six evaluation measures, widely used in multi-label learning [Wu and Zhou, 2017], are employed in this paper: hamming loss, one-error, coverage, ranking loss, average precision, and macro-AUC. Note that the coverage is normalized by the number of labels and thus all the evaluation measures all vary between [0,1].

Hyper-parameters of MLDF is set as follows. We set the number of max layer () as 20 and take with the six measures discussed above respectively, which means we will get different models with different measures. We take one RF-PCT and one ERF-PCT in each layer and use the 5-fold cross-validation to avoid overfitting. In the first layer, we take 40 trees in each forest, and then take 20 more trees than the previous layer until the number of trees reaches 100, which can enable MLDF to learn different representations. Similarly, we set the max-depth as 3 in the first layer, and then take 3 more than the previous layer when layer increasing.

Dataset Domain
CAL500 music 502 68 174
enron text 1702 1001 53
image images 2000 294 5
scene images 2407 1196 6
yeast biology 2417 103 14
corel16k-s1 images 13766 500 153
corel16k-s2 images 13761 500 164
eurlex-sm text 19348 5000 201
mediamill multimedia 43970 120 101
Table 3: Descriptions of the datasets in terms of the domain (Domain), number of examples (), number of features () and number of labels ().
Algorithm CAL500 enron image scene yeast corel16k-s1 corel16k-s2 eurlex-sm mediamill
hamming loss
MLDF 0.1360.001 0.0460.000 0.1480.003 0.0820.002 0.1900.003 0.0180.000 0.0170.000 0.0060.001 0.0270.001
RF-PCT 0.1370.001 0.0490.001 0.1560.002 0.0960.001 0.1960.002 0.0190.001 0.0180.001 0.0080.001 0.0280.001
DBPNN 0.1690.001 0.0750.001 0.2640.001 0.2600.001 0.2200.001 0.0290.001 0.0260.001 0.0070.001 0.0310.001
MLFE 0.1410.002 0.0470.001 0.1620.006 0.0840.002 0.2030.002 0.0190.001 0.0180.001 0.0070.001 0.0290.001
ECC 0.1820.005 0.0560.001 0.2180.027 0.0960.003 0.2070.003 0.0300.001 0.0180.001 0.0100.001 0.0350.001
RAKEL 0.1380.002 0.0580.001 0.1730.004 0.0960.004 0.2020.003 0.0200.001 0.0190.001 0.0070.001 0.0310.001
MLDF 0.1220.009 0.2160.009 0.2390.008 0.1880.005 0.2230.010 0.6400.003 0.6390.004 0.1380.001 0.1470.005
RF-PCT 0.1220.010 0.2310.011 0.2580.005 0.2150.010 0.2470.008 0.7230.002 0.7210.006 0.2700.006 0.1500.002
DBPNN 0.1160.013 0.4900.012 0.5050.012 0.6900.003 0.2470.004 0.7400.004 0.6970.004 0.4600.015 0.2000.003
MLFE 0.1330.010 0.2320.003 0.2650.008 0.2010.005 0.2450.010 0.6800.005 0.6650.004 0.3450.010 0.1510.002
ECC 0.1370.021 0.2930.008 0.4080.069 0.2470.010 0.2440.009 0.7060.006 0.7120.005 0.3460.007 0.1500.005
RAKEL 0.2860.039 0.4120.016 0.3120.010 0.2470.009 0.2510.008 0.8860.007 0.8970.006 0.4470.016 0.1810.002
MLDF 0.2740.005 0.1280.001 0.1590.004 0.7410.006 0.4340.004 0.0640.003 0.2620.002 0.2230.003 0.0660.001
RF-PCT 0.3100.002 0.1330.001 0.1700.004 0.7560.007 0.4360.007 0.0730.004 0.3210.002 0.2230.007 0.0580.001
DBPNN 0.3720.002 0.5750.003 0.1870.006 0.7840.002 0.4580.003 0.0840.004 0.3700.002 0.2920.006 0.5520.011
MLFE 0.3660.001 0.1720.001 0.1680.006 0.7580.008 0.4610.008 0.0800.008 0.3680.002 0.2370.007 0.0850.002
ECC 0.4360.002 0.4670.009 0.2290.034 0.8060.016 0.4640.005 0.0840.002 0.4460.003 0.3490.014 0.3860.010
RAKEL 0.6660.001 0.5600.002 0.2090.009 0.9710.001 0.5580.006 0.1040.003 0.6670.002 0.5230.008 0.5430.012
ranking loss
MLDF 0.1760.002 0.0770.001 0.1290.005 0.0590.004 0.1600.006 0.1430.002 0.1380.002 0.0140.001 0.0340.001
RF-PCT 0.1780.002 0.0790.001 0.1420.004 0.0700.004 0.1640.008 0.1650.001 0.1420.001 0.0290.001 0.0350.001
DBPNN 0.1850.002 0.1260.007 0.2780.005 0.2770.005 0.1870.001 0.1540.002 0.1480.002 0.3960.011 0.2300.001
MLFE 0.1850.003 0.0820.008 0.1480.007 0.0650.004 0.1740.006 0.1890.002 0.1880.001 0.0340.002 0.0460.001
ECC 0.2040.008 0.1330.004 0.2240.043 0.0850.003 0.1860.003 0.2330.002 0.2290.001 0.2630.007 0.1790.008
RAKEL 0.4440.005 0.2410.005 0.1960.008 0.1070.003 0.2450.004 0.4140.002 0.4180.001 0.3880.011 0.2220.001
average precision
MLDF 0.5120.003 0.6960.004 0.8420.005 0.8910.008 0.7700.005 0.3470.002 0.3420.004 0.8400.002 0.7320.007
RF-PCT 0.5120.006 0.6850.002 0.8290.003 0.8730.006 0.7580.008 0.2930.002 0.2870.002 0.7260.004 0.7290.001
DBPNN 0.4950.002 0.5000.007 0.6720.006 0.5630.004 0.7380.002 0.2890.002 0.2990.002 0.4270.013 0.5020.002
MLFE 0.4880.006 0.6880.009 0.8170.010 0.8820.005 0.7590.005 0.3190.001 0.3170.001 0.8530.007 0.7280.001
ECC 0.4820.008 0.6510.006 0.7390.043 0.8530.005 0.7520.006 0.2820.003 0.2760.002 0.5720.007 0.5970.014
RAKEL 0.3530.006 0.5390.006 0.7880.006 0.8430.005 0.7200.005 0.1030.003 0.0920.003 0.4400.013 0.5210.001
MLDF 0.5680.006 0.7420.014 0.8850.003 0.9560.003 0.7320.010 0.7280.001 0.7370.007 0.9300.002 0.8420.002
RF-PCT 0.5550.004 0.7290.012 0.8750.005 0.9470.002 0.7230.012 0.7120.004 0.7190.005 0.9040.007 0.8350.002
DBPNN 0.4990.001 0.6790.010 0.7460.006 0.7040.005 0.6270.004 0.6990.002 0.7080.003 0.5890.005 0.5100.001
MLFE 0.5470.006 0.6560.010 0.8410.006 0.9440.004 0.7050.005 0.6510.006 0.6620.002 0.8530.003 0.7990.002
ECC 0.5070.005 0.6460.008 0.8070.030 0.9310.004 0.6460.003 0.6270.004 0.6330.002 0.6240.004 0.5240.001
RAKEL 0.5470.007 0.5960.007 0.8030.005 0.8840.004 0.6140.003 0.5230.001 0.5250.001 0.5910.006 0.5130.001
Table 4: Predictive performance (mean standard deviation) of each comparing methods on the nine datasets. indicates MLDF is significantly better(worse) than the comparing method on the criterion based on paired -test at 95% significant level. means the smaller (larger) the value is, the performance will be the better.

4.2 Performance comparison

We compare MLDF to the following five contenders: a) RF-PCT [Kocev et al., 2013], b) DBPNN [Hinton and Salakhutdinov, 2006; Read et al., 2016], c) MLFE [Zhang et al., 2018a], d) RAKEL [Tsoumakas and Vlahavas, 2007] and e) ECC [Read et al., 2011]. In above, DBPNN is the representative of DNN methods; RAKEL, ECC, and RF-PCT are representatives of multi-label ensemble methods; MLFE is a method that utilizes the structural information in feature space to enrich the labeling information. Parameter settings of the compared methods are listed below. In details, for RF-PCT, we take the amount as 100. For DBPNN, we conduct the experiments with Meka [Read et al., 2016] and set the base classifier as the logistic function , other hyper-parameters are the same as recommended in Meka . For MLFE, we keep the same setting as suggested in [Zhang et al., 2018a], where , , and are chosen among and respectively. The ensemble size of ECC is set to 100 to accommodate the sufficient number of classifier chains, and the ensemble size of RAKEL is set to with as suggested in the literature. The base learner of ECC and RAKEL is SVM with a linear kernel. For fairness, all methods use the 5-fold cross-validation.

We conduct experiments on each algorithm for ten times. The mean metric value and the standard deviation across 10 training/testing trials are recorded for comparative studies. Table 4 reports the detailed experimental results of comparing algorithms. MLDF achieves optimal (lowest) average rank in terms of each evaluation measure. On the 9 benchmark datasets, across all the evaluation measures, MLDF ranks 1st in 98.46% cases and ranks 2nd in 1.54% cases. Compared on these six measures, MLDF ranks 1st in 100.00%, 96.29%, 96.29%, 100.00%, 98.15%, 100.00% cases respectively. To summarize, MLDF achieves the best performance against other well-established contenders across extensive benchmark datasets on various evaluation measures, which validates the effectiveness of MLDF.

4.3 Influence of measure-aware feature reuse

The measure-aware feature reuse aims to reuse the good representation in the previous layer according to confidence. When the confidence is lower than threshold , we reuse the representation in . If we skip the line 5 in Algorithm 3 and keep as 0 for all t in , it is just that we do not take the measure-aware feature reuse mechanism in all layers. Figure 2 shows the comparison between using the mechanism and not using the mechanism on CAL500, yeast, corel16k-s1 and corel16k-s2. The six radii represent six different measures, the outermost part of the hexagon represents the performance of MLDF, the center represents the performance of RF-PCT, and the darker hexagon represents the performance of MLDF without the mechanism. Area of MLDF is larger than that of MLDF without the mechanism, therefore it indicates that the measure-aware feature reuse mechanism is workable on these datasets. Furthermore, the area of MLDF without the measure-aware feature reuse mechanism gets smaller when the size of datasets increase, which confirms that the measure-aware feature reuse mechanism does well in larger data.

Figure 2: The performance comparison on CAL500, yeast, corel16k-s1 and corel16k-s2. The light hexagon represents the performance of MLDF, the darker one represents the performance of MLDF without the measure-aware feature reuse mechanism. The larger the area means the better performance.

4.4 Effect of measure-aware layer growth

We conduct experiments on yeast dataset to show effect of the measure-aware layer growth mechanism. Specifically, when MLDF arrives its final layer index , we keep the layer increase for observing whether the mechanism is effective. The number of trees in RF-PCT is 100, and the 5-fold cross-validation is used.

Figure 3: The test performance of MLDF in each layer on six measures on yeast dataset respectively. The triangle indicates the performance of RF-PCT, and the circle means the final layer index .

Figure 3 shows the yeast’s test performance curve of MLDF on 6 measures. The red circle means the final layer returned by Algorithm 3. For RF-PCT, we set the same trees as the final layer of MLDF (the red circle). The triangle indicates the performance of RF-PCT, which can be seen as a one-layer MLDF. In Figure 3, the performance of MLDF becomes better when the model goes deeper, and our algorithm can stop almost at the position with the best performance. It demonstrates the effectiveness of our stopping mechanism. The performance of MLDF (the circle) is better than RF-PCT (the triangle), where they have the same amount of trees. Moreover, MLDF controlled by different measures can converge in different layers. It indicates that the measure-aware layer growth mechanism is effective.

4.5 Label correlations exploitation

dataset hamming loss one-error coverage ranking loss average precision macro-AUC
CAL500 0.1370.001 (5/0/0) 0.1200.018 (4/0/1) 0.7380.004 (5/0/0) 0.1760.002 (5/0/0) 0.5110.005 (4/0/1) 0.5690.007 (5/0/0)
enron 0.0470.001 (5/0/0) 0.2250.007 (5/0/0) 0.2240.005 (4/0/1) 0.0790.004 (5/0/0) 0.6910.003 (5/0/0) 0.7380.005 (5/0/0)
image 0.1460.004 (5/0/0) 0.2430.016 (5/0/0) 0.1570.008 (5/0/0) 0.1300.010 (5/0/0) 0.8410.010 (5/0/0) 0.8860.010 (5/0/0)
scene 0.0830.003 (5/0/0) 0.1920.007 (5/0/0) 0.0630.002 (5/0/0) 0.0590.002 (5/0/0) 0.8890.004 (5/0/0) 0.9560.003 (5/0/0)
yeast 0.1900.003 (5/0/0) 0.2250.009 (5/0/0) 0.4340.006 (5/0/0) 0.1590.002 (5/0/0) 0.7690.003 (5/0/0) 0.7330.006 (5/0/0)
corel16k-s1 0.0190.001 (5/0/0) 0.7260.002 (2/0/3) 0.3160.012 (5/0/0) 0.1630.007 (4/0/1) 0.2950.002 (4/0/1) 0.6950.002 (3/0/2)
corel16k-s2 0.0180.001 (5/0/0) 0.7270.001 (1/0/4) 0.3130.016 (4/0/1) 0.1600.009 (3/0/2) 0.2840.007 (2/0/3) 0.6970.011 (3/0/2)
eurlex-sm 0.0070.001 (5/0/0) 0.2010.012 (5/0/0) 0.0430.001 (5/0/0) 0.0210.001 (5/0/0) 0.7840.008 (4/0/1) 0.9000.002 (4/0/1)
mediamill 0.0270.001 (5/0/0) 0.1440.002 (5/0/0) 0.1280.002 (5/0/0) 0.0350.001 (5/0/0) 0.7250.010 (3/0/2) 0.8430.004 (5/0/0)
sum of score 45/0/0 37/0/8 43/0/2 42/0/3 37/0/8 40/0/5
Table 5: Predictive performance (mean standard deviation) of each comparing methods on all datasets. () indicates the times that MLDF is significantly (superior/ equal/ inferior) to the comparing method on the criterion based on paired -test at 95% significant level.

Intuitively, the cascade structure enables MLDF to utilize label correlations. Thus we conduct a distinctive approach to exploit label correlations. Our layer-wise method gradually considers more complex label correlations by utilizing the lower layer label representation in higher layer modelling. Here we deliberately delete a specific label in first layer representation, then train the second layer and check the influence of that malicious deletion. Suppose accuracy decrease on label B is observed after deleting label A, we consider the two labels are correlated. The relative normalized decrease indicates the strength of correlations and we show the result on scene dataset in Figure 4. As shown in Figure 4, label “Sunset" highly effects label “Leaf" each other, since missing one’s information will effect the performance sharply on the other label. Besides, label “Beach" is highly correlated with label “Urban” since sometimes they exist in scene dataset together [Boutell et al., 2004]. It indicates that MLDF utilizes some correlations between labels to get better performance in inner layers.

4.6 Flexibility

In previous experiments, RF-PCT and ERF-PCT are the forest blocks in MLDF, which achieve the best performance. A natural question will be is it possible to replace the forest block by other multi-label tree-based methods in MLDF, and how will the performance change? To investigate this problem, we take one RFML-C4.5 and one ERFML-C4.5 in MLDF. To ensure fairness, we keep all the other configurations same as those in Section 4.2.

Table 5 shows the result of MLDF based on RFML-C4.5. By comparing the results in Table 4, we count the times that MLDF wins/draws/loses the comparison. MLDF wins 100.0%, 82.2%, 95.6%, 93.3%, 82.2%, 88.9% among all the 45 comparisons on six evaluation measures respectively. It is clear that MLDF based on RFML-C4.5 also achieves the best performance among all compared methods. Thus, no matter based on RF-PCT or RFML-C4.5, MLDF can achieve the best performance. It indicates that MLDF has good flexibility.

Figure 4: Effect of missing representation information on each label respectively. The scene dataset has 6 labels (top-to-down, left-to-right): “Beach", “Leaf", “Sunset", “Field", “Urban" and “Mountain".

5 Conclusion

In this paper, we first introduce the deep forest framework to multi-label learning and propose Multi-Label Deep Forest (MLDF). Because of the two measure-aware mechanisms, measure-aware feature reuse and measure-aware layer growth, our proposal can optimize different multi-label measures based on user demand, reduce the risk of overfitting, and achieve the best results on a bunch of benchmark datasets.

In the future, the efficiency of MLDF could be further improved by reusing some components during the process of forests training. Though our method can exploit high-order label correlations, we only show the second-order correlations in experiments. We will try to find a way to interpret how high-order correlations are used. Furthermore, we plan to embed extreme multi-label tree methods like FastXML [Prabhu and Varma, 2014] into MLDF and test the performance on extreme scale multi-label problems.


  • Belanger et al. [2017] D. Belanger, B. Yang, and A. McCallum. End-to-End Learning for Structured Prediction Energy Networks. In ICML, pages 429–439, 2017.
  • Blockeel et al. [1998] H. Blockeel, L. D. Raedt, and J. Ramon. Top-Down Induction of Clustering Trees. In ICML, pages 55–63, 1998.
  • Boutell et al. [2004] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, 2004.
  • Cheng et al. [2017] X. Cheng, S.-G. Zhao, X. Xuan, and K.-C. Chou. iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals. Oncotarget, 8(35):58494, 2017.
  • Clare and King [2001] A. Clare and R. D. King. Knowledge discovery in multi-label phenotype data. In PKDD, pages 42–53, 2001.
  • Feng and Zhou [2018] J. Feng and Z.-H. Zhou. Autoencoder by Forest. In AAAI, pages 2967–2973, 2018.
  • Geurts et al. [2006] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006.
  • Hinton and Salakhutdinov [2006] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • Kocev and Ceci [2015] D. Kocev and M. Ceci. Ensembles of extremely randomized trees for multi-target regression. In Discovery Science, pages 86–100, 2015.
  • Kocev et al. [2013] D. Kocev, C. Vens, J. Struyf, and S. Džeroski. Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3):817–833, 2013.
  • Madjarov et al. [2012] G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Džeroski. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45(9):3084–3104, 2012.
  • Nam et al. [2014] J. Nam, J. Kim, E. Loza Menc’ia, I. Gurevych, and J. Fürnkranz. Large-scale multi-label text classification - revisiting neural networks. In ECML, pages 437–452, 2014.
  • Pang et al. [2018] M. Pang, K.-M. Ting, P. Zhao, and Z.-H. Zhou. Improving Deep Forest by Confidence Screening. In ICDM, pages 1194–1199, 2018.
  • Platt et al. [1999] J. Platt et al.

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.

    Advances in large margin classifiers, 10(3):61–74, 1999.
  • Prabhu and Varma [2014] Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD, pages 263–272, 2014.
  • Ray et al. [2018] J. Ray, H. Wang, D. Tran, Y. Wang, M. Feiszli, L. Torresani, and M. Paluri. Scenes-Objects-Actions: A Multi-Task, Multi-Label Video Dataset. In ECCV, pages 660–676, 2018.
  • Read et al. [2011] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333–359, 2011.
  • Read et al. [2016] J. Read, P. Reutemann, B. Pfahringer, and G. Holmes. MEKA: A multi-label/multi-target extension to Weka. Journal of Machine Learning Research, 17(21):1–5, 2016.
  • Schapire and Singer [2000] R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine learning, 39(2-3):135–168, 2000.
  • Tsoumakas and Katakis [2007] G. Tsoumakas and I. Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.
  • Tsoumakas and Vlahavas [2007] G. Tsoumakas and I. P. Vlahavas. Random k -labelsets: An ensemble method for multilabel classification. In ECML, pages 406–417, 2007.
  • Tsoumakas et al. [2010] G. Tsoumakas, I. Katakis, and I. P. Vlahavas. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, pages 667–685. 2010.
  • Wang et al. [2016] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. CNN-RNN: A unified framework for multi-label image classification. In CVPR, pages 2285–2294, 2016.
  • Wu and Zhou [2017] X.-Z. Wu and Z.-H. Zhou. A Unified View of Multi-Label Performance Measures. In ICML, pages 3780–3788, 2017.
  • Zhang and Zhou [2006] M.-L. Zhang and Z.-H. Zhou. Multi-label neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10):1338–1351, 2006.
  • Zhang and Zhou [2014] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014.
  • Zhang et al. [2018a] Q.-W. Zhang, Y. Zhong, and M.-L. Zhang. Feature-induced labeling information enrichment for multi-label learning. In AAAI, pages 4446–4453, 2018a.
  • Zhang et al. [2018b] Y.-L. Zhang, J. Zhou, W. Zheng, J. Feng, L. Li, Z. Liu, M. Li, Z. Zhang, C. Chen, X. Li, and Z.-H. Zhou. Distributed Deep Forest and its Application to Automatic Detection of Cash-out Fraud. CoRR, abs/1805.04234, 2018b.
  • Zhao et al. [2015] F. Zhao, Y. Huang, L. Wang, and T. Tan.

    Deep semantic ranking based hashing for multi-label image retrieval.

    In CVPR, pages 1556–1564, 2015.
  • Zhou and El-Gohary [2015] P. Zhou and N. El-Gohary. Ontology-based multilabel text classification of construction regulatory documents. Journal of Computing in Civil Engineering, 30(4):04015058, 2015.
  • Zhou [2012] Z.-H. Zhou. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.
  • Zhou and Feng [2017] Z.-H. Zhou and J. Feng. Deep Forest: Towards An Alternative to Deep Neural Networks. In IJCAI, pages 3553–3559, 2017.
  • Zhou and Feng [2018] Z.-H. Zhou and J. Feng. Deep Forest. National Science Review, page nwy108, 2018.