Partial Multi-label Learning with Label and Feature Collaboration

03/17/2020 ∙ by Tingting Yu, et al. ∙ Southwest University 7

Partial multi-label learning (PML) models the scenario where each training instance is annotated with a set of candidate labels, and only some of the labels are relevant. The PML problem is practical in real-world scenarios, as it is difficult and even impossible to obtain precisely labeled samples. Several PML solutions have been proposed to combat with the prone misled by the irrelevant labels concealed in the candidate labels, but they generally focus on the smoothness assumption in feature space or low-rank assumption in label space, while ignore the negative information between features and labels. Specifically, if two instances have largely overlapped candidate labels, irrespective of their feature similarity, their ground-truth labels should be similar; while if they are dissimilar in the feature and candidate label space, their ground-truth labels should be dissimilar with each other. To achieve a credible predictor on PML data, we propose a novel approach called PML-LFC (Partial Multi-label Learning with Label and Feature Collaboration). PML-LFC estimates the confidence values of relevant labels for each instance using the similarity from both the label and feature spaces, and trains the desired predictor with the estimated confidence values. PML-LFC achieves the predictor and the latent label matrix in a reciprocal reinforce manner by a unified model, and develops an alternative optimization procedure to optimize them. Extensive empirical study on both synthetic and real-world datasets demonstrates the superiority of PML-LFC.



There are no comments yet.


page 2

page 3

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-label learning (MLL) deals with the scenario where each instance is annotated with a set of discrete non-exclusive labels [29, 6]. Recent years have witnessed an increasing research and application of MLL in various domains, such as image annotation [20], cybersecurity [7], gene functional annotation [24], and so on. Most MLL methods have an implicit assumption that each training example is precisely annotated with all of its relevant labels. However, it is difficult and costly to obtain fully annotated training examples in most real-word MLL tasks [17]. Therefore, recent MLL methods not only focus on how to assign a set of appropriate labels to unlabeled examples using the label correlations  [27, 31], but also on replenishing missing labels for incompletely labeled samples [14, 20, 16].

Existing MLL solutions still overlook another fact that naturally arises in real-world scenarios. For example, in Fig.1, the image was crowdly-annotated by workers with ‘Seaside’, ‘Sky’, ‘Sandbeach’, ‘Cloud’, ‘Tree’, ‘People’, ‘Sunset’ and ‘Ship’. Among these labels, the first five are relevant, and the last three are irrelevant of this image. Obviously, the training procedure is prone to be misled by irrelevant labels concealed in the candidate labels of training samples. To combat with such major difficulty, some pioneers term learning on such training data with irrelevant labels as Partial Multi-label Learning (PML) [21, 23], and proposed several PML approaches [19, 5, 13] to identify the irrelevant labels concealed in the candidate labels of annotated samples, and to achieve a predictor robust (or less prone) to irrelevant labels of training data.

Figure 1: An exemplary partial multi-label learning scenario. The image is annotated with eight candidate labels, only the first five (in red) are relevant, and the last three (in black) are irrelevant.

However, contemporary approaches either mainly focus on the smoothness assumption that the (dis-)similar instances should have (dis-)similar ground-truth labels [21, 19, 5], or the low-rank assumption that the ground-truth label matrix should be low-rank [23, 13]. While these two assumptions can not well handle the case that two instances without any common candidate label but with high feature similarity, and two instances with some overlapped candidate labels but with a low feature similarity. As a result, the smoothness-based methods [21, 19, 5] ignore the negative information that two instances with high (low) feature similarity but with a low (high) semantic similarity from the candidate labels. In other words, these methods do not make a good collaborative use of the information from the label and feature space for PML. For example, in Fig.2(a) and Fig.2(d), two instances ( and ) not only have a high (low) feature similarity, but also a high (low) semantic similarity due to largely overlapped (non-overlapped) candidate labels. From the setting of PML, these two instances are likely to have overlapped (non-overlapped) ground-truth labels. On the other hand, if two instances are without any overlapped candidate label (say zero semantic similarity), their ground-truth labels should be non-overlapped (as Fig. 2(b) show). Besides, if these two instances have a low feature similarity but with a high semantic similarity (as Fig. 2(c) show), their ground-truth labels can be overlapped to some extent.

(a) Two instances with both high feature and semantic similarity.
(b) Two instances with high feature similarity but with low (zero) semantic similarity.
(c) Two instances with low (moderate) feature similarity but with high semantic similarity.
(d) Two instances with low feature similarity and low semantic similarity.
Figure 2:

The feature and label information of PML data in four scenarios. The semantic similarity is computed on the candidate labels of instances, the feature similarity is computed on the feature vectors of instances. Ground-truth labels are highlighted in red, while other candidate labels are in grey.

Given these observations, we introduce the Partial Multi-label Learning with Label and Feature Collaboration (PML-LFC). PML-LFC firstly learns a linear predictor with respect to a latent ground-truth label matrix, and induces a low-rank constraint on the coefficient matrix of the predictor to account for the label correlations of multi-label data. Then, it computes the feature similarity between instances and the semantic similarity between them using the candidate label vectors, and forces the inner product similarity of latent label vectors consistent with the feature similarity and semantic similarity. In this way, both the label and feature information are collaboratively used to induce the latent label vectors, and the four scenarios illustrated in Fig.2 are jointly modelled. PML-LFC finally achieves the predictor and the latent label matrix in a reciprocal reinforce manner by a unified model, and develops an alternative optimization procedure to optimize them.

The main contributions of this paper are summarized as follows:

(i) We introduce PML-LFC to jointly leverage the label and feature information of partial multi-label data to induce a credible multi-label classifier, where existing PML solutions isolate the usage of label and feature information, or ignore the usage of negative information that two instances with high (low) feature similarity but with a low (high) semantic similarity from the candidate labels.

(ii) PML-LMC unifies the predictor training on PML data and latent label matrix exploration in a unified objective, and introduces an alternative optimization procedure to jointly optimize the predictor and latent label matrix in a mutually beneficial manner.
(iii) Empirical study on public multi-label datasets shows that PML-LFC significantly outperforms the related and competitive methods: fPML [23], PML-LRS [13], DRAMA [19], PRACTICLE [5], and two classical multi-label classifiers (RankSVM [4]

, and ML-KNN

[28]). In addition, the collaboration between labels and features contributes an improved performance.

The reminder of this paper is organized as follows. Section 2 clarifies the difference between our problem and multi-label learning, partial-label learning, and then reviews the latest PML methods. Section 3 elaborates on the PML-LFC and its optimization procedure. The experimental setup and results are provided and analyzed in Section 4. Conclusions and future works are summarized in Section 5.

2 Related Work

PML is different from multi-label crowd consensus learning [9, 17, 25], which wants to obtain high-quality consensus annotations of repeatedly annotated instances, while PML does not have such repeated annotations of the same instances. PML is unlike multi-label weak-label learning [14, 16], which focus on learning from annotated training data with incomplete (missing) labels. PML is also different from the popular partial-label learning (PLL) [3, 26], which assumes only one label from the candidate labels of the sample is the ground truth and aims to induce a multi-class predictor to assign one label for unseen sample. PLL can be viewed as a degenerated version of PML. We observe that PML is more difficult than the typical MLL and PLL problems, since the ground truth labels of samples are not directly accessible to train the predictor and a set of discrete non-exclusive labels should be carefully assigned. To be self-inclusive and help reader being informed, we give a brief review of popular PML solutions.

[21] introduced two PML approaches (PML-fp and PML-lc) to elicit the ground-truth labels by minimizing the confidence weighted ranking loss between candidate and non-candidate labels. PML-fp focuses on the utilization of feature information of training data, while PML-lc focuses on the label correlations. To mitigate the negative impact of irrelevant labels in the training phase, [5] proposed a two-stage approach (PARTICLE), which firstly elicits credible labels via iterative label propagation, and then takes the elicited labels to induce a multi-label classifier with virtual label splitting (PARTICLE-VLS) or maximum a posteriori reasoning (PARTICLE-MAP). [19]

introduced another two-stage PML approach (DRAMA) that firstly estimates the confidence value for each label by utilizing the feature manifold that indicates how likely a label is correct, and then induces a gradient boosting model to fit the label confidences by exploring the label correlations with the previously elicited labels in each boosting round. Due to the isolation between label elicitation and the classifier training, the elicited labels maybe not compatible with the classifier.

[23] introduced a feature-induced PML solution (fPML), which coordinately factorizes the observed sample-label association matrix and the sample-feature matrix into low-rank matrices and then reconstructs the sample-label matrix to identify irrelevant labels. At the same time, fPML optimizes a compatible predictor based on the reconstructed sample-label matrix. Similarly, [13] assumed the observed label matrix is the linear combination of a ground-truth label matrix with low-rank and an irrelevant label matrix with sparse constraints, and introduced a solution called PML-LRS.

These aforementioned state-of-the-art PML solutions either mainly focus on the usage of feature manifold that similar instances will have similar labels, or on the usage of ground-truth label matrix being low-rank. They still isolate the joint effect of features and labels for effective partial multi-label learning to some extent. The (latent) labels of instances are dependent on the features of these instances [30, 10], and the semantic similarity derived from the label sets of multi-label instances are positively correlated with the feature similarity between them [18, 24, 15]. Both the label and feature information of multi-label data should be well accounted for effective learning on PML data. Given that, we introduce PML-LFC to collaboratively use the feature and label information, which will be detailed in the next Section.

3 Proposed method

Let denote the -dimensional feature space, and denote the label space with respect to distinct labels. Given a PML dataset , where is the feature vector of the -th sample, and is the set of candidate labels currently annotated to . The key characteristic of PML is that only a subset labels are the ground-truth labels of , while the others () are irrelevant for . However, is not directly accessible to the predictor. The target of PML is to induce a multi-label classifier from . A naive PML solution is to divide PML problem into binary sub-problems, and then adopt a noisy label resistant learning algorithm [12, 11]. But this naive solution generally suffers from the label sparsity issue of multi-label data, where each instance typically is only annotated with several labels of whole label space and each label is only annotated to a small portion of instances. Moreover, it disregards the correlations between labels. Another straightforward solution is to take candidate labels as the ground-truth labels and then apply off-the-shelf MLL algorithms [6] to train the predictor. However, the predictor will be seriously misled by the false positive labels in the candidate labels.

To bypass the difficulty of the lack of known ground-truth labels of training instances, we take as the latent label confidence matrix, where reflects the confidence of the -th label being the ground-truth for the -th instance. Unlike existing two-stage approaches [5, 19] that firstly estimate the credible labels and then train predictor using the estimated labels. We integrate the estimation of label confidence matrix and predictor learning into a unified framework as follows:



denotes the loss function,

controls the complexity of the prediction model and is the regularization term for label confidence matrix , and are the trade-off parameters for the last two terms. In this unified formulation, the model is learned from the confidence label matrix rather than the original noisy label matrix . Therefore, the key is how to obtain reliable confidence matrix .

In this paper, we propose to train the predictor based on the widely-used the least square loss to fit the confidence label matrix as follows:


where is the coefficient matrix for the predictor. It is recognized the labels of multi-label instances are correlated and the label data matrix of instances should be a low rank one [22, 23]. Given that, we instantiate the regularization on the predictor with low-rank constraint on as follows:


The main bottleneck of PML problem is the lack of the ground-truth labels of training instances. To overcome this bottleneck, most efforts operate in the feature space based on the assumption that similar (dissimilar) instances have similar (dissimilar) label assignments. They adopt manifold regularization [1] derived from the feature similarity to refine the labels of PML data, and then induce a predictor on the refined labels [5, 19]. Some efforts work in the label space using the knowledge that the latent ground-truth label matrix should be low-rank [13, 23], or that the relevant labels of an instance are hidden in the candidate label set and should be ranked ahead of irrelevant ones outside of the candidate set [21]. Although these efforts leverage the feature information to identify the relevant/irrelevant labels of training instances to some extent, they are still inclined to assign similar labels to two instances with high feature similarity but without any common candidate label. From the definition of PML, it is easy to observe that the ground-truth labels of training instances are hidden in the collected candidate label set. In other words, if two annotated instances do not share any candidate label, then there is no overlap between the individual ground-truth label sets of the two instances (as Fig. 2(b) show). Besides, they also prefer to assign different label sets to two instances without sufficient large feature similarity but with largely overlapped candidate labels (as shown in Fig. 2(c). In summary, contemporary PML approaches do not sufficiently use the negative information that that two instances with high (low) feature similarity but with a low (high) semantic similarity from the candidate labels, since they do not collaboratively use the feature and label information in a coherent way.

To remedy this issue, we specify the last term in Eq. (1) as follows:


where represents the feature similarity between and , reflects the semantic similarity derived from candidate labels of these two instances, respectively. The first constraint guarantees that each candidate label has a non-negative confidence value, and the second constraint restricts the confidence value is within [0,1], and the sum of them equal to 1. We can find that: (i) if two instances have both high values of and , then and should be close to each other; (ii) if two instances have a large (or moderate) value of , then and can still have some overlaps; (iii) if two instances have a zero (or low) value of and a low value of , then and should be not overlapped. Our minimization of Eq. (4) jointly considers the above cases. In contrast, contemporary PML methods ignore the semantic similarity between instances. They do not make effective use of the negative information in the last two cases stated above. We want to remark that given the existence of irrelevant labels of training instances, it is not an easy job to quantify and leverage the important label correlation for partial multi-label learning. Thanks to the semantic similarity, which quantifies the similarity between instances based on the pattern that two (or more) labels co-annotate to the same instances, this pattern is also transferred to the latent confident label matrix . In addition, this pattern transfer is also coordinated by the low-rank constraint on the coefficient matrix and by the feature similarity, which alleviates the negative impact of irrelevant labels on quantifying the semantic similarity. In this way, the information sources from the feature and label spaces are jointly used to guide the latent label matrix learning, which rewards a credible multi-label predictor.

Here, we initialize the label confidence matrix as:


To quantify the feature similarity between instances, we adopt the widely-used Gaussian heat kernel similarity as follows:


where denotes the kernel width and is empirically set to . Clearly, when there are no two identical instances. We want to remark that other similarity metrics can also be adopted here. Our choice of Gaussian heat kernel is for its simplicity and wide application.

Diverse similarity metrics can also be adopted to quantify the semantic similarity between multi-label instances [18, 24]

, here we use the cosine similarity as follows:


where is the one-hot coding label vector for , if the -th label is annotated to , otherwise. Obviously, , it has a large value when two instances have a large portion of overlapped candidate labels, moderate value when they share some overlapped candidate labels, and zero value when they do not have any overlapped candidate label.

Based on the above analysis, we can instantiate the PML-LFC as follows:


The problem above can be further rewritten as follows:


where , if ; and otherwise, is the Hadamard product. However, the rank function in Eq.(9) is hard to optimize, the nuclear norm is suggested to surrogate the rank function. Therefore, Eq.(9) is reformulated as follows:


3.1 Optimization

Since the optimization problem in Eq.(10) is non-convex with respect to and at the same time. We apply the alternative optimization procedure to approximate them. Specifically, we alternatively optimize one variable while fixing the other one as a constant. The detailed procedure is presented below.

Update W: With fixed, Eq.(10) with respect to is equivalent to the following problem:


The minimization of Eq. (11) is a trace norm minimization problem, which is time-consuming. To reduce the computation time of Eq. (11), we use the Accelerated Gradient Descent (AGD) algorithm [8] to optimize as summarized in Algorithm 1.

In particular, , and in Algorithm 1 are defined as follows:


Update P: With fixed, Eq.(10) with respect to reduces to:


By introducing the Lagrange multiplier , Eq. (16) is equivalent to:


where , () are the ()-dimensional column vector with all ones. The gradient with respect to is:


We can use the Karush-Kuhn-Tucker (KKT) conditions [2] for the nonnegativity of as:


where , and are all-one matrices with dimensions and dimensions, respectively. Let , where and for any matrix , Eq. (19) can be rewritten as:


Eq. (20) leads to the following update formula:


Algorithm 2 summarizes the pseudo-code of PML-LFC. We observe that PML-LFC only needs at most ten iterations to converge on our used datasets.

Input: , , .
Output: .

1:Initialize = , ,, ,
3: Set
5:  Set
6:  Set and update
Algorithm 1 Optimization of

         : instance-feature matrix;
      : instance-label association matrix;
      , : scalar input parameters.
      Prediction coefficients .

1:Initialize by Eq.(5);
3: Seek the optimal by optimizing Eq. (11) and Algorithm 1;
4: Fix , update by optimizing Eq. (16);
5:While not convergence or within the allowed number of iterations
Algorithm 2 PML-LFC: Partial Multi-label Learning with Label and Feature Collaboration

4 Experiments

4.1 Experimental Setup

Dataset: For a quantitative performance evaluation, five synthetic and three real-world PML datasets are collected for experiments. Table 1 summarizes characteristics of these datasets. Specifically, to create a synthetic PML dataset, we take the current labels of instances as ground-truth ones. For each instance , we randomly insert the irrelevant labels of with number of the ground-truth labels, and we vary in the range of

. All the datasets are randomly partitioned into 80% for training and the rest 20% for testing. We repeat all the experiments for 10 times independently, report the average results with standard deviations.

Comparing methods: Four representative PML algorithms, including fPML [23], PML-LRS [13], DRAMA [19] and PARTICLE-VLS [5] are used as the comparing methods. DRAMA and PARTICLE-VLS mainly utilize the feature similarity between instances, while fPML and PML-LRS build on low-rank assumption of the label matrix, and fPML additionally explores and uses the coherence between the label and feature data matrix. In addition, two representative MLL solutions (ML-KNN [28] and Rank-SVM[4]) are also included as baselines for comparative analysis. The last two comparing methods directly take the candidate labels as ground-truths to train the respective predictor. For these comparing methods, parameter configurations are fixed or optimized by the suggestions in the original codes or papers. For our PML-LMC, we fix =10 and =10. The parameter sensitivity of and will be analyzed later.
Evaluation metrics:

For a comprehensive performance evaluation and comparison, we adopt five widely-used multi-label evaluation metrics:

hamming loss (HammLoss), one-error (OneError), coverage (Coverage), ranking loss (RankLoss) and average precision(AvgPrec). The formal definition of these metrics can be founded in [29, 6]. Note here coverage is normalized by the number of distinct labels, thus it ranges in [0,1]. Furthermore, the larger the value of average precision, the better the performance is, while the opposite holds for the other four evaluation metrics.

Data set Instances Features Labels avgGLs Noise
slashdot 3782 1079 22 1.181 -
scene 2407 294 6 1.074 -
enron 1702 1001 53 3.378 -
medical 978 1449 45 1.245 -
Corel5k 5000 499 374 0.245 -
YeastBP 6139 6139 217 5.537 2385
YeastCC 6139 6139 50 1.348 260
YeastMF 6139 6139 39 1.005 234
Table 1: Characteristics of the experimental dataset. ‘#Instance’ is the number of Examples, ‘#Features’ is the number of features, ‘#Labels’ is the number of distinct labels, ‘avgGLs’ is the average number of ground-truth labels of each instance, and ‘#’ is the number of noise labels of the dataset.

4.2 Results and Analysis

Table 2 reports the detailed experimental results of six comparing algorithms with the noisy label ratio of 50%, while similar observation can be found in terms of other noisy label ratios. The first stage of DARAM and PARTLCE-VLS utilizes the feature similarity to optimize the ground-truth label confidence in different ways. However, due to the features of three real-world PML datasets are protein-protein interaction networks, we directly use the network structure to optimize the ground-truth label confidence matrix in the first stage by respective algorithms. In the second stage, PARTICLE-VLS introduces a virtual label technique to transform the problem into multiple binary training sets, and results in the class-imbalanced problem and causes computation exception due to the sparse biological network data. For this reason, its results on the last three datasets can not be reported. Due to page limit, we summarize the win/tie/loss counts of our method versus the other comparing method in 23 cases (five datasets four ratios of noisy labels and three PML datasets) across five evaluation metrics in Table 3.

HammLoss 0.0780.005 0.1840.006 0.0440.000 0.0430.000 0.0520.000 0.0530.001 0.0730.000
RankLoss 0.1610.002 0.0530.000 0.1530.006 0.1270.006 0.1180.000 0.3050.032 0.1100.007
OneError 0.5340.005 0.6800.015 0.4460.012 0.4800.013 0.4040.001 0.7690.074 0.3930.028
Coverage 0.1820.022 0.1200.006 0.1650.007 0.1390.003 0.1330.001 0.3050.031 0.1280.007
AvgPrec 0.5820.017 0.4720.011 0.6390.007 0.6270.007 0.6860.001 0.3750.036 0.6960.010
HammLoss 0.2720.012 0.1100.013 0.1480.005 0.1670.001 0.1210.000 0.1230.017 0.1460.003
RankLoss 0.2590.015 0.0970.008 0.1240.011 0.1450.005 0.1180.002 0.1100.020 0.0940.004
OneError 0.5530.009 0.2600.009 0.3140.027 0.3620.009 0.2650.003 0.2510.044 0.2580.007
Coverage 0.2320.017 0.1090.011 0.1180.009 0.1360.006 0.1140.001 0.0970.018 0.0930.002
AvgPrec 0.6350.038 0.8380.016 0.8040.016 0.7740.005 0.8300.001 0.8280.033 0.8430.005
HammLoss 0.1090.006 0.1080.006 0.0600.001 0.1040.002 0.0680.001 0.0640.008 0.0510.001
RankLoss 0.1890.037 0.0540.000 0.1450.009 0.1970.009 0.1430.002 0.2380.037 0.0990.008
OneError 0.4760.047 0.3230.032 0.3260.036 0.4160.030 0.2600.004 0.4530.102 0.2540.013
Coverage 0.4810.038 0.2850.005 0.3690.014 0.3310.016 0.3540.002 0.4510.071 0.2840.010
AvgPrec 0.5040.053 0.6110.019 0.6130.015 0.6590.008 0.6130.002 0.4660.088 0.6830.009
HammLoss 0.4820.008 0.0700.009 0.3430.034 0.0220.002 0.0150.000 0.0210.001 0.0240.000
RankLoss 0.0180.003 0.0420.006 0.0750.027 0.0460.005 0.0360.003 0.1130.021 0.0360.005
OneError 0.1690.004 0.2700.020 0.4200.013 0.2160.008 0.1930.008 0.2200.082 0.1990.013
Coverage 0.2760.025 0.0950.011 0.1140.027 0.0650.010 0.0580.001 0.1160.020 0.0520.009
AvgPrec 0.8540.024 0.7660.015 0.6650.018 0.8310.007 0.8390.007 0.7300.022 0.8340.012
HammLoss 0.0810.007 0.1610.005 0.0510.009 0.0100.000 0.5540.000 0.0190.000 0.0100.000
RankLoss 0.2810.006 0.1340.000 0.0630.005 0.2100.008 0.2770.001 0.3260.056 0.1200.006
OneError 0.8020.007 0.7400.010 0.6390.017 0.6490.008 0.8010.002 0.8550.073 0.6310.010
Coverage 0.3910.007 0.3720.010 0.4030.007 0.4700.017 0.5390.003 0.5470.041 0.2810.013
AvgPrec 0.2920.008 0.2300.003 0.3930.006 0.2860.005 0.1990.008 0.1440.052 0.3120.002
HammLoss 0.3160.005 0.3290.012 0.0710.004 0.0620.000 0.0240.000
RankLoss 0.0250.000 0.3310.007 0.2080.009 0.1610.000 0.1430.002
OneError 0.7570.008 0.7430.013 0.6820.004 0.7960.002 0.5230.013
Coverage 0.4070.010 0.3740.008 0.3120.005 0.2950.002 0.2810.012
AvgPrec 0.2320.007 0.2420.011 0.3940.012 0.2140.001 0.4110.012
HammLoss 0.0460.008 0.3180.016 0.3510.012 0.0930.005 0.0710.000 0.0270.000
RankLoss 0.1880.004 0.0260.000 0.3080.009 0.1790.007 0.1780.000 0.1730.008
OneError 0.5550.004 0.6390.018 0.6580.014 0.5240.007 0.8320.003 0.4480.014
Coverage 0.1070.009 0.1730.010 0.1500.007 0.1120.004 0.1110.002 0.1030.003
AvgPrec 0.5160.010 0.3980.018 0.3860.012 0.5350.009 0.1930.002 0.5900.014
HammLoss 0.0550.005 0.3380.004 0.3480.004 0.044.008 0.0770.001 0.0260.000
RankLoss 0.2530.009 0.0250.000 0.3860.008 0.2690.006 0.2510.000 0.2430.012
OneError 0.6810.010 0.7850.005 0.7610.012 0.6930.009 0.8780.001 0.6610.017
Coverage 0.1230.008 0.1720.006 0.1680.007 0.1240.003 0.1370.001 0.1210.005
AvgPrec 0.4210.008 0.3300.006 0.3020.010 0.4420.009 0.1600.000 0.4570.011
Table 2: Experiment results on different datasets with noisy labels (50%). / indicates whether PML-LFC is statistically (according to pairwise -test at 95% significance level) superior/inferior to the other method.
Metric PML-LFC against
HammLoss 21/0/2 17/2/4 18/2/3 16/3/4 16/2/5 15/3/5
RankLoss 20/1/2 16/2/5 22/1/0 19/1/3 18/2/3 22/0/1
OneError 22/0/1 23/0/0 23/0/0 21/0/2 19/0/4 20/0/3
Coverage 21/1/1 23/0/0 22/0/1 23/0/0 21/1/1 20/0/3
AvgPrec 22/0/1 23/0/0 20/1/2 19/1/3 19/0/4 21/0/2
Total (Win/Tie/Lose) 106/2/7 102/4/9 105/4/6 98/5/12 93/5/17 98/3/14
Table 3: Win/Tie/Lose counts (pairwise -test at 95% signification level) of PML-LFC against each other comparing algorithm with different ratios of noisy labels {10%, 50%, 100%, 200%} on different datasets across five evaluation criteria.

Based on the results in Table 2 and 3, we can observe the following: (i) On the real-word PML datasets YeastBP,YeastCC and YeastMF, PML-LFC achieve the best performance in most cases except ML-KNN on ranking loss evaluation. (ii) On the synthic datasets, PML-LFC frequently outperforms other methods and slightly loses to RankSVM and DARAM on medical dataset. (iii) Out of 115 statistical tests PML-LFC achieves much better results than the popular PML methods PML-LRS, fPML, DARAM and PARTICLE-VLS in 91.30%, 85.22%, 80.87% and 85.22% cases, respectively. PML-LFC also significantly outperforms two classical MLL approaches RankSVM and ML-KNN in 92.17% and 88.70% cases, respectively. Which proves the necessity of accounting for irrelevant labels of PML training data. PML-LFC outperforms PML-LRS in most cases because PML-LRS mainly operates in the label space. fPML is similar to PML-LRS, while it uses feature information to guide the low-rank label matrix approximation. As a result, it sometimes obtains similar results as PML-LFC. PML-LFC also performs better than DARAM and PARTICLE-VLS, which mainly use information from the feature space. Another cause for the superiority of PML-LFC is that other comparing methods do not make a concrete use of the negative information between the label and feature space. From these results, we can draw a conclusion that PML-LFC well accounts the negative information between features and labels for effective partial multi-label learning.

4.3 Further Analysis

We perform ablation study to further study the effectiveness of PML-LFC. For this purpose, we introduce three variants of PML-LFC, namely, PML-LMF(oF), PML-LFC(oL) and PML-LFC(nJ). PML-LFC(oF) only uses feature similarity, PML-LFC(oL) only utilizes the semantic similarity. PML-LFC(nJ) does not jointly optimize the latent label matrix and the predictor in a unified objective function, it firstly optimizes the latent label matrix and then the multi-label predictor. Fig. 3 shows the results of these variants and PML-LFC on the slashdot dataset. All the experimental settings are the same as previous section.

Figure 3: The performance of PML-LFC and its degenerated variants on the slashdot dataset. For the first four evaluation metrics, the lower the value, the better the performance is. For AvgPrec, the higher the value, the better the performance is.

From Fig. 3, we can find that PML-LFC has the lowest HammLoss, RankLoss, One-Error, Coverage, and the highest AvgPrec among the four comparing methods. Neither the feature similarity nor the semantic similarity alone induces a comparable multi-label predictor with PML-LFC. In addition, PML-LFC(oF) and PML-LFC(oL) have similar performance with each other, which indicate that both the feature and label information can be used to induce a multi-label predictor. PML-LFC leverages both the label and feature information, it induces a less error-prone multi-label classifier and achieves a better classification performance than these two variants. PML-LFC(nJ) has the lowest performance across the five evaluation metrics, which corroborates the disadvantage of isolating the confident matrix learning and multi-label predictor training. This study further confirms that both the feature and label information of multi-label data should be appropriately used for effective partial multi-label learning, and our alternative optimization procedure has a reciprocal reinforce effect for the predictor and the latent label matrix.

To investigate the sensitivity of and , we vary and in the range of {0.001, 0.01, 0.1, 1, 10, 100} for PML-LFC on the medical dataset. The experimental results (measured by the five evaluation metrics) are shown in Fig. 4. The results on other datasets give similar observations. From Fig. 4(a), we can observe that, when , PML-LFC achieves the best performance. This observation suggests that it’s necessary to consider the low-rank label correlation for partial multi-label learning. When is too large or too small, the label correlation is underweighted or overweighted, thus the performance manifests a reduce. From Fig. 4(b), we can see that PML-LFC achieves the best performance when . When is too small, the feature similarity and semantic similarity of multi-label instances are not well accounted, which leads to a poor performance. When is too large (i.e., 100), PML-LFC also achieves a poor performance, as it excessively overweights the feature similarity and semantic similarity, but underweights the prediction model. From this analysis, we adopt and for experiments.

(a) Performances vs.
(b) Performance vs.
Figure 4: Results of PML-LFC under different input values of and .

5 Conclusions

We investigated the partial multi-label learning problem and proposed an approach called PML-LFC, which leverages the feature and label information for effective multi-label classification. PML-LFC takes into account the negative information between labels and features of partial multi-label data. Extensive experiments on PML datasets from different domains demonstrate the effectiveness of PML-LFC. We are planning to incorporate the abundant unlabeled data for effective extreme partial multi-label learning with a large label space.


  • [1] Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 7(11), 2399–2434 (2006)
  • [2] Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)
  • [3] Cour, T., Sapp, B., Taskar, B.: Learning from partial labels. JMLR 12(5), 1501–1536 (2011)
  • [4] Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: NeurIPS. pp. 681–687 (2002)
  • [5] Fang, J.P., Zhang, M.L.: Partial multi-label learning via credible label elicitation. In: AAAI. pp. 3518–3525 (2019)
  • [6] Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Computing Surveys 47(3),  52 (2015)
  • [7] Han, Y., Sun, G., Shen, Y., Zhang, X.: Multi-label learning with highly incomplete data via collaborative embedding. In: KDD. pp. 1494–1503 (2018)
  • [8]

    Ji, S., Ye, J.: An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. pp. 457–464. ACM (2009)

  • [9] Li, S.Y., Jiang, Y., Chawla, N.V., Zhou, Z.H.: Multi-label learning from crowds. TKDE 31(7), 1369–1382 (2019)
  • [10] Li, Y.F., Hu, J.H., Jiang, Y., Zhou, Z.H.: Towards discovering what patterns trigger what labels. In: AAAI. pp. 1012–1018 (2012)
  • [11] Liu, T., Tao, D.: Classification with noisy labels by importance reweighting. TPAMI 38(3), 447–461 (2016)
  • [12] Natarajan, N., Dhillon, I.S., Ravikumar, P.K., Tewari, A.: Learning with noisy labels. In: NeurIPS. pp. 1196–1204 (2013)
  • [13] Sun, L., Feng, S., Wang, T., Lang, C., Jin, Y.: Partial multi-label learning by low-rank and sparse decomposition. In: AAAI. pp. 5016–5023 (2019)
  • [14] Sun, Y.Y., Zhang, Y., Zhou, Z.H.: Multi-label learning with weak label. In: AAAI. pp. 593–598 (2010)
  • [15] Tan, Q., Liu, Y., Chen, X., Yu, G.: Multi-label classification based on low rank representation for image annotation. Remote Sensing 9(2),  109 (2017)
  • [16] Tan, Q., Yu, G., Domeniconi, C., Wang, J., Zhang, Z.: Incomplete multi-view weak-label learning. In: IJCAI. pp. 2703–2709 (2018)
  • [17] Tu, J., Yu, G., Domeniconi, C., Wang, J., Xiao, G., Guo, M.: Multi-label answer aggregation based on joint matrix factorization. In: ICDM. pp. 517–526 (2018)
  • [18] Wang, C., Yan, S., Zhang, L., Zhang, H.J.: Multi-label sparse coding for automatic image annotation. In: CVPR. pp. 1643–1650 (2009)
  • [19] Wang, H., Liu, W., Zhao, Y., Zhang, C., Hu, T., Chen, G.: Discriminative and correlative partial multi-label learning. In: IJCAI. pp. 2703–2709 (2019)
  • [20] Wu, B., Jia, F., Liu, W., Ghanem, B., Lyu, S.: Multi-label learning with missing labels using mixed dependency graphs. IJCV 126(8), 875–896 (2018)
  • [21] Xie, M.K., Huang, S.J.: Partial multi-label learning. In: AAAI. pp. 4302–4309 (2018)
  • [22] Xu, L., Wang, Z., Shen, Z., Wang, Y., Chen, E.: Learning low-rank label correlations for multi-label classification with missing labels. In: ICDM. pp. 1067–1072 (2014)
  • [23] Yu, G., Chen, X., Domeniconi, C., Wang, J., Li, Z., Zhang, Z., Wu, X.: Feature-induced partial multi-label learning. In: ICDM. pp. 1398–1403 (2018)
  • [24] Yu, G., Fu, G., Wang, J., Zhu, H.: Predicting protein function via semantic integration of multiple networks. TCBB 13(2), 220–232 (2016)
  • [25] Zhang, J., Wu, X.: Multi-label inference for crowdsourcing. In: KDD. pp. 2738–2747 (2018)
  • [26] Zhang, M.L., Yu, F., Tang, C.Z.: Disambiguation-free partial label learning. TKDE 29(10), 2155–2167 (2017)
  • [27] Zhang, M.L., Zhang, K.: Multi-label learning by exploiting label dependency. In: KDD. pp. 999–1008 (2010)
  • [28]

    Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition

    40(7), 2038–2048 (2007)
  • [29] Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. TKDE 26(8), 1819–1837 (2014)
  • [30] Zhang, Y., Zhou, Z.H.: Multilabel dimensionality reduction via dependence maximization. TKDD 4(3),  14 (2010)
  • [31] Zhu, Y., Kwok, J.T., Zhou, Z.H.: Multi-label learning with global and local label correlation. TKDE 30(6), 1081–1094 (2017)