1 Introduction
Multilabel learning (MLL) deals with the scenario where each instance is annotated with a set of discrete nonexclusive labels [29, 6]. Recent years have witnessed an increasing research and application of MLL in various domains, such as image annotation [20], cybersecurity [7], gene functional annotation [24], and so on. Most MLL methods have an implicit assumption that each training example is precisely annotated with all of its relevant labels. However, it is difficult and costly to obtain fully annotated training examples in most realword MLL tasks [17]. Therefore, recent MLL methods not only focus on how to assign a set of appropriate labels to unlabeled examples using the label correlations [27, 31], but also on replenishing missing labels for incompletely labeled samples [14, 20, 16].
Existing MLL solutions still overlook another fact that naturally arises in realworld scenarios. For example, in Fig.1, the image was crowdlyannotated by workers with ‘Seaside’, ‘Sky’, ‘Sandbeach’, ‘Cloud’, ‘Tree’, ‘People’, ‘Sunset’ and ‘Ship’. Among these labels, the first five are relevant, and the last three are irrelevant of this image. Obviously, the training procedure is prone to be misled by irrelevant labels concealed in the candidate labels of training samples. To combat with such major difficulty, some pioneers term learning on such training data with irrelevant labels as Partial Multilabel Learning (PML) [21, 23], and proposed several PML approaches [19, 5, 13] to identify the irrelevant labels concealed in the candidate labels of annotated samples, and to achieve a predictor robust (or less prone) to irrelevant labels of training data.
However, contemporary approaches either mainly focus on the smoothness assumption that the (dis)similar instances should have (dis)similar groundtruth labels [21, 19, 5], or the lowrank assumption that the groundtruth label matrix should be lowrank [23, 13]. While these two assumptions can not well handle the case that two instances without any common candidate label but with high feature similarity, and two instances with some overlapped candidate labels but with a low feature similarity. As a result, the smoothnessbased methods [21, 19, 5] ignore the negative information that two instances with high (low) feature similarity but with a low (high) semantic similarity from the candidate labels. In other words, these methods do not make a good collaborative use of the information from the label and feature space for PML. For example, in Fig.2(a) and Fig.2(d), two instances ( and ) not only have a high (low) feature similarity, but also a high (low) semantic similarity due to largely overlapped (nonoverlapped) candidate labels. From the setting of PML, these two instances are likely to have overlapped (nonoverlapped) groundtruth labels. On the other hand, if two instances are without any overlapped candidate label (say zero semantic similarity), their groundtruth labels should be nonoverlapped (as Fig. 2(b) show). Besides, if these two instances have a low feature similarity but with a high semantic similarity (as Fig. 2(c) show), their groundtruth labels can be overlapped to some extent.
The feature and label information of PML data in four scenarios. The semantic similarity is computed on the candidate labels of instances, the feature similarity is computed on the feature vectors of instances. Groundtruth labels are highlighted in red, while other candidate labels are in grey.
Given these observations, we introduce the Partial Multilabel Learning with Label and Feature Collaboration (PMLLFC). PMLLFC firstly learns a linear predictor with respect to a latent groundtruth label matrix, and induces a lowrank constraint on the coefficient matrix of the predictor to account for the label correlations of multilabel data. Then, it computes the feature similarity between instances and the semantic similarity between them using the candidate label vectors, and forces the inner product similarity of latent label vectors consistent with the feature similarity and semantic similarity. In this way, both the label and feature information are collaboratively used to induce the latent label vectors, and the four scenarios illustrated in Fig.2 are jointly modelled. PMLLFC finally achieves the predictor and the latent label matrix in a reciprocal reinforce manner by a unified model, and develops an alternative optimization procedure to optimize them.
The main contributions of this paper are summarized as follows:
(i) We introduce PMLLFC to jointly leverage the label and feature information of partial multilabel data to induce a credible multilabel classifier, where existing PML solutions isolate the usage of label and feature information, or ignore the usage of negative information that two instances with high (low) feature similarity but with a low (high) semantic similarity from the candidate labels.
(ii) PMLLMC unifies the predictor training on PML data and latent label matrix exploration in a unified objective, and introduces an alternative optimization procedure to jointly optimize the predictor and latent label matrix in a mutually beneficial manner.
(iii) Empirical study on public multilabel datasets shows that PMLLFC significantly outperforms the related and competitive methods: fPML [23], PMLLRS [13], DRAMA [19], PRACTICLE [5], and two classical multilabel classifiers (RankSVM [4]
, and MLKNN
[28]). In addition, the collaboration between labels and features contributes an improved performance.The reminder of this paper is organized as follows. Section 2 clarifies the difference between our problem and multilabel learning, partiallabel learning, and then reviews the latest PML methods. Section 3 elaborates on the PMLLFC and its optimization procedure. The experimental setup and results are provided and analyzed in Section 4. Conclusions and future works are summarized in Section 5.
2 Related Work
PML is different from multilabel crowd consensus learning [9, 17, 25], which wants to obtain highquality consensus annotations of repeatedly annotated instances, while PML does not have such repeated annotations of the same instances. PML is unlike multilabel weaklabel learning [14, 16], which focus on learning from annotated training data with incomplete (missing) labels. PML is also different from the popular partiallabel learning (PLL) [3, 26], which assumes only one label from the candidate labels of the sample is the ground truth and aims to induce a multiclass predictor to assign one label for unseen sample. PLL can be viewed as a degenerated version of PML. We observe that PML is more difficult than the typical MLL and PLL problems, since the ground truth labels of samples are not directly accessible to train the predictor and a set of discrete nonexclusive labels should be carefully assigned. To be selfinclusive and help reader being informed, we give a brief review of popular PML solutions.
[21] introduced two PML approaches (PMLfp and PMLlc) to elicit the groundtruth labels by minimizing the confidence weighted ranking loss between candidate and noncandidate labels. PMLfp focuses on the utilization of feature information of training data, while PMLlc focuses on the label correlations. To mitigate the negative impact of irrelevant labels in the training phase, [5] proposed a twostage approach (PARTICLE), which firstly elicits credible labels via iterative label propagation, and then takes the elicited labels to induce a multilabel classifier with virtual label splitting (PARTICLEVLS) or maximum a posteriori reasoning (PARTICLEMAP). [19]
introduced another twostage PML approach (DRAMA) that firstly estimates the confidence value for each label by utilizing the feature manifold that indicates how likely a label is correct, and then induces a gradient boosting model to fit the label confidences by exploring the label correlations with the previously elicited labels in each boosting round. Due to the isolation between label elicitation and the classifier training, the elicited labels maybe not compatible with the classifier.
[23] introduced a featureinduced PML solution (fPML), which coordinately factorizes the observed samplelabel association matrix and the samplefeature matrix into lowrank matrices and then reconstructs the samplelabel matrix to identify irrelevant labels. At the same time, fPML optimizes a compatible predictor based on the reconstructed samplelabel matrix. Similarly, [13] assumed the observed label matrix is the linear combination of a groundtruth label matrix with lowrank and an irrelevant label matrix with sparse constraints, and introduced a solution called PMLLRS.These aforementioned stateoftheart PML solutions either mainly focus on the usage of feature manifold that similar instances will have similar labels, or on the usage of groundtruth label matrix being lowrank. They still isolate the joint effect of features and labels for effective partial multilabel learning to some extent. The (latent) labels of instances are dependent on the features of these instances [30, 10], and the semantic similarity derived from the label sets of multilabel instances are positively correlated with the feature similarity between them [18, 24, 15]. Both the label and feature information of multilabel data should be well accounted for effective learning on PML data. Given that, we introduce PMLLFC to collaboratively use the feature and label information, which will be detailed in the next Section.
3 Proposed method
Let denote the dimensional feature space, and denote the label space with respect to distinct labels. Given a PML dataset , where is the feature vector of the th sample, and is the set of candidate labels currently annotated to . The key characteristic of PML is that only a subset labels are the groundtruth labels of , while the others () are irrelevant for . However, is not directly accessible to the predictor. The target of PML is to induce a multilabel classifier from . A naive PML solution is to divide PML problem into binary subproblems, and then adopt a noisy label resistant learning algorithm [12, 11]. But this naive solution generally suffers from the label sparsity issue of multilabel data, where each instance typically is only annotated with several labels of whole label space and each label is only annotated to a small portion of instances. Moreover, it disregards the correlations between labels. Another straightforward solution is to take candidate labels as the groundtruth labels and then apply offtheshelf MLL algorithms [6] to train the predictor. However, the predictor will be seriously misled by the false positive labels in the candidate labels.
To bypass the difficulty of the lack of known groundtruth labels of training instances, we take as the latent label confidence matrix, where reflects the confidence of the th label being the groundtruth for the th instance. Unlike existing twostage approaches [5, 19] that firstly estimate the credible labels and then train predictor using the estimated labels. We integrate the estimation of label confidence matrix and predictor learning into a unified framework as follows:
(1) 
where
denotes the loss function,
controls the complexity of the prediction model and is the regularization term for label confidence matrix , and are the tradeoff parameters for the last two terms. In this unified formulation, the model is learned from the confidence label matrix rather than the original noisy label matrix . Therefore, the key is how to obtain reliable confidence matrix .In this paper, we propose to train the predictor based on the widelyused the least square loss to fit the confidence label matrix as follows:
(2) 
where is the coefficient matrix for the predictor. It is recognized the labels of multilabel instances are correlated and the label data matrix of instances should be a low rank one [22, 23]. Given that, we instantiate the regularization on the predictor with lowrank constraint on as follows:
(3) 
The main bottleneck of PML problem is the lack of the groundtruth labels of training instances. To overcome this bottleneck, most efforts operate in the feature space based on the assumption that similar (dissimilar) instances have similar (dissimilar) label assignments. They adopt manifold regularization [1] derived from the feature similarity to refine the labels of PML data, and then induce a predictor on the refined labels [5, 19]. Some efforts work in the label space using the knowledge that the latent groundtruth label matrix should be lowrank [13, 23], or that the relevant labels of an instance are hidden in the candidate label set and should be ranked ahead of irrelevant ones outside of the candidate set [21]. Although these efforts leverage the feature information to identify the relevant/irrelevant labels of training instances to some extent, they are still inclined to assign similar labels to two instances with high feature similarity but without any common candidate label. From the definition of PML, it is easy to observe that the groundtruth labels of training instances are hidden in the collected candidate label set. In other words, if two annotated instances do not share any candidate label, then there is no overlap between the individual groundtruth label sets of the two instances (as Fig. 2(b) show). Besides, they also prefer to assign different label sets to two instances without sufficient large feature similarity but with largely overlapped candidate labels (as shown in Fig. 2(c). In summary, contemporary PML approaches do not sufficiently use the negative information that that two instances with high (low) feature similarity but with a low (high) semantic similarity from the candidate labels, since they do not collaboratively use the feature and label information in a coherent way.
To remedy this issue, we specify the last term in Eq. (1) as follows:
(4) 
where represents the feature similarity between and , reflects the semantic similarity derived from candidate labels of these two instances, respectively. The first constraint guarantees that each candidate label has a nonnegative confidence value, and the second constraint restricts the confidence value is within [0,1], and the sum of them equal to 1. We can find that: (i) if two instances have both high values of and , then and should be close to each other; (ii) if two instances have a large (or moderate) value of , then and can still have some overlaps; (iii) if two instances have a zero (or low) value of and a low value of , then and should be not overlapped. Our minimization of Eq. (4) jointly considers the above cases. In contrast, contemporary PML methods ignore the semantic similarity between instances. They do not make effective use of the negative information in the last two cases stated above. We want to remark that given the existence of irrelevant labels of training instances, it is not an easy job to quantify and leverage the important label correlation for partial multilabel learning. Thanks to the semantic similarity, which quantifies the similarity between instances based on the pattern that two (or more) labels coannotate to the same instances, this pattern is also transferred to the latent confident label matrix . In addition, this pattern transfer is also coordinated by the lowrank constraint on the coefficient matrix and by the feature similarity, which alleviates the negative impact of irrelevant labels on quantifying the semantic similarity. In this way, the information sources from the feature and label spaces are jointly used to guide the latent label matrix learning, which rewards a credible multilabel predictor.
Here, we initialize the label confidence matrix as:
(5) 
To quantify the feature similarity between instances, we adopt the widelyused Gaussian heat kernel similarity as follows:
(6) 
where denotes the kernel width and is empirically set to . Clearly, when there are no two identical instances. We want to remark that other similarity metrics can also be adopted here. Our choice of Gaussian heat kernel is for its simplicity and wide application.
Diverse similarity metrics can also be adopted to quantify the semantic similarity between multilabel instances [18, 24]
, here we use the cosine similarity as follows:
(7) 
where is the onehot coding label vector for , if the th label is annotated to , otherwise. Obviously, , it has a large value when two instances have a large portion of overlapped candidate labels, moderate value when they share some overlapped candidate labels, and zero value when they do not have any overlapped candidate label.
Based on the above analysis, we can instantiate the PMLLFC as follows:
(8) 
The problem above can be further rewritten as follows:
(9) 
where , if ; and otherwise, is the Hadamard product. However, the rank function in Eq.(9) is hard to optimize, the nuclear norm is suggested to surrogate the rank function. Therefore, Eq.(9) is reformulated as follows:
(10) 
3.1 Optimization
Since the optimization problem in Eq.(10) is nonconvex with respect to and at the same time. We apply the alternative optimization procedure to approximate them. Specifically, we alternatively optimize one variable while fixing the other one as a constant. The detailed procedure is presented below.
Update W: With fixed, Eq.(10) with respect to is equivalent to the following problem:
(11) 
The minimization of Eq. (11) is a trace norm minimization problem, which is timeconsuming. To reduce the computation time of Eq. (11), we use the Accelerated Gradient Descent (AGD) algorithm [8] to optimize as summarized in Algorithm 1.
Update P: With fixed, Eq.(10) with respect to reduces to:
(16) 
By introducing the Lagrange multiplier , Eq. (16) is equivalent to:
(17) 
where , () are the ()dimensional column vector with all ones. The gradient with respect to is:
(18) 
We can use the KarushKuhnTucker (KKT) conditions [2] for the nonnegativity of as:
(19) 
where , and are allone matrices with dimensions and dimensions, respectively. Let , where and for any matrix , Eq. (19) can be rewritten as:
(20) 
Eq. (20) leads to the following update formula:
(21) 
Algorithm 2 summarizes the pseudocode of PMLLFC. We observe that PMLLFC only needs at most ten iterations to converge on our used datasets.
4 Experiments
4.1 Experimental Setup
Dataset: For a quantitative performance evaluation, five synthetic and three realworld PML datasets are collected for experiments. Table 1 summarizes characteristics of these datasets. Specifically, to create a synthetic PML dataset, we take the current labels of instances as groundtruth ones. For each instance , we randomly insert the irrelevant labels of with number of the groundtruth labels, and we vary in the range of
. All the datasets are randomly partitioned into 80% for training and the rest 20% for testing. We repeat all the experiments for 10 times independently, report the average results with standard deviations.
Comparing methods: Four representative PML algorithms, including fPML [23], PMLLRS [13], DRAMA [19] and PARTICLEVLS [5] are used as the comparing methods. DRAMA and PARTICLEVLS mainly utilize the feature similarity between instances, while fPML and PMLLRS build on lowrank assumption of the label matrix, and fPML additionally explores and uses the coherence between the label and feature data matrix. In addition, two representative MLL solutions (MLKNN [28] and RankSVM[4]) are also included as baselines for comparative analysis. The last two comparing methods directly take the candidate labels as groundtruths to train the respective predictor. For these comparing methods, parameter configurations are fixed or optimized by the suggestions in the original codes or papers. For our PMLLMC, we fix =10 and =10. The parameter sensitivity of and will be analyzed later.
Evaluation metrics:
For a comprehensive performance evaluation and comparison, we adopt five widelyused multilabel evaluation metrics:
hamming loss (HammLoss), oneerror (OneError), coverage (Coverage), ranking loss (RankLoss) and average precision(AvgPrec). The formal definition of these metrics can be founded in [29, 6]. Note here coverage is normalized by the number of distinct labels, thus it ranges in [0,1]. Furthermore, the larger the value of average precision, the better the performance is, while the opposite holds for the other four evaluation metrics.Data set  Instances  Features  Labels  avgGLs  Noise 
slashdot  3782  1079  22  1.181   
scene  2407  294  6  1.074   
enron  1702  1001  53  3.378   
medical  978  1449  45  1.245   
Corel5k  5000  499  374  0.245   
YeastBP  6139  6139  217  5.537  2385 
YeastCC  6139  6139  50  1.348  260 
YeastMF  6139  6139  39  1.005  234 
4.2 Results and Analysis
Table 2 reports the detailed experimental results of six comparing algorithms with the noisy label ratio of 50%, while similar observation can be found in terms of other noisy label ratios. The first stage of DARAM and PARTLCEVLS utilizes the feature similarity to optimize the groundtruth label confidence in different ways. However, due to the features of three realworld PML datasets are proteinprotein interaction networks, we directly use the network structure to optimize the groundtruth label confidence matrix in the first stage by respective algorithms. In the second stage, PARTICLEVLS introduces a virtual label technique to transform the problem into multiple binary training sets, and results in the classimbalanced problem and causes computation exception due to the sparse biological network data. For this reason, its results on the last three datasets can not be reported. Due to page limit, we summarize the win/tie/loss counts of our method versus the other comparing method in 23 cases (five datasets four ratios of noisy labels and three PML datasets) across five evaluation metrics in Table 3.
Metric  RankSVM  MLKNN  PMLLRS  fPML  DARAM  PARTICLEVLS  PMLLFC 
slashdot  
HammLoss  0.0780.005  0.1840.006  0.0440.000  0.0430.000  0.0520.000  0.0530.001  0.0730.000 
RankLoss  0.1610.002  0.0530.000  0.1530.006  0.1270.006  0.1180.000  0.3050.032  0.1100.007 
OneError  0.5340.005  0.6800.015  0.4460.012  0.4800.013  0.4040.001  0.7690.074  0.3930.028 
Coverage  0.1820.022  0.1200.006  0.1650.007  0.1390.003  0.1330.001  0.3050.031  0.1280.007 
AvgPrec  0.5820.017  0.4720.011  0.6390.007  0.6270.007  0.6860.001  0.3750.036  0.6960.010 
scene  
HammLoss  0.2720.012  0.1100.013  0.1480.005  0.1670.001  0.1210.000  0.1230.017  0.1460.003 
RankLoss  0.2590.015  0.0970.008  0.1240.011  0.1450.005  0.1180.002  0.1100.020  0.0940.004 
OneError  0.5530.009  0.2600.009  0.3140.027  0.3620.009  0.2650.003  0.2510.044  0.2580.007 
Coverage  0.2320.017  0.1090.011  0.1180.009  0.1360.006  0.1140.001  0.0970.018  0.0930.002 
AvgPrec  0.6350.038  0.8380.016  0.8040.016  0.7740.005  0.8300.001  0.8280.033  0.8430.005 
enorn  
HammLoss  0.1090.006  0.1080.006  0.0600.001  0.1040.002  0.0680.001  0.0640.008  0.0510.001 
RankLoss  0.1890.037  0.0540.000  0.1450.009  0.1970.009  0.1430.002  0.2380.037  0.0990.008 
OneError  0.4760.047  0.3230.032  0.3260.036  0.4160.030  0.2600.004  0.4530.102  0.2540.013 
Coverage  0.4810.038  0.2850.005  0.3690.014  0.3310.016  0.3540.002  0.4510.071  0.2840.010 
AvgPrec  0.5040.053  0.6110.019  0.6130.015  0.6590.008  0.6130.002  0.4660.088  0.6830.009 
medical  
HammLoss  0.4820.008  0.0700.009  0.3430.034  0.0220.002  0.0150.000  0.0210.001  0.0240.000 
RankLoss  0.0180.003  0.0420.006  0.0750.027  0.0460.005  0.0360.003  0.1130.021  0.0360.005 
OneError  0.1690.004  0.2700.020  0.4200.013  0.2160.008  0.1930.008  0.2200.082  0.1990.013 
Coverage  0.2760.025  0.0950.011  0.1140.027  0.0650.010  0.0580.001  0.1160.020  0.0520.009 
AvgPrec  0.8540.024  0.7660.015  0.6650.018  0.8310.007  0.8390.007  0.7300.022  0.8340.012 
Corel5k  
HammLoss  0.0810.007  0.1610.005  0.0510.009  0.0100.000  0.5540.000  0.0190.000  0.0100.000 
RankLoss  0.2810.006  0.1340.000  0.0630.005  0.2100.008  0.2770.001  0.3260.056  0.1200.006 
OneError  0.8020.007  0.7400.010  0.6390.017  0.6490.008  0.8010.002  0.8550.073  0.6310.010 
Coverage  0.3910.007  0.3720.010  0.4030.007  0.4700.017  0.5390.003  0.5470.041  0.2810.013 
AvgPrec  0.2920.008  0.2300.003  0.3930.006  0.2860.005  0.1990.008  0.1440.052  0.3120.002 
YeastBP  
HammLoss  –  0.3160.005  0.3290.012  0.0710.004  0.0620.000  –  0.0240.000 
RankLoss  –  0.0250.000  0.3310.007  0.2080.009  0.1610.000  –  0.1430.002 
OneError  –  0.7570.008  0.7430.013  0.6820.004  0.7960.002  –  0.5230.013 
Coverage  –  0.4070.010  0.3740.008  0.3120.005  0.2950.002  –  0.2810.012 
AvgPrec  –  0.2320.007  0.2420.011  0.3940.012  0.2140.001  –  0.4110.012 
YeastCC  
HammLoss  0.0460.008  0.3180.016  0.3510.012  0.0930.005  0.0710.000  –  0.0270.000 
RankLoss  0.1880.004  0.0260.000  0.3080.009  0.1790.007  0.1780.000  –  0.1730.008 
OneError  0.5550.004  0.6390.018  0.6580.014  0.5240.007  0.8320.003  –  0.4480.014 
Coverage  0.1070.009  0.1730.010  0.1500.007  0.1120.004  0.1110.002  –  0.1030.003 
AvgPrec  0.5160.010  0.3980.018  0.3860.012  0.5350.009  0.1930.002  –  0.5900.014 
YeastMF  
HammLoss  0.0550.005  0.3380.004  0.3480.004  0.044.008  0.0770.001  –  0.0260.000 
RankLoss  0.2530.009  0.0250.000  0.3860.008  0.2690.006  0.2510.000  –  0.2430.012 
OneError  0.6810.010  0.7850.005  0.7610.012  0.6930.009  0.8780.001  –  0.6610.017 
Coverage  0.1230.008  0.1720.006  0.1680.007  0.1240.003  0.1370.001  –  0.1210.005 
AvgPrec  0.4210.008  0.3300.006  0.3020.010  0.4420.009  0.1600.000  –  0.4570.011 
Metric  PMLLFC against  

RankSVM  MLKNN  PMLLRS  fPML  DARAM  PARTICLEVLS  
HammLoss  21/0/2  17/2/4  18/2/3  16/3/4  16/2/5  15/3/5 
RankLoss  20/1/2  16/2/5  22/1/0  19/1/3  18/2/3  22/0/1 
OneError  22/0/1  23/0/0  23/0/0  21/0/2  19/0/4  20/0/3 
Coverage  21/1/1  23/0/0  22/0/1  23/0/0  21/1/1  20/0/3 
AvgPrec  22/0/1  23/0/0  20/1/2  19/1/3  19/0/4  21/0/2 
Total (Win/Tie/Lose)  106/2/7  102/4/9  105/4/6  98/5/12  93/5/17  98/3/14 
Based on the results in Table 2 and 3, we can observe the following: (i) On the realword PML datasets YeastBP,YeastCC and YeastMF, PMLLFC achieve the best performance in most cases except MLKNN on ranking loss evaluation. (ii) On the synthic datasets, PMLLFC frequently outperforms other methods and slightly loses to RankSVM and DARAM on medical dataset. (iii) Out of 115 statistical tests PMLLFC achieves much better results than the popular PML methods PMLLRS, fPML, DARAM and PARTICLEVLS in 91.30%, 85.22%, 80.87% and 85.22% cases, respectively. PMLLFC also significantly outperforms two classical MLL approaches RankSVM and MLKNN in 92.17% and 88.70% cases, respectively. Which proves the necessity of accounting for irrelevant labels of PML training data. PMLLFC outperforms PMLLRS in most cases because PMLLRS mainly operates in the label space. fPML is similar to PMLLRS, while it uses feature information to guide the lowrank label matrix approximation. As a result, it sometimes obtains similar results as PMLLFC. PMLLFC also performs better than DARAM and PARTICLEVLS, which mainly use information from the feature space. Another cause for the superiority of PMLLFC is that other comparing methods do not make a concrete use of the negative information between the label and feature space. From these results, we can draw a conclusion that PMLLFC well accounts the negative information between features and labels for effective partial multilabel learning.
4.3 Further Analysis
We perform ablation study to further study the effectiveness of PMLLFC. For this purpose, we introduce three variants of PMLLFC, namely, PMLLMF(oF), PMLLFC(oL) and PMLLFC(nJ). PMLLFC(oF) only uses feature similarity, PMLLFC(oL) only utilizes the semantic similarity. PMLLFC(nJ) does not jointly optimize the latent label matrix and the predictor in a unified objective function, it firstly optimizes the latent label matrix and then the multilabel predictor. Fig. 3 shows the results of these variants and PMLLFC on the slashdot dataset. All the experimental settings are the same as previous section.
From Fig. 3, we can find that PMLLFC has the lowest HammLoss, RankLoss, OneError, Coverage, and the highest AvgPrec among the four comparing methods. Neither the feature similarity nor the semantic similarity alone induces a comparable multilabel predictor with PMLLFC. In addition, PMLLFC(oF) and PMLLFC(oL) have similar performance with each other, which indicate that both the feature and label information can be used to induce a multilabel predictor. PMLLFC leverages both the label and feature information, it induces a less errorprone multilabel classifier and achieves a better classification performance than these two variants. PMLLFC(nJ) has the lowest performance across the five evaluation metrics, which corroborates the disadvantage of isolating the confident matrix learning and multilabel predictor training. This study further confirms that both the feature and label information of multilabel data should be appropriately used for effective partial multilabel learning, and our alternative optimization procedure has a reciprocal reinforce effect for the predictor and the latent label matrix.
To investigate the sensitivity of and , we vary and in the range of {0.001, 0.01, 0.1, 1, 10, 100} for PMLLFC on the medical dataset. The experimental results (measured by the five evaluation metrics) are shown in Fig. 4. The results on other datasets give similar observations. From Fig. 4(a), we can observe that, when , PMLLFC achieves the best performance. This observation suggests that it’s necessary to consider the lowrank label correlation for partial multilabel learning. When is too large or too small, the label correlation is underweighted or overweighted, thus the performance manifests a reduce. From Fig. 4(b), we can see that PMLLFC achieves the best performance when . When is too small, the feature similarity and semantic similarity of multilabel instances are not well accounted, which leads to a poor performance. When is too large (i.e., 100), PMLLFC also achieves a poor performance, as it excessively overweights the feature similarity and semantic similarity, but underweights the prediction model. From this analysis, we adopt and for experiments.
5 Conclusions
We investigated the partial multilabel learning problem and proposed an approach called PMLLFC, which leverages the feature and label information for effective multilabel classification. PMLLFC takes into account the negative information between labels and features of partial multilabel data. Extensive experiments on PML datasets from different domains demonstrate the effectiveness of PMLLFC. We are planning to incorporate the abundant unlabeled data for effective extreme partial multilabel learning with a large label space.
References
 [1] Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 7(11), 2399–2434 (2006)
 [2] Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)
 [3] Cour, T., Sapp, B., Taskar, B.: Learning from partial labels. JMLR 12(5), 1501–1536 (2011)
 [4] Elisseeff, A., Weston, J.: A kernel method for multilabelled classification. In: NeurIPS. pp. 681–687 (2002)
 [5] Fang, J.P., Zhang, M.L.: Partial multilabel learning via credible label elicitation. In: AAAI. pp. 3518–3525 (2019)
 [6] Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Computing Surveys 47(3), 52 (2015)
 [7] Han, Y., Sun, G., Shen, Y., Zhang, X.: Multilabel learning with highly incomplete data via collaborative embedding. In: KDD. pp. 1494–1503 (2018)

[8]
Ji, S., Ye, J.: An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. pp. 457–464. ACM (2009)
 [9] Li, S.Y., Jiang, Y., Chawla, N.V., Zhou, Z.H.: Multilabel learning from crowds. TKDE 31(7), 1369–1382 (2019)
 [10] Li, Y.F., Hu, J.H., Jiang, Y., Zhou, Z.H.: Towards discovering what patterns trigger what labels. In: AAAI. pp. 1012–1018 (2012)
 [11] Liu, T., Tao, D.: Classification with noisy labels by importance reweighting. TPAMI 38(3), 447–461 (2016)
 [12] Natarajan, N., Dhillon, I.S., Ravikumar, P.K., Tewari, A.: Learning with noisy labels. In: NeurIPS. pp. 1196–1204 (2013)
 [13] Sun, L., Feng, S., Wang, T., Lang, C., Jin, Y.: Partial multilabel learning by lowrank and sparse decomposition. In: AAAI. pp. 5016–5023 (2019)
 [14] Sun, Y.Y., Zhang, Y., Zhou, Z.H.: Multilabel learning with weak label. In: AAAI. pp. 593–598 (2010)
 [15] Tan, Q., Liu, Y., Chen, X., Yu, G.: Multilabel classification based on low rank representation for image annotation. Remote Sensing 9(2), 109 (2017)
 [16] Tan, Q., Yu, G., Domeniconi, C., Wang, J., Zhang, Z.: Incomplete multiview weaklabel learning. In: IJCAI. pp. 2703–2709 (2018)
 [17] Tu, J., Yu, G., Domeniconi, C., Wang, J., Xiao, G., Guo, M.: Multilabel answer aggregation based on joint matrix factorization. In: ICDM. pp. 517–526 (2018)
 [18] Wang, C., Yan, S., Zhang, L., Zhang, H.J.: Multilabel sparse coding for automatic image annotation. In: CVPR. pp. 1643–1650 (2009)
 [19] Wang, H., Liu, W., Zhao, Y., Zhang, C., Hu, T., Chen, G.: Discriminative and correlative partial multilabel learning. In: IJCAI. pp. 2703–2709 (2019)
 [20] Wu, B., Jia, F., Liu, W., Ghanem, B., Lyu, S.: Multilabel learning with missing labels using mixed dependency graphs. IJCV 126(8), 875–896 (2018)
 [21] Xie, M.K., Huang, S.J.: Partial multilabel learning. In: AAAI. pp. 4302–4309 (2018)
 [22] Xu, L., Wang, Z., Shen, Z., Wang, Y., Chen, E.: Learning lowrank label correlations for multilabel classification with missing labels. In: ICDM. pp. 1067–1072 (2014)
 [23] Yu, G., Chen, X., Domeniconi, C., Wang, J., Li, Z., Zhang, Z., Wu, X.: Featureinduced partial multilabel learning. In: ICDM. pp. 1398–1403 (2018)
 [24] Yu, G., Fu, G., Wang, J., Zhu, H.: Predicting protein function via semantic integration of multiple networks. TCBB 13(2), 220–232 (2016)
 [25] Zhang, J., Wu, X.: Multilabel inference for crowdsourcing. In: KDD. pp. 2738–2747 (2018)
 [26] Zhang, M.L., Yu, F., Tang, C.Z.: Disambiguationfree partial label learning. TKDE 29(10), 2155–2167 (2017)
 [27] Zhang, M.L., Zhang, K.: Multilabel learning by exploiting label dependency. In: KDD. pp. 999–1008 (2010)

[28]
Zhang, M.L., Zhou, Z.H.: Mlknn: A lazy learning approach to multilabel learning. Pattern Recognition
40(7), 2038–2048 (2007)  [29] Zhang, M.L., Zhou, Z.H.: A review on multilabel learning algorithms. TKDE 26(8), 1819–1837 (2014)
 [30] Zhang, Y., Zhou, Z.H.: Multilabel dimensionality reduction via dependence maximization. TKDD 4(3), 14 (2010)
 [31] Zhu, Y., Kwok, J.T., Zhou, Z.H.: Multilabel learning with global and local label correlation. TKDE 30(6), 1081–1094 (2017)
Comments
There are no comments yet.