1. Introduction
As a weaklysupervised machine learning framework, partial label learning ^{1}^{1}1In some literatures, partial label learning is also called as superset label learning (Liu and Dietterich, 2014), soft label learning (Oukhellou et al., 2009) or ambiguous label learning (Chen et al., 2014). learns from the ambiguous data where the groundtruth label is concealed in its corresponding candidate label set and does not directly accessible to the learning algorithm (Cour et al., 2011) (Wang and Zhang, 2018) (Wu and Zhang, 2019) (Zhang et al., 2016).
In recent years, owing to its excellent performance on learning from the data with partial labels, PLL has been widely used in many real world scenarios. For instance, in online object annotation (Figure 1 A), given the object annotations from varying users, one can treat the objects as instances and annotations as candidate labels, while the correct correspondence between instances and groundtruth labels are unknown (Liu and Dietterich, 2012) (Zhou and Gu, 2017). In automatic face naming (Figure 1 B), given multifigure images and the corresponding text descriptions, each face detected from images can be regarded as an instance and names extracted from the text description can be regarded as candidate labels, similarly the actual correspondence between faces and names are also not available (Chen et al., 2018). Partial label learning techniques can precisely identify the actual instancelabel correspondence and make prediction for unseen examples. In addition, partial label learning has also widely used in other realworld applications, including web mining (Feng and An, 2019a) (Luo and Orabona, 2010), multimedia content analysis (Chen et al., 2015) (Cour et al., 2011) (Wang and Zhang, 2019) (Zeng et al., 2013), facial age estimation (Feng and An, 2019b) (Wang et al., 2019), Eco informatics (Zhou et al., 2016), etc.
To accomplish the task of learning from partial label data, an intuitive strategy is disambiguation, i.e. trying to identify the groundtruth label from its corresponding candidate labels. Following such strategy, existing PLL learning framework can be roughly grouped into two categories: averagebased framework and identificationbased framework. For the averagebased framework, each candidate label is assumed to have equal contribution to the learning model and the groundtruth label is obtained by averaging the outputs from all candidate labels (Chen et al., 2017) (Cour et al., 2011) (Hullermeier and Beringer, 2005) (Zhang and Yu, 2015). For the identificationbased framework, the groundtruth label is considered as a latent variable, and it is often refined in an iteration manner (Jin and Ghahramani, 2003) (Nguyen and Caruana, 2008) (Yu and Zhang, 2015) (Yu and Zhang, 2017). Most of these methods mainly focus on identifying the unique groundtruth label, while the different contribution of other false candidate labels are not taken into consideration. Correspondingly, the pairwise ranking information of varying candidate labels and the candidate label relevance of the whole training data set are also regrettably ignored, which may lead these methods to be suboptimal.
In light of this observation, in this paper, we proposed a novel PLL approach called HERA, which simultaneously incorporates the HeterogEneous Loss and the SpaRse and LowrAnk Scheme
to estimate the different label confidence for each instance while training the model. Specifically, we formalize the different labeling confidence levels as a latent label confidence vector, and then the confidence vector is estimated by minimizing the heterogeneous loss, where the heterogeneous loss integrates the strengths of both the pairwise ranking loss and the pointwise reconstruction loss to provide informative identification information for model training. Note that, different from existing label ranking scheme, the employed pairwise label ranking does not just refer to the difference between the unique groundtruth label and all other false candidate labels, but covers the labeling confidence ranking of any two candidate labels, which makes it clear to distinguish the different contributions of varying candidate labels.
Moreover, in order to explore the global candidate label relevance among the whole training data sets, motivated by (Chiang et al., 2018), (Liu et al., 2013), (Xu et al., 2013) and (Yu et al., 2014) , we incorporate the sparse and lowrank scheme into our framework and assume that the observed candidate label matrix can be well approximated by decomposing into an ideal and a noisy label matrix. Different from existing multilabel learning approaches that encourage the ideal label matrix to be lowrank and the noisy label matrix to be sparse, in our framework, we pursue the groundtruth label matrix with sparse structure and constrain the false candidate label matrix with lowrank property. The sparse structure for groundtruth label matrix originates from the nature of PLL that each instance can only correspond to a unique groundtruth label, while the remaining noisy label matrix tends to be of lowrank since instances with the same false candidate labels often recur statistically in many realworld scenarios of PLL (Figure 2). Following this, we formalize our framework with the matrix decomposition constraint and separately regularize the groundtruth matrix and false candidate label matrix via sparse and lowrank procedure, which can not only intuitively exploit the candidate label relevance information of the whole training data but also theoretically avoid the algorithm falling into the trivial solution. Afterwards, we optimize the label confidence matrix and learning model in an alternating iteration manner. Experimental results on artificial as well as realworld PL data sets empirically show the effectiveness of the proposed method.
In summary, our main contributions lie in the following aspects:

To the best of our knowledge, it is the first time to integrate the pairwise ranking loss and pointwise reconstruction loss simultaneously (as a heterogeneous loss) into partial label learning framework, which provides informative label identification information for model training.

Different from the widely used marginbased PLL methods that only considers the margin between the unique groundtruth label and all other false labels, the employed pairwise ranking loss can make full use of the difference of any two different candidate labels and improve the label identification ability of learning model.

Different from existing multilabel learning approaches that encourage the ideal label matrix to be lowrank and the noisy label matrix to be sparse, in our framework, we pursue the groundtruth label matrix with sparse structure and constrain the false candidate label matrix with lowrank property. In fact, such regularization operation for the label matrix can well satisfy the statistical constraint of PLL problem. (Figure 2)
The rest of this paper is organized as follows. We first give a brief introduction about partial label learning in Section 2. Then, we present technical details of the proposed HERA algorithm and describe the optimization procedure for HERA in Section 3. Afterwards, in Section 4, we compare our proposed method with existing stateoftheart methods on both artificial and realworld data sets. Finally, we conduct experimental analysis and conclude the whole paper.
2. Related Work
As a weakly supervised learning framework, partial label learning aims to learn from the training data with implicit labeling information. An intuitive strategy to formalize such framework is disambiguation, i.e. identifying the groundtruth label from the candidate label set. Generally speaking, existing PLL methods based on such strategy can be roughly grouped into the following three strategies:
2.1. Average Disambiguation Strategy (ADS)
ADSbased PLL methods often treat each candidate label equally and they make prediction for unseen instances by averaging the modeling outputs from all candidate labels. Following such strategy, Hullermeier et al. (Hullermeier and Beringer, 2005) adopts an instancebased model following to predict labels for unseen instances. Cour et al. (Cour et al., 2011) identifies the groundtruth label from the ambiguous label set by averaging the modeling outputs from all candidate labels, i.e. . Zhang et al.(Zhang and Yu, 2015) and Chen et al. (Chen et al., 2017) also adopt an instancebased model and they make prediction via nearest neighbors weighted voting and minimum error reconstruction criterion. Recently, Tang and Zhang (Tang and Zhang, 2017) utilizes the boosting learning technique and improves the disambiguation model by simultaneously adapting the weights of training examples and the confidences of candidate labels. Intuitively, the above methods are clear and easy to implement, but it suffers from some common shortcomings that the output of the correct label is overwhelmed by the outputs of the other false positive labels, which may decrease the robustness of learning models.
2.2. Identification Disambiguation Strategy (IDS)
Different from averaging the output from all candidate labels, IDSbased PLL methods aims to directly identifying the groundtruth label from its corresponding candidate label set. Following such strategy, most existing PLL methods often regard the groundtruth label as a latent variable first, identified as , and then refine the model parameter iteratively by utilizing some specific criterions, such as maximum likelihood criterion (Grandvalet and Bengio, 2004) (Jin and Ghahramani, 2003) (Liu and Dietterich, 2012) , i.e. and maximum margin criterion (Nguyen and Caruana, 2008) (Yu and Zhang, 2017), i.e. . However, these methods mainly focus on the difference between the unique groundtruth label and all other false candidate labels, while the confidence of each candidate label being the groundtruth label tend to be regrettably ignored. In recent year, considering such shortcoming, some attempts try to improve the modeling disambiguation ability by exploiting the labeling confidence of different candidate labels, where either label distribution or label enhancement procedure is often employed into its corresponding PLL framework for improving the learning model (Feng and An, 2018) (Xu et al., 2019).
2.3. DisambiguationFree Strategy (DFS)
More recently, different from the above two disambiguationbased strategies, some methods aim to learn from PL data by fitting these data to existing learning techniques instead of disambiguation. Zhang et al. (Zhang et al., 2017) proposes a disambiguationfree algorithm named PLECOC, which utilizes ErrorCorrecting Output Codes (ECOC) coding matrix (Dietterich and Bakiri, 1994) and transfers the PLL problem into binary learning problem. Xuan et al. (Xuan and Zhang, 2018) proposes another disambiguationfree algorithm called PALOC, which enables binary decomposition for PLL data in a more concise manner without relying on extra manipulations. The experiments have empirically demonstrated that such DFSbased methods can achieve competitive performance against the above two kinds of disambiguationbased methods.
3. The Proposed Method
Formally speaking, we denote the dimensional input space as , and the output space as with
class labels. PLL aims to learn a classifier
from the PL training data with instances, where the groundtruth label of each instance is not directly accessible during the training phase. For the convenience of description, we denote as the training instance matrix, as the candidate label matrix, where (i.e. ) represents label is the candidate label of instance , while (i.e. ), otherwise. Besides, is denoted as the labeling confidence matrix, where represents the labeling confidence of labels for .3.1. Formulation
As is described in Section 2, disambiguation strategy has been widely employed in many PLL learning frameworks. However, existing methods suffer from a common weakness, i.e. the learning framework is sensitive to the false positive labels that cooccur with the groundtruth label and the different labeling confidence levels of candidate labels are regrettably ignored during the training process. In this paper, to alliviate the above weakness, we consider the groundtruth label matrix as a latent labeling confidence matrix and propose a novel unified framework to estimate such latent labeling confidence matrix while training the desired model simultaneously, which is defined as OP (1):
where
is the loss function,
controls the complexity of the model f, serves as the regularization to guarantee an optimal estimation of labeling confidence matrix P, and are the parameters to control the balance of three terms.To satisfy the different structure of PLL framework, most existing methods either utilize the maxmargin hinge loss function as ranking loss to distinguish varying candidate labels following (Yu and Zhang, 2017), or employ the least square loss function as reconstruction loss to conduct model training following (Feng and An, 2018), where single loss can only exploit less disambiguation information from training data and decrease the robustness of the learning model. Being aware of this, in order to leverage more informative disambiguation information and improve the robustness of learning model, we integrate the above two kinds of loss function into a unified heterogeneous loss as
(1) 
where , , and the two parts separately represent the pairwise ranking loss and the pointwise reconstruction loss. Specifically, measures the label confidence consistence between th candidate label and th candidate label of , while describes the modeling outputs consistence. Intuitively, if the label confidence gap between th label and th label is large, tends to be large, which encourages to be small. In contrast, if the gap is small, tends to be small, which can result in a large value of the . Besides, is the weight matrix of the classifier f. is a constant for normalization on each instance and is the tradeoff parameter to balance the two loss functions. Note that, such costsensitive pairwise ranking loss can clearly distinguish the difference between any two labels, which is different from the maxmargin hinge ranking loss that only aims to identify the unique correct label.
Moreover, to optimally estimate the labeling confidence matrix P
, the global correlations among the candidate labels are incorporated into the framework. To achieve this, we treat the groundtruth labels as outliers and decompose the observed candidate label matrix to
, where P is the sparse component, which captures the constraint of PLL that one instance can only correspond to a unique groundtruth label, and E is of lowrank, which depicts the statistical label correlations among different training instances.(2) 
In addition, in order to control the model complexity, we adopt the widely used squared Frobenius norm of W, i.e. . And the final framework of HERA can be formulated as the following optimization problem OP (2):
where is the norm and is the nuclear norm, which separately provides a good surrogate for sparse representation and lowrank representation.
In summary, in the objective function of OP(2), the first term is called heterogeneous loss that integrates both pairwise ranking loss and pointwise reconstruction loss into a unified loss function, which can provide informative identification information for model training. The second term is named as model complexity that is utilize to control the complexity of learning model, which can avoid training model tends to be overfitting during the training process. The third term serves as regularization that decomposes the candidate label matrix into a groundtruth label matrix and a noisy label matrix, and separately regularizes them with sparse and lowrank constraints, which can lead the model to obtain an optimal estimation of labeling confidence matrix. Overall, the whole framework of HERA can utilize much more global and local structural information from both feature and label space, and encourage the learning model to be effective and robust.
Afterwards, considering that OP(2) is an optimization problem with multiple variables, which is difficult to solve directly, we employ an alternating optimization procedure to update W, P and E iteratively. The details of optimization are exhibited in the following subsection.
3.2. Optimization
To optimize the target function conveniently, we convert OP (2) to the following equivalent problem OP (3):
Intuitively, the optimization problem of OP(3) is a constrained optimization problem, which can be solved by using an Augmented Lagrange Multiplier (ALM) technique. And we write its ALM form as follows OP(4):
where is the trace of matrix. are Lagrange multiplier matrices, and are the penalty parameters.
Obviously, for each of the four matrices to be solved in particularly OP(4), the objective function is convex if the remaining three matrices are kept fixed. Thus, OP(4) can be solved iteratively via the following steps:
Step 1: Calculate W. Fixing the other variables, we can calculate W by minimizing the following objective function:
(3) 
where the loss function of Eq (3) is differentiable, thus W can be optimized via the standard gradient descent algorithm. The W in each iteration is updated via:
(4)  
where is the stepsize of gradient descent, H is an indicator matrix with equal size of matrix W. is a function to obtain the indicator matrix H, where , , otherwise .
Step 2: Calculate P. Fixing the other variables, the subproblem to variable P is simplified as follows:
(5) 
Similar to Eq (3), the subgradient of Eq (5) is easy to obtain, and we also employ the gradient descent procedure to solve it. Specifically,
(6)  
Here, is an dimensional allones vector, is an indicator matrix, and indicates that , , otherwise . In addition, to satisfy the constraint of , we project the updated into the feasible set, i.e. .
Step 3: Calculate J. By fixing the other variables, the subproblem to variable J is reformulated as:
(7) 
According to (Zhu et al., 2010), Eq (7) has the closed form solution, and the variable J can be updated following , where
(8) 
Step 4: Calculate E. When fixing all the other variables that irrelevant to E, we have:
(9) 
Similarly, Eq (9) also has the closed form, and the variable E can be optimized following , where
is the singular value decomposition of
. Meanwhile, the updated is also projected into the feasible set to satisfy the constraint of .Finally, the Lagrange multiplier matrices M, N and the regularization terms , are updated based on Linearized Alternating Direction Method (LADM).
(10) 
During the entire optimization process, we first initialize the required variables, and then repeat the above steps until the algorithm converges. Finally, we make prediction for unseen instances following the steps of Section 3.3. Algorithm 1 summarizes the whole optimization process of HERA.
3.3. Prediction
In this section, we utilize a hybrid label prediction algorithm that is based on the nearest neighbor scheme to make prediction for unseen instances, where the prior instancesimilarity information and the modeling outputs information jointly contribute to increasing the classification accuracy.
Specifically, we construct the similarity matrix to characterize the instance similarity between the unseen instance and its nearest neighbors, where , and is the th nearest neighbor of . Meanwhile, we concatenate the labeling confidence matrix P and the modeling output into a unified matrix , where . Thereafter, we can obtain the final predicted label via the following label propagation scheme:
(11) 
where is the updated labeling matrix and the final predicted label is obtained by maximizing the outputs of prediction model, i.e. maximizing the th row of .
4. Experiments
4.1. Experimental Setup
To evaluate the performance of the proposed HERA method, we implement experiments on nine controlled UCI data sets and six realworld data sets: (1) Controlled UCI data sets. Under different configurations of three controlling parameters (i.e. p, r and ), the four UCI data sets generate 252 artificial partiallabel data sets (Chen et al., 2014)(Cour et al., 2011). Here, is the proportion of partiallabel examples, is the number of false candidate labels, and
is the cooccurring probability between one coupling candidate label and the groundtruth label.
(2) RealWorld (RW) data sets . These data sets are collected from the following four task domains: (A) Facial Age Estimation [FGNET]; (B) Image Classification [MSRCv2]; (C) Bird Sound Classification [BirdSong]; (D) Automatic Face Naming [Lost] [Soccer Player] [Yahoo! News]. Table 1 and 2 separately summarize the characteristics of UCI and Real World data sets, including the number of examples (EXP*), the number of the feature (FEA*), the whole number of class labels (CL*) and the average number of class labels (AVGCL*).Controlled UCI data sets  EXP*  FEA*  CL*  CONFIGURATIONS 

Glass  214  9  6  
Ecoli  336  7  8  
Dermatology  366  34  6  
Vehicle  846  18  4  
Segment  2310  18  7  
Abalone  4177  7  29  
Letter  5000  16  26  
Satimage  6345  36  7  
Pendigits  10992  16  10 
RW data sets  EXP*  FEA*  CL*  AVGCL*  TASK DOMAIN 

Lost  1122  108  16  2.33  Automatic Face Naming (Cour et al., 2011) 
MSRCv2  1758  48  23  3.16  Image Classification (Briggs et al., 2012) 
FGNET  1002  262  99  7.48  Facial Age Estimation (Panis and Lanitis, 2014) 
BirdSong  4998  38  13  2.18  Bird Song Classification (Briggs et al., 2012) 
Soccer Player  17472  279  171  2.09  Automatic Face Naming (Zeng et al., 2013) 
Yahoo! News  22991  163  219  1.91  Automatic Face Naming (Guillaumin et al., 2010) 
Meanwhile, we employ six stateoftheart PLL methods from three categories for comparative studies: PLSVM,
PLKNN
, CLPL, LSBCMM, PLECOC and PALOC, where the configured parameters are utilized according to the suggestions in respective literatures.
PLSVM (Nguyen and Caruana, 2008): An identificationbased disambiguation partial label learning algorithm, which learns from the partial label data by utilizing maximummargin strategy. [suggested configuration: ] ;

PLKNN (Hullermeier and Beringer, 2005): Based on nearest neighbor strategy, it makes prediction by averaging the outputs of the learning model. [suggested configuration: ];

CLPL (Cour et al., 2011): A convex optimization partial label learning method, which also makes prediction for unseen instances by adopting the averagingbased disambiguation strategy [suggested configuration: SVM with square hinge loss];

LSBCMM (Liu and Dietterich, 2012): A maximumlikelihood based partial label learning method by utilizing the identificationbased disambiguation strategy. [suggested configuration: q mixture components];

PLECOC (Zhang et al., 2017): Based on a codingdecoding procedure, it learns from partial label data in a disambiguationfree manner [suggested configuration: ];

PALOC (Xuan and Zhang, 2018): An approach that adapts onevsone decomposition strategy to learn from PL examples [suggested configuration: ];
Before conducting the experiments, we preintroduce the values of parameters employed in our framework. Specifically, we set among via crossvalidation. And the initial values of are empirically set among . Furthermore, the other variables are set as , , and . After initializing the above variables, we adopt tenfold crossvalidation to train the model and obtain the classification accuracy on each data set.
4.2. Experimental Results
In our paper, the experimental results of comparing methods are obtained by running the source codes provided by the authors, where the model parameters are configured according to the suggestions in respective literatures.
Data set  PLKNN  PLSVM  LSBCMM  CLPL  PLECOC  PALOC  SUM 

Glass  23/3/2  26/2/0  25/3/0  27/1/0  20/8/0  13/11/4  134/28/6 
Segment  16/12/0  21/7/0  27/1/0  28/0/0  1/26/1  6/21/1  99/67/2 
Vehicle  28/0/0  25/3/0  5/14/9  13/10/5  2/6/20  12/12/4  85/45/38 
Letter  28/0/0  28/0/0  28/0/0  28/0/0  28/0/0  21/7/0  161/7/0 
Satimage  1/27/0  28/0/0  21/7/0  28/0/0  1/27/1  9/14/5  87/75/6 
Abalone  22/6/0  27/1/0  11/16/1  25/3/0  5/20/3  0/24/4  90/70/8 
Ecoli  28/0/0  22/6/0  28/0/0  28/0/0  28/0/0  21/7/0  155/13/0 
Dermatology  21/7/0  21/7/0  22/6/0  21/7/0  21/7/0  16/12/0  112/46/0 
Pendigits  2/26/0  28/0/0  28/0/0  28/0/0  1/27/0  0/27/1  87/80/1 
SUM  169/81/2  226/26/0  199/47/10  230/21/5  100/121/25  102/135/19   
Lost  MSRCv2  BirdSong  SoccerPlayer  FGNET  Yahoo! News  

HERA  0.7120.035  0.5100.027  0.6950.021  0.5390.015  0.0760.017  0.6460.005 
PLSVM  0.7290.056  0.4820.027  0.6630.018  0.4970.004  0.0630.010  0.6360.010 
CLPL  0.7420.024  0.4130.020  0.6320.009  0.3680.004  0.0630.017  0.4620.009 
PLKNN  0.4240.030  0.4480.012  0.6140.009  0.4430.004  0.0390.008  0.4570.010 
LSBCMM  0.7070.019  0.4560.008  0.7170.015  0.5250.006  0.0590.008  0.6420.007 
PLECOC  0.7030.052  0.5050.027  0.7400.016  0.5370.020  0.0400.018  0.6100.008 
PALOC  0.6290.056  0.4790.042  0.7110.016  0.5370.015  0.0650.019  0.6250.005 
4.2.1. Controlled UCI data sets
We compare the HERA with all above comparing methods on nine controlled UCI data sets, and Figure 35 illustrate the classification accuracy of each comparing method as increases from 0.1 to 0.7 with the stepsize of 0.1. Together with the groundtruth label, the class labels are randomly chosen from to constitute the rest of each candidate label set, where . Figure 6 illustrates the classification accuracy of each comparing algorithm as increases from 0.1 to 0.7 with stepsize 0.1 (). Table 3 summaries the win/tie/loss counts between HERA and the other comparing methods. Out of 252 (9 data sets 28 configurations) statistical comparisons show that HERA achieves either superior or comparable performance against the six comparing methods:

Among these comparing methods, HERA achieves superior performance against PLSVM, LSBCMM and CLPL in most cases (over 80% cases). And compared with PLKNN, PLECOC and PALOC, it also achieves superior or comparable performance in 99.2%, 86.3%, 92.5% cases, respectively. The experimental results demonstrate that HERA has superior capacity of disambiguation against other methods based on varying disambiguation strategies, as well as disambiguationfree strategy.

Among these UCI data sets, HERA outperforms all comparing methods on Ecoli and Letter data set. And for other controlled UCI data sets, it is also superior or comparable against all the comparing stateoftheart methods. Specifically, the average classification accuracy of HERA is 15.8% higher than PLKNN on Ecoli data set and 12.8% higher than LSBCMM on Dermatology data set. Meanwhile, for CLPL and PLECOC, it also separately has 14.3% and 5.7% higher classification accuracy on Pendigits and Glass data set.

Overall, among the 252 statistical comparisons and 1512 experimental results, HERA outperforms other comparing methods in 1026 cases and achieves comparable performance in 431 cases (total: 96% cases), which strongly demonstrates the effectiveness of the proposed HERA algorithm.
4.2.2. RealWorld (RW) data sets
We compare the HERA with all above comparing algorithms on the RW data sets. The comparison results are reported in Table 4, where the recorded results are based on tenfold crossvalidation. According to Table 4, it is clear to observe that HERA performs better than most comparing PLL algorithms on these RW data sets.

From the view of the comparing methods, HERA achieves superior or comparable performance against all comparing stateoftheart methods. Especially, compared with the ADSbased methods, the classification accuracy of HERA is 28.5% higher than PLKNN on Lost data set and 9.5% higher than CLPL on MSRCv2 data set. And compared with the IDSbased methods, HERA gets 2.2% higher classification accuracy against PLSVM on BirdSong data set. Besides, when compared with DFSbased methods PLECOC and PALOC, it also can achieve 3.6% and 1.1% higher performance on Yahoo! News and FGNET data set, respectively.

From the view of the employed data sets, HERA also performs good disambiguation ability on all RW data sets. Specifically, the classification accuracies of HERA are totally higher than that of all other comparing methods on MSRCv2, FGNET, SoccerPlayer and Yahoo! News data sets. For Lost data sets, it also outperforms the comparing methods on 4/6 data sets. Besides, even on the BirdSong data set, it can also outperform half of comparing methods.

Note that, although the proposed HERA does not achieve the best performance on all Real World data sets, it has outperformed each comparing methods on at least 4/6 data sets and the improvement is also relatively significant. Therefore, the effectiveness of our proposed algorithm is demonstrated.
4.2.3. Summary
The two series of experiments mentioned above powerfully demonstrate the effectiveness of HERA, and we attribute the success to the superiority of heterogeneous loss and the special exploit of sparse and lowrank scheme, i.e. simultaneously integrate the pairwise ranking loss and pointwise reconstruction loss into a unified framework, and regularize the groundtruth label matrix with sparse structure and the noisy label matrix with lowrank constraint. In summary, during the learning stage, the proposed HERA not only utilizes the different contribution of each candidate label but also takes the candidate label relevance of the whole training data into consideration, which jointly improves the ability of modeling disambiguation. And as expected, the experimental results demonstrate the effectiveness of our method.
4.3. Further Analysis
4.3.1. Sensitivity Analysis
We study the sensitivity of HERA with respect to its four employed parameters . Figure 7 shows the performance of HERA under different parameter configurations on Lost and MSRCv2 data sets. From Figure 7, we can easily find that the value of usually has great influence on the performance of the proposed framework. Faced with different data sets, we set the parameter among via crossvalidation. Besides, other parameters often follow the optimal configurations () but vary with minor adjustments on different data sets.
4.3.2. Convergence Analysis
We conduct the convergence analysis of HERA on Lost and MSRCv2 data sets. Specifically, in Figure 8, each group of subfigures separately illustrate the convergence curves of the suboptimization process of model parameter W (left), suboptimization process of confidence matrix P (center), and the whole optimization process of HERA. We can easily observe that each gradually decreases to 0 as increases. Therefore, the convergence of HERA is demonstrated.
5. Conclusion
In this paper, we have proposed a novel PLL method HERA, which simultaneously integrated the strengths of both pairwise ranking loss and pointwise reconstruction loss to provide informative labeling identification information for desired model learning. Meanwhile, the sparse and lowrank scheme is also embedded into the proposed framework to exploit the labeling confidence information of the other false candidate labels and avoid the optimization problem falling into trivial solution. Extensive experiments have empirically demonstrated our proposed method can achieve superior or comparable performance against other comparing stateoftheart methods.
References
 (1)
 Briggs et al. (2012) F. Briggs, X. Fern, and R. Raich. 2012. Rankloss support instance machines for MIML instance annotation. In ACM SIGKDD international conference on Knowledge discovery and data mining. 534–542.
 Chen et al. (2015) C. Chen, V. Patel, and R. Chellappa. 2015. Matrix completion for resolving label ambiguity. In Computer Vision and Pattern Recognition. 4110–4118.
 Chen et al. (2018) C. Chen, V. Patel, and R. Chellappa. 2018. Learning from ambiguously labeled face images. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1653–1667.
 Chen et al. (2017) G. Chen, T. Liu, Y. Tang, Y. Jian, Y. Jie, and D. Tao. 2017. A Regularization Approach for InstanceBased Superset Label Learning. IEEE Transactions on Cybernetics (2017), 1–12.
 Chen et al. (2014) Y. Chen, V. Patel, R. Chellappa, and P. Phillips. 2014. Ambiguously Labeled Learning Using Dictionaries. IEEE Transactions on Information Forensics and Security (2014), 2076–2088.
 Chiang et al. (2018) K. Chiang, I. Dhillon, and C. Hsieh. 2018. Using side information to reliably learn lowrank matrices from missing and corrupted observations. The Journal of Machine Learning Research (2018), 3005–3039.
 Cour et al. (2011) T. Cour, B. Sapp, and B. Taskar. 2011. Learning from Partial Labels. IEEE Transactions on Knowledge and Data Engineering (2011), 1501–1536.
 Dietterich and Bakiri (1994) T. Dietterich and G. Bakiri. 1994. Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research (1994), 263–286.
 Feng and An (2018) L. Feng and B. An. 2018. Leveraging Latent Label Distributions for Partial Label Learning.. In International Joint Conference on Artificial Intelligence. 2107–2113.
 Feng and An (2019a) L. Feng and B. An. 2019a. Partial Label Learning by Semantic Difference Maximization. In International Joint Conference on Artificial Intelligence. in press.
 Feng and An (2019b) L. Feng and B. An. 2019b. Partial Label Learning with SelfGuided Retraining. In AAAI Conference on Artificial Intelligence. in press.
 Grandvalet and Bengio (2004) Y. Grandvalet and Y. Bengio. 2004. Learning from Partial Labels with Minimum Entropy. Cirano Working Papers (2004), 512–517.
 Guillaumin et al. (2010) M. Guillaumin, J. Verbeek, and C. Schmid. 2010. Multiple instance metric learning from automatically labeled bags of faces. In European Conference on Computer Vision. 634–647.
 Hullermeier and Beringer (2005) E. Hullermeier and J. Beringer. 2005. Learning from Ambiguously Labeled Examples. International Symposium on Intelligent Data Analysis (2005), 168–179.
 Jin and Ghahramani (2003) R. Jin and Z. Ghahramani. 2003. Learning with multiple labels. In Advances in Neural Information Processing Systems. 921–928.
 Liu et al. (2013) G. Liu, C. Lin, S. Yan, J. Sun, J. Yu, and Y. Ma. 2013. Robust Recovery of Subspace Structures by LowRank Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013), 171–184.
 Liu and Dietterich (2012) L. Liu and T. Dietterich. 2012. A conditional multinomial mixture model for superset label learning. In Advances in Neural Information Processing Systems. 548–556.
 Liu and Dietterich (2014) L. Liu and T. Dietterich. 2014. Learnability of the superset label learning problem. In International Conference on Machine Learning. 1629–1637.
 Luo and Orabona (2010) J. Luo and F. Orabona. 2010. Learning from candidate labeling sets. In Advances in Neural Information Processing Systems. 1504–1512.
 Nguyen and Caruana (2008) N. Nguyen and R. Caruana. 2008. Classification with partial labels. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 551–559.
 Oukhellou et al. (2009) L. Oukhellou, T. Denux, and P. Aknin. 2009. Learning from partially supervised data using mixture models and belief functions. Pattern Recognition (2009), 334–348.
 Panis and Lanitis (2014) G. Panis and A. Lanitis. 2014. An overview of research activities in facial age estimation using the FGNET aging database. In European Conference on Computer Vision. 737–750.
 Tang and Zhang (2017) C. Tang and M. Zhang. 2017. ConfidenceRated Discriminative Partial Label Learning. In AAAI Conference on Artificial Intelligence. 2611–2617.
 Wang and Zhang (2019) D. Wang and M. Zhang. 2019. Adaptive graph guided disambiguation for partial label learning. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. in press.
 Wang and Zhang (2018) J. Wang and M. Zhang. 2018. Towards mitigating the classimbalance problem for partial label learning. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2427–2436.
 Wang et al. (2019) Q. Wang, Y. Li, and Z. Zhou. 2019. Partial Label Learning with Unlabeled Data. In International Joint Conference on Artificial Intelligence. in press.
 Wu and Zhang (2019) J. Wu and M. Zhang. 2019. Disambiguation enabled linear discriminant analysis for partial label dimensionality reduction. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. in press.
 Xu et al. (2013) M. Xu, R. Jin, and Z. Zhou. 2013. Speedup matrix completion with side information: Application to multilabel learning. In Advances in Neural Information Processing Systems. 2301–2309.
 Xu et al. (2019) N. Xu, J. Lv, and X. Geng. 2019. Partial Label Learning via Label Enhancement. In AAAI Conference on Artificial Intelligence. in press.
 Xuan and Zhang (2018) W. Xuan and M. Zhang. 2018. Towards Enabling Binary Decomposition for Partial Label Learning. In International Joint Conference on Artificial Intelligence. 2427–2434.
 Yu and Zhang (2015) F. Yu and M. Zhang. 2015. Maximum margin partial label learning. In Asian Conference on Machine Learning. 96–111.
 Yu and Zhang (2017) F. Yu and M. Zhang. 2017. Maximum margin partial label learning. Machine Learning (2017), 573–593.
 Yu et al. (2014) H. Yu, P. Jain, P. Kar, and I. Dhillon. 2014. Largescale multilabel learning with missing labels. In International Conference on Machine Learning. 593–601.
 Zeng et al. (2013) Z. Zeng, S. Xiao, K. Jia, T. Chan, S. Gao, D. Xu, and Y. Ma. 2013. Learning by associating ambiguously labeled images. In IEEE Conference on Computer Vision and Pattern Recognition. 708–715.
 Zhang and Yu (2015) M. Zhang and F. Yu. 2015. Solving the partial label learning problem: an instancebased approach. In International Joint Conference on Artificial Intelligence. 4048–4054.
 Zhang et al. (2017) M. Zhang, F. Yu, and C. Tang. 2017. Disambiguationfree partial label learning. IEEE Transactions on Knowledge and Data Engineering (2017), 2155–2167.
 Zhang et al. (2016) M. Zhang, B. Zhou, and X. Liu. 2016. Partial label learning via featureaware disambiguation. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1335–1344.
 Zhou and Gu (2017) Y. Zhou and H. Gu. 2017. Geometric Mean Metric Learning for Partial Label Data. Neurocomputing (2017), 394–402.
 Zhou et al. (2016) Y. Zhou, J. He, and H. Gu. 2016. Partial Label Learning via Gaussian Processes. IEEE Transactions on Cybernetics (2016), 4443–4450.
 Zhu et al. (2010) G. Zhu, S. Yan, and Y. Ma. 2010. Image tag refinement towards lowrank, contenttag prior and error sparsity. In ACM International Conference on Multimedia. 461–470.
Comments
There are no comments yet.