1 Introduction
As a weaklysupervised machine learning framework, partial label learning
^{1}^{1}1In some literature, partiallabel learning is also called as superset label learning [1], ambiguous label learning[2] or soft label learning [3]. learns from ambiguous labeling information where each training example corresponds to a candidate label set, among which only one is the groundtruth label [4] [5] [6]. During the training process, the correct label of each training example is concealed in its candidate label set and not directly accessible to the learning algorithm.In many realworld scenarios, data with explicit labeling information (unique and correct label) is too scarce to obtain than that with implicit labeling information (redundant labels). Thus, when faced with such ambiguous data, conventional supervised learning framework based on
one instance one label is out of its capability to learn from it accurately. Recently, Partial Label Learning (PLL) provides an effective solution to cope with it and has been widely used in many realworld scenarios. For example, in online annotation (Figure 1 (A)), users with varying knowledge and cultural backgrounds tend to annotate the same image with different labels. In order to learn from such ambiguous annotated collection, it is necessary to find the correspondence between each image and its groundtruth label. In naming faces (Figure 1 (B)), given a multifigure image and its corresponding text description, the resulting set of images is ambiguously labeled if more than one name appear in the description. In other words, the specific correspondences between the faces and their names are unknown. In addition to the common scenarios mentioned above, PLL has also achieved competitive performance in many other applications, such as multimedia content analysis [7] [8] [9] [10], facial age estimation
[11], web mining [12], ecoinformatics [13], etc.The key to accomplish the task of learning from PartialLabel (PL) data is disambiguation, which needs to fully explore the valuable information from ambiguous PL training data and obtain the correct assignments between the training INStances and their CandiDate Labels (INSCDL). Recently, an Identificationbased Disambiguation Strategy (IDS) is widely used in many PLL framework owing to its competitive performance on alleviating the interference of false positive labels [13] [14] [15] [16] [17] [18]. Among existing PLL methods based on IDS, some are often combined with the offofshelf learning schemes to identify the groundtruth label in an iterative manner, such as maximum likelihood [13] [14] [15], maximum margin [16] [17] [19], etc. Others often try to explore the instance relationship from the ambiguous training data and directly disambiguate the candidate label sets [18]. Although the two kinds of PLL methods have obtained desirable performance in many realworld scenarios, they still suffer from some common defects. For example, for the instance relationship, they only consider the nearestneighbor instances’ similarity while simultaneously ignore the similarity among other instances and the dissimilarity among all instances, which makes the modeling output from unseen instance be overwhelmed by those from the negative nearest instances. And for the instancelabel assignments, they usually utilize an iterative propagation procedure to implicitly obtain the objective labels, but neither explicitly describe the existing INSCDL assignments relationship nor take the cooccurrence possibility of varying instancelabel assignments into consideration to directly identify the optimal assignments, which may make the algorithm lose sight of direct instancelabel assignments and result in its excessive attention to the instance relationship.
In order to overcome the above shortcomings, in this paper, we reinterpret the task of PLL as a matching selection problem, and simultaneously incorporate the instance relationship and the cooccurrence possibility of varying instancelabel assignments into the same framework, then provide a novel solution for PLL problem. Specifically, we regard the INSCDL correspondences as the instancelabel matchings, and the task of PLL can be further reformulated as an instancelabel matching selection problem (Figure 2), i.e. identifying the correct matching relationship between INStances and their GroundTruth Labels (INSGTL). Afterwards, the goal of the PLL problem is transformed into how to solve the matching selection problem and obtain the optimal instancelabel assignments. Recently, Graph Matching (GM) provides an effective solution for such problem, and owing to its excellent performance on utilizing structural information of training data, it has been widely used in many realworld applications [20] [21] [22] [23] [24]. Inspired by this, we incorporate the GM scheme into the PLL matching selection problem and propose a novel PLL learning framework named Graph Matching based Partial Label Learning (GMPLL). Note that, existing graph matching algorithms are formulated with onetoone constraint, which is not fully in accordance with the original task of PLL problem that one label can correspond to varying instances. Thus, we extend such onetoone constraint to manytoone
constraint and propose a manytoone probabilistic matching algorithm to make our method accommodate to the original PLL problem. Furthermore, during the establishment of the proposed framework, an affinity matrix is predetermined to describe the consistency relationship between varying INSCDL assignments, where the similarity and dissimilarity of instances are simultaneously incorporated into the matrix. And these predetermined knowledge contributes the subsequent learning process and leads the algorithm to obtain the optimal solution. Moreover, to improve the predicted accuracy of test instances, we integrate the minimum error reconstruction scheme and graph matching scheme into a unified framework, and propose a relaxed GM predicted algorithm, where each unseen instance is first assigned with a candidate label set via minimum error reconstruction from its neighbor instances and then the predicted label is selected from
maximum confidence candidate labels via graph matching strategy. Experimental results demonstrate that it can obtain higher classification accuracy than other predicted algorithms.In summary, our main contributions lie in the following three aspects:

Firstly, we reinterpret the conventional PLL problem and formulate the task of PLL as a matching selection problem. To the best of our knowledge, it is the first time to regard PLL problem as a matching selection problem, and accordingly we propose a novel GMbased PLL framework (GMPLL), where instance relationship and the cooccurrence possibility of varying instancelabel assignments are simultaneously taken into consideration.

Secondly, we extend conventional graphmatching algorithm with onetoone constraint to a probabilistic matching algorithm with manytoone constraint, which can guarantee that the proposed method fit the original task of PLL.

Finally, we propose a relaxed GM prediction algorithm, which simultaneously incorporate the graph matching scheme and minimum error reconstruction scheme into the same framework to improve the classification accuracy.
We start the rest of the paper by giving a brief introduction about PLL, and then present technical details of the proposed GMPLL algorithm and the comparative experiments with existing stateoftheart methods. Finally, we conduct experimental analysis and conclude the whole paper.
2 Related Work
Partial label learning, as a weakly supervised learning framework, focuses on solving the problem where data labeling information is excessively redundant. An intuitive strategy to cope with this issue is disambiguation, and existing disambiguationbased strategy are roughly grouped into three categories: Averaging Disambiguation Strategy (ADS), Identification Disambiguation Strategy (IDS) and DisambiguationFree Strategy (DFS).
2.1 Averaging Disambiguation Strategy (ADS)
ADSbased methods usually assume that each candidate label has equal contribution to the learning model and they make prediction for unseen instances by averaging the outputs from all candidate labels. Following such strategy, Hullermeier et al. and Chen et al. adopt an instancebased model and disambiguate the groundtruth label by averaging the outputs of nearest neighbors following [25] [26]. Yu et al. utilize minimum error reconstruction criterion and obtain the predicted label via maximizing the confidence of nearest neighbors weightedvoting result [18]
. Similarly, Tang et al. incorporate the boosting learning technique into its framework and improve the disambiguation classifier by adapting the weights of training examples and the groundtruth confidence of candidate labels
[27]. Moreover, to further improve the disambiguation effectiveness, Zhang et al. facilitate its training process by taking the local topological information from feature space into consideration [11]. Obviously, the above PLL methods are clear and easy to implement, but they share a critical shortcoming that the output of the groundtruth label is overwhelmed by the outputs of the other false positive labels, which will enforce negative influence on the disambiguation of groundtruth label.2.2 Identification Disambiguation Strategy (IDS)
In order to overcome the shortcomings of ADS, the IDS based PLL methods are proposed to directly disambiguate the candidate label set. This strategy aims to build a direct mapping from instance space to label space, and accurately identify the groundtruth label for each training instance. Existing PLL algorithms following this strategy often view the groundtruth label as a latent variable first, identified as , and then refine the model parameter
iteratively by utilizing ExpectationMaximization (EM) procedure
[14]. Among these methods, some usually incorporate the maximum likelihood criterion and obtain the optimal label via maximizing the outputs of candidate labels, following [2] [13] [14] [28] [29] [30]. Others often utilize the maximum margin criterion and identify the groundtruth label according to maximizing the margin between the outputs of candidate labels and that of the noncandidate labels, following [16] [17]. Experimental results demonstrate that IDSbased method has achieved superior and comparable performance than ADSbased methods.2.3 DisambiguationFree Strategy (DFS)
Recently, different from the two disambiguationbased PLL strategies mentioned above, some attempts have been made to learn from PL data by fitting the PL data to offtheshelf learning techniques, where they can directly make prediction for the unseen instances without conduct the disambiguation on the candidate label set corresponding to the training instances. Following such strategy, Zhang et al. propose a disambiguationfree algorithm named PLECOC [31], which utilizes ErrorCorrecting Output Codes (ECOC) coding matrix [32] and transfers the PLL problem into binary learning problem. Wu et al. propose another disambiguationfree algorithm called PALOC [33], which enables binary decomposition for PLL data in a more concise manner without relying on extra manipulations such as coding matrix. Experimental results empirically demonstrate that FDSbased algorithms can achieve comparable performance with the other disambiguation based PLL methods.
Although the above methods have achieved good performance on solving the PLL problem, they still suffer from some common shortcomings, i.e. they neither consider non nearest neighbor instancesimilarity nor take the instancedissimilarity into consideration. Therefore, in this paper, we utilize the GM scheme and propose a novel partial label learning framework called GMPLL, where the instance similarity and dissimilarity are simultaneously incorporated into the framework to improve the performance of disambiguation. The details of the framework is introduced in the following section.
3 The GMPLL Method
Formally speaking, we denote the dimensional input space as , and the output space as with class labels. PLL aims to learn a classifier from the PL training data , where the instance is described as a
dimensional feature vector, the candidate label set
is associated with the instance and represents the number of candidate labels for instance . Meanwhile, we denote as the groundtruth label assignments for training instances, where each corresponding to is not directly accessible to the algorithm.3.1 Formulation
GMPLL is a novel PLL framework based on GM scheme, which aims to explore valuable information from ambiguous PL data and establish an accurate assignment relationship between the instance space and the label space . To make the proposed method easily understanding, we illustrate the GMPLL method as a GM structure (Figure 3) before the following detailed introduction.
As depicted in Figure 3, both the instance space and label space are formulated as two different undirected graphs of size , where , and , . The nodes in the two graphs represent the instances and labels respectively, while the edges encode their similarities. The goal of GMPLL is to establish the graph nodes correspondence between and .
Here, we first denote as the adjacent matrix for each graph , where . encodes the instancesimilarity, which is calculated by normalizing the popular Cosine Metric,
(1) 
and encodes the labelsimilarity,
(2) 
where the similarity of different labels is set to 0 owing to the inherent characteristics of PLL problem that the prior pairwiselabel relationship is always missing. Note that once the label relationship as prior knowledge can be obtained, the proposed GMPLL can still be easily extended to satisfy the problem.
Then, we define to describe the graph node correspondences between and , where represents label is assigned to instance , and otherwise. Among these correspondences that , a large number of them are invaluable to be considered since label is not contained in the candidate label set of instance . Accordingly, we exclude the assignments between instances and their noncandidate labels, and obtain the rowwise vectorized replica , where each element of p is defined as:
(3) 
here , , , and the value of represents the confidence of instance assigned with its th candidate label.
Afterwards, the correspondence of INSCDL can be obtained by solving the optimization problem OP (1)
where measures the pairwise consistency between instance edge and label edge , which can also be regarded as the pairwise consistency between assignment and assignment . Motivated by recent studies [24] [34] [35], we further formulate the OP (1) in a more general pairwise compatibility form OP (2):
where is the affinity matrix that will be introduced in the following subsection Generation of Affinity Matrix K. And the optimization details of OP (2) will also be exhibited in the following Section 3.2.
3.1.1 Generation of Affinity Matrix K
Affinity Matrix is defined to describe the matching consistency, and each element represents the INSCDL correspondence between and , i.e.
(4) 
here , represents the value of sth element of p as the INSCDL correspondence between the th instance and its th candidate label .
By predetermining the prior knowledge into the learning framework, affinity matrix can imply valuable information exploited from PL training data, including both the similarity and dissimilarity between instances, and the INSCDL mapping relationship as well. Thus, we initialize the affinity matrix K as follows
(5) 
It is worth noting that, compared with the conventional PLL methods based on nearest neighbor scheme, the proposed framework contributes more prior knowledge to the learning process:

It utilizes the similarity information from more training instances instead of only from the nearest neighbors.

It not only utilizes the instance similarity but also takes the dissimilarity between instances into consideration. Particularly, as shown in Eq (5), with a higher similarity degree between two instances ( and ), the will get a higher value, i.e., the groundtruth labels ( and
) of the two instances have higher probability to locate in the intersection of their candidate labels. On the contrary, if with a lower similarity degree between
and , and will have higher probability to belong to nonintersection of their candidate labels.
After initializing the affinity matrix K, we take the issue of class imbalance with respect to training data into consideration, and incorporate the number of instance candidate labels as a bias into the generation of affinity matrix:
(6) 
here is the weight parameter, is the indicator function such that iff is true, and otherwise. To reduce noise and alleviate the computational complexity, we increase the sparsity of the affinity matrix K and set if , where is the threshold parameter and it will be analyzed in Section 5.1.
At this point, the prior knowledge has been encoded into the affinity matrix, and it can provide good guidance for the subsequence learning process.
3.2 Optimization
In this section, we extend the probabilistic graph matching scheme from [36] and derive a probabilistic graph matching partial label learning algorithm. The core of the proposed algorithm is based on the observation that we can use the solution of the spectral matching algorithm [37] to refine the estimate of the affinity matrix K and then solve a new assignment problem based on the refined matrix K. Namely, we can attenuate the affinities corresponding to matches with small matching probabilities and thus prune the affinity matrix K. In the same vein, we aim to adaptively increase the entries in K corresponding to assignments with high matching probabilities.
Concretely, we relax the first constraint of OP (2) to and interpret p as matching probabilities . Then, the affinity matrix K can be further interpreted as a joint matching probabilities . Afterwards, we refine K and p in an iterative manner where each iteration can be partitioned into two steps: estimating the mapping confidence of p and refining the affinity matrix K. In the former step, we relax the onetoone constraints of [37] as a manytoone constrain to accommodate that multiple instances may correspond to the same label. In the latter step, we follow [36] to make the refinement of K allow analytic interpretation and provable convergence.
Hence, we minimize the objective function OP (3)
where is the assignment probability and represents the conditional assignment probability that is the probability of assignment when is valid. In our scheme, the and need to be updated simultaneously.
Specifically, in iteration t, we denote the estimation of by and by , respectively. Then, we update by
(7) 
where represents the joint probability which is the joint probability of assignment and assignment .
Different from the onetoone constraint of conventional GM problem, the framework of GMPLL is formulated with manytoone constraint. Thus, we induce the constraint =1. And can be normalized as:
(8) 
Next, we refine the conditional assignment probability by
(9) 
During the entire process of optimization, we first initialize the required variables, and then repeat the above steps until the algorithm converges. Finally, we get the assigned label for each training example. The whole training algorithm of GMPLL is summarized in Algorithm 1.
3.3 Prediction
During the stage of label prediction for unseen instances, we propose a graph matching based PLL prediction algorithm, which simultaneously takes the similarity reconstruction scheme and the GM scheme into consideration. The details of the prediction algorithm is introduced as follows.
We first integrate both the training instances and test instances into a large instances set, and then calculate a new instancesimilarity matrix following Eq (1). Afterwards, we assign the candidate label set for each test instance according to the weightedvoting results of its nearest neighbor instances , where the weights are calculated via minimum error reconstruction scheme OP (4):
here, is an element of w and .
Based on the weightedvoting results, we obtain the confidence of each candidate label assigned to , and then we can rank these labels according to the confidence in a descending order. Afterwards, we select the maximum confidence labels to constitute the candidate label set for . Subsequently, the construction of candidate label set for each unseen instance has been completed.
Apparently, when the value of equals to the total number of candidatelabel categories , the predicted model will degenerate into disambiguation from all candidate labels, which is commonly in existing methods. In contrast, if only one label is retained (), the groundtruth label will be assigned with the maximum probability label, which is the same as [18]. The larger the value of is, the higher probability that the groundtruth label can be contained in the candidate label set, but meanwhile it would draw massive false labels that can decrease the effectiveness of the model. On the contrary, the smaller the value of is, the less false labels would be contained in the candidate label set, which would also result in the fact that the groundtruth label may be removed from the candidate label set.
Based on the above analysis, we can conclude that the total number of class labels (CL*) and the average number of class labels (AVGCL*) for each instance have significant influence on the selecting of the number of assigned candidate labels . Concretely, on one hand, more class labels means more noise class labels, thus we tend to assign with a smaller value to avoid the negative effect of these noise labels when CL* is larger. On the other hand, the average number of class labels can represent the average number of positive labels, thus we tend to choose larger when AVGCL* is larger. At this point, we can calculate the by the following formula:
(10) 
here is the integral function, which represents the rounding operation for .
Finally, once the above operations are completed, we follow the idea of Algorithm 1 to rebuild the affinity matrix and utilize the GM scheme to recover the correct mapping between test instances and their groundtruth labels.
4 Experiments
4.1 Experimental Setup
To verify the effectiveness of the proposed GMPLL method, we conduct experiments on nine controlled UCI data sets and six realworld data sets:
(1) Controlled UCI data sets. Under specified configuration of two controlling parameters (i.e. p and r), the nine UCI data sets generate 189 () artificial partiallabel data sets [2] [38]. Here, is the proportion of instances with partial labeling and is the number of candidate labels except the groundtruth label. Table I summarizes the characteristics of the nine UCI data sets, including the number of examples (EXP*), the number of the features (FEA*), the whole number of class labels (CL*) and their common configurations (CONFIGURATIONS).
UCI data sets  EXP*  FEA*  CL*  CONFIGURATIONS 

Glass  214  10  7  
Ecoli  336  7  8  
Dermatology  364  23  6  
Vehicle  846  18  4  
Segment  2310  18  7  
Abalone  4177  7  29  
Letter  5000  16  26  
Satimage  6345  36  7  
Pendigits  10992  16  10 
(2) RealWorld (RW) data sets . These data sets are collected from the four following task domains: (A) Facial Age Estimation Human faces are represented as instances and the ages annotated by ten crowdsourced labelers together with the groundtruth ages are regarded as candidate labels; (B) Automatic Face Naming Human faces copped from images or videos are represented as instances and each candidate label set is composed of the names extracted from the corresponding captions or subtitles; (C) Object Classification Image segmentations constitute the instance space and the objects appearing within the same image constitute the candidate label sets; (D) Bird Song Classification Singing syllables of the birds are represented as instances while bird species jointly singing during a 10seconds period are regarded as candidate labels; Table II summarizes the characteristics of the above real world data sets, including not only the number of examples (EXP*), the number of the feature (FEA*) and the whole number of class labels (CL*), but also the average number of class labels (AVGCL*) and their task domains (TASK DOMAIN).
RW data sets  EXP*  FEA*  CL*  AVGCL*  TASK DOMAIN 

Lost  1122  108  16  2.33  Automatic Face Naming [38] 
MSRCv2  1758  48  23  3.16  Image Classification [39] 
FGNET  1002  262  99  7.48  Facial Age Estimation [40] 
Soccer Player  17472  279  171  2.09  Automatic Face Naming [7] 
Yahoo! News  22991  163  219  1.91  Automatic Face Naming [41] 
Meanwhile, we employ four
classical (PLSVM, PLKNN, CLPL, LSBCMM) and
four stateoftheart (M3PL, PLLEAF, PLECOC, IPAL) partial label learning algorithms that are based on different disambiguation strategies ^{2}^{2}2We partially use the open source codes from Zhang Minling’s homepage: http://cse.seu.edu.cn/PersonalPage/zhangml/ for comparative studies, where the configured parameters of each method are utilized following the suggestions in respective literatures:Data set  PLKNN  PLSVM  LSBCMM  CLPL  M3PL  PLLEAF  PLECOC  IPAL  sum 

glass  19/2/0  21/0/0  21/0/0  21/0/0  21/0/0  7/4/10  21/0/0  19/2/0  150/8/10 
segment  21/0/0  21/0/0  21/0/0  21/0/0  21/0/0  21/0/0  16/5/0  21/0/0  163/5/0 
vehicle  21/0/0  21/0/0  17/3/1  18/0/3  19/2/0  8/7/6  7/5/9  21/0/0  132/17/19 
letter  14/7/0  21/0/0  21/0/0  21/0/0  21/0/0  21/0/0  5/16/0  15/6/0  139/29/0 
satimage  19/2/0  21/0/0  21/0/0  21/0/0  21/0/0  20/1/0  15/6/0  19/2/0  157/11/0 
abalone  21/0/0  21/0/0  0/0/21  0/10/11  21/0/0  0/0/21  0/0/21  21/0/0  84/10/74 
ecoli  12/9/0  21/0/0  21/0/0  21/0/0  21/0/0  1/13/7  21/0/0  11/10/0  129/32/7 
dermatology  14/7/0  21/0/0  21/0/0  21/0/0  6/14/1  0/14/7  21/0/0  13/8/0  117/43/8 
pendigits  2/19/0  21/0/0  21/0/0  21/0/0  21/0/0  9/12/0  11/10/0  1/20/0  107/61/0 
sum  163/22/4  178/6/5  142/0/42  148/14/27  153/15/21  79/44/66  110/37/42  125/55/9   

PLSVM [16]: Based on IDS, it gets the predictedlabel according to incorporating maximum margin scheme. [suggested configuration: ] ;

PLKNN [25]: Based on ADS, it obtains the predictedlabel according to averaging the outputs of the nearest neighbors. [suggested configuration: =10];

CLPL [38]: A convex optimization partiallabel learning method based on ADS. [suggested configuration: SVM with hinge loss];

LSBCMM [13]: Based on IDS, it makes prediction according to calculating the maximumlikelihood value of the model with unseen instances input. [suggested configuration: q mixture components];

M3PL [17]: Originated from PLSVM, it is also based on the maximummargin strategy, and it gets the predictedlabel via calculating the maximum values of model outputs. [suggested configuration: ] ;

PLLEAF [11]: A partiallabel learning method via featureaware disambiguation. [suggested configuration: =10, , ];

IPAL [18]: it disambiguates the candidate label set by taking the instance similarity into consideration. [suggested configuration: =10];

PLECOC [31]: Based on a codingdecoding procedure, it learns from partiallabel training examples in a disambiguationfree manner. [suggested configuration: the codeword length ];
Before conducting the experiments, we give the range of the required variables. In detail, during the training phase, the threshold variable is set among to exploit the most valuable similarity information and dissimilarity information. And the coefficient parameter is chosen from to balance the effect of the number of varying label categories. During the test phase, inspired by [18], we empirically set for nearest neighbor instances to complete the candidate label set of each unseen instance, and meanwhile the size of the label set is empirically set to more than 1 to guarantee that the groundtruth label can be involved in the assigned candidate label set. After initializing the above variables, we adopt tenfold crossvalidation to train the model and get the average classification accuracy on each data set.
Lost  MSRCv2  Yahoo! News  BirdSong  SoccerPlayer  FGNET  

GMPLL  0.7370.043  0.5300.019  0.6290.007  0.6630.010  0.5490.009  0.0650.021 
PLSVM  0.6390.056  0.4170.027  0.6360.018  0.6620.032  0.4300.004  0.0580.010 
CLPL  0.6700.024  0.3750.020  0.4620.009  0.6320.017  0.3470.004  0.0470.017 
PLKNN  0.3320.030  0.4170.012  0.4570.009  0.6140.024  0.4940.004  0.0370.008 
LSBCMM  0.5910.019  0.4310.008  0.6480.015  0.7170.024  0.5060.006  0.0560.008 
M3PL  0.7320.035  0.5210.030  0.6550.010  0.7090.010  0.4460.013  0.0370.025 
PLLEAF  0.6640.020  0.4590.013  0.5970.012  0.7060.012  0.5150.004  0.0720.010 
IPAL  0.7260.041  0.5230.025  0.6670.014  0.7080.014  0.5470.014  0.0570.023 
PLECOC  0.7030.052  0.5050.027  0.6620.010  0.7400.016  0.5370.020  0.0400.018 
4.2 Experimental Results
Since the origins of the two kinds of data sets are different, nine UCI data sets are constructed manually while six RW data sets come from real world scenarios, we conduct two series of experiments to evaluate the proposed method and the experimental results are exhibited in the following two subsections separately. In our paper, the experimental results of the comparing algorithms originate from two aspects: one is from the results we implemented by utilizing the source codes provided by the authors; the other is from the results exhibited in the respective literatures.
Lost  MSRCv2  Yahoo! News  BirdSong  SoccerPlayer  FGNET  

GMPLL  0.8810.005  0.7700.013  0.7050.612  0.8340.010  0.6680.003  0.1860.021 
PLSVM  0.8870.012  0.6530.024  0.8710.002  0.8250.012  0.6880.014  0.1360.021 
CLPL  0.8940.005  0.6560.010  0.8340.002  0.8220.004  0.6800.010  0.1580.018 
PLKNN  0.6150.036  0.6160.006  0.6920.010  0.7720.021  0.4920.015  0.1730.017 
LSBCMM  0.7210.010  0.5240.007  0.8720.001  0.7160.014  0.7040.002  0.1380.019 
M3PL  0.8600.006  0.7320.025  0.8700.002  0.8550.030  0.7610.010  0.1270.013 
PLLEAF  0.8090.022  0.6450.015  0.8270.002  0.8820.014  0.7020.003  0.1480.009 
IPAL  0.8400.041  0.7140.015  0.8230.008  0.8330.030  0.6730.014  0.1580.024 
PLECOC  0.8510.013  0.5550.030  0.8620.007  0.8860.014  0.6710.003  0.1320.019 
4.2.1 Controlled UCI data sets
Figure 46 illustrate the classification accuracy of each comparing method on the nine controlled data sets as increases from to with the stepsize . Together with the groundtruth label, the class labels are randomly chosen from to constitute the rest of each candidate label set, where . Table III summaries the win/tie/loss counts between GMPLL and other comparing methods. Out of 189 (9 data sets 21 configurations) statistical comparisons show that GMPLL achieves either superior or comparable performance against the eight comparing methods, which is embodied in the following aspects:

Among the comparing methods, GMPLL achieves superior performance against PLKNN, PLSVM, LSBCMM, CLPL and M3PL in most cases. And compared with PLLEAF, PLECOC and IPAL, it also achieves superior or comparable performance in 65.08%, 77.78%, 95.23% cases, respectively. These results demonstrate that the proposed method has superior capacity of disambiguation against other methods based on varying disambiguation strategies, as well as disambiguationfree strategy.

Compared with the methods that directly establish INSGTL assignments, GMPLL achieves superior performance on most data sets. For example, the average classification accuracy of GMPLL is 11.2% higher than M3PL on Glass data set and 29.5% higher than PLSVM on Satimage data set. Meanwhile, GMPLL also has higher or comparable classification accuracy against the comparing stateoftheart methods on other controlled UCI data sets. We attribute such success to that it can utilize the cooccurrence possibility of varying instancelabel assignments to obtain the accurate INSGTL assignments.

Compared with the methods utilizing the instance similarity, GMPLL also achieves competitive performance. From the perspective of the Average Classification Accuracy, GMPLL gets 1.2% higher than IPAL on Segment data set and 1.4% higher than PLLEAF on Letter data set, respectively; And from the perspective of the MaxMin of classification accuracy, GMPLL is only 0.84% higher on Glass
data set while all other methods are more than 1%. Moreover, the standard deviation of GMPLL classification accuracy is lower than the other comparing methods on most data sets. These results clearly indicate the advantage of the proposed method against other instancesimilarity based methods.
4.2.2 Realworld (RW) data sets
We compare the GMPLL with all above comparing algorithms on the realworld data sets. The comparison results of inductive accuracy and transductive accuracy are separately reported in Table IV and Table V, where the recorded results are based on tenfold crossvalidation.
The transductive classification accuracy reflects the disambiguation capacity of PLL methods in recovering groundtruth labeling information from candidate label set, while the inductive classification accuracy reflects the prediction capacity of obtaining the groundtruth label for unseen examples. According to Table IV and Table V, it is clear to observe that GMPLL performs better than most comparing PLL algorithms on these RW data sets. The superiority of GMPLL can be embodied in the following aspects:

As shown in Table IV, GMPLL significantly outperforms all comparing methods on Lost, MSRCv2, and SoccerPlayer data sets, respectively. Especially, compared with the classical methods, the classification accuracy of the proposed method is 40.5% higher than that of PLKNN on Lost data set, and 20.2% higher than that of CLPL on SoccerPlayer data set. Even compared with the stateoftheart methods, it also can achieve 2.5% higher than PLECOC on MSRCv2 and 1.1% higher than IPAL on Lost data set.

Meanwhile, GMPLL also achieves competitive performance on other RW data sets. Specifically, for the FGNET data set, GMPLL outperforms all comparing methods except PLLEAF, where it is only 0.7% lower than PLLEAF. But on Yahoo! News data set, GMPLL performs great superiority than PLLEAF, where the classification accuracy is 3.4% higher than that of PLLEAF. Besides, among all comparing methods, it is impressive that GMPLL outperforms CLPL and PLKNN on all six RW data sets. And, it also exceeds other comparing methods over four in six RW data sets. The experimental results demonstrate the superiority of GMPLL.

As shown in Table V, GMPLL shows significantly superior disambiguation ability on Lost, MSRCv2 and FGNET data set and competitive disambiguation ability on BirdSong and SoccerPlayer data sets, which demonstrates the superiority of the GM scheme on disambiguation. But for Yahoo! News data set, GMPLL is inferior to some comparing stateoftheart methods. Even so, it can still achieve superior or comparable performance against other comparing methods on making prediction for unseen instances, which demonstrates the superiority of GM scheme on making prediction for unseen instances. In summary, the experimental results demonstrate the effectiveness of our proposed GMPLL algorithm.

We notice that the performance of GMPLL is inferior to most comparing methods on Yahoo! News data set, which is attributed to the low intraclass instance similarity. Especially, over 8440 examples come from two categories, among which the intraclass instance similarity of over 65% examples is less than 0.60. Obviously, such low intraclass instance similarity may decrease the effectiveness of our proposed method.
4.2.3 Summary
The two series of experiments mentioned above powerfully demonstrate the effectiveness of GMPLL, and we attribute the success to the superiority of GM scheme, i.e. simultaneously taking the instance relationship and the cooccurrence possibility of varying instancelabel assignments into the same framework. Concretely speaking, for the instance relationship, especially the instance dissimilarity, it can alleviate the effect of the similar instance with varying labels and avoid the outputs of instances be overwhelmed by that of its negative nearest instances. And for the instancelabel assignments, the cooccurrence possibility can lead the algorithm to pay more attention to matching selection and reducing its dependence on instance relationship. The two schemes jointly improve the effectiveness and robustness of the proposed method. And as expected, the experimental results demonstrate the effectiveness of our method.
5 Further Analysis
5.1 Parameter Sensitivity
The proposed method learns from the PL examples by utilizing two important parameters, i.e. (threshold parameter) and (the number of candidate labels assigned to unseen instances). Figure 7 and Figure 8 respectively illustrate how GMPLL performs under different and configurations. We study the sensitivity analysis of GMPLL in the following subsection.
5.1.1 The threshold parameter
The threshold parameter controls the percentage of prior knowledge incorporated into the learning framework. More prior knowledge can be added into the framework as is small, while less prior knowledge contributes to the learning process when becomes larger. On the other hand, small will draw more noise into the learning framework and large will lose more valuable information, two of which have negative effects on the learning model. Faced with varying data sets, we set the threshold parameter among via crossvalidation and the specific value is shown in Table VI.
Data set  Lost  MSRCv2  FGNET  BirdSong  SoccerPlayer  Yahoo! News 

0.6  0.6  0.8  0.5  0.3  0.7 
5.1.2 The number of candidate label for unseen instances
As mentioned above, the percentage of candidate labels assigned to unseen instances has great influence on making prediction for unseen instances. According to the analysis in section 3.3, we simultaneously take the total number of class labels (CL*) and the average number of class labels (AVGCL*) into consideration, and then utilize Eq (10) to obtain the number of assigned labels . To demonstrate the validness of Eq (10) empirically, we conduct the experiments under different configuration and express the comparing results in Figure 8.
As described in Figure 8, with the increasing of , the classification accuracy of GMPLL at first increases and later decreases. And such phenomenon is intuitive, i.e. algorithm with smaller indicates that less noisy labels need to be removed but the groundtruth label has lower possibility to be contained in the candidate label set; and larger indicates that the groundtruth label has higher possibility to be contained in the candidate label set but it tends to draw more noisy labels into the candidate label set. The number comparison of assigned candidate labels between empirically optimal value and calculation results of Eq (10) on each RW data set is exhibited in Table VII. As shown in Table VII, except the FGNET data set, the empirically optimal number of candidate labels is basically identical to the calculation results of Eq (10).
Data set  Lost  MSRCv2  FGNET  BirdSong  SoccerPlayer  Yahoo! News 

3  3  4  3  2  2  
3  4  1  4  2  1 
5.2 Time Consumption
Although we have conducted corresponding strategies to reduce the computational complexity of the proposed algorithm, the time consumption of the proposed prediction model is still longer than some comparing methods on some largescale data sets. Nonetheless, such time consumption is acceptable for the PLL problem. Specifically, on most UCI data sets, the time consumptions are no more than 30 seconds; meanwhile, on some smallscale or mediumscale RW data sets, it is also no more than 20 seconds. Moreover, although the time consumption of the prediction model is longer than some comparing methods, the total running time cost (combining training time and testing time) is appropriate and sometimes even less than some stateoftheart PLL methods, such as PLLEAF. According to our experimental results, the running time cost of our proposed methods is no more than 1.5h on all RW data sets, which is only 1/10 of that of PLLEAF. Table VIII illustrates the total running time and testing time consumption of our proposed algorithm on both UCI and RW data sets, measured within Matlab environment equipped with Intel E52650 CPU.
Data set  Lost  MSRCv2  FGNET  BirdSong  SoccerPlayer 

running time  37.046s  127.818s  198.160s  281.765s  3271.877s 
testing time  0.837s  1.431s  1.254s  21.879s  422.012s 
Data set  Yahoo! News  glass  segment  satimage  vehicle 
running time  8612.220s  2.080s  80.095s  236.724s  7.138s 
testing time  1025.886s  0.204s  5.901s  29.574s  0.743s 
Data set  letter  abalone  ecoli  dermatology  pendigits 
running time  312.502s  268.547s  1.916s  2.924s  116.202s 
testing time  28.344s  21.380s  0.287s  0.334s  11.538s 
6 Conclusion
In this paper, we have proposed a novel graphmatching based partial label learning method GMPLL. To the best of our knowledge, it is the first time to reformulate the PLL problem into a graph matching structure. By incorporating much prior knowledge and establishing INSCDL assignments, the proposed GMPLL algorithm can effectively contribute the valuable information to the learning model. Extensive experiments have demonstrated the effectiveness of our proposed method. In the future, we will further explore other knowledge from PL data and improve the denoising method to further improve the effectiveness and robustness of the model.
References
 [1] L. Liu and T. Dietterich, “Learnability of the superset label learning problem,” in International Conference on Machine Learning, 2014, pp. 1629–1637.
 [2] Y. Chen, V. Patel, R. Chellappa, and P. Phillips, “Ambiguously labeled learning using dictionaries,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2076–2088, 2014.
 [3] L. Oukhellou, T. Denux, and P. Aknin, “Learning from partially supervised data using mixture models and belief functions,” Pattern Recognition, vol. 42, no. 3, pp. 334–348, 2009.
 [4] J. Wang and M.L. Zhang, “Towards mitigating the classimbalance problem for partial label learning,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 2427–2436.

[5]
Y.C. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips,
“Dictionary learning from ambiguously labeled data,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2013, pp. 353–360.  [6] M.L. Zhang, “Disambiguationfree partial label learning,” in Proceedings of the 2014 SIAM International Conference on Data Mining, 2014, pp. 37–45.
 [7] Z. Zeng, S. Xiao, K. Jia, T. Chan, S. Gao, D. Xu, and Y. Ma, “Learning by associating ambiguously labeled images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 708–715.

[8]
M. Xie and S. Huang, “Partial multilabel learning,” in
AAAI Conference on Artificial Intelligence
, 2018, pp. 1–1.  [9] C. H. Chen, V. M. Patel, and R. Chellappa, “Learning from ambiguously labeled face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2018.
 [10] T. Cour, B. Sapp, C. Jordan, and B. Taskar, “Learning from ambiguously labeled images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 919–926.
 [11] M. Zhang, B. Zhou, and X. Liu, “Partial label learning via featureaware disambiguation,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1335–1344.
 [12] J. Luo and F. Orabona, “Learning from candidate labeling sets,” in Advances in Neural Information Processing Systems, 2010, pp. 1504–1512.
 [13] L. Liu and T. G. Dietterich, “A conditional multinomial mixture model for superset label learning,” in Advances in Neural Information Processing Systems, 2012, pp. 548–556.
 [14] R. Jin and Z. Ghahramani, “Learning with multiple labels,” in Advances in Neural Information Processing Systems, 2003, pp. 921–928.
 [15] L. Feng and B. An, “Leveraging latent label distributions for partial label learning,” in International Joint Conference on Artificial Intelligence, 2018, pp. 2107–2113.
 [16] N. Nguyen and R. Caruana, “Classification with partial labels,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 551–559.
 [17] F. Yu and M. Zhang, “Maximum margin partial label learning,” Machine Learning, vol. 106, no. 4, pp. 573–593, 2017.
 [18] M. Zhang and F. Yu, “Solving the partial label learning problem: an instancebased approach,” in International Joint Conference on Artificial Intelligence, 2015, pp. 4048–4054.
 [19] F. Yu and M.L. Zhang, “Maximum margin partial label learning,” in Asian Conference on Machine Learning, 2016, pp. 96–111.
 [20] M. Chertok and Y. Keller, “Spectral symmetry analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 7, pp. 1227–1238, 2009.
 [21] A. Egozi, Y. Keller, and H. Guterman, “Improving shape retrieval by spectral matching and meta similarity,” IEEE Transactions on Image Processing, vol. 19, no. 5, pp. 1319–1327, 2010.
 [22] J. H. Hays, M. Leordeanu, A. A. Efros, and Y. Liu, “Discovering texture regularity via higherorder matching,” in European Conference on Computer Vision, 2006, pp. 522–535.
 [23] T. Wang and H. Ling, “Gracker: A graphbased planar object tracker,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1494–1501, 2018.
 [24] T. Wang, H. Ling, C. Lang, and S. Feng, “Graph matching with adaptive and branching path following,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 1, pp. 1–1, 2017.
 [25] E. Hullermeier and J. Beringer, “Learning from ambiguously labeled examples,” International Symposium on Intelligent Data Analysis, vol. 10, no. 5, pp. 168–179, 2005.
 [26] G. Chen, T. Liu, Y. Tang, Y. Jian, Y. Jie, and D. Tao, “A regularization approach for instancebased superset label learning,” IEEE Transactions on Cybernetics, vol. 48, no. 3, pp. 967–978, 2017.
 [27] C. Tang and M. Zhang, “Confidencerated discriminative partial label learning,” in AAAI Conference on Artificial Intelligence, 2017, pp. 2611–2617.
 [28] Y. Grandvalet and Y. Bengio, “Learning from partial labels with minimum entropy,” Cirano Working Papers, pp. 512–517, 2004.
 [29] Y. Zhou, J. He, and H. Gu, “Partial label learning via gaussian processes,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4443–4450, 2016.
 [30] P. Vannoorenberghe and P. Smets, “Partially supervised learning by a credal em approach,” in European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, 2005, pp. 956–967.
 [31] M. Zhang, F. Yu, and C. Tang, “Disambiguationfree partial label learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2155–2167, 2017.
 [32] T. G. Dietterich and G. Bakiri, “Solving multiclass learning problems via errorcorrecting output codes,” Journal of Artificial Intelligence Research, vol. 48, pp. 263–286, 1994.
 [33] X. Wu and M.L. Zhang, “Towards enabling binary decomposition for partial label learning,” in International Joint Conference on Artificial Intelligence, 2018, pp. 2868–2874.
 [34] M. Cho, J. Lee, and K. M. Lee, “Reweighted random walks for graph matching,” in European Conference on Computer vision, 2010, pp. 492–505.
 [35] Z.Y. Liu and H. Qiao, “Gnccp graduated non convexity and concavity procedure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1258–1267, 2014.
 [36] A. Egozi, Y. Keller, and H. Guterman, “A probabilistic approach to spectral graph matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 18–27, 2013.
 [37] M. Leordeanu and M. Hebert, “A spectral technique for correspondence problems using pairwise constraints,” in International Conference on Computer Vision, 2005, pp. 1482–1489.
 [38] T. Cour, B. Sapp, and B. Taskar, “Learning from partial labels,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 5, pp. 1501–1536, 2011.
 [39] F. Briggs, X. Fern, and R. Raich, “Rankloss support instance machines for miml instance annotation,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 534–542.
 [40] G. Panis and A. Lanitis, “An overview of research activities in facial age estimation using the fgnet aging database,” Journal of American History, vol. 5, no. 2, pp. 37–46, 2016.
 [41] M. Guillaumin, J. Verbeek, and C. Schmid, “Multiple instance metric learning from automatically labeled bags of faces,” in European Conference on Computer Vision, 2010, pp. 634–647.