1 Introduction
Partial label (PL) learning is a weakly supervised learning problem with ambiguous labels
[10, 21], where each training instance is assigned a set of candidate labels, among which only one is the true label. Since it is typically difficult and costly to annotate instances precisely, the task of partial label learning naturally arises in many realworld learning scenarios, including automatic face naming [10, 21], and web mining [13].As the true label information is hidden in the candidate label set, the main challenge of PL lies in identifying the ground truth labels from the candidate noise labels. Intuitively, one basic strategy to tackle partial label learning is performing label disambiguation. There are two main groups of disambiguationbased PL approaches: the averagebased disambiguation approaches and the identificationbased approaches. For averagingbased disambiguation, each candidate label is treated equally in model induction and the final prediction is made by averaging the modeling outputs of all the candidate labels [10, 3, 23]. Without differentiating the informative true labels from the noise irrelevant labels, such simple averaging methods in general cannot produce satisfactory performance. Hence recent studies are mostly focused on identificationbased disambiguation methods. Many identificationbased disambiguation methods treat the groundtruth labels as latent variables and identify the true labels by employing iterative label refining procedures [11, 22, 16]. For example, the work in [5]
tries to estimate the latent label distribution with iterative label propagations and then induce a prediction model by fitting the learned latent label distribution. Another work in
[12]exploits a selftraining strategy to induce label confidence values and learn classifiers in an alternative manner by minimizing the squared loss between the model predictions and the learned label confidence matrix. However, these methods suffer from the cumulative errors induced in either the separate label distribution estimation steps or the errorprone label confidence estimation process. Moreover, all these methods have a common drawback: they automatically assumed random noise in the label space – that is, they assume the noise labels are randomly distributed. However, in real world problems the appearance of noise labels are usually dependent on the target true labels. For example, when the object contained in an image is a “computer”, a noise label “TV” could be added due to a recognition mistake or image ambiguity, but it is less likely to annotate the object as “lamp” or “curtain”, while the probability of getting noise labels such as “tree” or “bike” is even smaller.
In this paper, we propose a novel multilevel adversarial generative model, MGPLL, for partial label learning. The MGPLL model comprises of conditional data generators at both the label level and feature level. The noise label generator directly models nonrandom appearances of noise labels conditioning on the true label by adversarially matching the candidate label observations, while the data feature generator models the data samples conditioning on the corresponding true labels by adversarially matching the observed data sample distribution. Moreover, a prediction network is incorporated to predict the denoised true label of each instance from its input features, which forms inverse mappings between labels and features, together with the data feature generator. The learning of the overall model corresponds to a minimax adversarial game, which simultaneously identifies true labels of the training instances from both the observed data features and the observed candidate labels, while inducing accurate prediction networks that map input feature vectors to (denoised) true label vectors. To the best of our knowledge, this is the first work that exploits multilevel generative models to model nonrandom noise labels for partial label learning. We conduct extensive experiments on realworld and synthesized PL datasets. The empirical results show the proposed MGPLL achieves the stateoftheart PL performance.
2 Related Work
Partial label learning is a weakly supervised classification problem in many realworld domains, where the true label of each training instance is hidden within a given candidate label set. The PL setting is different from the noise label learning, where the groundtruth labels on some instances are replaced by noise labels. The key for PL learning lies in how to disambiguate the candidate labels. Existing disambiguationbased approaches mainly follow two strategies: the averagebased strategy and the identificationbased strategy.
The averagebased strategy assumes that each candidate label contributes equally to model training, and then averages the outputs of all the candidate labels for final prediction. Following such a strategy, the discriminative learning methods [3, 23] distinguish the averaged model outputs based on all the candidate labels from the averaged outputs based on all the noncandidate labels. The instancebased learning methods [10, 7] on the other hand predict the label for a test instance by averaging the candidate labels from its neighbors. The simple averagebased strategy however cannot produce satisfactory performance since it fails to take the difference among the candidate labels into account.
By considering the differences between candidate labels, the identificationbased strategy has gained increasing attention due to its effectiveness of handling the candidate labels with discrimination. Many existing approaches following this strategy take the groundtruth labels as latent variables. Then the latent variables and model parameters are refined via EM procedures which optimize the objective function based on a maximum likelihood criterion [11] or a maximum margin criterion [14]. Recently, some researchers proposed to learn the label confidence value of each candidate label by leveraging the topological information in feature space and achieved some promising results [22, 5, 20, 6, 18]. One previous work in [5] attempts to estimate the latent label distribution using iterative label propagation along the topological information extracted based on the local consistency assumption; i.e., nearby instances are supposed to have similar label distributions. Then, a prediction model is induced using the learned latent label distribution. However, the latent label distribution estimation can be impaired by the cumulative error induced in the propagation process, which can consequently degrade the partial label learning performance. Another work in [12] tries to refine the label confidence values with a selftraining strategy and induce the prediction model over the refined label confidence scores by alternative optimization. However, due to the nature of alternative optimization, the estimation error on confidence values can negatively impact the coupled partial label predictor, Moreover, all these existing methods have assumed random label noise by default, which however does not hold in many real world learning scenarios. This paper presents the first work that explicitly model nonrandom noise labels for partial label learning.
3 The Proposed Approach
Given a partial label training set , where is a ddimensional feature vector for the th instance, and denotes the candidate label indicator vector associated with , which has multiple 1 values corresponding to the groundtruth label and the noise labels, the task is to learn a good multiclass prediction model. In real world scenarios, the irrelevant noise labels are typically not presented in a random manner, but rather correlated with the groundtruth label. In this section, we present a novel multilevel generative model for partial label learning, MGPLL. The model is illustrated in Figure 1. It models nonrandom noise labels using an adversarial conditional noise label generator with a corresponding discriminator , and builds connections between the denoised label vectors and instance features using a labelconditioned feature level generator and a label prediction network . The overall model learning problem corresponds to a minimax adversarial game, which conducts multilevel generator learning by matching the observed data in both the feature and label spaces, while boosting the correspondence relationships between features and labels to induce an accurate multiclass prediction model. Below we present the details of the two level generations, the prediction network, and the overall learning problem.
3.1 Conditional Noise Label Generation
The key challenge of partial label learning lies in the fact that groundtruth label is hidden among noise labels in the given candidate label set. As aforementioned, in real world partial label learning problems, the appearances of noise labels are typically not random, but rather correlated with the groundtruth labels. Hence we propose a conditional noise label generation model to model the appearances of the targetlabel dependent noise labels and match the observed candidate label distribution in the training data through adversarial learning.
Specifically, given a noise value sampled from a uniform distribution
and a onehot label indicator vector sampled from a multinomial distribution , we use a noise label generator to generate a noise label vector conditioning on the true label , which can be combined with to form a generated candidate label vector , such that . We then adopt the adversarial learning principle to learn such a noise label generation model by introducing a discriminator , which is a twoclass classifier and predicts how likely a given label vector comes from the real data instead of generated data. By adopting the adversarial loss of the Wasserstein Generative Adversarial Network (WGAN) [1], our adversarial learning problem can be formulated as the following minimax optimization problem:(1)  
Here the discriminator attempts to maximally distinguish the generated candidate label vectors from the observed candidate label indicator vectors in the real training data, while the generator tries to generate noise label vectors and hence candidate label vectors that are similar to the real data in order to maximally confuse the discriminator . By playing a minimax game between the generator and the discriminator , the adversarial learning is expected to induce a generator such that the generated candidate label distribution can match the observed candidate label distribution in the training data [8]. We adopted the training loss of WGAN here, as WGAN can overcome the mode collapse problem and have improved learning stability comparing to the standard GAN [1].
Note although the proposed generator is designed to model truelabel dependent noise labels, it can be easily modified to model random noise label distributions by simply dropping the label vector input to have .
3.2 Prediction Network
The ultimate goal of partial label learning is to learn an accurate prediction network . To train a good predictor, we need to obtain denoised labels on the training data. For a candidate label indicator vector , if the noise label indicator vector is given, one can simply perform label denoising as follows to obtain the corresponding true label vector :
(2) 
Here the operator “” is introduced to generalize the standard minus “” operator into the nonideal case, where the noise label indicator vector is not properly contained in the candidate label indicator vector.
The generator presented in the previous section provides a mechanism to generate noise labels and denoise candidate label sets, but requires true target label vector as input. We can use the outputs of the prediction network to approximate the target true label vectors of the training data for candidate label denoising purpose with , while using the denoised labels as the prediction target for . Specifically, with the noise label generator and predictor , we can perform partial label learning by minimizing the following classification loss on the training data :
(3)  
Although in the ideal case, the output vectors of and
would be indicator label vectors, it is errorprone and difficult for neural networks to output discrete values. To pursue more reliable predictions and avoid overconfident outputs,
andpredict the probability of each class label being a noise label and groundtruth label respectively. Hence the loss function
in Eq.(3) above denotes a mean square error loss between the predicted probability of each label being the true label (through ) and its confidence of being a groundtruth label (through ).3.3 Conditional Feature Level Data Generation
With the noise label generation model and the prediction network above, the observed training data in both the label and feature spaces are exploited to recognize the true labels and induce good prediction models. Next, we incorporate a conditional data generator at the feature level to map (denoised) label vectors in the label space into instances in the feature space, aiming to further strengthen the mapping relations between data samples and the corresponding labels, enhance label denoising and hence improve partial label learning performance. Specifically, given a noise value sampled from a uniform distribution and a onehot label vector sampled from a multinomial distribution , generates an instance in the feature space that is corresponding to label . Hence given the training label vectors in denoised with , the data generator is expected to regenerate the corresponding training instances in the feature space. This assumption can be captured using the following generation loss:
(4) 
where denotes the denoised label vector for the th training instance, and is a mean square error loss function.
Moreover, by introducing a discriminator , which predicts how likely a given instance is real, we can deploy an adversarial learning scheme to learn the generator through the following minimax optimization problem with the WGAN loss:
(5)  
By playing a minimax game between and , this adversarial learning is expected to induce a generator that can generate samples with the same distribution as the observed training instances. Hence the mapping relation from label vectors to samples induced by can also hold on the real training data, and should be consistent with the inverse mapping from samples to label vectors through the prediction network. Therefore, we can further consider an auxiliary classification loss on the generated data:
(6) 
where can be a crossentropy loss between the label prediction probability vector and the sampled true label indicator vector.
3.4 Learning the MGPLL Model
By integrating the classification loss in Eq.(3), the adversarial losses in Eq.(1) and Eq.(5), the generation loss in Eq.(4) and the auxiliary classification loss in Eq.(6) together, MGPLL learning can be formulated as the following minmax optimization problem:
(7) 
where , and
are tradeoff hyperparameters. The learning of the overall model corresponds to a minimax adversarial game. We develop a batchbased stochastic gradient descent algorithm to solve it by conducting minimization over {
} and maximization over {} alternatively. The overall training algorithm is outlined in Algorithm 1.Input: the PL training set;
: the tradeoff hyperparameters;
: the clipping parameter;
: minibatch size.
4 Experiment
4.1 Datasets
We conducted experiments on both controlled synthetic PL datasets and a number of realworld PL datasets.
The synthetic datasets are generated from four UCI datasets, ecoli, vehicle, segment and satimage, which have 8, 4, 7 and 7 classes, and 336, 846, 2310 and 6,345 examples, respectively. From each UCI dataset, we generated synthetic PL datasets using three controlling parameters and , following the controlling protocol in previous works [19, 20, 12]. Among the three parameters, controls the proportion of instances that have noise candidate labels, controls the number of false positive labels, and controls the probability of a specific false positive label cooccurring with the true label. Under different parameter configurations, multiple PL variants can be generated from each UCI dataset. In particular, we considered two settings. In the first setting, we consider random noise labels with the following three groups of configurations: (I) , ; (II) , ; (III) , . In the second setting, we consider the target labeldependent noise labels with the following configuration: (IV) , , . In total, this provides us 112 (28 configurations 4 UCI datasets) synthetic PL datasets.
Dataset  #Example  #Feature  #Class  avg.#CLs 

FGNET  1,002  262  78  7.48 
Lost  1,122  108  16  2.23 
MSRCv2  1,758  48  23  3.16 
BirdSong  4,998  38  13  2.18 
Yahoo! News  22,991  163  219  1.91 
We used five realworld PL datasets that are collected from several application domains, including FGNET [15] for facial age estimation, Lost [3], Yahoo! News [9] for automatic face naming in images or videos, MSRCv2 [4] for object classification, and BirdSong [2] for bird song classification. The characteristics of the realworld PL datasets are summarized in Table 1.
4.2 Comparison Methods
We compared the proposed MGPLL approach with the following PL methods, each configured with the suggested parameters according to the respective literature:

[leftmargin=*]

PLSVM [14]: A maximummargin based method which maximizes the classification margin between candidate and noncandidate class labels.

CLPL [3]: A convex optimization based method for partial label learning.

PALOC [19]: An ensemble method which trains multiple binary classifies with the onevsone decomposition strategy and makes prediction by consulting all binary classifies.

SURE [12]: A selftraining based method which learns a confidence matrix of candidate labels with a maximum infinity norm regularization and trains the prediction model over the learned label confidence matrix.
4.3 Implementation Details
The proposed MGPLL model has five component networks, all of which are designed as multilayer perceptrons with Leaky ReLu activation for the middle layers. The noise label generator is a fourlayer network with sigmoid activation in the output layer. The conditional data generator is a fivelayer network with tanh activation in the output layer, while batch normalization is deployed in its middle three layers. The predictor is a threelayer network with softmax activation in the output layer. Both the noise label discriminator and the data discriminator are threelayer networks without activation in the output layer. The RMSProp
[17] optimizer is used in our implementation and the minibatch size m is set to 32. We selected the hyperparameters , and from {0.001, 0.01, 0.1, 1, 10} based on the classification loss value in the training objective function; that is, we chose their values that lead to the smallest training loss.MGPLL vs –  

SURE  PALOC  CLPL  PLSVM  
varying  18/7/3  22/6/0  24/4/0  24/4/0 
varying  16/9/3  19/9/0  21/7/0  22/6/0 
varying  14/12/2  18/10/0  20/8/0  23/5/0 
varying  15/13/0  18/10/0  18/10/0  21/7/0 
Total  63/41/8  77/35/0  83/29/0  90/22/0 
Win/tie/loss counts of pairwise ttest (at 0.05 significance level) between MGPLL and each comparison approach.
4.4 Results on Synthetic PL Datasets
We conducted experiment on two types of synthetic PL datasets generated from the UCI datasets, with random noise labels and target labeldependent noise labels, respectively. For each PL dataset, tenfold crossvalidation is performed and the average test accuracy results are recorded. First we study the comparison results over the PL datasets with target labeldependent noise labels under the PL configuration setting IV. In this setting, a specific label is selected as the coupled label that cooccurs with the groundtruth label with probability , and any other label can be randomly chosen as a noisy label with probability . Figure 2 presents the comparison results for the configuration setting IV, where increases from 0.1 to 0.7 with and . From Figure 2 we can see that the proposed MGPLL produces promising results. It consistently outperforms the other methods across different values on three datasets, while achieving remarkable gains on segment and satimage. We also conducted experiments on the PL datasets with random noise labels produced under PL configuration settings I, II and III, while MGPLL (with noise label generator ) achieves similar positive comparison results as above. Due to the limitation of space, instead of including the comparison figures, we summarize the comparison results below with statistical significance tests.
To statistically study the significance of the performance gains achieved by MGPLL over the other comparison methods, we conducted pairwise ttest at 0.05 significance level based on the comparison results of tenfold crossvalidation over all the 112 synthetic PL datasets obtained for all different configuration settings. The detailed win/tie/loss counts between MGPLL and each comparison method are reported in Table 2, from which we have the following observations: (1) MGPLL achieves superior or at least comparable performance over PALOC, CLPL, and PLSVM in all cases, which is not easy given the comparison methods have different strengths across different datasets. (2) MGPLL significantly outperforms PALOC, CLPL, and PLSVM in 68.7%, 74.1%, and 80.3% of the cases respectively, and produces ties in the remaining cases. (3) MGPLL significantly outperforms SURE in 56.2% of the cases while achieves comparable performance in 36.6%, and is outperformed by SURE in only remaining 7.1% of the cases. (4) On the PL datasets with target labeldependent noise labels, we can see that MGPLL significantly outperforms SURE, PALOC, CLPL, and PLSVM in 53.5%, 64.2%, 64.2%, and 75.0%, of the cases respectively. (5) It is worth noting that MGPLL is never significantly outperformed by any comparison methods on datasets with labeldependent noise labels. In summary, these results on the controlled PL datasets clearly demonstrate the effectiveness of MGPLL for partial label learning under different settings.
MGPLL  SURE  PALOC  CLPL  PLSVM  PLKNN  

FGNET  0.0790.024  0.0680.032  0.0640.019  0.0630.027  0.0630.029  0.0380.025 
FGNET(MAE3)  0.4680.027  0.4580.024  0.4350.018  0.4580.022  0.3560.022  0.2690.045 
FGNET(MAE5)  0.6260.022  0.6150.019  0.6090.043  0.5960.017  0.4790.016  0.4380.053 
Lost  0.7980.033  0.7800.036  0.6290.056  0.7420.038  0.7290.042  0.4240.036 
MSRCv2  0.5330.021  0.4810.036  0.4790.042  0.4130.041  0.4610.046  0.4480.037 
BirdSong  0.7480.020  0.7280.024  0.7110.016  0.6320.019  0.6600.037  0.6140.021 
Yahoo! News  0.6780.008  0.6440.015  0.6250.005  0.4620.009  0.6290.010  0.4570.004 
MGPLL  CLSw/oadvn  CLSw/oadvx  CLSw/og  CLSw/oaux  CLS  

FGNET  0.0790.024  0.0610.024  0.0720.020  0.0680.029  0.0760.022  0.0570.016 
FGNET(MAE3)  0.4680.027  0.4300.029  0.4510.032  0.4360.038  0.4560.033  0.4200.420 
FGNET(MAE5)  0.6260.022  0.5830.055  0.6050.031  0.5900.045  0.6120.044  0.5700.034 
Lost  0.7980.033  0.6230.037  0.7540.032  0.6870.026  0.7820.043  0.6090.040 
MSRCv2  0.5330.021  0.4720.030  0.4800.038  0.4970.031  0.5260.036  0.4500.037 
BirdSong  0.7480.020  0.7280.010  0.7320.011  0.7160.011  0.7420.024  0.6740.016 
Yahoo! News  0.6780.008  0.6450.008  0.6750.009  0.6480.014  0.6710.012  0.6100.015 
4.5 Results on RealWorld PL Datasets
We compared the proposed MGPLL method with the comparison methods on five realworld PL datasets. For each dataset, tenfold crossvalidation is conducted, while the mean test accuracy as well as the standard deviation results are reported in Table
3. Moreover, statistical pairwise ttest at 0.05 significance level is conducted to compare MGPLL with each comparison method based on the results of tenfold crossvalidation. The significance results are indicated in Table 3 as well. Note that the average number of candidate labels (avg.#CLs) of FGNET dataset is quite large, which causes poor performance for all the comparison methods. For better evaluation of this facial age estimation task, we employ the conventional mean absolute error (MAE) [23] to conduct two extra experiments. Two extra test accuracies are reported on the FGNET dataset where a test sample is considered to be correctly predicted if the difference between the predicted age and the groundtruth age is less than 3 years (MAE3) or 5 years (MAE5). From Table 3 we have the following observations: (1) Comparing with all the five PL methods, MGPLL consistently produces the best results on all the datasets, with remarkable performance gains in many cases. For example, MGPLL outperforms the best alternative comparison methods by 5.2%, 3.4% and 2.0% on MSRCv2, Yahoo! News and Birdsong respectively. (2) Out of the total 35 comparison cases (5 comparison methods 7 datasets), MGPLL significantly outperforms all the comparison methods across 77.1% of the cases, and achieves competitive performance in the remaining 22.9% of cases. (3) It is worth noting that the performance of MGPLL is never significantly inferior to any other comparison methods. These results on the realworld PL datasets again validate the efficacy of the proposed method.4.6 Ablation Study
The objective function of MGPLL contains five loss terms: classification loss, adversarial loss at the label level, adversarial loss at the feature level, generation loss and auxiliary classification loss. To assess the contribution of each part, we conducted an ablation study by comparing MGPLL with the following ablation variants: (1) CLSw/oadvn, which drops the adversarial loss at the label level. (2) CLSw/oadvx, which drops the adversarial loss at the feature level. (3) CLSw/og, which drops the generation loss. (4) CLSw/oaux, which drops the auxiliary classification loss. (5) CLS, which only uses the classification loss by dropping all the other loss terms. The comparison results are reported in Table 4. We can see that comparing to the full model, all five variants produce inferior results in general and have performance degradations to different degrees. This demonstrates that the different components in MGPLL all contribute to the proposed model to some extend. From Table 4, we can also see that the variant CLSw/oadvn has a relatively larger performance degradation by dropping the adversarial loss at the label level, while the variant CLSw/oaux has a small performance degradation by dropping the auxiliary classification loss. This makes sense as by dropping the adversarial loss for learning noise label generator, the generator can produce poor predictions and seriously impact the label denoising of the MGPLL model. This suggests that our nonrandom noise label generation through adversarial learning is a very effective and important component for MGPLL. For CLSw/oaux, as we have already got the classification loss on real data, it is reasonable to see that the auxiliary classification loss on generated data can help but is not critical. Overall, the ablation study results suggest that the proposed MGPLL is effective.
5 Conclusion
In this paper, we proposed a novel multilevel generative model, MGPLL, for partial label learning. MGPLL uses a conditional label level generator to model target label dependent nonrandom noise label appearances, which directly performs candidate label denoising, while using a conditional feature level generator to generate data samples from denoised label vectors. Moreover, a prediction network is incorporated to predict the denoised true label of each instance from its input features, which forms inverse mappings between labels and features, together with the data feature generator. The adversarial learning of the overall model simultaneously identifies true labels of the training instances from both the observed data features and the observed candidate labels, while inducing accurate prediction networks that map input feature vectors to (denoised) true label vectors. We conducted extensive experiments on realworld and synthesized PL datasets. The proposed MGPLL model demonstrates the stateoftheart PL performance.
References

[1]
(2017)
Wasserstein generative adversarial networks.
In
Proceedings of the International Conference on Machine Learning (ICML)
, Cited by: §3.1.  [2] (2012) Rankloss support instance machines for miml instance annotation. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), Cited by: §4.1.
 [3] (2011) Learning from partial labels. Journal of Machine Learning Research 12 (May), pp. 1501–1536. Cited by: §1, §2, 3rd item, §4.1.

[4]
(1994)
Solving multiclass learning problems via errorcorrecting output codes.
Journal of artificial intelligence research
2, pp. 263–286. Cited by: §4.1.  [5] (2018) Leveraging latent label distributions for partial label learning.. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §1, §2.
 [6] (2019) Partial label learning by semantic difference maximization. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §2.
 [7] (2018) A regularization approach for instancebased superset label learning. IEEE transactions on cybernetics 48 (3), pp. 967–978. Cited by: §2.
 [8] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.1.

[9]
(2010)
Multiple instance metric learning from automatically labeled bags of faces.
In
European Conference on Computer Vision (ECCV)
, Cited by: §4.1.  [10] (2006) Learning from ambiguously labeled examples. Intelligent Data Analysis 10 (5), pp. 419–439. Cited by: §1, §1, §2, 1st item.
 [11] (2003) Learning with multiple labels. In Advances in neural information processing systems (NeurIPS), Cited by: §1, §2.
 [12] (2019) Partial label learning with selfguided retraining. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §2, 5th item, §4.1.
 [13] (2010) Learning from candidate labeling sets. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
 [14] (2008) Classification with partial labels. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), Cited by: §2, 2nd item.
 [15] (2014) An overview of research activities in facial age estimation using the fgnet aging database. In European Conference on Computer Vision (ECCV), Cited by: §4.1.
 [16] (2017) Confidencerated discriminative partial label learning. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1.
 [17] (2012) Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §4.3.
 [18] (2019) Adaptive graph guided disambiguation for partial label learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 83–91. Cited by: §2.
 [19] (2018) Towards enabling binary decomposition for partial label learning.. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: 4th item, §4.1.
 [20] (2019) Partial label learning via label enhancement. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2, §4.1.

[21]
(2013)
Learning by associating ambiguously labeled images.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1.  [22] (2015) Solving the partial label learning problem: an instancebased approach. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §1, §2.
 [23] (2016) Partial label learning via featureaware disambiguation. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Cited by: §1, §2, §4.5.