1 Introduction
Overparameterized deep neural networks owe their popularity much to their ability to (nearly) perfectly memorize large numbers of training examples, and the memorization is known to decrease the generalization error
Feldman (2020). On the other hand, scaling the acquisition of examples for training neural networks inevitably introduces nonfully supervised data annotation, a typical example among which is partial label Nguyen and Caruana (2008); Cour et al. (2011); Zhang et al. (2016, 2017); Feng and An (2018); Xu et al. (2019); Yao et al. (2020b); Lv et al. (2020); Feng et al. (2020b); Wen et al. (2021)—a partial label for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. Partial label learning(PLL) trains multiclass classifiers from instances that are associated with partial labels. It is therefore apparent that some techniques should be applied to prevent memorizing the false candidate labels when PLL resorts to deep learning, and unfortunately, empirical evidence has shown generalpurpose regularization cannot achieve that goal
Lv et al. (2021).A large number of deep PLL algorithms have recently emerged that aimed to design regularizers Yao et al. (2020a, b); Lyu et al. (2022) or network architectures Wang et al. (2022) for PLL data. Further, there are some PLL works that provided theoretical guarantees while making their methods compatible with deep networks Lv et al. (2020); Feng et al. (2020b); Wen et al. (2021); Wu and Sugiyama (2021). We observe that these existing theoretical works have focused on the instanceindependent setting where the generation process of partial labels is homogeneous across training examples. With an explicit formulation of the generation process, the asymptotical consistency Mohri et al. (2018) of the methods, namely, whether the classifier learned from partial labels approximates the Bayes optimal classifier, can be analyzed.
However, the instanceindependent process cannot model the real world well since data labeling is prone to different levels of error in tasks of varying difficulty. Intuitively, instancedependent (ID) partial labels should be quite realistic as some poorquality or ambiguous instances are more difficult to be labeled with an exact true label. Although the instanceindependent setting has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, relatively less attention has been paid to the practically relevant setting of ID partial labels. Very recently, one solution has been proposed Xu et al. (2021) which learned directly from ID partial labels, nevertheless, it is still unclear in theory whether the learned classifier is good. Motivated by the above observations, we set out to investigate ID PLL with the aim of proposing a learning approach that is modelindependent and theoretically explain when and why the proposed method works.
In this paper, we propose PrOgressive Purification (Pop), a theoretically grounded PLL framework for ID partial labels. Specifically, we use the observed partial labels to pretrain a randomly initialized classifier (deep network) for several epochs, and then we update both partial labels and the classifier for the remaining epochs. In each epoch, we purify each partial label by moving out the candidate labels for which the current classifier has high confidence of being incorrect, and subsequently we train the classifier with the purified partial labels in the next epoch. As a consequence, the false candidate labels are gradually sifted out and the classification performance of the classifier is improved. We justify Pop and outline the main contributions below:

[topsep=0ex,leftmargin=*,parsep=1pt,itemsep=1pt]

We propose a novel approach named Pop for the ID PLL problem, which purifies the partial labels and refines the classifier iteratively. Extensive experiments validate the effectiveness of Pop.

We prove that Pop can be guaranteed to enlarge the region where the model is reliable by a promising rate, and eventually approximates the Bayes optimal classifier with mild assumptions. This proof process does not rely on the assumption of the instanceindependent setting. To the best of our knowledge, this is the first theoretically guaranteed approach for the general ID PLL problem.

Pop is flexible with respect to losses, so that the losses designed for the instanceindependent PLL problems can be embedded directly. We empirically show that such embedding allows advanced PLL losses can be applied to the ID problem and achieve stateoftheart learning performance.
2 Related Work
In this section, we briefly go through the seminal works in PLL, focusing on the theoretical works and discussing the underlying assumptions behind them.
Nondeep PLL There have been substantial nondeep PLL algorithms from the pioneering work Jin and Ghahramani (2003)
. From a practical standpoint, they have been studied along two different research routes: the identificationbased strategy and the averagebased strategy. The identificationbased strategy purifies each partial label and extracts the true label heuristically in the training phase, so as to identify the true labels
Chen et al. (2014); Zhang et al. (2016); Tang and Zhang (2017); Feng and An (2019); Xu et al. (2019). On the contrary, the averagebased strategy treats all candidates equally Hüllermeier and Beringer (2006); Cour et al. (2011); Zhang and Yu (2015). On the theoretical side, Liu and Dietterich Liu and Dietterich (2012) analyzed the learnability of PLL by making a small ambiguity degree conditionassumption, which ensures classification errors on any instance have a probability of being detected. And Cour
et al. Cour et al. (2011) proposed a consistent approach under the small ambiguity degree condition and a dominance assumption on data distribution (Proposition 5 in Cour et al. (2011)). Liu and Dietterich Liu and Dietterich (2012) proposed a Logistic StickBreaking Conditional Multinomial Model to portray the mapping between instances and true labels while assuming the generation of the partial label is independent of the instance itself. It should be noted that the vast majority of nondeep PLL works have only empirically verified the performance of algorithms on small data sets, without formalizing the statistical model for the PLL problem, and therefore even less so for theoretical analysis of when and why the algorithms work.Deep PLL In recent years, deep learning has been applied to PLL and has greatly advanced the practical application of PLL. Yao et al. Yao et al. (2020a, b) and Lv et al. Lv et al. (2020) proposed learning objectives that are compatible with stochastic optimization and thus can be implemented by deep networks. Soon Feng et al. Feng et al. (2020b) formalized the first generation process for PLL. They assumed that given the latent true label, the probability of all incorrect labels being added into the candidate label set is uniform and independent of the instance. Thanks to the uniform generation process, they proposed two provably consistent algorithms. Wen et al. Wen et al. (2021) extended the uniform one to the classdependent case, but still keep the instanceindependent assumption unchanged. In addition, a new paradigm called complementary label learning Ishida et al. (2017); Yu et al. (2018); Ishida et al. (2019); Feng et al. (2020a) has been proposed that learns from instances equipped with a complementary label. A complementary label specifies the classes to which the instance does not belong, so it can be considered to be an inverted PLL problem. However, all of them made the instanceindependent assumption for analyzing the statistic consistency. Wu and Sugiyama Wu and Sugiyama (2021) proposed a framework that unifies the formalization of multiple generation processes under the instanceindependent assumption.
Very recently, some researchers are beginning to notice a more general setting—ID PLL. Learning with the ID partial labels is challenging, and all instanceindependent approaches cannot handle the ID PLL problem directly. Specifically, the theoretical approaches mentioned above utilize mainly the loss correction technique, which corrects the prediction or the loss of the classifier using a prior or estimated knowledge of data generation processes, i.e., a set of parameters controlling the probability of generating incorrect candidate labels, or it is often called transition matrix Patrini et al. (2017). The transition matrix can be characterized fixedly in the instanceindependent setting since it does not need to include instancelevel information, a condition that does not hold in ID PLL. Furthermore, it is illposed
to estimate the transition matrix by only exploiting partially labeled data, i.e., the transition matrix is unidentifiable
Xia et al. (2020). Therefore, some new methods should be proposed to tackle this issue. Xu et al. Xu et al. (2021) introduced a solution that infers the latent label posterior via variational inference methods Blei et al. (2017), nevertheless, its effectiveness would be hardly guaranteed. In this paper, we propose Pop for the ID PLL problem and theoretically prove that the learned classifier approximates well to the Bayes optimal.3 Proposed Method
3.1 Preliminaries
First of all, we briefly introduce some necessary notations. Consider a multiclass classification problem of classes. Let be the dimensional instance space and be the label space with
class labels. In supervised learning, let
be the underlying “clean” distribution generating from which i.i.d. samples are drawn.In PLL, there is a partial label space and the PLL training set is sampled independently and identically from a “corrupted” density over . It is generally assumed that and have the same marginal distribution of instances . Then the generation process of partial labels can thus be formalized as . We define the probability that, given the instance and its class label , label being included in its partial label as the flipping probability:
The key definition in PLL is that the latent true label of an instance is always one of its candidate label, i.e., .
We consider use deep models by the aid of an inverse link function Reidand and Williamson (2010) where denotes the dimensional simplex, for example, the softmax, as learning model in this paper. Then the goal of supervised multiclass classification and PLL is the same: a scoring function that can make correct predictions on unseen inputs. Typically, the classifier takes the form:
The Bayes optimal classifier (learned using supervised data) is the one that minimizes the risk w.r.t the 01 loss (or some classificationcalibrated loss Bartlett et al. (2006)), i.e.,
In this case, the scoring function
recovers the classposterior probabilities, i.e.,
. When the supervision information available is partial label, the PLL risk under w.r.t. a suitable PLL loss is defined asMinimizing induces the classifier and it is desirable that the minimizer approach . In addition, let be the class label with the second highest posterior possibility among all labels.
3.2 Overview
In the latter part of this section, we will introduce a concept pure level set as the region where the model is reliable. We prove that given a tiny reliable region, one could progressively enlarge this region and improves the model with a sufficient rate by disambiguating the partial labels. Motivated by the theoretical results, we propose an approach Pop that works by progressively purifying the partial labels to move out the false candidate labels, and eventually the learned classifier could approximate the Bayes optimal classifier.
Pop employs the observed partial labels to pretrain a randomly initialized classifier for several epochs, and then updates both partial labels and the classifier for the remaining epochs. We start with a warmup period, in which we train the predictive model with a welldefined PLL loss Lv et al. (2020). This allows us to attain a reasonable predictive model before it starts fitting incorrect labels Zhang et al. . After the warmup period, we iteratively purify each partial label by moving out the candidate labels for which the current classifier has high confidence of being incorrect, and subsequently we train the classifier with the purified partial labels in the next epoch. After the model has been fully trained, the predictive model can perform prediction for unseen instances.
3.3 Theoretical Analysis
We assume that the hypothesis class is sufficiently complex (and deep networks could meet this condition), such that the approximation error equals zero, i.e., . Then the gap between the learned classifier and Bayes optimal is determined by the ambiguity caused by partial labels. Besides, for two instance and that satisfy , the indicator function equals 1 if the more confident point is inconsistent with Bayes classifier. Then, the gap between and should be controlled by the risk at point . Therefore, we assume that there exist constants , , such that for ,
(1) 
In addition, for the probability density function
of the margin , we assume that there exist constants , such that . Then, the worstcase densityimbalance ratio is denoted by .Motivated by the pure level set in binary classification Zhang et al. (2021), we define the pure level set in PLL, i.e., the region where the model is reliable:
Definition 1
(Pure level set). A set is pure for if for all .
Assume that there exists a boundary for all which satisfies and , we have
(2) 
which means that there is a tiny region where the model is reliable.
Let be the new boundary and . As the probability density function of the margin is bounded by , we have the following result for that satisfies ^{*}^{*}*More details could be found in Appendix A.1.:
(3)  
Combining Eq. (1) and Eq. (3), there is
(4) 
Denote by the label with the highest posterior probability for the current prediction. If , we have ^{†}^{†}†More details could be found in Appendix A.2.
(5) 
which means that the label is incorrect label. Therefore, we could move the label out from the candidate label set to disambiguate the partial label, and then refine the learning model with the partial label with less ambiguity. In this way, we would move one step forward by trusting the model with the tiny reliable region with following theorem.
Theorem 1
Assume that there exists an boundary with for an such that for all which satisfy , . For each and and , if , we move out label from the candidate label set and then update the candidate label set as . Then the new classifier is trained on the updated data with the new distribution . Let be the minimum boundary that is pure for for . Then, we have
The detailed proof can be found in Appendix A.1. Theorem 1 shows that the purified region would be enlarged by at least a constant factor with the given purification strategy.
As the flipping probability of the incorrect label in the ID generation process is related to its posterior probability, we assume that there exists a constant such that:
(6) 
Then we prove that if there exists a pure level set for an initialized model, our proposed approach can purify incorrect labels and the classifier will finally match the Bayes optimal classifier after sufficient rounds under a reasonable hyperparameter setting .
Theorem 2
For any flipping probability of each incorrect label , define . And for a given function there exists a level set which is pure for . If one runs purification in Theorem 1 starting with and the initialization: (1) , (2) , (3), then we have:
The proof of Theorem 2 is provided in Appendix A.3. Theorem 2 shows that the classifier can be guaranteed to eventually approximate the Bayes optimal classifier.
According to Theorems 1 and 2, we could progressively purify the observed partial labels by moving out the incorrect candidate labels with the continuously strict bound, and subsequently train an effective classifier with the purified labels. Therefore, the proposed method Pop in the following subsection is designed along these lines.
3.4 The Pop Method
3.4.1 Warmup Training
We start with a warmup period, in which the classifier is trained on the original partial label to be able to attain reasonable outputs before fitting label noise Zhang et al. . The predictive model
is trained on partially labeled examples by minimizing the following PLL loss function
Lv et al. (2020):(7) 
Here, is the crossentropy loss and the weight is initialized with with uniform weights:
(8) 
and then could be tackled simply using the current predictions for slightly putting more weights on more possible labels Lv et al. (2020):
(9) 
3.4.2 Progressive Purification
After the warmup period, the classifier could be employed for purification. According to Theorem 1, we could progressively move out the incorrect candidate label with the continuously strict bound, and subsequently train an effective classifier with the purified labels. Specifically, we set a high threshold and calculate the difference for each candidate label. If there is a label for satisfies , we move out it from the candidate label set and update the candidate label set.
If there is no purification for all partial labels, we begin to decrease the threshold and continue the purification for improving the training of the model. In this way, the incorrect candidate labels are progressively removed from the partial label round by round, and the performance of the classifier is continuously improved. According to Theorem 2, the learned classifier will be consistent with the Bayes optimal classifier eventually. The algorithmic description of Pop is shown in Algorithm 1.
Dataset  #Train  #Test  #Features  #Class Labels  avg. #CLs 

CUB200  3619  2414  150,258  200  8.71 
FashionMNIST 
60,000  10,000  784  10  4.61 
KuzushijiMNIST  60,000  10,000  784  10  4.34 
CIFAR10  50,000  10,000  3,072  10  2.74 
CIFAR100  50,000  10,000  3,072  100  5.50 
Dataset  #Train  #Test  #Features  #Class Labels  avg. #CLs  Task Domain 

Lost  898  224  108  16  2.23  automatic face naming Cour et al. (2011) 
MSRCv2  1,406  352  48  23  3.16  object classification Liu and Dietterich (2012) 
BirdSong  3,998  1,000  38  13  2.18  bird song classification Briggs et al. (2012) 
Soccer Player  13,978  3,494  279  171  2.09  automatic face naming Zeng et al. (2013) 
Yahoo! News  18,393  4,598  163  219  1.91  automatic face naming Guillaumin et al. (2010) 
4 Experiments
4.1 Datasets
We adopt five widely used benchmark datasets including CUB200 Welinder et al. (2010), KuzushijiMNIST Clanuwat et al. (2018), FashionMNIST Xiao et al. (2017), CIFAR10 Krizhevsky and Hinton (2009), CIFAR100 Krizhevsky and Hinton (2009). These datasets are manually corrupted into ID partially labeled versions. Specifically, we set the flipping probability of each incorrect label corresponding to an instance by using the confidence prediction of a neural network trained using supervised data parameterized by Xu et al. (2021). The flipping probability , where is the set of all incorrect labels except for the true label of . The average number of candidate labels (avg. #CLs) for each benchmark dataset corrupted by the ID generation process is recorded in Table 1.
In addition, five realworld PLL datasets which are collected from different application domains are used, including Lost Cour et al. (2011), Soccer Player Zeng et al. (2013), Yahoo!News Guillaumin et al. (2010), MSRCv2 Liu and Dietterich (2012), and BirdSong Briggs et al. (2012). The average number of candidate labels (avg. #CLs) for each realworld PLL dataset is also recorded in Table 2.
CUB200  Kuzushijimnist  Fashionmnist  CIFAR10  CIFAR100  

Pop  45.680.12%  88.700.02%  87.620.04%  79.000.28%  57.680.14% 
Valen  0.59%  87.950.08%  87.200.18%  77.710.35%  55.600.24% 
Proden  40.660.11%  87.600.23%  87.210.11%  76.770.63%  55.120.12% 
RC  41.860.71%  87.250.06%  87.060.14%  76.490.52%  55.180.70% 
CC  22.160.55%  83.310.07%  86.010.13%  72.870.82%  55.560.23% 
Lw  19.000.16%  84.460.22%  86.250.01%  46.770.66%  48.000.16% 
Lost  BirdSong  MSRCv2  Soccer Player  Yahoo!News  

Pop  78.570.45%  74.470.36%  45.860.28%  54.480.10%  66.380.07% 
Valen  76.870.86%  73.390.26%  49.970.43%  55.810.10%  66.260.13% 
Proden  76.470.25%  73.440.12%  45.100.16%  54.050.15%  66.140.10% 
RC  76.260.46%  69.330.32%  49.470.43%  56.020.59%  63.510.20% 
CC  63.540.25%  69.900.58%  41.500.44%  49.070.36%  54.860.48% 
Lw  73.130.32%  51.450.26%  49.850.49%  50.240.45%  48.210.29% 
4.2 Baselines
The performance of Pop is compared against five deep PLL approaches:

[topsep=0ex,leftmargin=*,parsep=1pt,itemsep=1pt]

Proden Lv et al. (2020): A progressive identification approach which approximately minimizes a risk estimator and identifies the true labels in a seamless manner;

RC Feng et al. (2020b): A riskconsistent approach which employs the loss correction strategy to establish the true risk by only using the partially labeled data;

CC Feng et al. (2020b): A classifierconsistent approach which also uses the loss correction strategy to learn the classifier that approaches the optimal one;

Valen Yao et al. (2020a): An ID PLL approach which recovers the latent label distribution via variational inference methods;

Lw Wen et al. (2021): A riskconsistent approach which proposes a leveraged weighted loss to trade off the losses on candidate labels and noncandidate ones.
For all the deep approaches, We used the same training/validation setting, models, and optimizer for fair comparisons. Specifically, a 3layer MLP is trained on KuzushijiMNIST and FashionMNIST, the 32layer ResNet He et al. (2016) is trained on CIFAR10, the 12layer ConvNet Han et al. (2018)
is trained on CIFAR100, a pretrained ResNet18 and pretrained ResNet34 is trained on FLOWER102 and CUB200, and the linear model is trained on realworld PLL datasets, respectively. The hyperparameters are selected so as to maximize the accuracy on a validation set (10% of the training set). We run 5 trials on the benchmark datasets and the realworld PLL datasets. The mean accuracy as well as standard deviation are recorded for all comparing approaches. All the comparing methods are implemented with PyTorch.
CUB200  Kuzushijimnist  Fashionmnist  CIFAR10  CIFAR100  
Proden  40.660.11%  87.600.23%  87.210.11%  76.770.63%  55.120.12% 
Proden+Pop  45.680.12%  88.700.02%  87.620.04%  79.000.28%  57.680.14% 
RC  41.860.71%  87.250.06%  87.060.14%  76.490.52%  55.180.70% 
RC+Pop  46.020.66%  87.780.09%  87.450.05%  78.890.17%  57.660.11% 
CC  22.160.55%  83.310.07%  86.010.13%  72.870.82%  55.560.23% 
CC+Pop  22.740.19%  83.980.10%  86.320.06%  77.030.58%  56.180.06% 
Lw  19.000.16%  84.460.22%  86.250.01%  46.770.66%  48.000.16% 
Lw+Pop  19.550.11%  84.710.07%  86.400.05%  48.540.04%  49.610.27% 
Lost  BirdSong  MSRCv2  Soccer Player  Yahoo!News  

Proden  76.470.25%  73.440.12%  45.100.16%  54.050.15%  66.140.10% 
Proden+Pop  78.570.45%  74.470.36%  45.860.28%  54.480.10%  66.380.07% 
RC  76.260.46%  69.330.32%  49.470.43%  56.020.59%  63.510.20% 
RC+Pop  78.560.45%  70.770.26%  51.180.59%  56.490.03%  63.860.22% 
CC  63.540.25%  69.900.58%  41.500.44%  49.070.36%  54.860.48% 
CC+Pop  65.470.93%  71.500.06%  43.210.43%  49.360.02%  55.220.05% 
Lw  73.130.32%  51.450.26%  49.850.49%  50.240.45%  48.210.29% 
Lw+Pop  75.300.26%  52.350.26%  52.420.86%  50.940.47%  48.60.12% 
4.3 Experimental Results
Table 3 and Table 4 report the classification accuracy of each approach on benchmark datasets corrupted by the ID generation process and the realworld PLL datasets, respectively. The best results are highlighted in bold. We can observe that Pop achieves the best performance against other approaches in most cases and the performance advantage of Pop over comparing approaches is stable under varying the number of candidate labels.
In addition, to analysis the purified region in Theorem 1, we employ the confidence predictions of (the network in Section 4.1) as the posterior and plot the curve of the estimated purified region in every epoch on Lost in Figure 2. We can see that although the estimated purified region would be not accurate enough, the curve could show that the trend of continuous increase for the purified region.
4.4 Further Analysis
As the framework of Pop is flexible for the loss function, we integrate the proposed method with the previous methods for instanceindependent PLL including Proden, RC, CC and Lw. In this subsection, we empirically prove that the previous methods for instanceindependent PLL could be promoted to achieve better performance after integrating with Pop.
Table 5 and Table 6 report the classification accuracy of each method for instanceindependent PLL and its variant integrated with Pop on benchmark datasets corrupted by the ID generating procedure and the realworld datasets, respectively. As shown in Table 5 and Table 6, the approaches integrated with Pop including Proden+Pop, RC+Pop, CC+Pop and Lw+Pop achieve superior performance against original method, which clearly validates the usefulness of Pop framework for improving performance for ID PLL.
Figure 3 illustrates the variant integrated with Pop performs under different hyperparameter configurations on CIFAR10 while similar observations are also made on other data sets. The hyperparameter sensitivity on other datasets could be founded in Appendix A.4. As shown in Figure 3, it is obvious that the performance of the variant integrated with Pop is relatively stable across a broad range of each hyperparameter. This property is quite desirable as Pop framework could achieve robust classification performance.
5 Conclusion
In this paper, the problem of partial label learning is studied where a novel approach Pop is proposed. we consider ID partial label learning and propose a theoreticallyguaranteed approach, which could train the classifier with progressive purification of the candidate labels and is theoretically guaranteed to eventually approximates the Bayes optimal classifier for ID PLL. Specifically, we first purify the candidate labels of these data with high confidence in the predictions of the classifier. Next, the model is improved using purified labels. Then, we continue alternating the purification and model training until it converges. In addition, we prove that the model could improve with sufficient rate through iterations and eventually be consistent with the Bayes optimal classifier. Experiments on benchmark and realworld datasets validate the effectiveness of the proposed method.
If PLL methods become very effective, the need for exactly annotated data would be significantly reduced. As a result, the employment of data annotators might be decreased which could lead to a negative societal impact.
References
 [1] (2006) Convexity, classification, and risk bounds. Journal of the American Statistical Association 101 (473), pp. 138?156. Cited by: §3.1.
 [2] (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §2.
 [3] (2012) Rankloss support instance machines for miml instance annotation. In Proceedings of 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12), pp. 534–542. Cited by: Table 2, §4.1, §6.4.
 [4] (2014) Ambiguously labeled learning using dictionaries. IEEE Transactions on Information Forensics and Security 9 (12), pp. 2076–2088. Cited by: §2.
 [5] (2018) Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718. Cited by: §4.1, §6.4.

[6]
(2011)
Learning from partial labels.
Journal of Machine Learning Research
12 (5), pp. 1501–1536. Cited by: §1, §2, Table 2, §4.1, §6.4. 
[7]
(2020)
Does learning require memorization? a short tale about a long tail.
In
Proccedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC’20)
, pp. 954–959. Cited by: §1. 
[8]
(2018)
Leveraging latent label distributions for partial label learning..
In
Proceedings of 27th International Joint Conference on Artificial Intelligence (IJCAI’18)
, pp. 2107–2113. Cited by: §1.  [9] (2019) Partial label learning with selfguided retraining. In Proceedings of 33rd AAAI Conference on Artificial Intelligence (AAAI’19), pp. 3542–3549. Cited by: §2.
 [10] (2020) Learning with multiple complementary labels. In Proceedings of 37th International Conference on Machine Learning (ICML’20), pp. 3072–3081. Cited by: §2.
 [11] (2020) Provably consistent partiallabel learning. In Advances in Neural Information Processing Systems 33 (NeurIPS’20), pp. 10948–10960. Cited by: §1, §1, §2, 2nd item, 3rd item.

[12]
(2010)
Multiple instance metric learning from automatically labeled bags of faces.
In
Proceedings of 11th European Conference on Computer Vision (ECCV’10)
, Vol. 6311, pp. 634–647. Cited by: Table 2, §4.1, §6.4.  [13] (2018) Coteaching: robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems 31 (NeurIPS’18), pp. 8527–8537. Cited by: §4.2.

[14]
(2016)
Deep residual learning for image recognition.
In
Proceedings of 29th IEEE conference on Computer Vision and Pattern Recognition (CVPR’16)
, pp. 770–778. Cited by: §4.2.  [15] (2006) Learning from ambiguously labeled examples. Intelligent Data Analysis 10 (5), pp. 419–439. Cited by: §2.
 [16] (2017) Learning from complementary labels. In Advances in Neural Information Processing Systems 30 (NeurIPS’17), pp. 5639–5649. Cited by: §2.
 [17] (2019) Complementarylabel learning for arbitrary losses and models. In Proceedings of 36th International Conference on Machine Learning (ICML’19), pp. 2971–2980. Cited by: §2.
 [18] (2003) Learning with multiple labels. In Advances in Neural Information Processing Systems 16 (NeurIPS’03), pp. 921–928. Cited by: §2.
 [19] (2009) Learning multiple layers of features from tiny images. Cited by: §4.1, §6.4.
 [20] (2012) A conditional multinomial mixture model for superset label learning. In Advances in Neural Information Processing Systems 25 (NIPS’12), pp. 548–556. Cited by: §2, Table 2, §4.1, §6.4.
 [21] (2021) On the robustness of average losses for partiallabel learning. arXiv preprint arXiv:2106.06152. Cited by: §1.
 [22] (2020) Progressive identification of true labels for partiallabel learning. In Proceedings of 37th International Conference on Machine Learning (ICML’20), pp. 6500–6510. Cited by: §1, §1, §2, §3.2, §3.4.1, 1st item.
 [23] (2022) Partial label learning by semantic difference maximization. In Proceedings of 31st International Joint Conference on Artificial Intelligence, Cited by: §1.
 [24] (2018) Foundations of machine learning. MIT press. Cited by: §1.
 [25] (2008) Classification with partial labels. In Proceedings of 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’08), pp. 551–559. Cited by: §1.
 [26] (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), pp. 1944–1952. Cited by: §2.
 [27] (2010) Composite binary losses. The Journal of Machine Learning Research 11, pp. 2387–2422. Cited by: §3.1.
 [28] (2017) Confidencerated discriminative partial label learning. In Proceedings of 31st AAAI Conference on Artificial Intelligence (AAAI’17), pp. 2611–2617. Cited by: §2.
 [29] (2022) PiCO: contrastive label disambiguation for partial label learning. In Proceedings of 10th International Conference on Learning Representations (ICLR’22), Cited by: §1.
 [30] (2010) Caltechucsd birds 200. Cited by: §4.1, §6.4.
 [31] (2021) Leveraged weighted loss for partial label learning. In Proceedings of 36th International Conference on Machine Learning (ICML’21), pp. 11091–11100. Cited by: §1, §1, §2, 5th item.
 [32] (2021) Learning with proper partial labels. arXiv preprint arXiv:2112.12303. Cited by: §1, §2.
 [33] (2020) Partdependent label noise: towards instancedependent label noise. In Advances in Neural Information Processing Systems 33 (NeurIPS’20), pp. 7597–7610. Cited by: §2.
 [34] (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1, §6.4.
 [35] (2019) Partial label learning via label enhancement. In Proceedings of 33rd AAAI Conference on Artificial Intelligence (AAAI’19), pp. 5557–5564. Cited by: §1, §2.
 [36] (2021) Instancedependent partial label learning. In Advances in Neural Information Processing Systems 34 (NeurIPS’21), Cited by: §1, §2, §4.1.
 [37] (2020) Deep discriminative cnn with temporal ensembling for ambiguouslylabeled image classification. In Proceedings of 34th AAAI Conference on Artificial Intelligence (AAAI’20), pp. 12669–12676. Cited by: §1, §2, 4th item.
 [38] (2020) Network cooperation with progressive disambiguation for partial label learning. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), pp. 471–488. Cited by: §1, §1, §2.
 [39] (2018) Learning with biased complementary labels. In Proceedings of 15th European Conference on Computer Vision (ECCV’18), pp. 68–83. Cited by: §2.
 [40] (2013) Learning by associating ambiguously labeled images. In Proceedings of 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), pp. 708–715. Cited by: Table 2, §4.1, §6.4.
 [41] Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, Toulon, France. Cited by: §3.2, §3.4.1.
 [42] (2017) Disambiguationfree partial label learning. IEEE Transactions on Knowledge and Data Engineering 29 (10), pp. 2155–2167. Cited by: §1.
 [43] (2015) Solving the partial label learning problem: an instancebased approach. In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI’15), pp. 4048–4054. Cited by: §2.
 [44] (2016) Partial label learning via featureaware disambiguation. In Proceedings of 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), pp. 1335–1344. Cited by: §1, §2.
 [45] (2021) Learning with featuredependent label noise: a progressive approach. In Proceedings of 9th International Conference on Learning Representations (ICLR’21), Cited by: §3.3.
6 Appendix
6.1 Proofs of Theorem 1
Assume that there exists a boundary for all which satisfies and , we have
(10) 
Let be the new boundary and . As the probability density function of the margin is bounded by , we have the following result for that satisfies ^{‡}^{‡}‡Details of Eq. (3) in the paper submission
(11)  
Due to that holds, we can further relax Eq. (11) as follows:
(12)  
Then, we can find that the assumption that the gap between and should be controlled by the risk at point implies:
(13)  
Hence, for s.t. , according to Eq. (13) we have
(14)  
which means that will be the same label as and thus the level set is pure for . Meanwhile, the choice of ensures that
(15)  
Here, the proof of Theorem 1 has been completed.
6.2 Details of Eq. (5)
If , according to Eq. (13) we have:
(16)  
6.3 Proofs of Theorem 2
To begin with, we prove that there exists at least a level set pure to . Considering satisfies , we have . Due to the assumption , it suffices to satisfy to ensure that has the same prediction with when . Since we have , by choosing one can ensure that initial has a pure level set.
Then in the rest of the iterations we ensure the level set is pure. We decrease by a reasonable factor to avoid incurring too many corrupted labels while ensuring enough progress in label purification, i.e. , such that in the level set we have . This condition ensures the correctness of flipping when . The the purified region cannot be improved once since there is no guarantee that has consistent label with when and . To get the largest purified region, we can set . Since the probability density function of the margin is bounded by , we have:
(17)  
Then .
The rest of the proof is the total round , which follows from the fact that each round of label flipping improves the the purified region by a factor of :
(18)  