1 Introduction
Weakly supervised data (Zhou, 2017) commonly appears in machine learning tasks due to the high cost in collecting of a large amount of strong supervision data. Examples include incomplete supervision data, i.e., only a small subset of training data is given with labels; inexact supervision data, i.e., only coarsegrained labels are given, and inaccurate supervision data, i.e., the given labels are not always groundtruth (Zhou, 2017). Learning with weakly supervised data plays an important role in various applications, including text categorization (Joachims, 1999), semantic segmentation (Vezhnevets et al., 2012), bioinformatics (Kasabov & Pang, 2004)
(Huang et al., 2014), medical care (Wang et al., 2017), etc.A large number of efforts for weakly supervised learning (WSL) have been devoted (Zhou, 2017; Chapelle et al., 2006; Settles, 2012; Frénay & Verleysen, 2014)
, e.g., semisupervised learning methods
(Chapelle et al., 2006), label noise learning methods (Frénay & Verleysen, 2014), etc. Existing methods can roughly be categorized into two classes from the objective perspective. One is to maximize the performance gain based on some distribution assumption and the other is to maintain the performance safeness based on the worstcase analysis.The first objective of WSL focuses on maximizing the performance by making assumptions on the true label structure. For example, to maximize learning performance on unlabeled data, transductive support vector machines (TSVMs)
(Joachims, 1999) adopt the lowdensity assumption (Chapelle & Zien, 2005), i.e., the decision boundary should lie in a lowdensity region. Label propagation algorithms (Zhu et al., 2003) spread the soft labels from few labeled nodes to the whole graph based on the manifold assumptions (Zhou et al., 2004). To maximize learning performance with noisy labeled data, one can try to improve the label quality by assuming that the mislabeled items can be detected by measures like classification confidence (Sun et al., 2007), model complexity (Gamberger et al., 2000), influence functions (Koh & Liang, 2017).It is generally expected that in comparison with a supervised algorithm that uses only labeled data, WSL can help improve the performance by using more weakly supervised data. However, it is noteworthy that bad matching of problem structure with model assumption can lead to degradation in learning performance. The fact that WSL does not always help has been observed by multiple researchers, (Cozman et al., 2003; Chawla et al., 2005; Chapelle et al., 2006; Van Hulse & Khoshgoftaar, 2009; Frénay & Verleysen, 2014; Li & Zhou, 2015) pointed out that there are cases in which the use of the weakly supervised data may degenerate the performance, making it even worse after WSL.
Therefore, the other objective of WSL that focus on maintaining the safeness has attracted significant attraction in recent years. (Li & Zhou, 2011, 2015) builds safe semisupervised SVMs through optimizing the worstcase performance gain given a set of candidate lowdensity separators. (Balsubramani & Freund, 2015) proposes to learn a safe prediction given that the groundtruth label assignment is restricted to a specific candidate set. (Guo & Li, 2018) proposes a general safe WSL formulation given that the groundtruth can be constructed by a set of base learners and they optimize the worstcase performance gain. These methods are developed to alleviate the performance degeneration problem of WSL based on analyzing the worstcase of the groundtruth, and these efforts typically cause conservative performance improvement.
However, in many practical tasks, we not only need a strong performance improvement but also need to ensure safeness, that is, we need to take into account both the best and the worst case of learning performance. Both are indispensable, because any dissatisfaction may not be what we hope to happen in the real applications, causing the results of weakly supervised learning being somewhat difficult to rely on.
In this work, we consider the question of how to optimize label quality for WSL such that both the bestcase and the worstcase performance are considered. Figure 1 illustrates the motivation for this work. Specifically, let be the performance derived from baseline supervised strategies without using weakly supervised data and be the performance derived from weakly supervised learning with correct data assumption. How to optimize the training set labels, such that the learned performance could be closely related to , meanwhile it would not be worse than .
In view of this issue, this paper proposes a simple and novel weakly supervised learning framework PGS (maximize Performance Gain while maintaining Safeness). Our framework uses a small amount of validation data to clearly guide label quality optimization. Because verification set is a good approximation for describing generalization risk, it can provide reliable and powerful guidance for performance improvement and performance safeness. At the same time, the verification set can effectively avoid the performance problems caused by incorrect data distribution assumptions. This new perspective of weakly supervised learning can be formalized into a novel BiLevel optimization (Bard, 2013) where one optimization problem is nested within another problem and can be effectively addressed for both convex and nonconvex situations. A large number of experimental results also clearly confirm that the new framework achieves impressive performance. In summary, our contributions in this work are as follows:

We propose a new weakly supervised learning framework that tries to maximize the performance gain and maintain safeness at the same time.

We propose a reliable solution by formulating the problem into an optimization with effective algorithm, as well as justified by a large number of experiments.
In the following, we first review several related WSL frameworks and then present the PGS framework. Next, we show the experiment. Finally, we conclude this paper.
2 Two WSL Frameworks and Limitations
Notations and Setup Consider a prediction problem from some input space (e.g., images) to an output space (e.g., labels). We are given a weakly supervised training data set where . Assuming that the true data distribution is . Let be the model derived from baseline supervised learning (SL) methods without WSL, i.e., and be the model derived from WSL methods, i.e., .
Maximize Performance Gain The first branch of WSL aims to maximize gain, i.e.,
s.t. 
where is the performance of model on true data distribution which is unknown. It is assumed that the higher the performance, the better.
These methods (Chapelle & Zien, 2005; Sun et al., 2007; Gamberger et al., 2000; Koh & Liang, 2017) make assumptions on the true label structure and aim to maximize the bestcase performance where bestcase means the true label follows assumptions of these WSL methods. However, mismatched assumption occurs in real applications and will cause performance degeneration problem (Zhou, 2017).
Maintain Performance Safeness The second branch of WSL aims to maintain safeness, i.e.,
s.t.  
These methods (Li & Zhou, 2011, 2015; Balsubramani & Freund, 2015; Guo & Li, 2018) maximize the worstcase performance gain to maintain safeness, i.e., considering the true label in the worstcase. The algorithm performs better than supervised learning methods while causing conservative performance improvement.
3 The Pgs Framework
3.1 General Formulation
The proposed PGS framework optimizes label quality for weakly supervised data by taking both the bestcase and worstcase performance into account. The PGS maximizes learning performance of the model trained on the optimized weakly supervised data, meanwhile, PGS requires that the worstcase performance of the model won’t be worse than the model trained on the raw label. Thus, the PGS framework can be generally formulated as:
s.t.  
Figure 2 illustrates the PGS framework.
3.2 Determine missing or corrupted labels robustly
In reality, the true label is unavailable. To overcome this issue, PGS introduces a small clean unbiased validation set with size to approximate the true data distribution. The effectiveness of similar idea has been recently demonstrated in hyperparameter tuning (Maclaurin et al., 2015; Ravi & Larochelle, 2017; Ren et al., 2018a; Franceschi et al., 2017, 2018), metalearning (Ren et al., 2018b), fewshot learning (Ravi & Larochelle, 2017), and training set debugging (Cadamuro & Zhu, 2016; Zhang et al., 2018) for the reason that it is usually acceptable to manually check a small amount of training data as the validation set and the performance on a clean unbiased validation set well approximates the generalization performance.
We adopt bootstrap strategy to create multiple validation sets to determine the missing or corrupted training set labels robustly. On one hand, the PGS optimizes validation performance of the model trained on the optimized label to maximize performance gain. On the other hand, PGS optimizes the performance gain against the original label on all validation sets simultaneously to maintain safeness.
s.t.  
In this paper, the worstcase is defined as the worst performance gain on one validation set. However, it is worth mentioning that in real applications, there are multiple flexible ways to define the worstcase performance to maintain safeness. For example, in finegrained classification, if we need the model performance can not degrade on some specific classes, the validation sets can be composed of examples in these classes. For multiview/multimodal data, we can require that the model performance will not decrease on every view/modal. We can also optimize multiple performance measures simultaneously and require that the performance will not decrease on every performance measure.
Maximize Gain PGS adopts the validation performance as an objective to optimize the training label. By optimizing validation performance, the PGS can leverage useful information from a large training set, and still converge to an appropriate distribution favored by the validation set. A clean unbiased validation set is a good approximation of the true data distribution, thus, the proposed PGS framework can maximize performance gain and improve the generalization performance.
Maintain Safeness PGS explicitly requires that in the worstcase, the model trained on the optimized data will not be worse than the model trained on the original data, i.e., do not have the performance degeneration problem. In this paper, we adopt the bootstrap strategy to create multiple validation sets and define the worstcase indicates the worst performance gain in these validation sets. As mentioned previously, in real applications, it is flexible to define the worstcase performance. Therefore, the PGS framework can maintain safeness in various scenarios.
3.3 Handling instances weights and label correction
Optimizing the label quality includes two operations. First is to determine the harmful instances. A weight
is assigned to each training instance where higher weight corresponds to higher quality. Then, the harmful instance can be relabeled or simply keep the low weight to decrease their influence. To relabel the instance, we need to estimate a label transition quantity
for this instance.With these two operations, the empirical risk minimization procedure for two fundamental machine learning tasks, classification and regression, can be written as:
(5)  
(6) 
where is the number of classes, is the weight, for classification task and for regression task is the label transition quantity. The model trained on the raw data can be recovered when , for classification task and , for regression task.
In order to reduce the complexity of the search space and utilize the information of original training data, PGS constrains the distance between the optimized label and the original label . For classification, the way to measure the distance between and is the value of , for regression, the way to measure the distance between and is the value of .
In practice, the performance of a model can be replaced with the surrogate loss function. Specifically, for a
classification problem, our objective as follows:s.t. 
where is the validation loss of the model trained on the original data, .
For regression task, the model training procedure can be replaced with and .
3.4 Gradient Method for BiLevel Optimization
In this section, we discuss the optimization strategies for PGS. The optimization of regression problem is similar to the classification one, thus we only discuss the detail procedure for the classification task.
Eq.(3.3) is a bilevel optimization problem (Bard, 2013), where one optimization problem is nested within another problem. The lowerlevel optimization is to find an empirical risk minimizer model given the training set whereas the upperlevel optimization is to minimize the validation loss given the learned model. For the writing simplicity, we denote the upperlevel objective as and the lowerlevel objective as .
It is difficult to optimize the upperlevel objective function directly because in general there is no closedform expression of the optimal . The classical approaches for solving bilevel optimization problem can be categorized as singlelevel reduction methods, descent methods, trustregion methods, and evolutionary methods (Sinha et al., 2018). In this paper, we adopt two of the most popular methods for convex and nonconvex situations.
Optimization for Convexity If the empirical loss function is differentiable and strictly convex, the lowerlevel optimization problem can be replaced with its KarushKuhnTucker (KKT) conditions (Boyd & Vandenberghe, 2004):
(8) 
The solution to defines an implicit function . Then we can adopt the implicit function theorem to estimate how varies in and :
(9)  
(10) 
This tells us how changes with respect to an infinitesimal change to and
. Now, we can apply the chain rule to get the gradient of the whole optimization problem w.r.t.
and ,(11)  
(12) 
In practice, the matrix inverses is not pleasing, the alternative method to compute is compute it as the solution to . When the Hessian is positive definite, the linear system can be solved conveniently and only requires matrixvector products, that is we do not have to materialize the Hessian.
Optimization for NonConvexity However, in general, we can not use the implicit theorem to obtain the optimal
, for the reason that setting the derivative to zero only leads to a saddle point for nonconvex functions. In general cases, we adopt gradient descent methods (or one of its variants like momentum, RMSProp, Adam, etc.) to solve the optimal
approximately. Specifically, the training procedure can be written as:(13) 
where is the learning rate. We replace the lowerlevel problem with the dynamical procedure, and place them in the objective in the Lagrangian form:
(14)  
where is the Lagrange multipliers associated with the step of the dynamical process. The partial derivatives of the Lagrangian are given by:
(15)  
(16)  
(17)  
(18)  
(19) 
where
(20)  
(21)  
(22) 
By setting each derivative to zero we can obtain that,
(23)  
(24) 
Then, the whole problem can be solved with gradient methods. Algorithm 1 and 2 summarize the pseudo code of these two optimization strategies.
Convergence For convexity situations, the lowerlevel optimization problem is replaced with its KKT conditions and the overall bilevel optimization problem is reduced to a singlelevel optimization problem. Therefore, the optimization problem enjoys the same convergence properties as the gradient methods for singlelevel optimization.
For nonconvexity situations, we adopt an approximation procedure with respect to the original bilevel problem. It is necessary to analyze the convergence of this algorithm, where the proof is shown in the appendix.
Theorem 1.
(Convergence). Suppose the empirical loss function is Lipschitz continuous. Let be the optimal solution to the lowerlevel optimization problem, then as , we have .
We stress that assumptions are very natural and satisfied by many loss functions of practical interests. For example, for classification, logistic loss is a Lipschitz smooth function. Similar cases can also be applied to square loss in regression.
Method  Baseline  Validation Only  REED  SModel  SafeW  PGS 

Unbiased validation set  77.4 0.45  76.2 0.13  78.6 0.21  76.2 0.45  77.9 0.34  83.4 0.19 
Biased validation set  76.7 0.47  69.1 0.20  78.2 0.30  75.7 0.41  76.4 0.42  80.3 0.39 
Method  Baseline  Model  Mean Teacher  PseudoLabeling  SafeW  PGS 

Unbiased validation set  80.3 0.43  83.6 0.36  84.9 0.40  82.5 0.66  82.2 0.28  86.5 0.33 
Biased validation set  79.8 0.41  82.8 0.49  84.1 0.41  82.0 0.53  81.6 0.36  84.1 0.29 
4 Experiments
In order to validate the effectiveness of the proposed method, extensive experiments are conducted on a broad range of data sets that cover diverse domains including standard MNIST, CIFAR benchmarks for image classification, and six UCI datasets for regression tasks. Both unbiased and biased validation set are considered, and two WSL tasks, label noise learning and semisupervised learning settings are conducted for comparison.
4.1 Experimental Setup
For label noise learning, PGS is compared with the following methods. REED (Reed et al., 2014): a bootstrapping technique where the training target is a convex combination of the model prediction and the label. SMODEL (Goldberger & BenReuven, 2017)
: it adds a fully connected softmax layer after the regular classification output layer to model the noise transition matrix.
SafeW (Guo & Li, 2018): it uses ensemble strategy to generate a prediction that can maintain safeness of weakly supervised learning. In addition, we compare with two simple baselines. Baseline: it combines the noisy data and validation data as the training set to train a model. Validation only: which only use the validation data as the training set to train a model.For semisupervised learning, PGS is compared with the following methods. Model (Laine & Aila, 2017): it adds a loss term which encourages the distance between network’s output for different passes of unlabeled data through the network to be small. Mean Teacher (Tarvainen & Valpola, 2017): it is a stable version of Model, which sets the target to predictions made using an exponential moving average of parameters from previous training steps. PseudoLabeling (Lee, 2013): it produces pseudolabels for unlabeled data using the prediction function itself over the course of training. In addition, we compare with SafeW and the baseline method Baseline.
For the REED method, we use the best parameter reposted in (Reed et al., 2014)
. For the SMODEL method, the transition weight is set to a smoothed identity matrix. The baseline method, REED, and SModel are adopted as the base learners in SafeW. For
PGS, a twolayer neural network is employed as the model. Gradient descent and ADAM are used to optimize the lowerlevel and the upperlevel objective respectively. The iteration of lowerlevel and upperlevel optimization are set to 500 and 20 for all experiments.
For the Model, Mean Teacher, and PseudoLabeling method, we adopt the model structure and optimization method with PGS. The Model, Mean Teacher, and Baseline method are adopted as the base learners in SafeW. For PGS, the weight and label transition quantity for labeled data are known and we only optimize the value for unlabeled data.
To make sure that our method does not have the privilege of training on more data, all compared methods combine the validation data with raw training data as a new training set.
4.2 MNIST Handwritten Digit Recognition Task
MNIST (LeCun et al., 1998) is a standard dataset for handwritten digit classification. We select a total of images of size and split into four subsets: the training set with 7,000 training examples, the validation set with 1,000 examples, the hypervalidation set with 1,000 examples to monitor training progress and tuning hyperparameters, and the test data with examples. We also investigate the impact of the biased validation data by subsample a class imbalanced validation set. Specifically, the ratio between images belong to class 04 and 59 is shifted from 1:1 to 1:3 in the biased validation set, and 3 validation sets are generated with bootstrap strategy.
Label Noise Learning Results For label noise learning, we add uniform flip noise to the training set, means that all label classes can uniformly flip to any other label classes, which has been mostly studied in literature (Frénay & Verleysen, 2014). The summary of classification accuracy with noisy ratio is reported in Table 1. All the results are averaged from 5 runs with different random splits of datasets.
From the comparison results in Table 1, our method achieves the best performance on both biased and unbiased validation sets among all the methods. Methods that maximize performance only, such as SModel, could suffer performance degradation. The overall performance improvement of safenessonly methods (such as SafeW) is limited. Our approach is not inferior to the baseline approach and achieves maximum performance improvement. It is not effective to use validation sets only, because the amount of data in validation sets is small and it is difficult to train a good model. Above results demonstrate that PGS is not equivalent to the baseline methods which simply train a model on the validation and training data. In contrast, PGS utilizes the validation data to improve the label quality of training set and thus improve the final performance.
SemiSupervised Learning Results For semisupervised learning, of the training data is labeled while the rest are unlabeled. Similar to label noise learning, the summary of classification accuracy is reported in Table 2 and results are averaged from 5 runs with different random splits.
Results in Table 2 show that PGS also achieves good performance on semisupervised learning. The key difference between semisupervised learning and label noise learning is that we know which instance has a highquality label in semisupervised learning, thus, PGS achieves better performance with less trusted labeled training instances.
4.3 CIFAR10 Image Classification Task
CIFAR10 (LeCun et al., 1998) is benchmark for image classification task. The dataset consists of natural images with a size of pixels and has 10 categories. We also subsample a set of images and split into four subsets: the training set with 7,000 training examples, the validation set with 1,000 examples, the hypervalidation set with 1,000 examples, and the test data with examples. To construct a biased validation set, the ratio between the top five classes and the bottom five classes is shifted to 1:3, and we create 3 validation sets using bootstrap strategy.
Label Noise Learning Results The summary of classification accuracy under noisy ratio for label noise learning with uniform flip noise is reported in Table 3. All the results are averaged from 5 runs with different random splits of datasets. Results on Table 3 show that PGS also obtains maximum performance improvement on label noise learning and do not have performance degradation problem.
Method  Baseline  Validation Only  REED  SModel  SafeW  PGS 

Unbiased validation set  58.3 0.43  49.5 0.24  64.5 0.39  63.3 0.87  62.0 0.42  66.3 0.13 
Biased validation set  56.8 0.47  42.3 0.20  64.3 0.41  62.9 0.67  61.5 0.44  64.5 0.33 
Method  Baseline  Model  Mean Teacher  PseudoLabeling  SafeW  PGS 

Unbiased validation set  60.7 0.38  64.4 0.53  65.7 0.28  62.1 0.50  63.5 0.39  68.8 0.33 
Biased validation set  59.0 0.38  63.6 0.49  64.9 0.25  61.8 0.43  62.5 0.41  65.7 0.40 
SemiSupervised Learning Results The results of classification accuracy with labeled data for semisupervised learning is reported in Table 4. All the results are averaged from 5 runs with different random splits of datasets. Results on Table 4 show that PGS also derive highly competitive performance with all compared methods on semisupervised learning.
Label Noise Learning  
Dataset  Baseline  SVR  PGS 
abalone  .017 .004  .143 .033  .010 .002 
bodyfat  .150 .016  .102 .013  .087 .010 
cpusmall  .005 .001  .003 .001  .002 .001 
mg  .033 .004  .030 .005  .026 .004 
mpg  .081 .011  .074 .009  .021 .005 
space_ga  .100 .014  .093 .010  .021 .006 
SemiSupervised Learning  
Dataset  Baseline  SelfLS  PGS 
abalone  .015 .003  .012 .020  .011 .002 
bodyfat  .089 .023  .062 .019  .053 .011 
cpusmall  .006 .001  .003 .001  .002 .001 
mg  .039 .004  .031 .006  .025 .008 
mpg  .084 .013  .040 .011  .029 .003 
space_ga  .060 .009  .031 .010  .019 .004 
4.4 Datasets for Regression Task
We also do experiments on six regression datasets to verifies the effectiveness of our proposal on regression tasks. The datasets cover diverse domains including physical measurements (abalone), health (bodyfat), economics (cadata), activity recognition (mpg), etc. For each dataset, we split it into four parts: training set, validation set, hypervalidation set, test set, according to the ratio . We normalize all features and labels into . For label noise learning, we add a gauss noise to the training set, and for semisupervised learning, we select 40% of training set as the labeled data.
Since there are rarely works focusing on weakly supervised regression tasks, for label noise learning, we only compared with the baseline method which uses the training set and validation set simultaneously and a noisy robust regression method, Support Vector Regression (SVR). For semisupervised learning, we compared with the baseline method and the SelfLS method which is an extension of supervised least square method based on selftraining. For PGS
, we use linear regression as the model, i.e.,
.Label Noise Learning Results We report the Mean Squared Error for these UCI datasets in Table 5. Each result is averaged of 10 runs under 50% noisy ratio. From Table 5, we can see that our proposal achieves the largest improvement, even when SVR suffers the performance degeneration problem. These demonstrate the effectiveness of PGS for label noise regression task.
4.5 Label Quality Improvement
To further test the effectiveness of label quality improvement, we compare PGS with two methods: Influence function (Koh & Liang, 2017), which corrects label via perturbing a training point and counting the changes to prediction; Nearest Neighbor, which recommends the label from closest validation data when asked for a suggested label correction. These two methods are two commonly used as baselines for mislabel correction (Zhang et al., 2018).
From Table 6, we can see that PGS dominates the compared methods, which further demonstrates the effectiveness of our proposal in label quality improvement.
Noisy Ratio 


PGS  
10%  54.17  65.71  68.21  
20%  56.56  65.79  68.06  
30%  58.32  65.21  67.64  
40%  60.16  66.81  68.33  
50%  61.15  65.05  67.97  
60%  60.15  64.81  66.91  
4.6 Validation Set Size
It is meaningful to study the impact of validation set size. We investigate how the classification accuracy varies with the size of validation set increased from 100 to 1000 on MNIST dataset. We design experiments of PGS on both label noise learning with 50% noisy ratio and semisupervised learning with 40% labeled data. The accuracy improvement against baseline method is plotted in Figure 3.
From Figure 3, we can see that, as long as the size of validation set reaches a certain small scale (such as 100 samples), the proposal already achieves quite good performance improvement. Moreover, our proposal has great potential. Although we have achieved good results, this experiment shows that by increasing the size of validation set, we can continue to improve the performance significantly.
5 Conclusion
This paper presents a new WSL framework. Compared with previous works, it considers both 1) performance maximization and 2) performance safeness. These two requirements are usually indispensable in many practical tasks. We use smallscale validation set to guide label quality optimization to get an effective solution, and the effectiveness is clearly verified by extensive experimental results. We believe that the new framework opens a door for reliable weakly supervised learning. Many followup works, such as integration of the adversarial mechanism, can be further studied.
References

Balsubramani & Freund (2015)
Balsubramani, A. and Freund, Y.
Optimally combining classifiers using unlabeled data.
In Proceedings of the 28th Conference on Learning Theory, pp. 211–225, 2015.  Bard (2013) Bard, J. F. Practical bilevel optimization: algorithms and applications. Springer Science & Business Media, 2013.
 Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.
 Cadamuro & Zhu (2016) Cadamuro, G. and Zhu, X. Debugging machine learning models. In ICML Workshop on Reliable Machine Learning in the Wild, 2016.

Chapelle & Zien (2005)
Chapelle, O. and Zien, A.
Semisupervised classification by low density separation.
In
Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics
, pp. 57–64, 2005.  Chapelle et al. (2006) Chapelle, O., Scholkopf, B., and Zien, A. Semisupervised learning. MIT Press, 2006.
 Chawla et al. (2005) Chawla, N. V. et al. Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research, pp. 331–366, 2005.
 Cozman et al. (2003) Cozman, F. G., Cohen, I., and Cirelo, M. C. Semisupervised learning of mixture models. In Proceedings of the 20th International Conference on Machine Learning, pp. 99–106, 2003.

Franceschi et al. (2017)
Franceschi, L., Donini, M., Frasconi, P., and Pontil, M.
Forward and reverse gradientbased hyperparameter optimization.
In Proceedings of the 34th International Conference on Machine Learning, pp. 1165–1173, 2017.  Franceschi et al. (2018) Franceschi, L., Frasconi, P., Salzo, S., and Pontil, M. Bilevel programming for hyperparameter optimization and metalearning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1563–1572, 2018.
 Frénay & Verleysen (2014) Frénay, B. and Verleysen, M. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, pp. 845–869, 2014.
 Gamberger et al. (2000) Gamberger, D., Lavrac, N., and Dzeroski, S. Noise detection and elimination in data preprocessing: experiments in medical domains. Applied Artificial Intelligence, pp. 205–223, 2000.
 Goldberger & BenReuven (2017) Goldberger, J. and BenReuven, E. Training deep neuralnetworks using a noise adaptation layer. In Proceedings of the 5th International Conference on Learning Representations, 2017.
 Guo & Li (2018) Guo, L.Z. and Li, Y.F. A general formulation for safely exploiting weakly supervised data. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 3126–3133, 2018.
 Huang et al. (2014) Huang, F., Ahuja, A., Downey, D., Yang, Y., Guo, Y., and Yates, A. Learning representations for weakly supervised natural language processing tasks. Computational Linguistics, pp. 85–120, 2014.
 Joachims (1999) Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning, pp. 200–209, 1999.
 Kasabov & Pang (2004) Kasabov, N. and Pang, S. Transductive support vector machines and applications in bioinformatics for promoter recognition. Neural Information ProcessingLetters and Reviews, pp. 1–6, 2004.
 Koh & Liang (2017) Koh, P. W. and Liang, P. Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 1885–1894, 2017.
 Laine & Aila (2017) Laine, S. and Aila, T. Temporal ensembling for semisupervised learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
 Lee (2013) Lee, D.H. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, pp. 2–8, 2013.
 Li & Zhou (2011) Li, Y.F. and Zhou, Z.H. Towards making unlabeled data never hurt. In Proceedings of the 28th International Conference on Machine Learning, pp. 1081–1088, 2011.
 Li & Zhou (2015) Li, Y.F. and Zhou, Z.H. Towards making unlabeled data never hurt. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 175–188, 2015.
 Maclaurin et al. (2015) Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2113–2122, 2015.
 Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
 Reed et al. (2014) Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping. CoRR, abs/1412.6596, 2014.
 Ren et al. (2018a) Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., and Zemel, R. S. Metalearning for semisupervised fewshot classification. In Proceedings of the 6th International Conference on Learning Representations, 2018a.

Ren et al. (2018b)
Ren, M., Zeng, W., Yang, B., and Urtasun, R.
Learning to reweight examples for robust deep learning.
In Proceedings of the 35th International Conference on Machine Learning, pp. 4331–4340, 2018b.  Settles (2012) Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, pp. 1–114, 2012.

Sinha et al. (2018)
Sinha, A., Malo, P., and Deb, K.
A review on bilevel optimization: from classical to evolutionary
approaches and applications.
IEEE Transactions on Evolutionary Computation
, pp. 276–295, 2018.  Sun et al. (2007) Sun, J.w., Zhao, F.y., Wang, C.j., and Chen, S.f. Identifying and correcting mislabeled training instances. In Future Generation Communication and Networking, pp. 244–250, 2007.
 Tarvainen & Valpola (2017) Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In Advances in Neural Information Processing Systems, pp. 1195–1204, 2017.
 Van Hulse & Khoshgoftaar (2009) Van Hulse, J. and Khoshgoftaar, T. Knowledge discovery from imbalanced and noisy data. IEEE Transactions on Knowledge and Data Engineering, pp. 1513–1542, 2009.

Vezhnevets et al. (2012)
Vezhnevets, A., Ferrari, V., and Buhmann, J. M.
Weakly supervised structured output learning for semantic
segmentation.
In
Proceedings of the 25th IEEE Conference on Computer Vision and Pattern Recognition
, pp. 845–852, 2012.  Wang et al. (2017) Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. M. Chestxray8: Hospitalscale chest xray database and benchmarks on weaklysupervised classification and localization of common thorax diseases. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3462–3471, 2017.
 Zhang et al. (2018) Zhang, X., Zhu, X., and Wright, S. J. Training set debugging using trusted items. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 4482–4489, 2018.
 Zhou et al. (2004) Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. Learning with local and global consistency. In Advances in Neural Information Processing Systems, pp. 321–328, 2004.
 Zhou (2017) Zhou, Z.H. A brief introduction to weakly supervised learning. National Science Review, pp. 44–53, 2017.
 Zhu et al. (2003) Zhu, X., Ghahramani, Z., and Lafferty, J. D. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine learning, pp. 912–919, 2003.
A. Proof of Theorem 1
Theorem 2.
(Convergence). Suppose the empirical loss function is Lipschitz continuous. Let be the optimal solution to the lowerlevel optimization problem, then as , we have .
Proof.
Since the is Lipschitz continuous, there exists such that for every and every ,
Let
With , it is obvious that , thus we have that .
Using the continuity of , we have ,
Therefore, we have
which completes the proof. ∎
B. Impact of Iteration Number
We investigate the impact of the iteration number during the optimization dynamics on the quality of the solution. The result is plotted in Figure 4, where xaxis is the number of iterations for optimization of training loss, yaxis is the number of iterations for optimization of validation loss and zaxis is the classification accuracy.
From Figure 4, we can see that, as the lowerlevel iterations increase, the better initialization we can obtain for the upperlevel optimization problem, and the upperlevel optimization only need a small number of iterations (such as 35 iterations) to achieve a quite good solution.
Comments
There are no comments yet.