Weakly supervised data (Zhou, 2017) commonly appears in machine learning tasks due to the high cost in collecting of a large amount of strong supervision data. Examples include incomplete supervision data, i.e., only a small subset of training data is given with labels; inexact supervision data, i.e., only coarse-grained labels are given, and inaccurate supervision data, i.e., the given labels are not always ground-truth (Zhou, 2017). Learning with weakly supervised data plays an important role in various applications, including text categorization (Joachims, 1999), semantic segmentation (Vezhnevets et al., 2012), bioinformatics (Kasabov & Pang, 2004)2014), medical care (Wang et al., 2017), etc.
, e.g., semi-supervised learning methods(Chapelle et al., 2006), label noise learning methods (Frénay & Verleysen, 2014), etc. Existing methods can roughly be categorized into two classes from the objective perspective. One is to maximize the performance gain based on some distribution assumption and the other is to maintain the performance safeness based on the worst-case analysis.
The first objective of WSL focuses on maximizing the performance by making assumptions on the true label structure. For example, to maximize learning performance on unlabeled data, transductive support vector machines (TSVMs)(Joachims, 1999) adopt the low-density assumption (Chapelle & Zien, 2005), i.e., the decision boundary should lie in a low-density region. Label propagation algorithms (Zhu et al., 2003) spread the soft labels from few labeled nodes to the whole graph based on the manifold assumptions (Zhou et al., 2004). To maximize learning performance with noisy labeled data, one can try to improve the label quality by assuming that the mislabeled items can be detected by measures like classification confidence (Sun et al., 2007), model complexity (Gamberger et al., 2000), influence functions (Koh & Liang, 2017).
It is generally expected that in comparison with a supervised algorithm that uses only labeled data, WSL can help improve the performance by using more weakly supervised data. However, it is noteworthy that bad matching of problem structure with model assumption can lead to degradation in learning performance. The fact that WSL does not always help has been observed by multiple researchers, (Cozman et al., 2003; Chawla et al., 2005; Chapelle et al., 2006; Van Hulse & Khoshgoftaar, 2009; Frénay & Verleysen, 2014; Li & Zhou, 2015) pointed out that there are cases in which the use of the weakly supervised data may degenerate the performance, making it even worse after WSL.
Therefore, the other objective of WSL that focus on maintaining the safeness has attracted significant attraction in recent years. (Li & Zhou, 2011, 2015) builds safe semi-supervised SVMs through optimizing the worst-case performance gain given a set of candidate low-density separators. (Balsubramani & Freund, 2015) proposes to learn a safe prediction given that the ground-truth label assignment is restricted to a specific candidate set. (Guo & Li, 2018) proposes a general safe WSL formulation given that the ground-truth can be constructed by a set of base learners and they optimize the worst-case performance gain. These methods are developed to alleviate the performance degeneration problem of WSL based on analyzing the worst-case of the ground-truth, and these efforts typically cause conservative performance improvement.
However, in many practical tasks, we not only need a strong performance improvement but also need to ensure safeness, that is, we need to take into account both the best and the worst case of learning performance. Both are indispensable, because any dissatisfaction may not be what we hope to happen in the real applications, causing the results of weakly supervised learning being somewhat difficult to rely on.
In this work, we consider the question of how to optimize label quality for WSL such that both the best-case and the worst-case performance are considered. Figure 1 illustrates the motivation for this work. Specifically, let be the performance derived from baseline supervised strategies without using weakly supervised data and be the performance derived from weakly supervised learning with correct data assumption. How to optimize the training set labels, such that the learned performance could be closely related to , meanwhile it would not be worse than .
In view of this issue, this paper proposes a simple and novel weakly supervised learning framework PGS (maximize Performance Gain while maintaining Safeness). Our framework uses a small amount of validation data to clearly guide label quality optimization. Because verification set is a good approximation for describing generalization risk, it can provide reliable and powerful guidance for performance improvement and performance safeness. At the same time, the verification set can effectively avoid the performance problems caused by incorrect data distribution assumptions. This new perspective of weakly supervised learning can be formalized into a novel Bi-Level optimization (Bard, 2013) where one optimization problem is nested within another problem and can be effectively addressed for both convex and non-convex situations. A large number of experimental results also clearly confirm that the new framework achieves impressive performance. In summary, our contributions in this work are as follows:
We propose a new weakly supervised learning framework that tries to maximize the performance gain and maintain safeness at the same time.
We propose a reliable solution by formulating the problem into an optimization with effective algorithm, as well as justified by a large number of experiments.
In the following, we first review several related WSL frameworks and then present the PGS framework. Next, we show the experiment. Finally, we conclude this paper.
2 Two WSL Frameworks and Limitations
Notations and Setup Consider a prediction problem from some input space (e.g., images) to an output space (e.g., labels). We are given a weakly supervised training data set where . Assuming that the true data distribution is . Let be the model derived from baseline supervised learning (SL) methods without WSL, i.e., and be the model derived from WSL methods, i.e., .
Maximize Performance Gain The first branch of WSL aims to maximize gain, i.e.,
where is the performance of model on true data distribution which is unknown. It is assumed that the higher the performance, the better.
These methods (Chapelle & Zien, 2005; Sun et al., 2007; Gamberger et al., 2000; Koh & Liang, 2017) make assumptions on the true label structure and aim to maximize the best-case performance where best-case means the true label follows assumptions of these WSL methods. However, mismatched assumption occurs in real applications and will cause performance degeneration problem (Zhou, 2017).
Maintain Performance Safeness The second branch of WSL aims to maintain safeness, i.e.,
These methods (Li & Zhou, 2011, 2015; Balsubramani & Freund, 2015; Guo & Li, 2018) maximize the worst-case performance gain to maintain safeness, i.e., considering the true label in the worst-case. The algorithm performs better than supervised learning methods while causing conservative performance improvement.
3 The Pgs Framework
3.1 General Formulation
The proposed PGS framework optimizes label quality for weakly supervised data by taking both the best-case and worst-case performance into account. The PGS maximizes learning performance of the model trained on the optimized weakly supervised data, meanwhile, PGS requires that the worst-case performance of the model won’t be worse than the model trained on the raw label. Thus, the PGS framework can be generally formulated as:
Figure 2 illustrates the PGS framework.
3.2 Determine missing or corrupted labels robustly
In reality, the true label is unavailable. To overcome this issue, PGS introduces a small clean unbiased validation set with size to approximate the true data distribution. The effectiveness of similar idea has been recently demonstrated in hyper-parameter tuning (Maclaurin et al., 2015; Ravi & Larochelle, 2017; Ren et al., 2018a; Franceschi et al., 2017, 2018), meta-learning (Ren et al., 2018b), few-shot learning (Ravi & Larochelle, 2017), and training set debugging (Cadamuro & Zhu, 2016; Zhang et al., 2018) for the reason that it is usually acceptable to manually check a small amount of training data as the validation set and the performance on a clean unbiased validation set well approximates the generalization performance.
We adopt bootstrap strategy to create multiple validation sets to determine the missing or corrupted training set labels robustly. On one hand, the PGS optimizes validation performance of the model trained on the optimized label to maximize performance gain. On the other hand, PGS optimizes the performance gain against the original label on all validation sets simultaneously to maintain safeness.
In this paper, the worst-case is defined as the worst performance gain on one validation set. However, it is worth mentioning that in real applications, there are multiple flexible ways to define the worst-case performance to maintain safeness. For example, in fine-grained classification, if we need the model performance can not degrade on some specific classes, the validation sets can be composed of examples in these classes. For multi-view/multi-modal data, we can require that the model performance will not decrease on every view/modal. We can also optimize multiple performance measures simultaneously and require that the performance will not decrease on every performance measure.
Maximize Gain PGS adopts the validation performance as an objective to optimize the training label. By optimizing validation performance, the PGS can leverage useful information from a large training set, and still converge to an appropriate distribution favored by the validation set. A clean unbiased validation set is a good approximation of the true data distribution, thus, the proposed PGS framework can maximize performance gain and improve the generalization performance.
Maintain Safeness PGS explicitly requires that in the worst-case, the model trained on the optimized data will not be worse than the model trained on the original data, i.e., do not have the performance degeneration problem. In this paper, we adopt the bootstrap strategy to create multiple validation sets and define the worst-case indicates the worst performance gain in these validation sets. As mentioned previously, in real applications, it is flexible to define the worst-case performance. Therefore, the PGS framework can maintain safeness in various scenarios.
3.3 Handling instances weights and label correction
Optimizing the label quality includes two operations. First is to determine the harmful instances. A weight
is assigned to each training instance where higher weight corresponds to higher quality. Then, the harmful instance can be relabeled or simply keep the low weight to decrease their influence. To relabel the instance, we need to estimate a label transition quantityfor this instance.
With these two operations, the empirical risk minimization procedure for two fundamental machine learning tasks, classification and regression, can be written as:
where is the number of classes, is the weight, for classification task and for regression task is the label transition quantity. The model trained on the raw data can be recovered when , for classification task and , for regression task.
In order to reduce the complexity of the search space and utilize the information of original training data, PGS constrains the distance between the optimized label and the original label . For classification, the way to measure the distance between and is the value of , for regression, the way to measure the distance between and is the value of .
In practice, the performance of a model can be replaced with the surrogate loss function. Specifically, for aclassification problem, our objective as follows:
where is the validation loss of the model trained on the original data, .
For regression task, the model training procedure can be replaced with and .
3.4 Gradient Method for Bi-Level Optimization
In this section, we discuss the optimization strategies for PGS. The optimization of regression problem is similar to the classification one, thus we only discuss the detail procedure for the classification task.
Eq.(3.3) is a bi-level optimization problem (Bard, 2013), where one optimization problem is nested within another problem. The lower-level optimization is to find an empirical risk minimizer model given the training set whereas the upper-level optimization is to minimize the validation loss given the learned model. For the writing simplicity, we denote the upper-level objective as and the lower-level objective as .
It is difficult to optimize the upper-level objective function directly because in general there is no closed-form expression of the optimal . The classical approaches for solving bi-level optimization problem can be categorized as single-level reduction methods, descent methods, trust-region methods, and evolutionary methods (Sinha et al., 2018). In this paper, we adopt two of the most popular methods for convex and non-convex situations.
Optimization for Convexity If the empirical loss function is differentiable and strictly convex, the lower-level optimization problem can be replaced with its Karush-Kuhn-Tucker (KKT) conditions (Boyd & Vandenberghe, 2004):
The solution to defines an implicit function . Then we can adopt the implicit function theorem to estimate how varies in and :
This tells us how changes with respect to an infinitesimal change to and
. Now, we can apply the chain rule to get the gradient of the whole optimization problem w.r.t.and ,
In practice, the matrix inverses is not pleasing, the alternative method to compute is compute it as the solution to . When the Hessian is positive definite, the linear system can be solved conveniently and only requires matrix-vector products, that is we do not have to materialize the Hessian.
Optimization for Non-Convexity However, in general, we can not use the implicit theorem to obtain the optimal
, for the reason that setting the derivative to zero only leads to a saddle point for non-convex functions. In general cases, we adopt gradient descent methods (or one of its variants like momentum, RMSProp, Adam, etc.) to solve the optimalapproximately. Specifically, the training procedure can be written as:
where is the learning rate. We replace the lower-level problem with the dynamical procedure, and place them in the objective in the Lagrangian form:
where is the Lagrange multipliers associated with the -step of the dynamical process. The partial derivatives of the Lagrangian are given by:
By setting each derivative to zero we can obtain that,
Convergence For convexity situations, the lower-level optimization problem is replaced with its KKT conditions and the overall bi-level optimization problem is reduced to a single-level optimization problem. Therefore, the optimization problem enjoys the same convergence properties as the gradient methods for single-level optimization.
For non-convexity situations, we adopt an approximation procedure with respect to the original bi-level problem. It is necessary to analyze the convergence of this algorithm, where the proof is shown in the appendix.
(Convergence). Suppose the empirical loss function is Lipschitz continuous. Let be the optimal solution to the lower-level optimization problem, then as , we have .
We stress that assumptions are very natural and satisfied by many loss functions of practical interests. For example, for classification, logistic loss is a Lipschitz smooth function. Similar cases can also be applied to square loss in regression.
|Unbiased validation set||77.4 0.45||76.2 0.13||78.6 0.21||77.9 0.34||83.4 0.19|
|Biased validation set||76.7 0.47||69.1 0.20||78.2 0.30||76.4 0.42||80.3 0.39|
|Unbiased validation set||80.3 0.43||83.6 0.36||84.9 0.40||82.5 0.66||82.2 0.28||86.5 0.33|
|Biased validation set||79.8 0.41||82.8 0.49||84.1 0.41||82.0 0.53||81.6 0.36||84.1 0.29|
In order to validate the effectiveness of the proposed method, extensive experiments are conducted on a broad range of data sets that cover diverse domains including standard MNIST, CIFAR benchmarks for image classification, and six UCI datasets for regression tasks. Both unbiased and biased validation set are considered, and two WSL tasks, label noise learning and semi-supervised learning settings are conducted for comparison.
4.1 Experimental Setup
For label noise learning, PGS is compared with the following methods. REED (Reed et al., 2014): a bootstrapping technique where the training target is a convex combination of the model prediction and the label. S-MODEL (Goldberger & Ben-Reuven, 2017)
: it adds a fully connected softmax layer after the regular classification output layer to model the noise transition matrix.SafeW (Guo & Li, 2018): it uses ensemble strategy to generate a prediction that can maintain safeness of weakly supervised learning. In addition, we compare with two simple baselines. Baseline: it combines the noisy data and validation data as the training set to train a model. Validation only: which only use the validation data as the training set to train a model.
For semi-supervised learning, PGS is compared with the following methods. -Model (Laine & Aila, 2017): it adds a loss term which encourages the distance between network’s output for different passes of unlabeled data through the network to be small. Mean Teacher (Tarvainen & Valpola, 2017): it is a stable version of -Model, which sets the target to predictions made using an exponential moving average of parameters from previous training steps. Pseudo-Labeling (Lee, 2013): it produces pseudo-labels for unlabeled data using the prediction function itself over the course of training. In addition, we compare with SafeW and the baseline method Baseline.
For the REED method, we use the best parameter reposted in (Reed et al., 2014)
. For the S-MODEL method, the transition weight is set to a smoothed identity matrix. The baseline method, REED, and S-Model are adopted as the base learners in SafeW. ForPGS
, a two-layer neural network is employed as the model. Gradient descent and ADAM are used to optimize the lower-level and the upper-level objective respectively. The iteration of lower-level and upper-level optimization are set to 500 and 20 for all experiments.
For the -Model, Mean Teacher, and Pseudo-Labeling method, we adopt the model structure and optimization method with PGS. The -Model, Mean Teacher, and Baseline method are adopted as the base learners in SafeW. For PGS, the weight and label transition quantity for labeled data are known and we only optimize the value for unlabeled data.
To make sure that our method does not have the privilege of training on more data, all compared methods combine the validation data with raw training data as a new training set.
4.2 MNIST Handwritten Digit Recognition Task
MNIST (LeCun et al., 1998) is a standard dataset for handwritten digit classification. We select a total of images of size and split into four subsets: the training set with 7,000 training examples, the validation set with 1,000 examples, the hyper-validation set with 1,000 examples to monitor training progress and tuning hyper-parameters, and the test data with examples. We also investigate the impact of the biased validation data by subsample a class imbalanced validation set. Specifically, the ratio between images belong to class 0-4 and 5-9 is shifted from 1:1 to 1:3 in the biased validation set, and 3 validation sets are generated with bootstrap strategy.
Label Noise Learning Results For label noise learning, we add uniform flip noise to the training set, means that all label classes can uniformly flip to any other label classes, which has been mostly studied in literature (Frénay & Verleysen, 2014). The summary of classification accuracy with noisy ratio is reported in Table 1. All the results are averaged from 5 runs with different random splits of datasets.
From the comparison results in Table 1, our method achieves the best performance on both biased and unbiased validation sets among all the methods. Methods that maximize performance only, such as S-Model, could suffer performance degradation. The overall performance improvement of safeness-only methods (such as SafeW) is limited. Our approach is not inferior to the baseline approach and achieves maximum performance improvement. It is not effective to use validation sets only, because the amount of data in validation sets is small and it is difficult to train a good model. Above results demonstrate that PGS is not equivalent to the baseline methods which simply train a model on the validation and training data. In contrast, PGS utilizes the validation data to improve the label quality of training set and thus improve the final performance.
Semi-Supervised Learning Results For semi-supervised learning, of the training data is labeled while the rest are unlabeled. Similar to label noise learning, the summary of classification accuracy is reported in Table 2 and results are averaged from 5 runs with different random splits.
Results in Table 2 show that PGS also achieves good performance on semi-supervised learning. The key difference between semi-supervised learning and label noise learning is that we know which instance has a high-quality label in semi-supervised learning, thus, PGS achieves better performance with less trusted labeled training instances.
4.3 CIFAR-10 Image Classification Task
CIFAR-10 (LeCun et al., 1998) is benchmark for image classification task. The dataset consists of natural images with a size of pixels and has 10 categories. We also subsample a set of images and split into four subsets: the training set with 7,000 training examples, the validation set with 1,000 examples, the hyper-validation set with 1,000 examples, and the test data with examples. To construct a biased validation set, the ratio between the top five classes and the bottom five classes is shifted to 1:3, and we create 3 validation sets using bootstrap strategy.
Label Noise Learning Results The summary of classification accuracy under noisy ratio for label noise learning with uniform flip noise is reported in Table 3. All the results are averaged from 5 runs with different random splits of datasets. Results on Table 3 show that PGS also obtains maximum performance improvement on label noise learning and do not have performance degradation problem.
|Unbiased validation set||58.3 0.43||49.5 0.24||64.5 0.39||63.3 0.87||62.0 0.42||66.3 0.13|
|Biased validation set||56.8 0.47||42.3 0.20||64.3 0.41||62.9 0.67||61.5 0.44||64.5 0.33|
|Unbiased validation set||60.7 0.38||64.4 0.53||65.7 0.28||62.1 0.50||63.5 0.39||68.8 0.33|
|Biased validation set||59.0 0.38||63.6 0.49||64.9 0.25||61.8 0.43||62.5 0.41||65.7 0.40|
Semi-Supervised Learning Results The results of classification accuracy with labeled data for semi-supervised learning is reported in Table 4. All the results are averaged from 5 runs with different random splits of datasets. Results on Table 4 show that PGS also derive highly competitive performance with all compared methods on semi-supervised learning.
|Label Noise Learning|
|abalone||.017 .004||.010 .002|
|bodyfat||.150 .016||.102 .013||.087 .010|
|cpusmall||.005 .001||.003 .001||.002 .001|
|mg||.033 .004||.030 .005||.026 .004|
|mpg||.081 .011||.074 .009||.021 .005|
|space_ga||.100 .014||.093 .010||.021 .006|
|abalone||.015 .003||.012 .020||.011 .002|
|bodyfat||.089 .023||.062 .019||.053 .011|
|cpusmall||.006 .001||.003 .001||.002 .001|
|mg||.039 .004||.031 .006||.025 .008|
|mpg||.084 .013||.040 .011||.029 .003|
|space_ga||.060 .009||.031 .010||.019 .004|
4.4 Datasets for Regression Task
We also do experiments on six regression datasets to verifies the effectiveness of our proposal on regression tasks. The datasets cover diverse domains including physical measurements (abalone), health (bodyfat), economics (cadata), activity recognition (mpg), etc. For each dataset, we split it into four parts: training set, validation set, hyper-validation set, test set, according to the ratio . We normalize all features and labels into . For label noise learning, we add a gauss noise to the training set, and for semi-supervised learning, we select 40% of training set as the labeled data.
Since there are rarely works focusing on weakly supervised regression tasks, for label noise learning, we only compared with the baseline method which uses the training set and validation set simultaneously and a noisy robust regression method, Support Vector Regression (SVR). For semi-supervised learning, we compared with the baseline method and the Self-LS method which is an extension of supervised least square method based on self-training. For PGS
, we use linear regression as the model, i.e.,.
Label Noise Learning Results We report the Mean Squared Error for these UCI datasets in Table 5. Each result is averaged of 10 runs under 50% noisy ratio. From Table 5, we can see that our proposal achieves the largest improvement, even when SVR suffers the performance degeneration problem. These demonstrate the effectiveness of PGS for label noise regression task.
4.5 Label Quality Improvement
To further test the effectiveness of label quality improvement, we compare PGS with two methods: Influence function (Koh & Liang, 2017), which corrects label via perturbing a training point and counting the changes to prediction; Nearest Neighbor, which recommends the label from closest validation data when asked for a suggested label correction. These two methods are two commonly used as baselines for mislabel correction (Zhang et al., 2018).
From Table 6, we can see that PGS dominates the compared methods, which further demonstrates the effectiveness of our proposal in label quality improvement.
4.6 Validation Set Size
It is meaningful to study the impact of validation set size. We investigate how the classification accuracy varies with the size of validation set increased from 100 to 1000 on MNIST dataset. We design experiments of PGS on both label noise learning with 50% noisy ratio and semi-supervised learning with 40% labeled data. The accuracy improvement against baseline method is plotted in Figure 3.
From Figure 3, we can see that, as long as the size of validation set reaches a certain small scale (such as 100 samples), the proposal already achieves quite good performance improvement. Moreover, our proposal has great potential. Although we have achieved good results, this experiment shows that by increasing the size of validation set, we can continue to improve the performance significantly.
This paper presents a new WSL framework. Compared with previous works, it considers both 1) performance maximization and 2) performance safeness. These two requirements are usually indispensable in many practical tasks. We use small-scale validation set to guide label quality optimization to get an effective solution, and the effectiveness is clearly verified by extensive experimental results. We believe that the new framework opens a door for reliable weakly supervised learning. Many follow-up works, such as integration of the adversarial mechanism, can be further studied.
Balsubramani & Freund (2015)
Balsubramani, A. and Freund, Y.
Optimally combining classifiers using unlabeled data.In Proceedings of the 28th Conference on Learning Theory, pp. 211–225, 2015.
- Bard (2013) Bard, J. F. Practical bilevel optimization: algorithms and applications. Springer Science & Business Media, 2013.
- Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.
- Cadamuro & Zhu (2016) Cadamuro, G. and Zhu, X. Debugging machine learning models. In ICML Workshop on Reliable Machine Learning in the Wild, 2016.
Chapelle & Zien (2005)
Chapelle, O. and Zien, A.
Semi-supervised classification by low density separation.
Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 57–64, 2005.
- Chapelle et al. (2006) Chapelle, O., Scholkopf, B., and Zien, A. Semi-supervised learning. MIT Press, 2006.
- Chawla et al. (2005) Chawla, N. V. et al. Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research, pp. 331–366, 2005.
- Cozman et al. (2003) Cozman, F. G., Cohen, I., and Cirelo, M. C. Semi-supervised learning of mixture models. In Proceedings of the 20th International Conference on Machine Learning, pp. 99–106, 2003.
Franceschi et al. (2017)
Franceschi, L., Donini, M., Frasconi, P., and Pontil, M.
Forward and reverse gradient-based hyperparameter optimization.In Proceedings of the 34th International Conference on Machine Learning, pp. 1165–1173, 2017.
- Franceschi et al. (2018) Franceschi, L., Frasconi, P., Salzo, S., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1563–1572, 2018.
- Frénay & Verleysen (2014) Frénay, B. and Verleysen, M. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, pp. 845–869, 2014.
- Gamberger et al. (2000) Gamberger, D., Lavrac, N., and Dzeroski, S. Noise detection and elimination in data preprocessing: experiments in medical domains. Applied Artificial Intelligence, pp. 205–223, 2000.
- Goldberger & Ben-Reuven (2017) Goldberger, J. and Ben-Reuven, E. Training deep neural-networks using a noise adaptation layer. In Proceedings of the 5th International Conference on Learning Representations, 2017.
- Guo & Li (2018) Guo, L.-Z. and Li, Y.-F. A general formulation for safely exploiting weakly supervised data. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 3126–3133, 2018.
- Huang et al. (2014) Huang, F., Ahuja, A., Downey, D., Yang, Y., Guo, Y., and Yates, A. Learning representations for weakly supervised natural language processing tasks. Computational Linguistics, pp. 85–120, 2014.
- Joachims (1999) Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning, pp. 200–209, 1999.
- Kasabov & Pang (2004) Kasabov, N. and Pang, S. Transductive support vector machines and applications in bioinformatics for promoter recognition. Neural Information Processing-Letters and Reviews, pp. 1–6, 2004.
- Koh & Liang (2017) Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 1885–1894, 2017.
- Laine & Aila (2017) Laine, S. and Aila, T. Temporal ensembling for semi-supervised learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
- Lee (2013) Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, pp. 2–8, 2013.
- Li & Zhou (2011) Li, Y.-F. and Zhou, Z.-H. Towards making unlabeled data never hurt. In Proceedings of the 28th International Conference on Machine Learning, pp. 1081–1088, 2011.
- Li & Zhou (2015) Li, Y.-F. and Zhou, Z.-H. Towards making unlabeled data never hurt. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 175–188, 2015.
- Maclaurin et al. (2015) Maclaurin, D., Duvenaud, D., and Adams, R. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2113–2122, 2015.
- Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
- Reed et al. (2014) Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping. CoRR, abs/1412.6596, 2014.
- Ren et al. (2018a) Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., and Zemel, R. S. Meta-learning for semi-supervised few-shot classification. In Proceedings of the 6th International Conference on Learning Representations, 2018a.
Ren et al. (2018b)
Ren, M., Zeng, W., Yang, B., and Urtasun, R.
Learning to reweight examples for robust deep learning.In Proceedings of the 35th International Conference on Machine Learning, pp. 4331–4340, 2018b.
- Settles (2012) Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, pp. 1–114, 2012.
Sinha et al. (2018)
Sinha, A., Malo, P., and Deb, K.
A review on bilevel optimization: from classical to evolutionary
approaches and applications.
IEEE Transactions on Evolutionary Computation, pp. 276–295, 2018.
- Sun et al. (2007) Sun, J.-w., Zhao, F.-y., Wang, C.-j., and Chen, S.-f. Identifying and correcting mislabeled training instances. In Future Generation Communication and Networking, pp. 244–250, 2007.
- Tarvainen & Valpola (2017) Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, pp. 1195–1204, 2017.
- Van Hulse & Khoshgoftaar (2009) Van Hulse, J. and Khoshgoftaar, T. Knowledge discovery from imbalanced and noisy data. IEEE Transactions on Knowledge and Data Engineering, pp. 1513–1542, 2009.
- Vezhnevets et al. (2012) Vezhnevets, A., Ferrari, V., and Buhmann, J. M. Weakly supervised structured output learning for semantic segmentation. In
- Wang et al. (2017) Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3462–3471, 2017.
- Zhang et al. (2018) Zhang, X., Zhu, X., and Wright, S. J. Training set debugging using trusted items. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 4482–4489, 2018.
- Zhou et al. (2004) Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. Learning with local and global consistency. In Advances in Neural Information Processing Systems, pp. 321–328, 2004.
- Zhou (2017) Zhou, Z.-H. A brief introduction to weakly supervised learning. National Science Review, pp. 44–53, 2017.
- Zhu et al. (2003) Zhu, X., Ghahramani, Z., and Lafferty, J. D. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine learning, pp. 912–919, 2003.
A. Proof of Theorem 1
(Convergence). Suppose the empirical loss function is Lipschitz continuous. Let be the optimal solution to the lower-level optimization problem, then as , we have .
Since the is Lipschitz continuous, there exists such that for every and every ,
With , it is obvious that , thus we have that .
Using the continuity of , we have ,
Therefore, we have
which completes the proof. ∎
B. Impact of Iteration Number
We investigate the impact of the iteration number during the optimization dynamics on the quality of the solution. The result is plotted in Figure 4, where x-axis is the number of iterations for optimization of training loss, y-axis is the number of iterations for optimization of validation loss and z-axis is the classification accuracy.
From Figure 4, we can see that, as the lower-level iterations increase, the better initialization we can obtain for the upper-level optimization problem, and the upper-level optimization only need a small number of iterations (such as 35 iterations) to achieve a quite good solution.