Learning with deep neural networks has enjoyed huge empirical success in recent years across a wide variety of tasks, from image processing to speech recognition, and from language modeling to recommender system(Goodfellow et al., 2016). However, their success highly counts on the availability of well-annotated and big data, which is barely available for real-world applications. Instead, what we are facing with in practice are large data sets which are collected from crowd-sourcing platforms or crawled from the Internet, thus containing many corrupted labels (Li et al., 2017b; Patrini et al., 2017). Besides, due to the vast learning capacity of deep networks, they will eventually over-fit on these corrupted labels, leading to poor predicting performance, which can be worse than that obtained from simple models (Zhang et al., 2016; Arpit et al., 2017).
focus on estimating on the noise transition matrix. However, the noise transition matrix can be hard to accurately estimate, especially when the number of classes is large(Han et al., 2018). To be free of estimating the noise transition matrix, a promising direction is training networks only on selected instances that are more likely to be clean (Jiang et al., 2018; Han et al., 2018; Ma et al., 2018; Yu et al., 2019). Intuitively, as the training data becomes less noisy, better performance can be obtained. Among those works, the representative methods are MentorNet (Jiang et al., 2018) and Co-teaching (Han et al., 2018; Yu et al., 2019). Specifically, MentorNet pre-trains an extra network, and then uses the extra network for selecting clean instances to guide the training. When the clean validation data is not available, MentorNet has to use a predefined curriculum (Bengio et al., 2009). Co-teaching maintains two networks which have identical architectures simultaneously during the training process. And in each mini-batch of data, each network is updated using the other network’s small-loss instances. Empirical results demonstrate that under both extremely and low-level noise, Co-teaching can achieve much better performance than MentorNet.
is the crux. Memorization happens widely in various architectures of deep network, e.g., multilayer perceptron (MLP) and convolutional neural network (CNN). They all tend to learn easy patterns first and then over-fit on (possibly noisy) training data set. Due to such effect, sample-selection methods can learn correct patterns at early stage and then use the obtained discriminative ability to filter out corrupted instances in subsequent training epochs(Han et al., 2018; Chen et al., 2019). However, these methods are hard to tune with good performance. The problem is that it is difficult to exactly control how many instances need to be filtered out during different stages in the training; and trivial attempts can easily lead to even worse performance than standard deep networks (Han et al., 2018). Some recent endeavors seek to evade from this problems by integrating with other auxiliary information, e.g., a small clean subset is used in (Ren et al., 2018)
, and knowledge graphs are utilized in(Li et al., 2017b).
In this paper, motivated by the success of automated machine learning (AutoML) on designing data-dependent models (Hutter et al., 2018; Yao et al., 2018), we propose to exploit memorization effects automatically using AutoML techniques. Contributions are summarized as follows:
[leftmargin = 10px]
First, to have an in-depth understanding of why it is difficult to exploit the memorization effect, we examine its behaviors from the perspective of practical usage. We find that, while there exist general patterns in how memorization occurs with the training process, it is hard to quantize to which extend such effect can happen. Especially, memorization can be affected by many factors, e.g., data sets used, noisy types, network architectures, and the choice of the optimizers.
To make good use of AutoML techniques, we then derive an expressive search space for exploiting memorization, which is from the above observations, i.e., the curvature of how many instances need to be sampled during iterating should be similar with the inverse of the learning curve on the validation set. Such a space is not too huge since it has only a few variables, thus allows subsequent algorithms converging fast to promising candidates. Besides, it is also not too small, as it covers all necessary functions, i.e., not just a specific function used in the past but also many functions can be considered in the future.
Then, to design an efficient algorithm, we first show the difficulty of getting gradients in our space and the failure of weight-sharing with the existence of corrupted labels. These motivate us to take a probabilistic view of the search problem and adopt natural gradient descent (Amari, 1998; Pascanu and Bengio, 2013) for optimization. The designed algorithm can effectively address above problems and is significantly faster than other popular search algorithms.
Finally, we conduct extensive experiments on both synthetic and benchmark data sets, under various settings using different network architectures. These experiments demonstrate that the proposed method can not only be much more efficient than existing AutoML algorithms, but also can achieve much better performance than the state-of-the-art sample-selection approaches designed by humans. Besides, we further visualize and explain the searched functions, which can also help design better rules to control memorization effects in the future.
2 Related work
2.1 Learning from Noisy Labels
The mainstream research focuses on class-conditional noise (CCN) (Angluin and Laird, 1988)
, where the label corruption is independent of features. Generally, recent methods for handling CCN model can be classified into three categories. The first one is based on the estimation of transition matrix, which tries to capture how correct labels flip into wrong ones(Sukhbaatar et al., 2015; Reed et al., 2015; Patrini et al., 2017; Ghosh et al., 2017). These methods then use the estimated matrix to correct gradients or loss during training. However, they are fragile to heavy noise and unable to handle many classes (Han et al., 2018). The second type is the regularization approach (Miyato et al., 2016; Laine and Aila, 2017; Tarvainen and Valpola, 2017). Although regularization approach can achieve a satisfying performance, it is still an incomplete approach since (Jiang et al., 2018) shows that it can only delay the overfitting progress rather than avoid it, i.e. given enough training time, it can still fit the noisy data completely. Thus, it requires much domain knowledge to determine the appropriate number of training epochs in order to prevent overfitting. The last one is sample-selection approach, which attempts to reduce negative effects from noisy labels by selecting clean instances during training. The recent state-of-the-art method is also built on sample-selection approach (Jiang et al., 2018; Han et al., 2018; Malach and Shalev-Shwartz, 2017; Yu et al., 2019).
A promising criteria to select “clean instances” is to pick up instances that has relatively small losses in each mini-batch (Jiang et al., 2018; Han et al., 2018). The fundamental property behind these methods is the memorization effect of deep networks (Zhang et al., 2016; Arpit et al., 2017), which means deep networks can learn simple patterns first and then start to over-fit. Such effect helps classifiers set up discriminate ability in the early stage, then make clean instances more likely to have smaller loss that those corrupted ones. The general framework of sample-selection approach is in Algorithm 1. Specifically, some small-loss instances are selected from the mini-batch in step 5. These “clean” instances are then used to update network parameters in step 6. The in step 8, which controls how many instances to be kept in each epoch, is the most important hyper-parameter as it explicitly exploits the memorization effect.
However, it is hard to exactly determine how much proportion of small-loss samples should be selected in each epoch (Jiang et al., 2018; Ren et al., 2018). As will be discussed in Section 3.1, due to various practical usages issues, to which extend memorization effect can happen is hard to quantize. Thus, performance obtained from existing solutions are far from desired, and we are motivated to solve this issue by AutoML.
2.2 Automated Machine Learning (AutoML)
Automated machine learning (AutoML) (Hutter et al., 2018; Yao et al., 2018) has recently exhibited its power in easing the usage of and designing better machine learning models. Basically, AutoML can be regarded as a black-box optimization problem where we need to efficiently and effectively search for hyper-parameters or designs for the underlying learning models evaluated by the validation set.
Regarding the success of AutoML, there are two important perspectives (Feurer et al., 2015; Zoph and Le, 2017; Xie and Yuille, 2017; Bender et al., 2018; Yao et al., 2019a, b): 1). Search space: Search space is domain-specific. First, it needs to be general enough, which means it should cover existing models as special cases. This also helps experts better understand limitations of existing models and thus facilitate future researches. However, the space cannot be too general, otherwise searching in such a space will be too expensive. 2). Search algorithm: Optimization problems in AutoML are usually black-box. Unlike convex optimization, there is no universal and efficient optimization tools. Once the search space is determined, domain knowledge should also be explored in the design of search algorithm so that good candidates in the space can be identified efficiently.
There are two types of search algorithms popularly used in the literature of AutoML. The first one is derivative-free optimization methods, it is usually used for searching in a general search space, e.g., reinforcement learning(Zoph and Le, 2017; Baker et al., 2017) et al., 2009; Xie and Yuille, 2017), and Bayes optimization (Feurer et al., 2015; Snoek et al., 2012). More recently, one-shot gradient-based methods, which alternatively update parameters and hyper-parameters, have been developed as more efficient replacements for derivative-free optimization methods on some domain-specific search spaces, e.g., the supernet in neural network architecture search (Bender et al., 2018; Liu et al., 2019; Akimoto et al., 2019; Xie et al., 2018). These methods have two critical requirements, i.e., gradients w.r.t hyper-parameters can be computed and parameters can be shared among different hyper-parameters.
3 The Proposed Method
Here, we first give a closer look at why it is difficult to exploit the memorization effect (Section 3.1). This also helps us identify key observations on how memorization can happen. This observation subsequently enables us to design an expressive but compact search space (Section 3.2), and motivates us to use natural gradient method (Amari, 1998; Ollivier et al., 2017) that can generate gradients in the parameterized space for efficient optimization (Section 3.3).
3.1 Key Observations
As in Section 2.2, an important challenge in designing search spaces is to balance the size (or dimension) and the expressive ability of the search space. An overly constrained search space may not contain candidates that have a satisfying performance, whereas too large search spaces will be difficult to effectively search. All these require us to have an in-depth understanding of memorization effect. Thus, we are motivated to look at factors which can affect memorization in practical usages of deep networks, and to seek patterns from the resultant influences. Specifically, we examine memorization when data sets, architectures, or optimizers are changed. Results are in Figure 1. From these figures, we can observe:
[leftmargin = 10px]
Foot-stone of space design: There exists a general pattern among all cases, i.e., all models’ test accuracy will first increase, then decrease.
This general pattern is consistent with that in the literature (Zhang et al., 2016; Arpit et al., 2017; Tanaka et al., 2018; Han et al., 2018). It is also the key domain knowledge for designing an expressive and compact search space, which makes the subsequent search possible. However, the more important observation is:
[leftmargin = 10px]
Need of AutoML: Curvature can be significantly affected by these factors. When the peak will appear (i.e., stop learning from simple patterns and start to over-fit), and to which extend the performance will drop from peak (i.e., over-fit on noisy labels) are all hard to quantize.
This observation shows a great variety in appearances of the memorization effect, which further poses a significant need for automated exploiting of such effect. Besides, it is hard to know in advance what learning curve will exactly look like in advance and thus impossible to manually design before learning is performed.
3.2 Search Space
Recall that in step 8 of Algorithm 1, controls how many instances are kept in each mini-batch, and we want to exploit the memorization effect through the variation of . Based on the first empirical observation in Section 3.1, our design of first should satisfy:
[leftmargin = 10px]
Curvature: should be similar with the inverse of the learning curve. In other words, should first drop, then (possibly) rise.
The reason behind this constraint is as follows: Since the learning curve represents the model’s accuracy, when it rises, we should drop more large-loss samples as the large loss is more likely the result of corrupted labels than model’s misclassification. And when the learning curve falls, we should drop less to help the model learn more.
[leftmargin = 10px]
Range: for with .
Since denotes the proportion of selected instances, it is naturally in . Besides, at the beginning, dropping small-loss samples will be same as dropping samples randomly, and we need to pass sufficient number of instances such that the model can learn from patterns.
As itself can be seen as a function which takes as input and outputs a scalar, we can use some ”basis functions” to construct a complicate step by step. To make it first decrease, then possibly increase, we need two types of function, decreasing term and increasing term to construct one . We choose some simple monotone functions for decreasing and increasing terms as shown in Fig 2, and the final can be written as:
where the is the hyper-parameters controlling each term (for example and in the table) and the weight for each term to make the final .
Let the clean validation set be , be Algorithm 1 with network parameter and sampling rule ( is parameterized by equation 1). measures the validation performance with the model parameter . Thus, memorization effect can be automatically exploited by solving the following problem
Then, the optimal curvature is derived from .
3.3 Search Algorithm
Here, we first discuss problems of using one-shot gradient-based methods here (Section 3.3.1). These problems motivate us design an efficient search algorithm based on natural gradient algorithm (Section 3.3.2).
3.3.1 Issues of Existing Algorithms
Gradient-based methods need the chain rule to obtain the gradient w.r.t hyper-parameters from the gradient w.r.t network weights, i.e.,. This condition does not hold for our problem since our hyper-parameters control how many samples will be used to update the weights. Thus, it is hard to compute here. Besides, the learned parameters cannot be shared among different . In previous methods, hyper-parameters are not coupled with the training process, e.g., network architectures (Liu et al., 2019; Akimoto et al., 2019) and regularization (Luketina et al., 2016). However, here will heavily influence the training process, which is also shown in our Figure 1. Thus, one-shot gradient-based methods cannot be applied here.
3.3.2 Proposed Algorithm
The analysis above demonstrates that only derivative-free methods are applicable to our problem, which can be slow. Here, we discover that natural gradient (NG) (Amari, 1998; Pascanu and Bengio, 2013; Ollivier et al., 2017), can be used. The most interesting point here is that NG can still benefit from gradient descent, but without the computation of or sharing of .
The basic idea of NG is summarized as follows: instead of directly optimizing w.r.t , we consider a random distribution over parametrized by , and maximize the expected value of our validation performance w.r.t , i.e.,
To optimize w.r.t , NG updates by
where is the step-size, is the Fisher information matrix at , and
First, parameters are not shared between ’s as is obtained from full model trained in equation 4. Second, by sampling candidate s from the given distribution in each iteration, equation 6 just needs validation’s performance (without computation of ). Besides, such gradient has proved to give the steepest ascend direction in the probabilistic space spanned by (Theorem 1 in (Angluin and Laird, 1988)).
The last question is how to choose , which may significantly influence the algorithm’s convergence behavior. Fortunately, NG exhibits strong robustness against different choices over
. The reason is that Fisher matrix, which encodes the second order approximation to the Kullback-Leibler divergence, can accurately capture curvatures introduced by various. Thus, NG descent is parameterization-invariant, has good generalization ability, and moreover can be regarded as a second-order method in the space of (Ollivier et al., 2017). The proposed search algorithm is in Algorithm 2. As will be shown in experiments, all these properties make NG an ideal search algorithm here, and can be faster than other popular AutoML approaches.
We implement all our experiments using PyTorch 0.4.1 on a GTX 1080 Ti GPU.
4.1 Experiments on synthetic data
In this section, we demonstrate the superiority of the proposed search space and search algorithm on the synthetic data. The ground-truth is shown in Figure 3(a), which is an example curvature satisfying two requirements in Section 3.2.
4.1.1 Search space comparison
The goal here is to approximate in Figure 3(a). Let the estimated function be where . The target is to minimize RMSE, i.e., . Three different search spaces are compared: 1). Full space: , i.e., there is no constraint in the space, and estimation for at each time stamp is performed independently; 2). Co-teaching’s space in equation 3, i.e., , and needs to be estimated; and 3). The proposed space in equation 1, which encodes our observations in Section 3.1. Random search (Bergstra and Bengio, 2012) is performed in all three spaces.
Results are in Figure 3(b)
. We can see that no constraints on the search space will lead to disastrous performance and extremely slow convergence rate. And compare our space with previous Co-teaching’s space, we can see that our space can approximate the target better due to its larger degree of freedom.
4.1.2 Search algorithm comparison
We first test our proposed method’s efficiency over other hyper-parameter optimization (HPO) algorithms on a synthetic problem as follows: Given a pre-defined as target, we try to search for a that has the lowest squared loss to the target. We compare our proposed method with random search (Bergstra and Bengio, 2012) and Bayesian optimization (Kandasamy et al., 2019), which are two popular methods for hyper-parameter optimization. Results are in Figure 3(c). The results demonstrate that our proposed natural gradient approach has the fastest convergence rate. And it can find the best with the least RMSE to the target.
4.2 Experiments on benchmark data sets
We verify the efficiency of our approach on three benchmark data sets, i.e., MNIST, CIFAR-10 and CIFAR-100. These data sets are popularly used for the evaluation of learning with noisy labels in the literature (Zhang et al., 2016; Arpit et al., 2017; Jiang et al., 2018; Han et al., 2018). Following (Patrini et al., 2017; Han et al., 2018; Chen et al., 2019), we corrupt these data sets manually by two types of noise. (1) Symmetry flipping (with and noise level): Parts of correct labels are flip uniform randomly to other classes; (2) Pair flipping (with noise level): A simulation of fine-grained classification with noisy labels, where labelers may make mistakes only within very similar classes.
We set the network architecture as the same in (Yu et al., 2019). To measure the performance, same as (Patrini et al., 2017; Han et al., 2018; Chen et al., 2019), we use the test accuracy, i.e., . Intuitively, higher test accuracy means that the algorithm is more robust to the label noise.
4.2.1 Comparison on Learning Performance
To show better accuracy can be achieved by automatically exploiting the memorization effect, we compared the proposed method with 1). MentorNet (Jiang et al., 2018); 2). Co-teaching (Han et al., 2018); 3). Co-teaching+ (Yu et al., 2019) (enhancing Co-teaching by disagreements on predictions); 4). Decoupling (Malach and Shalev-Shwartz, 2017); 5). F-correction (Patrini et al., 2017); 6). As a simple baseline, we also compare with the standard deep network that directly trains on noisy datasets (abbreviated as Standard). As an example usage, the proposed method is combined with Co-teaching, i.e., Co-teaching is run with search . Figure 4 shows the comparison with various human-designed methods. We can see that the proposed method significantly outperforms existing methods by a large margin, especially on the more noisy cases (i.e., symmetric-50% and pair-45%). Besides, the proposed method not only beats Co-teaching due to better exploiting of the memorization effect, but also wins Co-teaching+, which further filter small-loss instances in Co-teaching by checking disagreements on predictions of labels. These demonstrates the importance and benefits of searching proper .
4.2.2 Case Study on Sampling Rate
To understand why the proposed method can obtain much higher testing accuracy, we plot searched in Figure 5. We can see that all methods finally drop more large-loss instances than the ground-truth noisy level. The reason is very intuitive, a large-loss instance usually also has larger gradient, and it can have much more influences than several clean small-loss instances if its label is wrong. Thus, we may want to drop more samples eventually. However, this is not an easy task, as in Table 8 of (Han et al., 2018), simply making larger can significantly decrease testing accuracy. The searched curvature from equation 3 is of great importance. Another interesting observation is that the non-monotonous property of allowed in Section 3.2 is not useful. This might be due to monotonous curves can already balance well learning and over-fitting here.
4.3 Comparison with HPO methods
Finally, in this section, we compare the proposed natural gradient (NG) algorithm with 1). random search (Bergstra and Bengio, 2012) and 2). Hyperband (Li et al., 2017a). Note that Bayesian optimization is slower than Hyperband, thus not compared (Hyperband cannot be used in Section 4.1.2 due to no inner loops). Besides, reinforcement learning (RL) (Zoph and Le, 2017) is not compared as the searching problem is not a multi-step one. Genetic programming (Xie and Yuille, 2017) is not considered neither, as the search space is a continuous one. Figure 6 compares the proposed method with random search and Hyperband. From the figure, we can see that natural gradient converges faster than other HPO methods in this problem. Our proposed method can also find better s under different data sets and noise settings.
In this paper, motivated by the main difficulty that to what extent the memorization effect of deep networks can happen, we propose to exploit memorization by automated machine learning (AutoML) techniques. This is done by first designing an expressive but compact search space, which is based on observed general patterns for memorization, and designing a natural gradient-based search algorithm, which overcomes the problem of non-differential and failure of parameter-sharing. Extensive experiments on both synthetic data sets and benchmark data sets demonstrate that the proposed method can not only be much efficient than existing AutoML algorithms, but also achieve much better performance than the state-of-the-art sample-selection approach.
- Adaptive stochastic natural gradient method for one-shot neural architecture search. In ICML, pp. 171–180. Cited by: §2.2, §3.3.1.
- Natural gradient works efficiently in learning. Neural Computation 10 (2), pp. 251–276. Cited by: 3rd item, §3.3.2, §3.
- Learning from noisy examples. Machine Learning 2 (4), pp. 343–370. Cited by: §2.1, §3.3.2.
- A closer look at memorization in deep networks. In ICML, pp. 233–242. Cited by: §1, §1, §2.1, §3.1, §4.2.
- Designing neural network architectures using reinforcement learning. In ICLR, Cited by: §2.2.
- Understanding and simplifying one-shot architecture search. In ICML, pp. 549–558. Cited by: §2.2, §2.2.
- Curriculum learning. In ICML, pp. 41–48. Cited by: §1.
- Random search for hyper-parameter optimization. JMLR 13 (Feb), pp. 281–305. Cited by: §4.1.1, §4.1.2, §4.3.
- Understanding and utilizing deep neural networks trained with noisy labels. In ICML, pp. 1062–1070. Cited by: §1, §4.2, §4.2.
- Particle swarm model selection. JMLR 10 (Feb), pp. 405–440. Cited by: §2.2.
- Efficient and robust automated machine learning. In NeurIPS, pp. 2962–2970. Cited by: §2.2, §2.2.
Robust loss functions under label noise for deep neural networks. In AAAI, pp. 1919–1925. Cited by: §2.1.
- Deep learning. MIT press. Cited by: §1.
- Co-teaching: robust training of deep neural networks with extremely noisy labels. In NeurIPS, pp. 8527–8537. Cited by: §1, §1, §2.1, §2.1, §3.1, Remark 3.1, §4.2.1, §4.2.2, §4.2, §4.2, Algorithm 1.
- Automated machine learning: methods, systems, challenges. Springer. Note: In press, available at http://automl.org/book. Cited by: §1, §2.2.
- MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pp. 2309–2318. Cited by: §1, §2.1, §2.1, §2.1, §4.2.1, §4.2, Algorithm 1.
Tuning hyperparameters without grad students: scalable and robust bayesian optimisation with Dragonfly. Technical report arXiv preprint arXiv:1903.06694. Cited by: §4.1.2.
Temporal ensembling for semi-supervised learning. In ICLR, Cited by: §2.1.
- Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR 18 (1), pp. 6765–6816. Cited by: §4.3.
- Learning from noisy labels with distillation. In ICCV, pp. 1910–1918. Cited by: §1, §1.
- DARTS: differentiable architecture search. In ICLR, Cited by: §2.2, §3.3.1.
- Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML, pp. 2952–2960. Cited by: §3.3.1.
- Dimensionality-driven learning with noisy labels. In ICML, pp. 3361–3370. Cited by: §1.
- Decoupling” when to update” from” how to update”. In NIPS, pp. 960–970. Cited by: §2.1, §4.2.1.
- Virtual adversarial training: a regularization method for supervised and semi-supervised learning. In ICLR, Cited by: §2.1.
- Information-geometric optimization algorithms: a unifying picture via invariance principles. JMLR 18 (1), pp. 564–628. Cited by: §3.3.2, §3.3.2, §3.
- Revisiting natural gradient for deep networks. Technical report arXiv preprint arXiv:1301.3584. Cited by: 3rd item, §3.3.2.
- Making deep neural networks robust to label noise: a loss correction approach. In CVPR, pp. 2233–2241. Cited by: §1, §1, §2.1, §4.2.1, §4.2, §4.2.
- Training deep neural networks on noisy labels with bootstrapping. In ICLR Workshop, Cited by: §2.1.
- Learning to reweight examples for robust deep learning. In ICML, pp. 4331–4340. Cited by: §1, §2.1.
- Practical Bayesian optimization of machine learning algorithms. In NIPS, pp. 2951–2959. Cited by: §2.2.
- Training convolutional networks with noisy labels. In ICLR Workshop, Cited by: §1, §2.1.
- Joint optimization framework for learning with noisy labels. In CVPR, pp. 5552–5560. Cited by: §3.1.
- Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, Cited by: §2.1.
- Genetic CNN. In ICCV, pp. 1388–1397. Cited by: §2.2, §2.2, §4.3.
- SNAS: stochastic neural architecture search. In ICLR, Cited by: §2.2.
- Searching for interaction functions in collaborative filtering. Technical report arXiv preprint arXiv:1906.12091. Cited by: §2.2.
- Taking human out of learning applications: a survey on automated machine learning. Technical report arXiv preprint arXiv:1810.13306. Cited by: §1, §2.2.
- Differentiable neural architecture search via proximal iterations. arXiv preprint arXiv:1905.13577. Cited by: §2.2.
- How does disagreement help generalization against label corruption?. In ICML, pp. 7164–7173. Cited by: §1, §2.1, §4.2.1, §4.2.
- Understanding deep learning requires rethinking generalization. ICLR. Cited by: §1, §1, §2.1, §3.1, §4.2.
- Neural architecture search with reinforcement learning. In ICLR, Cited by: §2.2, §2.2, §4.3.