ChoiceNet: Robust Learning by Revealing Output Correlations

05/16/2018 ∙ by Sungjoon Choi, et al. ∙ Kakao Corp. 0

In this paper, we focus on the supervised learning problem with corrupted training data. We assume that the training dataset is generated from a mixture of a target distribution and other unknown distributions. We estimate the quality of each data by revealing the correlation between the generated distribution and the target distribution. To this end, we present a novel framework referred to here as ChoiceNet that can robustly infer the target distribution in the presence of inconsistent data. We demonstrate that the proposed framework is applicable to both classification and regression tasks. ChoiceNet is extensively evaluated in comprehensive experiments, where we show that it constantly outperforms existing baseline methods in the handling of noisy data. Particularly, ChoiceNet is successfully applied to autonomous driving tasks where it learns a safe driving policy from a dataset with mixed qualities. In the classification task, we apply the proposed method to the CIFAR-10 dataset and it shows superior performances in terms of robustness to noisy labels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training a deep neural network requires immense amounts of training data which are often collected using crowdsourcing methods, such as Amazon’s Mechanical Turk (AMT). However, in practice, the crowd-sourced labels are often noisy

Bi et al. (2014). Furthermore, deep neural networks are vulnerable to over-fitting given the noisy training data in that they are capable of memorizing the entire dataset even with inconsistent labels, leading to a poor generalization performance Zhang et al. (2016).

Assuming that a training dataset is generated from a mixture of a target distribution and other distributions, we address this problem through the principled idea of revealing the correlation between the target distribution and the other distributions. We present a framework for robust learning which is applicable to arbitrary neural network architectures such as convolutional neural networks

He et al. (2016a)

or recurrent neural networks

Chung et al. (2014). We call this framework ChoiceNet.

Throughout this paper, we aim to address the following questions:

  1. How can we measure the quality of training data in a principled manner?

  2. In the presence of inconsistent outputs, how can we infer the target distribution in a scalable manner?

Traditionally, noisy outputs are handled by modeling additive random distributions, often leading to robust loss functions

Hampel et al. (2011)

. However, we argue that these approaches are too restrictive when handling severe outliers or inconsistencies in the datasets. To address the first question, we leverage the concept of a correlation. Precisely, we measure the quality of training data using the correlation between the target distribution and the data generating distribution. However, estimating the correct correlation requires an access to a target distribution, whereas learning the correct target distribution requires knowing the correlation between the distributions to be known, making it a chicken-and-egg problem. To address the second question, we simultaneously estimate the target distribution as well as the correlation in an end-to-end-manner using stochastic gradient decent methods, in this case Adam

Kingma and Ba (2014), to achieve scalability.

The cornerstone of the proposed method is a mixture of correlated density network (MCDN) block. First, we present a Cholesky transform method for sampling the weights of a neural network that enables us to model correlated outputs. We also present an effective regularizer to train ChoiceNet. To the best of our knowledge, this represents the first approach simultaneously to infer the target distribution and the output correlations using a neural network in an end-to-end manner.

Revealing the output correlations was proposed in earlier work Bonilla et al. (2008), in which a multi-task Gaussian process prediction (MTGPP) model is proposed. In particular, MTGPP used correlated Gaussian processes to model multiple tasks by learning a free-form cross-covariance matrix. However, due to the multi-task learning setting, it is not suitable for learning a single target function. In other work Choi et al. (2016), a leverage optimization method which optimizes the leverage of each demonstrations is proposed. Unlike to former study Bonilla et al. (2008), the latter Choi et al. (2016) focused on inferring a single expert policy by incorporating a sparsity constraint by assuming that the most demonstrations are collected from a skillful consistent expert.

ChoiceNet is initially applied to a synthetic regression task, where we demonstrate its robustness to extreme outliers and ability to distinguish the target distribution and noise distributions. We then apply it to an autonomous driving scenario in which the driving demonstrations are collected from both safe and careless drivers and show that it can robustly learn a safe and stable driving policy. Subsequently, we move on to the classification tasks using the MNIST and CIFAR-10 datasets. We show that the proposed method outperforms existing baseline methods in terms of robustness with regard to the handling of noisy labels.

2 Related Work

Recently, robustness in deep learning has been actively studied

Fawzi et al. (2017) as deep neural networks are being applied to diverse tasks involving real-world applications such as autonomous driving Paden et al. (2016) or medical diagnosis Gulshan et al. (2016) where a simple malfunction can have catastrophic results AP and REUTERS (2016). Perhaps, the most actively studied area regarding robustness in deep learning is the modeling and defense against adversarial attacks in the input domain Aung et al. (2017); Sinha et al. (2017); Carlini and Wagner (2017); Papernot et al. (2016). Adversarial examples are intentionally designed inputs that cause incorrect predictions in learned models by adding a small perturbation that is scarcely recognized by humans Goodfellow et al. (2014). While this is a substantially important research direction, we focus on the noise in the outputs, e.g., outliers from different distributions or random labels.

A number of studies Bekker and Goldberger (2016); Patrini et al. (2017); Goldberger and Ben-Reuven (2017); Jindal et al. (2016); Liu et al. (2017)

deal with the problems which arise when handling noisy labels in the training dataset in that massive datasets such as the ImageNet dataset

Deng et al. (2009) are often mostly from crowdsourcing and which thus may contain inaccurate and inconsistent labels Bi et al. (2014). To deal with noisy labels, an earlier study Bekker and Goldberger (2016) proposed an extra layer for the modeling of output noises. Later work Jindal et al. (2016) extended the aforementioned approach Bekker and Goldberger (2016) by adding an additional noise adaptation layer with aggressive dropout regularization. A similar method was then proposed Patrini et al. (2017)

which initially estimated the label corruption matrix with a learned classifier and used the corruption matrix to fine-tune the classifier. Other research

Jiang et al. (2017) concentrated on the training of an additional neural network, referred to as MentorNet, which assigns a weight to each instance of training data to supervise the training of a base network, termed StudentNet, to overcome the over-fitting of corrupted training data. On final study of note here Rolnick et al. (2017) analyzed the intrinsic robustness of deep neural network models to massive label noise and empirically showed that a larger batch size with a lower learning rate can be beneficial with regard to the robustness. Motivated by that work Rolnick et al. (2017), we train ChoiceNet with a large batch size and a low learning rate.

Unlike previous methods that only require noisy training datasets, some work Li et al. (2017); Malach and Shalev-Shwartz (2017); Hendrycks et al. (2018); Veit et al. (2017) require a small number of clean datasets. A gold-loss correction method was also presented Hendrycks et al. (2018); it initially learns a label corruption matrix using a small clean dataset and then uses the corruption matrix to retrain a corrected classifier. A label-cleaning network has also been proposed Veit et al. (2017). It corrects noisy labels in the training dataset by leveraging information from a small clean dataset.

Adding small label noises while training is known to be beneficial to training, as it can be regarded as an effective regularization method Lee (2013); Goodfellow et al. (2016). Similar methods have been proposed to tackle noisy outputs. A bootstrapping method Reed et al. (2014) which train a neural network with a convex combination of the output of the current network and the noisy target was proposed. Other researchers Xie et al. (2016) proposed DisturbLabel, a simple method which randomly replaces a percentage of the labels with incorrect values for each iteration. Mixing both input and output data was also proposed Tokozume et al. (2018); Zhang et al. (2017). One study Zhang et al. (2017) considered the image recognition problem under label noise and the other Tokozume et al. (2018) focused on a sound recognition problem.

Modeling correlations of output training data has been actively studied in light of Gaussian processes Rasmussen (2006). MTGPP Bonilla et al. (2008) that models the correlations of multiple tasks via Gaussian process regression was also proposed. Due to the multi-task setting, however, Bonilla et al. (2008) is not suitable for robust regression tasks. Other researchers Choi et al. (2016) proposed a robust learning from demonstration method using a sparse constrained leverage optimization method which estimates the correlation between training outputs. Unlike the former study Bonilla et al. (2008), the latter above Choi et al. (2016) can robustly recover the expert policy function. While our problem setting is similar to the latter study Choi et al. (2016), we propose end-to-end learning of both the target distribution and the correlation of each training data, thus offering, a clear advantage in terms of scalability. The aforementioned study Choi et al. (2016) also requires the design of a proper kernel structure, which is not suitable for high-dimensional inputs and classification problems.

3 ChoiceNet

In this section, we introduce a foundational theory and the model architecture of ChoiceNet. ChoiceNet consists of a base network and a mixture of correlated density network (MCDN) block. Section 3.1 legitimates the reparameterization trick for correlated samples. Subsequently, we present the mechanism of ChoiceNet in Section 3.2 and loss functions for ChoiceNet regarding regression and classification tasks in Section 3.3.

3.1 Reparameterization Trick for Correlated Sampling

We introduce fundamental theorems which lead to Cholesky transform

for given random variables

. We apply this transform to random matrices and which carry out weight matrices for prediction and a supplementary role, respectively. Each proof of theorem can be found in the Appendix.

Theorem 1.

Let and be uncorrelated random variables such that

(1)

For a given , set

(2)

Then

Theorem 2.

Assume the same condition in Theorem 1 and define as (2). For given functions and , set . Then

Due to the above theorem, correlation is invariant to mean-translation and variance-dilatation. Now we define the key operation of the MCDN block named

Cholesky Transform.

Definition.

For , we define Cholesky transform as follows

(3)

Here is a function for given parametes . By plugging random variables in , we obtain a new random variable correlated with . This makes it possible to use the reparametrization trick Kingma (2017); Kingma and Welling (2013) to learn parameters , and . Indeed, according to (2). Thus by applying Theorem 2 to with and , we reach the following result.

Corollary.

Under the same conditions in Theorem 1, let

Then

Aforementioned Corollary implies the random variable has a correlation with . The following theorem further states that a correlation between random matrices is invariant to an affine transform. This legitimates using Cholesky transform to generate weight matrices in the MCDN block.

Theorem 3.

Let . For , random matrices are given such that for every ,

(4)

and

(5)

Given , set for each . Then an elementwise correlation between and equals i.e.

equivalently,

3.2 Model Architecture

Figure 1: Model Architecture of ChoiceNet

In this section, we describe the model architecture and the mechanism of ChoiceNet. In the followings, is a constant indicating expected measurement noise and is a bounded function, e.g., a hyperbolic tangent. , and where and

denote the dimensions of a feature vector

and output , respectively, and is the number of mixtures. is a fixed constant whose value is close to .

ChoiceNet is a twofold architecture: (a) a base network and (b) a MCDN block (see Figure 1). A base network extracts features for a given dataset. Then the MCDN block estimates the densities of the data generating distributions through . Contrary to the mixture density network (MDN), during the density estimation process, the MCDN block samples correlated weights using Cholesky transform. Consequently, the MCDN block is able to generate the correlated mean vectors . The overall mechanism of ChoiceNet can be elaborated as follows:

Modules
Cholesky Transform
Outputs

By Theorem 3, for each

and the output density is modeled via correlated mean vectors. Note that both and are minimized, when

. Furthermore, as we apply Gaussian distributions for Cholesky transform, the influences of uninformative or independent data, whose correlations are close to 0, is attenuated as their variances increase

Kendall and Gal (2017).

3.3 Training Objectives

Denote a training dataset by . We consider both regression and classification tasks.

Regression

For the regression task, we employ both -loss and the standard MDN loss Bishop (1994); Choi et al. (2017); Christopher (2016);

(6)

where and are hyper-parameters and is the density of multivariate Gaussian:

We also add weight decay and the following Kullback-Leibler regularizer to (6)

(7)

The above KL regularizer encourages the mixture components with the strong correlations to have high mixture probabilities. This guidance is useful since ChoiceNet uses the mean vector

of the first mixture component at the inference stage.

Classification

In the classification task, we suppose each is a -dimensional one-hot vector. Unlike the regression task, (6) is not appropriate for the classification task. We employ the following loss function:

(8)

where is a hyper-parameter. Similar to the regression task, we use both (7) and weight decay.

4 Experiments

4.1 Regression Tasks

We conduct two regression experiments: 1) a synthetic scenario where the training dataset contains outliers sampled from other distributions and 2) a track driving scenario where the driving demonstrations are collected from two different driving modes.

Synthetic Example

We first apply ChoiceNet to a simple one-dimensional regression problem of fitting where as shown in Figure 5

. ChoiceNet is compared with a naive multilayer perceptron (MLP), a mixture density network (MDN) with five mixtures where all networks have two hidden layers with

nodes with a ReLU activation function. Gaussian process regression (GPR)

Rasmussen (2006), leveraged Gaussian process regression (LGPR) with leverage optimization Choi et al. (2016), and robust Gaussian process regression (RGPR) with an infinite Gaussian process mixture model Rasmussen and Ghahramani (2002) are also compared. For the GP based methods, we use a squared-exponential kernel function and the hyper-parameters are determined using a simple median trick Dai et al. (2014)111 A median trick selects the length parameter of a kernel function to be the median of all pairwise distances between training data.. To evaluate its performance in corrupted datasets, we randomly replace the original target values with outliers whose output values are uniformly sampled from to . We vary the outlier rates from (clean) to (extremely noisy).

Table 1 illustrates the RMSEs (root mean square errors) between the reference target function and the fitted results of ChoiceNet and other compared methods. Given an intact training dataset, all the methods show stable performances in that the RMSEs are all below . Given training datasets whose outlier rates exceed , however, only ChoiceNet successfully fits the target function whereas the other methods fail as shown in Figure 5.

Outliers ChoiceNet MDN MLP GPR LGPR RGPR
Table 1: RMSE of compared methods on synthetic toy examples
Figure 2: Reference function and fitting results of compared methods on different outlier rates.
Figure 3: Fitting results on datasets with (a) flipped function and (c) uniform corruptions. Resulting correlations of two components with (b) flipped function and (d) uniform corruptions.

To further inspect whether ChoiceNet can distinguish between the target distribution and noise distributions, we train ChoiceNet on two datasets. In particular, we use the same target function and replace of the output values whose input values are within to using two different corruptions: one uniformly sampled from to and the other from a flipped target function. For this experiment, we set for better visualization. As shown in Figure 3 and 3, ChoiceNet successfully fits the target function. The correlations of the second component decrease as outliers are introduced as shown in Figure 3 and 3. Surprisingly, when the target and noise distribution are negatively correlated (the flipped function case), the correlations of the second component become as depicted in Figure 3. Contrarily, for the uniform corruption case, the correlations of the second component are within and . We argue that this clearly shows the capability of ChoiceNet to distinguish the target distribution from noisy distributions.

Autonomous Driving Experiment

In this experiment, we apply ChoiceNet to a autonomous driving scenario in a simulated environment. In particular, the tested methods are asked to learn the policy from driving demonstrations collected from both safe and careless driving modes. We use the same set of methods used for the previous task. The policy function is defined as a mapping between four dimensional input features consist of three frontal distances to left, center, and right lanes and lane deviation distance from the center of the lane to the desired heading. Once the desired heading is computed, the angular velocity of a car is computed by and the directional velocity is fixed to . The driving demonstrations are collected from keyboard inputs by human users. The objective of this experiment is to assess its performance on a training set generated from two different distributions. We would like to note that this task does not have a reference target function in that all demonstrations are collected manually. Hence, we evaluated the performances of the compared methods by running the trained policies on a straight track by randomly deploying static cars.

Table 2 and Table 3 indicate collision rates and RMS lane deviation distances of the tested methods, respectively, where the statistics are computed from independent runs on the straight lane by randomly placing static cars as shown in Figure 7. ChoiceNet clearly outperforms compared methods in terms of both safety (low collision rates) and stability (low RMS lane deviation distances).

Figure 4: Resulting trajectories of compared methods trained with mixed demonstrations. (best viewed in color).
Outliers ChoiceNet MDN MLP GPR LGPR RGPR
Table 2: Collision rates of compared methods on straight lanes.
Outliers ChoiceNet MDN MLP GPR LGPR RGPR
Table 3: Root mean square lane deviation distances (m) of compared methods on straight lanes.

4.2 Classification Tasks

We conduct classification experiments on the MNIST and CIFAR-10 datasets to evaluate the performance of ChoiceNet on corrupted labels. To generate noisy datasets, we follow the setting in Zhang et al. (2017) which randomly shuffles a percentage of the labels in the dataset222In the corrupted label setting, for a given corruption probability , the expected ratio of correct labels is . Additional experiments of replacing the percentage of labels to a random labels and a fixed label can be found in the Appendix.. We vary the corruption probabilities from to for the MNIST dataset and from to for the CIFAR-10 dataset and compare median accuracies after five runs for each configuration.

For the MNIST experiments, we construct two networks: a network with two residual blocks He et al. (2016b) with convolutional layers followed by a fully-connected layer with

output neurons (ConvNet) and a network with the same two residual blocks followed by a MCDN block (ChoiceNet). We train each network for

epochs with a fixed learning rate of .

For the CIFAR experiments, we adopt WideResNet (WRN) Zagoruyko and Komodakis (2016) with layers and a widening factor of . To construct ChoiceNet, we replace the last layer of WideResNet with a MCDN block. We set , , , and modules consist of two fully connected layers with hidden units and a ReLU activation function. We train each network for epochs with a minibatch size of . We begin with a learning rate of , and it decays by after and epochs. We apply random horizontal flip and random crop with 4

pixel-padding and use a weight decay of

for the baseline network as He et al. (2016b). However, to train ChoiceNet, we reduce the weight decay rate to

and apply gradient clipping at

. We also lower the learning rate to for the first epoch to stabilize training.

On both MNIST and CIFAR-10 experiments, we also compare ChoiceNet with Mixup Zhang et al. (2017) which, to the best of our knowledge, shows the state-of-the-art performance on noisy labels. We set the parameter of Mixup to be for the baseline network as suggested in the original paper. For ChoiceNet, we set to be .


Corruption Configuration Best Last
50% ConvNet 95.4 89.5
ConvNet+Mixup 97.2 96.8
ChoiceNet 99.2 99.2
80% ConvNet 86.3 76.9
ConvNet+Mixup 87.2 87.2
ChoiceNet 98.2 97.6
90% ConvNet 76.1 69.8
ConvNet+Mixup 74.7 74.7
ChoiceNet 94.7 89.0
95% ConvNet 72.5 64.4
ConvNet+Mixup 69.2 68.2
ChoiceNet 88.5 80.0
Table 4: Test accuracies on the MNIST datasets with corrupted labels.

Corruption Configuration Best Last
20% WRN (WideResNet) 88.5 85.3
CN ChoiceNet) 90.7 90.3
WRN + Mixup 92.9 92.3
CN + Mixup 92.5 92.3
50% WRN 79.7 59.3
CN 85.9 84.6
WRN + Mixup 87.3 83.1
CN + Mixup 88.4 87.9
80% WRN 67.8 27.4
CN 69.8 65.2
WRN + Mixup 72.1 62.9
CN + Mixup 76.1 75.4
Table 5: Test accuracies on the CIFAR-10 datasets with corrupted labels

The classification results of the MNIST dataset and the CIFAR dataset are shown in Table 7 and Table 5, respectively. In the MNIST experiments, ChoiceNet consistently outperforms ConvNet and ConvNet+Mixup by a significant margin, and the difference between the accuracies of ChoiceNet and the others becomes more clear as the corruption probability increases. Particularly, the best test accuracy of ChoiceNet reaches even when of the training labels are randomly shuffled.

In the CIFAR-10 experiments, ChoiceNet outperforms WideResNet and achieves its accuracy over even when of the labels are shuffled whereas the accuracy of WideResNet drops below . When we inspect the training accuracies on the -shuffled set, WideResNet tends to overfit (memorize) to noisy labels and shows train accuracy. On the contrary, ChoiceNet shows . Detailed learning curves can be found in the Appendix. When trained with Mixup, both networks become robust to noisy labels to some extent. However, the results of the two networks still show significant differences except for the corrupted experiments on which both of them show similar accuracies. Interestingly, when ChoiceNet and Mixup are combined, it achieves a high accuracy of even on the shuffled dataset. We also note that ChoiceNet (without Mixup) outperforms WideResNet+Mixup when the corruption ratio is over on the last accuracies.

5 Conclusion

In this paper, we have presented ChoiceNet that can robustly learn a target distribution given noisy training data. The keystone of ChoiceNet is the mixture of correlated density network block which can estimate the densities of data distributions using a set of correlated mean functions. We have demonstrated that ChoiceNet can robustly infer the target distribution on corrupted training data in the following tasks; regression with synthetic data, autonomous driving, and MNIST and CIFAR-10 image classification tasks. Our experiments verify that ChoiceNet outperforms existing methods in the handling of noisy data.

Selecting proper hyper-parameters including the optimal number of mixture components is a compelling topic for the practical usage of ChoiceNet. Furthermore, one can use ChoiceNet for active learning by evaluating the quality of each training data using through the lens of correlations. We leave these as important questions for future work.

References

Appendix A Proof of Theorems in Section 3.1

See 1

Proof of Theorem 1.

Since and are uncorrelated, we have

(9)

By (1), we directly obtain

Also, by (1) and (9),

Similarly,

Therefore

The theorem is proved. ∎

See 2

Proof of Theorem 2.

Note that

Therefore, by Theorem 1

Hence

The theorem is proved. ∎

See 3

Proof of Theorem 3.

First we prove that for and

(10)

Note that

By (4),

so (10) is proved. Next we prove

(11)

Observe that

Similarly,

Hence (11) is proved. Therefore by (10) and (11)

The theorem is proved. ∎

Remark.

Recall the definition of Cholesky transform: for

(12)

Note that we do not assume and should follow typical distributions. Hence every above theorems hold for general class of random variables. Additionally, by Theorem 2 and (12), has the following -dependent behaviors;

Thus strongly correlated weights i.e. , provide prediction with confidence while uncorrelated weights encompass uncertainty. These different behaviors of weights perform regularization and preclude over-fitting caused by bad data since uncorrelated and negative correlated weights absorb vague and outlier pattern, respectively.

Appendix B Experiements

b.1 Regression Tasks

b.1.1 Synthetic Example

We provide more fitting results for the synthetic example in Figure 5. Given an intact dataset, all compared methods robustly fit the given training data. However, other methods fail to correctly fit the underlying target function given corrupted data. When the outlier rate exceeds all tested methods fail to fit.

Figure 5: Reference function and fitting results of compared methods on different outlier rates, , , , and ).

b.1.2 Autonomous Driving Experiment

Here, we describe the features used for the autonomous driving experiments. As shown in the manuscript, we use a four dimensional feature, a lane deviation distance of an ego car, and three frontal distances to the closest car at left, center, and right lanes as shown in Figure 6. We upperbound the frontal distance to . Figure 7 and 7 illustrate manually collected trajectories of a safe driving mode and a careless driving mode.

Figure 6: Descriptions of the featrues of an ego red car used in autonomous driving experiments.
Figure 7: Manually collected trajectories of (a) safe driving mode and (b) careless driving mode. (best viewed in color).

b.2 Classification Tasks

b.2.1 Mnist

Here, we present additional experimental results using the MNIST dataset on following three different scenarios:

  1. Biased label experiments where we randomly assign the percentage of the training labels to label .

  2. Random shuffle experiments where we randomly replace the percentage of the training labels from the uniform multinomial distribution.

  3. Random permutation experiments where we replace the percentage of the labels based on the label permutation matrix where we follow the random permutation in Reed et al. [2014].

The best and final accuracies on the intact test dataset for biased label experiments are shown in Table 6. In all corruption rates, ChoiceNet achieves the best performance compared to two baseline methods. The learning curves of the biased label experiments are depicted in Figure 8. Particularly, we observe unstable learning curves regarding the test accuracies of ConvNet and Mixup. As training accuracies of such methods show stable learning behaviors, this can be interpreted as the networks are simply memorizing noisy labels. In the contrary, the learning curves of ChoiceNet show stable behaviors which clearly indicates the robustness of the proposed method.

The experimental results and learning curves of the random shuffle experiments are shown in Table 7 and Figure 9. The convolutional neural networks trained with Mixup show robust learning behaviors when of the training labels are uniformly shuffled. However, given an extremely noisy dataset ( and ), the test accuracies of baseline methods decrease as the number of epochs increases. ChoiceNet shows outstanding robustness to the noisy dataset in that the test accuracies do not drop even after epochs for the cases where the corruption rates are below . For the case, however, over-fitting is occured in all methods.

Table 8 and Figure 10 illustrate the results of the random permutation experiments. Specifically, we change the labels of randomly selected training data using a permutation rule: following Reed et al. [2014]. We argue that this setting is more arduous than the random shuffle case in that we are intentionally changing the labels based on predefined permutation rules.

Corruption Configuration Best Last
25% ConvNet 95.4 89.5
ConvNet+Mixup 97.2 96.8
ChoiceNet 99.2 99.2
40% ConvNet 86.3 76.9
ConvNet+Mixup 87.2 87.2
ChoiceNet 98.2 97.6
45% ConvNet 76.1 69.8
ConvNet+Mixup 74.7 74.7
ChoiceNet 94.7 89.0
47% ConvNet 72.5 64.4
ConvNet+Mixup 69.2 68.2
ChoiceNet 88.5 80.0
Table 6: Test accuracies on the MNIST dataset with biased label.
Corruption Configuration Best Last
50% ConvNet 97.1 95.9
ConvNet+Mixup 98.0 97.8
ChoiceNet 99.1 99.0
80% ConvNet 90.6 79.0
ConvNet+Mixup 95.3 95.1
ChoiceNet 98.3 98.3
90% ConvNet 76.1 54.1
ConvNet+Mixup 78.6 42.4
ChoiceNet 95.9 95.2
95% ConvNet 50.2 31.3
ConvNet+Mixup 53.2 26.6
ChoiceNet 84.5 66.0
Table 7: Test accuracies on the MNIST dataset with corrupted label.
Corruption Configuration Best Last
25% ConvNet 94.4 92.2
ConvNet+Mixup 97.6 97.6
ChoiceNet 99.2 99.2
40% ConvNet 77.9 71.8
ConvNet+Mixup 84.0 83.0
ChoiceNet 99.2 98.8
45% ConvNet 68.0 61.4
ConvNet+Mixup 68.9 55.8
ChoiceNet 98.0 97.1
47% ConvNet 58.2 53.9
ConvNet+Mixup 60.2 53.4
ChoiceNet 92.5 86.1
Table 8: Test accuracies on the MNIST dataset with randomly permutated label.
Figure 8: Learning curves of compared methods on random bias experiments using MNIST with different noise levels.
Figure 9: Learning curves of compared methods on random shuffle experiments using MNIST with different noise levels.
Figure 10: Learning curves of compared methods on random permutation experiments using MNIST with different noise levels.

b.2.2 Cifar-10

Here, we present detailed learning curves of the CIFAR-10 experiments while varying the noise level from to following the configurations in Zhang et al. [2017] in Figure 11.

Figure 11: Learning curves of compared methods on CIFAR-10 experiments with different noise levels.