1 Introduction
Training a deep neural network requires immense amounts of training data which are often collected using crowdsourcing methods, such as Amazon’s Mechanical Turk (AMT). However, in practice, the crowdsourced labels are often noisy
Bi et al. (2014). Furthermore, deep neural networks are vulnerable to overfitting given the noisy training data in that they are capable of memorizing the entire dataset even with inconsistent labels, leading to a poor generalization performance Zhang et al. (2016).Assuming that a training dataset is generated from a mixture of a target distribution and other distributions, we address this problem through the principled idea of revealing the correlation between the target distribution and the other distributions. We present a framework for robust learning which is applicable to arbitrary neural network architectures such as convolutional neural networks
He et al. (2016a)Chung et al. (2014). We call this framework ChoiceNet.Throughout this paper, we aim to address the following questions:

How can we measure the quality of training data in a principled manner?

In the presence of inconsistent outputs, how can we infer the target distribution in a scalable manner?
Traditionally, noisy outputs are handled by modeling additive random distributions, often leading to robust loss functions
Hampel et al. (2011). However, we argue that these approaches are too restrictive when handling severe outliers or inconsistencies in the datasets. To address the first question, we leverage the concept of a correlation. Precisely, we measure the quality of training data using the correlation between the target distribution and the data generating distribution. However, estimating the correct correlation requires an access to a target distribution, whereas learning the correct target distribution requires knowing the correlation between the distributions to be known, making it a chickenandegg problem. To address the second question, we simultaneously estimate the target distribution as well as the correlation in an endtoendmanner using stochastic gradient decent methods, in this case Adam
Kingma and Ba (2014), to achieve scalability.The cornerstone of the proposed method is a mixture of correlated density network (MCDN) block. First, we present a Cholesky transform method for sampling the weights of a neural network that enables us to model correlated outputs. We also present an effective regularizer to train ChoiceNet. To the best of our knowledge, this represents the first approach simultaneously to infer the target distribution and the output correlations using a neural network in an endtoend manner.
Revealing the output correlations was proposed in earlier work Bonilla et al. (2008), in which a multitask Gaussian process prediction (MTGPP) model is proposed. In particular, MTGPP used correlated Gaussian processes to model multiple tasks by learning a freeform crosscovariance matrix. However, due to the multitask learning setting, it is not suitable for learning a single target function. In other work Choi et al. (2016), a leverage optimization method which optimizes the leverage of each demonstrations is proposed. Unlike to former study Bonilla et al. (2008), the latter Choi et al. (2016) focused on inferring a single expert policy by incorporating a sparsity constraint by assuming that the most demonstrations are collected from a skillful consistent expert.
ChoiceNet is initially applied to a synthetic regression task, where we demonstrate its robustness to extreme outliers and ability to distinguish the target distribution and noise distributions. We then apply it to an autonomous driving scenario in which the driving demonstrations are collected from both safe and careless drivers and show that it can robustly learn a safe and stable driving policy. Subsequently, we move on to the classification tasks using the MNIST and CIFAR10 datasets. We show that the proposed method outperforms existing baseline methods in terms of robustness with regard to the handling of noisy labels.
2 Related Work
Recently, robustness in deep learning has been actively studied
Fawzi et al. (2017) as deep neural networks are being applied to diverse tasks involving realworld applications such as autonomous driving Paden et al. (2016) or medical diagnosis Gulshan et al. (2016) where a simple malfunction can have catastrophic results AP and REUTERS (2016). Perhaps, the most actively studied area regarding robustness in deep learning is the modeling and defense against adversarial attacks in the input domain Aung et al. (2017); Sinha et al. (2017); Carlini and Wagner (2017); Papernot et al. (2016). Adversarial examples are intentionally designed inputs that cause incorrect predictions in learned models by adding a small perturbation that is scarcely recognized by humans Goodfellow et al. (2014). While this is a substantially important research direction, we focus on the noise in the outputs, e.g., outliers from different distributions or random labels.A number of studies Bekker and Goldberger (2016); Patrini et al. (2017); Goldberger and BenReuven (2017); Jindal et al. (2016); Liu et al. (2017)
deal with the problems which arise when handling noisy labels in the training dataset in that massive datasets such as the ImageNet dataset
Deng et al. (2009) are often mostly from crowdsourcing and which thus may contain inaccurate and inconsistent labels Bi et al. (2014). To deal with noisy labels, an earlier study Bekker and Goldberger (2016) proposed an extra layer for the modeling of output noises. Later work Jindal et al. (2016) extended the aforementioned approach Bekker and Goldberger (2016) by adding an additional noise adaptation layer with aggressive dropout regularization. A similar method was then proposed Patrini et al. (2017)which initially estimated the label corruption matrix with a learned classifier and used the corruption matrix to finetune the classifier. Other research
Jiang et al. (2017) concentrated on the training of an additional neural network, referred to as MentorNet, which assigns a weight to each instance of training data to supervise the training of a base network, termed StudentNet, to overcome the overfitting of corrupted training data. On final study of note here Rolnick et al. (2017) analyzed the intrinsic robustness of deep neural network models to massive label noise and empirically showed that a larger batch size with a lower learning rate can be beneficial with regard to the robustness. Motivated by that work Rolnick et al. (2017), we train ChoiceNet with a large batch size and a low learning rate.Unlike previous methods that only require noisy training datasets, some work Li et al. (2017); Malach and ShalevShwartz (2017); Hendrycks et al. (2018); Veit et al. (2017) require a small number of clean datasets. A goldloss correction method was also presented Hendrycks et al. (2018); it initially learns a label corruption matrix using a small clean dataset and then uses the corruption matrix to retrain a corrected classifier. A labelcleaning network has also been proposed Veit et al. (2017). It corrects noisy labels in the training dataset by leveraging information from a small clean dataset.
Adding small label noises while training is known to be beneficial to training, as it can be regarded as an effective regularization method Lee (2013); Goodfellow et al. (2016). Similar methods have been proposed to tackle noisy outputs. A bootstrapping method Reed et al. (2014) which train a neural network with a convex combination of the output of the current network and the noisy target was proposed. Other researchers Xie et al. (2016) proposed DisturbLabel, a simple method which randomly replaces a percentage of the labels with incorrect values for each iteration. Mixing both input and output data was also proposed Tokozume et al. (2018); Zhang et al. (2017). One study Zhang et al. (2017) considered the image recognition problem under label noise and the other Tokozume et al. (2018) focused on a sound recognition problem.
Modeling correlations of output training data has been actively studied in light of Gaussian processes Rasmussen (2006). MTGPP Bonilla et al. (2008) that models the correlations of multiple tasks via Gaussian process regression was also proposed. Due to the multitask setting, however, Bonilla et al. (2008) is not suitable for robust regression tasks. Other researchers Choi et al. (2016) proposed a robust learning from demonstration method using a sparse constrained leverage optimization method which estimates the correlation between training outputs. Unlike the former study Bonilla et al. (2008), the latter above Choi et al. (2016) can robustly recover the expert policy function. While our problem setting is similar to the latter study Choi et al. (2016), we propose endtoend learning of both the target distribution and the correlation of each training data, thus offering, a clear advantage in terms of scalability. The aforementioned study Choi et al. (2016) also requires the design of a proper kernel structure, which is not suitable for highdimensional inputs and classification problems.
3 ChoiceNet
In this section, we introduce a foundational theory and the model architecture of ChoiceNet. ChoiceNet consists of a base network and a mixture of correlated density network (MCDN) block. Section 3.1 legitimates the reparameterization trick for correlated samples. Subsequently, we present the mechanism of ChoiceNet in Section 3.2 and loss functions for ChoiceNet regarding regression and classification tasks in Section 3.3.
3.1 Reparameterization Trick for Correlated Sampling
We introduce fundamental theorems which lead to Cholesky transform
for given random variables
. We apply this transform to random matrices and which carry out weight matrices for prediction and a supplementary role, respectively. Each proof of theorem can be found in the Appendix.Theorem 1.
Let and be uncorrelated random variables such that
(1) 
For a given , set
(2) 
Then
Theorem 2.
Due to the above theorem, correlation is invariant to meantranslation and variancedilatation. Now we define the key operation of the MCDN block named
Cholesky Transform.Definition.
For , we define Cholesky transform as follows
(3) 
Here is a function for given parametes . By plugging random variables in , we obtain a new random variable correlated with . This makes it possible to use the reparametrization trick Kingma (2017); Kingma and Welling (2013) to learn parameters , and . Indeed, according to (2). Thus by applying Theorem 2 to with and , we reach the following result.
Corollary.
Aforementioned Corollary implies the random variable has a correlation with . The following theorem further states that a correlation between random matrices is invariant to an affine transform. This legitimates using Cholesky transform to generate weight matrices in the MCDN block.
Theorem 3.
Let . For , random matrices are given such that for every ,
(4) 
and
(5) 
Given , set for each . Then an elementwise correlation between and equals i.e.
equivalently,
3.2 Model Architecture
In this section, we describe the model architecture and the mechanism of ChoiceNet. In the followings, is a constant indicating expected measurement noise and is a bounded function, e.g., a hyperbolic tangent. , and where and
denote the dimensions of a feature vector
and output , respectively, and is the number of mixtures. is a fixed constant whose value is close to .ChoiceNet is a twofold architecture: (a) a base network and (b) a MCDN block (see Figure 1). A base network extracts features for a given dataset. Then the MCDN block estimates the densities of the data generating distributions through . Contrary to the mixture density network (MDN), during the density estimation process, the MCDN block samples correlated weights using Cholesky transform. Consequently, the MCDN block is able to generate the correlated mean vectors . The overall mechanism of ChoiceNet can be elaborated as follows:
Modules  
Cholesky Transform  
Outputs 
By Theorem 3, for each
and the output density is modeled via correlated mean vectors. Note that both and are minimized, when
. Furthermore, as we apply Gaussian distributions for Cholesky transform, the influences of uninformative or independent data, whose correlations are close to 0, is attenuated as their variances increase
Kendall and Gal (2017).3.3 Training Objectives
Denote a training dataset by . We consider both regression and classification tasks.
Regression
For the regression task, we employ both loss and the standard MDN loss Bishop (1994); Choi et al. (2017); Christopher (2016);
(6) 
where and are hyperparameters and is the density of multivariate Gaussian:
We also add weight decay and the following KullbackLeibler regularizer to (6)
(7) 
The above KL regularizer encourages the mixture components with the strong correlations to have high mixture probabilities. This guidance is useful since ChoiceNet uses the mean vector
of the first mixture component at the inference stage.Classification
4 Experiments
4.1 Regression Tasks
We conduct two regression experiments: 1) a synthetic scenario where the training dataset contains outliers sampled from other distributions and 2) a track driving scenario where the driving demonstrations are collected from two different driving modes.
Synthetic Example
We first apply ChoiceNet to a simple onedimensional regression problem of fitting where as shown in Figure 5
. ChoiceNet is compared with a naive multilayer perceptron (MLP), a mixture density network (MDN) with five mixtures where all networks have two hidden layers with
nodes with a ReLU activation function. Gaussian process regression (GPR)
Rasmussen (2006), leveraged Gaussian process regression (LGPR) with leverage optimization Choi et al. (2016), and robust Gaussian process regression (RGPR) with an infinite Gaussian process mixture model Rasmussen and Ghahramani (2002) are also compared. For the GP based methods, we use a squaredexponential kernel function and the hyperparameters are determined using a simple median trick Dai et al. (2014)^{1}^{1}1 A median trick selects the length parameter of a kernel function to be the median of all pairwise distances between training data.. To evaluate its performance in corrupted datasets, we randomly replace the original target values with outliers whose output values are uniformly sampled from to . We vary the outlier rates from (clean) to (extremely noisy).Table 1 illustrates the RMSEs (root mean square errors) between the reference target function and the fitted results of ChoiceNet and other compared methods. Given an intact training dataset, all the methods show stable performances in that the RMSEs are all below . Given training datasets whose outlier rates exceed , however, only ChoiceNet successfully fits the target function whereas the other methods fail as shown in Figure 5.
Outliers  ChoiceNet  MDN  MLP  GPR  LGPR  RGPR 

To further inspect whether ChoiceNet can distinguish between the target distribution and noise distributions, we train ChoiceNet on two datasets. In particular, we use the same target function and replace of the output values whose input values are within to using two different corruptions: one uniformly sampled from to and the other from a flipped target function. For this experiment, we set for better visualization. As shown in Figure 3 and 3, ChoiceNet successfully fits the target function. The correlations of the second component decrease as outliers are introduced as shown in Figure 3 and 3. Surprisingly, when the target and noise distribution are negatively correlated (the flipped function case), the correlations of the second component become as depicted in Figure 3. Contrarily, for the uniform corruption case, the correlations of the second component are within and . We argue that this clearly shows the capability of ChoiceNet to distinguish the target distribution from noisy distributions.
Autonomous Driving Experiment
In this experiment, we apply ChoiceNet to a autonomous driving scenario in a simulated environment. In particular, the tested methods are asked to learn the policy from driving demonstrations collected from both safe and careless driving modes. We use the same set of methods used for the previous task. The policy function is defined as a mapping between four dimensional input features consist of three frontal distances to left, center, and right lanes and lane deviation distance from the center of the lane to the desired heading. Once the desired heading is computed, the angular velocity of a car is computed by and the directional velocity is fixed to . The driving demonstrations are collected from keyboard inputs by human users. The objective of this experiment is to assess its performance on a training set generated from two different distributions. We would like to note that this task does not have a reference target function in that all demonstrations are collected manually. Hence, we evaluated the performances of the compared methods by running the trained policies on a straight track by randomly deploying static cars.
Table 2 and Table 3 indicate collision rates and RMS lane deviation distances of the tested methods, respectively, where the statistics are computed from independent runs on the straight lane by randomly placing static cars as shown in Figure 7. ChoiceNet clearly outperforms compared methods in terms of both safety (low collision rates) and stability (low RMS lane deviation distances).
Outliers  ChoiceNet  MDN  MLP  GPR  LGPR  RGPR 

Outliers  ChoiceNet  MDN  MLP  GPR  LGPR  RGPR 

4.2 Classification Tasks
We conduct classification experiments on the MNIST and CIFAR10 datasets to evaluate the performance of ChoiceNet on corrupted labels. To generate noisy datasets, we follow the setting in Zhang et al. (2017) which randomly shuffles a percentage of the labels in the dataset^{2}^{2}2In the corrupted label setting, for a given corruption probability , the expected ratio of correct labels is . Additional experiments of replacing the percentage of labels to a random labels and a fixed label can be found in the Appendix.. We vary the corruption probabilities from to for the MNIST dataset and from to for the CIFAR10 dataset and compare median accuracies after five runs for each configuration.
For the MNIST experiments, we construct two networks: a network with two residual blocks He et al. (2016b) with convolutional layers followed by a fullyconnected layer with
output neurons (ConvNet) and a network with the same two residual blocks followed by a MCDN block (ChoiceNet). We train each network for
epochs with a fixed learning rate of .For the CIFAR experiments, we adopt WideResNet (WRN) Zagoruyko and Komodakis (2016) with layers and a widening factor of . To construct ChoiceNet, we replace the last layer of WideResNet with a MCDN block. We set , , , and modules consist of two fully connected layers with hidden units and a ReLU activation function. We train each network for epochs with a minibatch size of . We begin with a learning rate of , and it decays by after and epochs. We apply random horizontal flip and random crop with 4
pixelpadding and use a weight decay of
for the baseline network as He et al. (2016b). However, to train ChoiceNet, we reduce the weight decay rate toand apply gradient clipping at
. We also lower the learning rate to for the first epoch to stabilize training.On both MNIST and CIFAR10 experiments, we also compare ChoiceNet with Mixup Zhang et al. (2017) which, to the best of our knowledge, shows the stateoftheart performance on noisy labels. We set the parameter of Mixup to be for the baseline network as suggested in the original paper. For ChoiceNet, we set to be .
Corruption  Configuration  Best  Last 

50%  ConvNet  95.4  89.5 
ConvNet+Mixup  97.2  96.8  
ChoiceNet  99.2  99.2  
80%  ConvNet  86.3  76.9 
ConvNet+Mixup  87.2  87.2  
ChoiceNet  98.2  97.6  
90%  ConvNet  76.1  69.8 
ConvNet+Mixup  74.7  74.7  
ChoiceNet  94.7  89.0  
95%  ConvNet  72.5  64.4 
ConvNet+Mixup  69.2  68.2  
ChoiceNet  88.5  80.0 
Corruption  Configuration  Best  Last 

20%  WRN (WideResNet)  88.5  85.3 
CN ChoiceNet)  90.7  90.3  
WRN + Mixup  92.9  92.3  
CN + Mixup  92.5  92.3  
50%  WRN  79.7  59.3 
CN  85.9  84.6  
WRN + Mixup  87.3  83.1  
CN + Mixup  88.4  87.9  
80%  WRN  67.8  27.4 
CN  69.8  65.2  
WRN + Mixup  72.1  62.9  
CN + Mixup  76.1  75.4 
The classification results of the MNIST dataset and the CIFAR dataset are shown in Table 7 and Table 5, respectively. In the MNIST experiments, ChoiceNet consistently outperforms ConvNet and ConvNet+Mixup by a significant margin, and the difference between the accuracies of ChoiceNet and the others becomes more clear as the corruption probability increases. Particularly, the best test accuracy of ChoiceNet reaches even when of the training labels are randomly shuffled.
In the CIFAR10 experiments, ChoiceNet outperforms WideResNet and achieves its accuracy over even when of the labels are shuffled whereas the accuracy of WideResNet drops below . When we inspect the training accuracies on the shuffled set, WideResNet tends to overfit (memorize) to noisy labels and shows train accuracy. On the contrary, ChoiceNet shows . Detailed learning curves can be found in the Appendix. When trained with Mixup, both networks become robust to noisy labels to some extent. However, the results of the two networks still show significant differences except for the corrupted experiments on which both of them show similar accuracies. Interestingly, when ChoiceNet and Mixup are combined, it achieves a high accuracy of even on the shuffled dataset. We also note that ChoiceNet (without Mixup) outperforms WideResNet+Mixup when the corruption ratio is over on the last accuracies.
5 Conclusion
In this paper, we have presented ChoiceNet that can robustly learn a target distribution given noisy training data. The keystone of ChoiceNet is the mixture of correlated density network block which can estimate the densities of data distributions using a set of correlated mean functions. We have demonstrated that ChoiceNet can robustly infer the target distribution on corrupted training data in the following tasks; regression with synthetic data, autonomous driving, and MNIST and CIFAR10 image classification tasks. Our experiments verify that ChoiceNet outperforms existing methods in the handling of noisy data.
Selecting proper hyperparameters including the optimal number of mixture components is a compelling topic for the practical usage of ChoiceNet. Furthermore, one can use ChoiceNet for active learning by evaluating the quality of each training data using through the lens of correlations. We leave these as important questions for future work.
References
 AP and REUTERS [2016] AP and REUTERS. Tesla working on ’improvements’ to its autopilot radar changes after model s owner became the first selfdriving fatality., June 2016. URL https://goo.gl/XkzzQd.
 Aung et al. [2017] A. M. Aung, Y. Fadila, R. Gondokaryono, and L. Gonzalez. Building robust deep neural networks for road sign detection. arXiv preprint arXiv:1712.09327, 2017.
 Bekker and Goldberger [2016] A. J. Bekker and J. Goldberger. Training deep neuralnetworks based on unreliable labels. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2682–2686. IEEE, 2016.
 Bi et al. [2014] W. Bi, L. Wang, J. T. Kwok, and Z. Tu. Learning to predict from crowdsourced data. In UAI, pages 82–91, 2014.
 Bishop [1994] C. M. Bishop. Mixture density networks. 1994.
 Bonilla et al. [2008] E. V. Bonilla, K. M. Chai, and C. Williams. Multitask gaussian process prediction. In Proc. of the Advances in Neural Information Processing Systems, pages 153–160, 2008.
 Carlini and Wagner [2017] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pages 39–57. IEEE, 2017.
 Choi et al. [2016] S. Choi, K. Lee, and S. Oh. Robust learning from demonstration using leveraged Gaussian processes and sparse constrained opimization. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2016.
 Choi et al. [2017] S. Choi, K. Lee, S. Lim, and S. Oh. Uncertaintyaware learning from demonstration using mixture density networks with samplingfree variance modeling. arXiv preprint arXiv:1709.02249, 2017.
 Christopher [2016] M. B. Christopher. PATTERN RECOGNITION AND MACHINE LEARNING. SpringerVerlag New York, 2016.
 Chung et al. [2014] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Dai et al. [2014] B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M.F. F. Balcan, and L. Song. Scalable kernel methods via doubly stochastic gradients. In Proc. of the Advances in Neural Information Processing Systems, pages 3041–3049, 2014.

Deng et al. [2009]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
Proc. of IEEE Conference on Computer Vision and Pattern Recognition
, pages 248–255. IEEE, 2009.  Fawzi et al. [2017] A. Fawzi, S. M. M. Dezfooli, and P. Frossard. A geometric perspective on the robustness of deep networks. IEEE Signal Processing Magazine, 2017.
 Goldberger and BenReuven [2017] J. Goldberger and E. BenReuven. Training deep neuralnetworks using a noise adaptation layer. In Proc. of International Conference on Learning Representations, 2017.
 Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
 Goodfellow et al. [2014] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Gulshan et al. [2016] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Journal of the American Medical Association, 316(22):2402–2410, 2016.
 Hampel et al. [2011] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
 He et al. [2016a] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016a.
 He et al. [2016b] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016b.
 Hendrycks et al. [2018] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. arXiv preprint arXiv:1802.05300, 2018.
 Jiang et al. [2017] L. Jiang, Z. Zhou, T. Leung, L.J. Li, and L. FeiFei. Mentornet: Regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017.
 Jindal et al. [2016] I. Jindal, M. Nokleby, and X. Chen. Learning deep networks from noisy labels with dropout regularization. In Proc. of IEEE International Conference onData Mining, pages 967–972. IEEE, 2016.
 Kendall and Gal [2017] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems, pages 5580–5590, 2017.
 Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma [2017] D. P. Kingma. Variational inference & deep learning: A new synthesis. University of Amsterdam, 2017.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Lee [2013]
D.H. Lee.
Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks.
In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.  Li et al. [2017] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and J. Li. Learning from noisy labels with distillation. arXiv preprint arXiv:1703.02391, 2017.
 Liu et al. [2017] X. Liu, S. Li, M. Kan, S. Shan, and X. Chen. Selferrorcorrecting convolutional neural network for learning with noisy labels. In Proc. of IEEE International Conference on Automatic Face & Gesture Recognition, pages 111–117. IEEE, 2017.
 Malach and ShalevShwartz [2017] E. Malach and S. ShalevShwartz. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems, pages 961–971, 2017.
 Paden et al. [2016] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for selfdriving urban vehicles. IEEE Transactions on Intelligent Vehicles, 1(1):33–55, 2016.
 Papernot et al. [2016] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, pages 372–387. IEEE, 2016.
 Patrini et al. [2017] G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: a loss correction approach. In Proc. of the Conference on Computer Vision and Pattern Recognition, volume 1050, page 22, 2017.
 Rasmussen [2006] C. E. Rasmussen. Gaussian processes for machine learning. 2006.
 Rasmussen and Ghahramani [2002] C. E. Rasmussen and Z. Ghahramani. Infinite mixtures of gaussian process experts. In Advances in Neural Information Processing Systems, pages 881–888, 2002.
 Reed et al. [2014] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
 Rolnick et al. [2017] D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
 Sinha et al. [2017] A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
 Tokozume et al. [2018] Y. Tokozume, Y. Ushiku, and T. Harada. Proc. of international conference on learning representations. 2018.
 Veit et al. [2017] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie. Learning from noisy largescale datasets with minimal supervision. In Conference on Computer Vision and Pattern Recognition, 2017.
 Xie et al. [2016] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturblabel: Regularizing cnn on the loss layer. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4753–4762, 2016.
 Zagoruyko and Komodakis [2016] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
 Zhang et al. [2016] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In Proc. of International Conference on Learning Representations, 2016.
 Zhang et al. [2017] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. In Proc. of International Conference on Learning Representations, 2017.
Appendix A Proof of Theorems in Section 3.1
See 1
Proof of Theorem 1.
See 2
See 3
Proof of Theorem 3.
Remark.
Recall the definition of Cholesky transform: for
(12) 
Note that we do not assume and should follow typical distributions. Hence every above theorems hold for general class of random variables. Additionally, by Theorem 2 and (12), has the following dependent behaviors;
Thus strongly correlated weights i.e. , provide prediction with confidence while uncorrelated weights encompass uncertainty. These different behaviors of weights perform regularization and preclude overfitting caused by bad data since uncorrelated and negative correlated weights absorb vague and outlier pattern, respectively.
Appendix B Experiements
b.1 Regression Tasks
b.1.1 Synthetic Example
We provide more fitting results for the synthetic example in Figure 5. Given an intact dataset, all compared methods robustly fit the given training data. However, other methods fail to correctly fit the underlying target function given corrupted data. When the outlier rate exceeds all tested methods fail to fit.
b.1.2 Autonomous Driving Experiment
Here, we describe the features used for the autonomous driving experiments. As shown in the manuscript, we use a four dimensional feature, a lane deviation distance of an ego car, and three frontal distances to the closest car at left, center, and right lanes as shown in Figure 6. We upperbound the frontal distance to . Figure 7 and 7 illustrate manually collected trajectories of a safe driving mode and a careless driving mode.
b.2 Classification Tasks
b.2.1 Mnist
Here, we present additional experimental results using the MNIST dataset on following three different scenarios:

Biased label experiments where we randomly assign the percentage of the training labels to label .

Random shuffle experiments where we randomly replace the percentage of the training labels from the uniform multinomial distribution.

Random permutation experiments where we replace the percentage of the labels based on the label permutation matrix where we follow the random permutation in Reed et al. [2014].
The best and final accuracies on the intact test dataset for biased label experiments are shown in Table 6. In all corruption rates, ChoiceNet achieves the best performance compared to two baseline methods. The learning curves of the biased label experiments are depicted in Figure 8. Particularly, we observe unstable learning curves regarding the test accuracies of ConvNet and Mixup. As training accuracies of such methods show stable learning behaviors, this can be interpreted as the networks are simply memorizing noisy labels. In the contrary, the learning curves of ChoiceNet show stable behaviors which clearly indicates the robustness of the proposed method.
The experimental results and learning curves of the random shuffle experiments are shown in Table 7 and Figure 9. The convolutional neural networks trained with Mixup show robust learning behaviors when of the training labels are uniformly shuffled. However, given an extremely noisy dataset ( and ), the test accuracies of baseline methods decrease as the number of epochs increases. ChoiceNet shows outstanding robustness to the noisy dataset in that the test accuracies do not drop even after epochs for the cases where the corruption rates are below . For the case, however, overfitting is occured in all methods.
Table 8 and Figure 10 illustrate the results of the random permutation experiments. Specifically, we change the labels of randomly selected training data using a permutation rule: following Reed et al. [2014]. We argue that this setting is more arduous than the random shuffle case in that we are intentionally changing the labels based on predefined permutation rules.
Corruption  Configuration  Best  Last 

25%  ConvNet  95.4  89.5 
ConvNet+Mixup  97.2  96.8  
ChoiceNet  99.2  99.2  
40%  ConvNet  86.3  76.9 
ConvNet+Mixup  87.2  87.2  
ChoiceNet  98.2  97.6  
45%  ConvNet  76.1  69.8 
ConvNet+Mixup  74.7  74.7  
ChoiceNet  94.7  89.0  
47%  ConvNet  72.5  64.4 
ConvNet+Mixup  69.2  68.2  
ChoiceNet  88.5  80.0 
Corruption  Configuration  Best  Last 

50%  ConvNet  97.1  95.9 
ConvNet+Mixup  98.0  97.8  
ChoiceNet  99.1  99.0  
80%  ConvNet  90.6  79.0 
ConvNet+Mixup  95.3  95.1  
ChoiceNet  98.3  98.3  
90%  ConvNet  76.1  54.1 
ConvNet+Mixup  78.6  42.4  
ChoiceNet  95.9  95.2  
95%  ConvNet  50.2  31.3 
ConvNet+Mixup  53.2  26.6  
ChoiceNet  84.5  66.0 
Corruption  Configuration  Best  Last 

25%  ConvNet  94.4  92.2 
ConvNet+Mixup  97.6  97.6  
ChoiceNet  99.2  99.2  
40%  ConvNet  77.9  71.8 
ConvNet+Mixup  84.0  83.0  
ChoiceNet  99.2  98.8  
45%  ConvNet  68.0  61.4 
ConvNet+Mixup  68.9  55.8  
ChoiceNet  98.0  97.1  
47%  ConvNet  58.2  53.9 
ConvNet+Mixup  60.2  53.4  
ChoiceNet  92.5  86.1 
b.2.2 Cifar10
Here, we present detailed learning curves of the CIFAR10 experiments while varying the noise level from to following the configurations in Zhang et al. [2017] in Figure 11.
Comments
There are no comments yet.