Deep learning is regularly used in safety-critical applications. For example, deep learning is used in the object-recognition systems of autonomous cars, where malfunction may lead to severe injury or death. It has been shown that data corruption can have dramatic effects on such critical deep-learning pipelines Akhtar and Mian (2018); Yuan et al. (2019); Kurakin et al. (2016a); Wang and Yu (2019); Sharif et al. (2016); keen security lab (2019); Kurakin et al. (2016b). This insight has sparked research on robust deep learning based on, for example, adversarial training Madry et al. (2017); Kos and Song (2017); Papernot et al. (2015); Tramèr et al. (2017); Salman et al. (2019), sensitivity analysis Wang et al. (2018), or noise correction Patrini et al. (2017); Yi and Wu (2019); Arazo et al. (2019).
Research on robust deep learning focuses usually on “adversarial attacks,” that is, intentional data corruptions designed to cause failures of specific pipelines. In contrast, the fact that data is often of poor quality much more generally has received little attention. But low-quality data is very common, simply because data used for deep learning is rarely collected based on rigid experimental designs but rather amassed from whatever resources are available Roh et al. (2021); Friedrich et al. (2020). Among the few papers that consider such corruptions are Barron (2019); Belagiannis et al. (2015); Jiang et al. (2018); Wang et al. (2016); Lederer (2020)
, who replace the standard loss functions, such as squared error and soft-max cross entropy loss functions by some Lipschitz-continuous alternatives, such as Huber loss functions. But there is much room for improvement, especially because the existing methods do not make efficient use of the uncorrupted samples in the data.
In this paper, we devise a novel approach to deep learning with “randomly” corrupted data. The inspiration is the very recent line of research on median-of-means Lugosi and Mendelson (2019a, b); Lecué et al. (2020) and Le Cam’s procedure Le Cam (1973, 1986)2017a, b). We especially mimic Lecué and Lerasle (2017b) in defining parameter updates in a min-max fashion. We show that this approach indeed outmatches other approaches on simulated and real-world data.
Our three main contributions are:
We introduce a robust training scheme for neural networks that incorporates the median-of-means principle through Le Cam’s procedure.
We show that our approach can be implemented by using a simple gradient-based algorithm.
We demonstrate that our approach outperforms standard deep-learning pipeline across different levels of corruption.
Outline of the paper
In Section 2, we state the problem and give a step-by-step derivation of the DeepMoMestimator (Definition 1). In Section 4, we highlight similarities and differences to other approaches. In Section 4.1, we establish a stochastic-gradient algorithm to compute our estimators (Algorithm 1). In Sections 4.2 and 4.3, we demonstrate that our approach rivals or outmatches traditional training schemes based on least-squares and on cross-entropy both for simulated corrupted and uncorrupted data. In Section 5, we demonstrate the same on real data. Finally, in Section 6, we summarize the results of this work and conclude.
2 Framework and estimator
We first introduce the statistical framework and our corresponding estimator. We consider data with such that
where is the unknown data-generating function, and
are the stochastic error vectors. In particular, eachis an input of the system and the corresponding output. The data is partitioned into two parts: the first part comprises the informative samples; the second part comprises the problematic samples (such as corrupted samples—irrespective of what the source of the corruption is). The two parts are index by and , respectively. Of course, the sets and are unknown in practice (otherwise, one could simply remove the problematic samples). In brief, we consider a standard deep-learning setup—with the only exception that we make an explicit distinction between “good” and “bad” samples.
Our general goal is then, as usual, to approximate the data-generating function defined in (1
). But our specific goal is to take into account the fact that there may be problematic samples. Our candidate functions are feed-forward neural networksof the form
indexed by the parameter spaces
where is the number of layers, are the (finite-dimensional) weight matrices with and , are bias parameters, andfor , , and Lederer (2021).
Neural networks, such as those in (2), are typically fitted to data by minimizing the sum of loss function: . The two standard loss functions for regression problems () and classification problems () are the squared-error (SE) loss
and the soft-max cross entropy (SCE) loss
respectively. It is well known that such loss functions efficient on benign data but sensitive to heavy-tailed data, corrupted samples, and so forth Huber (1964).
We want to keep those loss functions’ efficiency on benign samples, but, at the same time, avoid their failure in the presence of problematic samples. We achieve this by a median-of-means approach () inspired by Lecué and Lerasle (2017b). The details of the approach are mathematically intricate, but the general idea is simple. We thus describe the general idea first and then formally define the estimator afterward. The approach can roughly be formulated in terms of three-step updates:
Partition the data into blocks of samples.
On each block, calculate the empirical mean of the loss increment with respect to two separate sets of parameters and the standard loss function ( in regression and in classification).
Use the block that corresponds to the median of the empirical means in Step 2 to update the parameters.
Go back to Step 1 until convergence.
Let us now be more formal. In the first step, we consider blocks , that are an equipartition of , which means the blocks have equal cardinalities , that cover the entire index set . In practice, we set , where and , if does not divide .
Given , the quantities in Step 2 is defined by
and we denote by an
-quantile of the set
in particular, in Step 3, we compute the empirical median-of-means in Step 2 by defining
Our estimator is then the solution of the min-max problem of the increment tests defined in the following. (The estimator can also be seen as a generalization of the standard least-squares/cross-entropy approaches—see Appendix C.)
Definition 1 (DeepMoM).
For and given blocks described in the above, we define
The rational is as follows: one the one hand, using least-squares/cross-entropy on each block ensures efficient use of the “good” samples; on the other hand, using the median over the blocks removes the corruptions and, therefore, ensures robustness toward the “bad” samples.
3 Related literature
We now take a moment to highlight relationships with other approaches as well as differences to those approaches. Since problematic samples are the rule rather than an exception in deep learning, the sensitivity of the standard loss functions has sparked much research interest. In regression settings, for example,Barron (2019); Belagiannis et al. (2015); Jiang et al. (2018); Wang et al. (2016); Lederer (2020) have replaced the squared-error loss by the absolute-deviation loss , which generates estimators for the empirical median, or the Huber loss function Huber (1964); Huber and Ronchetti (2009); Hampel et al. (2011)
where is a tuning parameter that determines the robustness of the function. In classification settings, for example, Goodfellow et al. (2015); Madry et al. (2019); Wong and Kolter (2018) have added an penalty on the parameters. Changing the loss function in those ways can make the estimators robust toward the problematic data, but it also forfeits the efficiency of the standard loss functions in using the informative data. In contrast, our approach offers robustness with respect to the “bad” samples but also efficient use of the “good” samples.
Another related topic is adversarial attacks. Adversarial attacks are intentional corruptions of the data with the goal of exploiting weaknesses of specific deep-learning pipelines Kurakin et al. (2016b); Goodfellow et al. (2018); Brückner et al. (2012); Su et al. (2019); Athalye et al. (2018). Hence, papers on adversarial attacks and our paper study data corruption. However, the perspectives on corruptions are very different: while the literature on adversarial attacks has a notion of a “mean-spirited opponent,” we have a much more general notion of “good” and “bad” samples. The adversarial-attack perspective is much more common in deep learning, but our view is much more common in the sciences more generally. The different notions also lead to different methods: methods in the context of adversarial attacks concern specific types of attacks and pipelines, while our method can be seen as a way to render deep learning more robust in general. A consequence of the different views is that adversarial attacks are designed for specific purposes, while our approach can be seen as a general way to make deep learning more robust in general. It would be misleading to include methods from the adversarial-attack literature in our numerical comparisons (they do not perform well simply because they are typically designed for very specific types of attacks and pipelines), but one could use our method in adversarial-attack frameworks. To avoid digression, we defer such studies to future work.
The following papers are related on a more general level: He et al. (2017) highlights that the combination of non-robust methods does not lead to a robust method. Carlini and Wagner (2017) shows that even the detection of problematic input is very difficult. Xu and Mannor (2010) introduces a notion of algorithmic robustness to study differences between training errors and expected errors. Tramèr and Boneh (2019) states that an ensemble of two robust methods, each of which is robust to a different form of perturbation, may be robust to neither. Tsipras et al. (2019) demonstrates that there exists a trade-off between a model’s standard accuracy and its robustness to adversarial perturbations.
4 Algorithm and numerical analysis
In this section, we devise an algorithm for computing the DeepMoM estimator of Definition 1. We then corroborate in simulations that our estimator is both robust toward corruptions as well as efficient in using benign data.
It turns out that can be computed with standard optimization techniques. In particular, we can apply stochastic-gradient steps. The only minor challenge is that the estimator involves both a minimum and a supremum, but this can be addressed by using two updates in each optimization step: one update to make progress with respect to the minimum, and one update to make progress with respect to the supremum.
descents in its first arguments and ascents in its second arguments . Hence, we are concerned with the gradients of
for fixed .
For simplicity, the gradient of an objective function with respect to at a point is denoted by , and the derivative of the activation functions are denoted by . (In line with the usual conventions, the derivative of the ReLU function at zero is set to zero.)
The above computations are then the basis for our computation of in Algorithm 1. In that algorithm, we set
Throughout the paper, the batch size is , maximum number of iteration , and the stopping criterion .
4.2 Numerical analysis for regression data
We now consider the regression data and show that our approach can indeed outmatch other approaches, such as vanilla squared-error, absolute-deviation, and Huber estimation.
We consider a uniform width . The elements of the input vectors
are i.i.d. sampled from a standard Gaussian distribution and then normalized such thatfor all . The elements of the true weight matrices and true bias parameters
are i.i.d. sampled from a uniform distribution on. The stochastic noise variables
are i.i.d. sampled from a centered Gaussian distribution such that the empirical signal to noise ratio equals.
We corrupt the data in three different ways. Recall that and
denote the sets of informative samples and corrupted samples (outliers), respectively.
Corrupted outputs (outliers): The noise variables for outliers are replaced by i.i.d. samples from a uniform distribution on . This means that the corresponding outputs are subject to heavy yet bounded corruptions.
Corrupted outputs (everywhere): All noise variables are replaced by i.i.d. samples from a Student’s t-distribution with
degrees of freedom. This means that all outputsare subject to unbounded corruptions.
Corrupted inputs: The elements of the input vectors for outliers receive (after computing ) an additional perturbation that is i.i.d. sampled from a standard Gaussian distribution. This means that the analyst gets to see corrupted versions of the input.
data sets are generated as described above. The first half of the samples in each data set are assigned to training and the remaining half of the samples to testing. For each estimator , the average of the generalization error is computed and re-scaled with respect to the approach with informative data.
The contenders are , absolute error, Huber, and the squared error loss functions. For convenience, we denote , , and as the estimators obtained by minimizing , , and on training data, respectively.
Besides, we consider a sequence of estimators , where , and we define as
We further consider a sequence of Huber estimators , where are the q-th quantile of with , and we define as
Results and conclusions
The results for different settings are summarized in Tables 1–3. First, we observe that , least-squares, and Huber estimators behave very similarly in the uncorrupted case () and for mildly corrupted outputs (). But once the corruptions are more substantial, our approach clearly outperforms the other approaches. In general, we conclude that is efficient on benign data and robust on problematic data.
4.3 Numerical analysis for classification data
We now consider the multi-class problems and demonstrate that the DeepMoM estimator in Definition 1 outperforms the soft-max cross-entropy estimation in terms of prediction accuracies.
We consider a spiral data set with consisting of five classes with identical cardinalities that span the entire index set . We denote the symbol as the Hadamard product between two vectors and
as the normal distribution with mean. For each class , we have
, and , where the elements of and are given by and for and . The data is visualized in Figure 2.
To fit this data, we consider a two layers ReLU network with uniform width . Each element of the input vectors are then divided by the maximum among them such that for all and .
We corrupt the spiral data using the two methods mentioned below.
Corrupted labels: The problematic labels for are shuffled to other class labels.
Corrupted inputs: The elements of the input vectors for outliers are subjected to an additional perturbation, as prescribed under Section 4.2.
The first half of the samples will be used for training, while the other half will be used for testing. For the estimator , which is computed by minimizing the mean of on training data, the generalization accuracy is calculated, where is the indicator function defined by
We define the quantity as in Section 4.2 with the considered number of blocks .
Results and conclusions
Table 4 summarizes the results for various settings. To begin, we see that DeepMoM and estimators behave very closely in the uncorrupted and slightly corrupted labels () and slightly corrupted inputs () cases. However, when the corruptions are more serious, our DeepMoM estimator clearly outperforms the other approaches. In general, we conclude that is efficient on benign data and robust on problematic data.
We now illustrate the potential of DeepMoM in practice. We demonstrate that DeepMoM can withstand data corruptions considerably better than usual approaches.
Application in regression data
The first application is the prediction of the critical temperature of a superconductor. Superconductor materials have a wide range of practical uses. Because of their frictionless property, superconducting wires and electrical systems, for example, have the potential to transport and deliver electricity without any energy loss in the energy industry. This frictionless property, however, occurs only when the ambient temperature is at or below the critical temperature of the superconductor. As a result, the estimation of a superconductor’s critical temperature has baffled scientists since the discovery of superconductivity.
The data Hamidieh (2018); Dua and Graff (2017) contains samples and features extracted from the superconductor’s chemical formula Hamidieh (2018). In this example, we first select samples randomly for training and keep the remaining samples for testing. The normalization of the input data are the same as we described in Section 4.2
. We fit this data by considering a two layers ReLU network with 50 neurons in the first hidden layer and 5 neurons in the second one.
Application in classification data
The second application is the classification of handwritten digit images for the values within
from the well-known MNIST data setLeCun et al. (1998); Lecun et al. .
This data contains samples for training, samples for testing, and pixels as features. The preprocessing of the input data and the network settings are the same as we described in Section 4.2.
Our aim for these two types of applications is to validate the DeepMoM method on original data with some corruptions and compare the results to other robust competitors for deep learning, as specified in Section 4.2 and Section 4.3, respectively.
Our new approach to training the parameters of neural networks is robust against corrupted samples and yet leverages informative samples efficiently. We have confirmed these properties numerically in Sections 4.2, 4.3, and 5. The approach can, therefore, be used as a general substitute for basic least-squares-type or cross-entropy-type approaches.
We have restricted ourselves to feed-forward neural networks with ReLU activation, but there are no obstacles for applying our approach more generally, for example, to convolutional networks or other activation functions. However, to keep the paper clear and concise, we defer a detailed analysis of in other deep-learning frameworks to future work.
Similarly, we model corruption by uniform or heavy-tailed random perturbations of the inputs or outputs or by randomly swapping labels, but, of course, one can conceive a plethora of different ways to corrupt data.
In sum, given modern data’s limitations and our approach’s ability to make efficient use of such data, we believe that our contribution can have a substantial impact on deep-learning practice.
We thank Guillaume Lecué, Timothé Mathieu, Mahsa Taheri, and Fang Xie for their insightful inputs an suggestions.
Threat of adversarial attacks on deep learning in computer vision: a survey. arXiv:1801.00553. Cited by: §1.
Unsupervised label noise modeling and loss correction.
Proceedings of the 36th international conference on machine learning97, pp. 312–321. Cited by: §1.
-  (2018) Synthesizing robust adversarial examples. arXiv:1707.07397. Cited by: §3.
-  (2019) A general and adaptive robust loss function. arXiv:1701.03077. Cited by: §1, §3.
-  (2015) Robust optimization for deep regression. arXiv:1505.06606. Cited by: §1, §3.
-  (2012) Static prediction games for adversarial learning problems. Journal of machine learning research 13 (85), pp. 2617–2654. Cited by: §3.
-  (2017) Adversarial examples are not easily detected: bypassing ten detection methods. arXiv:1705.07263. Cited by: §3.
-  (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Cited by: §5.
Is there a role for statistics in artificial intelligence?. arXiv:2009.09070. Cited by: §1.
-  (2018) Making machine learning robust against adversarial inputs : such inputs distort how machine-learning based systems are able to function in the world as it is. Communications of the ACM 61 (7), pp. 56–66. Cited by: §3.
-  (2015) Explaining and harnessing adversarial examples. arXiv:1412.6572. Cited by: §3.
-  (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Computational materials science 154, pp. 346–354. Cited by: §5.
-  (2011) Robust statistics: the approach based on influence functions. Wiley. Cited by: §3.
-  (2017) Adversarial example defenses: ensembles of weak defenses are not strong. arXiv:1706.04701. Cited by: §3.
-  (2009) Robust statistics. Wiley. Cited by: §3.
-  (1964) Robust estimation of a location parameter. Annals of mathematical statistics 35 (1), pp. 73–101. Cited by: §2, §3.
-  (2018) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv:1712.05055. Cited by: §1, §3.
-  (2019) Experimental security research of tesla autopilot. Tencent keen security lab. Cited by: §1.
-  (2017) Delving into adversarial attacks on deep policies. arXiv:1705.06452. Cited by: §1.
-  (2016) Adversarial examples in the physical world. arXiv:1607.02533. Cited by: §1.
-  (2016) Adversarial machine learning at scale. arXiv:1611.01236. Cited by: §1, §3.
-  (1973) Convergence of estimates under dimensionality restrictions. The annals of statistics 1 (1), pp. 38–53. Cited by: §1.
Sums of independent random variables. Asymptotic methods in statistical decision theory springer series in statistics, pp. 399–456. Cited by: §1.
-  (2020) Robust classification via MOM minimization. Machine learning 109 (8), pp. 1635–1665. Cited by: §1.
-  (2017) Learning from mom’s principles: le cam’s approach. arXiv:1701.01961. Cited by: §1.
-  (2017) Robust machine learning by median-of-means : theory and practice. arXiv:1711.10306. Cited by: §1, §2.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §5.
-  The MNIST database. MNIST handwritten digit database. External Links: Cited by: §5.
-  (2020) Risk bounds for robust deep learning. arXiv:2009.06202. Cited by: §1, §3.
-  (2021) Activation functions in artificial neural networks: a systematic overview. arXiv:2101.09957. Cited by: §2.
-  (2019) Accelerating SGD with momentum for over-parameterized learning. arXiv:1810.13395. Cited by: Appendix B.
-  (2019) Regularization, sparse recovery, and median-of-means tournaments. Bernoulli 25 (3). Cited by: §1.
-  (2019) Risk minimization by median-of-means tournaments. Journal of the European mathematical society 22 (3), pp. 925–965. Cited by: §1.
-  (2017) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083. Cited by: §1.
-  (2019) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083. Cited by: §3.
-  (2015) Distillation as a defense to adversarial perturbations against deep neural networks. arXiv:1511.04508. Cited by: §1.
-  (2017) Making deep neural networks robust to label noise: a loss correction approach. arXiv. Cited by: §1.
-  (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing. Cited by: Appendix D.
-  (2021) A survey on data collection for machine learning: a big data - AI integration perspective. IEEE transactions on knowledge and data engineering 33 (4), pp. 1328–1347. Cited by: §1.
-  (1986) Learning representations by back-propagating errors. Nature 323 (6088), pp. 533–536. Cited by: Appendix A.
Provably robust deep learning via adversarially trained smoothed classifiers. arXiv:1906.04584. Cited by: §1.
Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. Cited by: §1.
One pixel attack for fooling deep neural networks.
IEEE transactions on evolutionary computation23 (5), pp. 828–841. Cited by: §3.
-  (2019) Optimization for deep learning:theory and algorithms. arXiv:1912.08957. Cited by: Appendix B.
-  (2019) Adversarial training and robustness for multiple perturbations. arXiv:1904.13000. Cited by: §3.
-  (2017) Ensemble adversarial training: attacks and defenses. arXiv:1705.07204. Cited by: §1.
Robustness may be at odds with accuracy. Cited by: §3.
-  (2019) A direct approach to robust deep learning using adversarial networks. arXiv:1905.09591. Cited by: §1.
-  (2018) Towards robust deep neural networks. arXiv:1810.11726. Cited by: §1.
-  (2016) Studying very low resolution recognition using deep networks. arXiv:1601.04153. Cited by: §1, §3.
-  (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv:1711.00851. Cited by: §3.
-  (2010) Robustness and generalization. arXiv:1005.2243. Cited by: §3.
Probabilistic end-to-end noise correction for learning with noisy labels.
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Cited by: §1.
-  (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §1.
Appendix A Gradients
Given , we can find—see the definition of the empirical median-of-means in (6)—indexes (which depend on , respectively) such that and