1 Introduction
Deep learning is regularly used in safetycritical applications. For example, deep learning is used in the objectrecognition systems of autonomous cars, where malfunction may lead to severe injury or death. It has been shown that data corruption can have dramatic effects on such critical deeplearning pipelines Akhtar and Mian (2018); Yuan et al. (2019); Kurakin et al. (2016a); Wang and Yu (2019); Sharif et al. (2016); keen security lab (2019); Kurakin et al. (2016b). This insight has sparked research on robust deep learning based on, for example, adversarial training Madry et al. (2017); Kos and Song (2017); Papernot et al. (2015); Tramèr et al. (2017); Salman et al. (2019), sensitivity analysis Wang et al. (2018), or noise correction Patrini et al. (2017); Yi and Wu (2019); Arazo et al. (2019).
Research on robust deep learning focuses usually on “adversarial attacks,” that is, intentional data corruptions designed to cause failures of specific pipelines. In contrast, the fact that data is often of poor quality much more generally has received little attention. But lowquality data is very common, simply because data used for deep learning is rarely collected based on rigid experimental designs but rather amassed from whatever resources are available Roh et al. (2021); Friedrich et al. (2020). Among the few papers that consider such corruptions are Barron (2019); Belagiannis et al. (2015); Jiang et al. (2018); Wang et al. (2016); Lederer (2020)
, who replace the standard loss functions, such as squared error and softmax cross entropy loss functions by some Lipschitzcontinuous alternatives, such as Huber loss functions. But there is much room for improvement, especially because the existing methods do not make efficient use of the uncorrupted samples in the data.
In this paper, we devise a novel approach to deep learning with “randomly” corrupted data. The inspiration is the very recent line of research on medianofmeans Lugosi and Mendelson (2019a, b); Lecué et al. (2020) and Le Cam’s procedure Le Cam (1973, 1986)
Lecué and Lerasle (2017a, b). We especially mimic Lecué and Lerasle (2017b) in defining parameter updates in a minmax fashion. We show that this approach indeed outmatches other approaches on simulated and realworld data.Our three main contributions are:

We introduce a robust training scheme for neural networks that incorporates the medianofmeans principle through Le Cam’s procedure.

We show that our approach can be implemented by using a simple gradientbased algorithm.

We demonstrate that our approach outperforms standard deeplearning pipeline across different levels of corruption.
Outline of the paper
In Section 2, we state the problem and give a stepbystep derivation of the DeepMoMestimator (Definition 1). In Section 4, we highlight similarities and differences to other approaches. In Section 4.1, we establish a stochasticgradient algorithm to compute our estimators (Algorithm 1). In Sections 4.2 and 4.3, we demonstrate that our approach rivals or outmatches traditional training schemes based on leastsquares and on crossentropy both for simulated corrupted and uncorrupted data. In Section 5, we demonstrate the same on real data. Finally, in Section 6, we summarize the results of this work and conclude.
2 Framework and estimator
We first introduce the statistical framework and our corresponding estimator. We consider data with such that
(1) 
where is the unknown datagenerating function, and
are the stochastic error vectors. In particular, each
is an input of the system and the corresponding output. The data is partitioned into two parts: the first part comprises the informative samples; the second part comprises the problematic samples (such as corrupted samples—irrespective of what the source of the corruption is). The two parts are index by and , respectively. Of course, the sets and are unknown in practice (otherwise, one could simply remove the problematic samples). In brief, we consider a standard deeplearning setup—with the only exception that we make an explicit distinction between “good” and “bad” samples.Our general goal is then, as usual, to approximate the datagenerating function defined in (1
). But our specific goal is to take into account the fact that there may be problematic samples. Our candidate functions are feedforward neural networks
of the form(2) 
indexed by the parameter spaces
and
where is the number of layers, are the (finitedimensional) weight matrices with and , are bias parameters, and
are the activation functions. For ease of exposition, we concentrate on ReLU activation, that is,
for , , and Lederer (2021).Neural networks, such as those in (2), are typically fitted to data by minimizing the sum of loss function: . The two standard loss functions for regression problems () and classification problems () are the squarederror (SE) loss
(3) 
and the softmax cross entropy (SCE) loss
(4) 
respectively. It is well known that such loss functions efficient on benign data but sensitive to heavytailed data, corrupted samples, and so forth Huber (1964).
We want to keep those loss functions’ efficiency on benign samples, but, at the same time, avoid their failure in the presence of problematic samples. We achieve this by a medianofmeans approach () inspired by Lecué and Lerasle (2017b). The details of the approach are mathematically intricate, but the general idea is simple. We thus describe the general idea first and then formally define the estimator afterward. The approach can roughly be formulated in terms of threestep updates:

Partition the data into blocks of samples.

On each block, calculate the empirical mean of the loss increment with respect to two separate sets of parameters and the standard loss function ( in regression and in classification).

Use the block that corresponds to the median of the empirical means in Step 2 to update the parameters.

Go back to Step 1 until convergence.
Let us now be more formal. In the first step, we consider blocks , that are an equipartition of , which means the blocks have equal cardinalities , that cover the entire index set . In practice, we set , where and , if does not divide .
Given , the quantities in Step 2 is defined by
(5) 
and we denote by an
quantile of the set
in particular, in Step 3, we compute the empirical medianofmeans in Step 2 by defining
(6) 
Our estimator is then the solution of the minmax problem of the increment tests defined in the following. (The estimator can also be seen as a generalization of the standard leastsquares/crossentropy approaches—see Appendix C.)
Definition 1 (DeepMoM).
For and given blocks described in the above, we define
The rational is as follows: one the one hand, using leastsquares/crossentropy on each block ensures efficient use of the “good” samples; on the other hand, using the median over the blocks removes the corruptions and, therefore, ensures robustness toward the “bad” samples.
3 Related literature
We now take a moment to highlight relationships with other approaches as well as differences to those approaches. Since problematic samples are the rule rather than an exception in deep learning, the sensitivity of the standard loss functions has sparked much research interest. In regression settings, for example,
Barron (2019); Belagiannis et al. (2015); Jiang et al. (2018); Wang et al. (2016); Lederer (2020) have replaced the squarederror loss by the absolutedeviation loss , which generates estimators for the empirical median, or the Huber loss function Huber (1964); Huber and Ronchetti (2009); Hampel et al. (2011)where is a tuning parameter that determines the robustness of the function. In classification settings, for example, Goodfellow et al. (2015); Madry et al. (2019); Wong and Kolter (2018) have added an penalty on the parameters. Changing the loss function in those ways can make the estimators robust toward the problematic data, but it also forfeits the efficiency of the standard loss functions in using the informative data. In contrast, our approach offers robustness with respect to the “bad” samples but also efficient use of the “good” samples.
Another related topic is adversarial attacks. Adversarial attacks are intentional corruptions of the data with the goal of exploiting weaknesses of specific deeplearning pipelines Kurakin et al. (2016b); Goodfellow et al. (2018); Brückner et al. (2012); Su et al. (2019); Athalye et al. (2018). Hence, papers on adversarial attacks and our paper study data corruption. However, the perspectives on corruptions are very different: while the literature on adversarial attacks has a notion of a “meanspirited opponent,” we have a much more general notion of “good” and “bad” samples. The adversarialattack perspective is much more common in deep learning, but our view is much more common in the sciences more generally. The different notions also lead to different methods: methods in the context of adversarial attacks concern specific types of attacks and pipelines, while our method can be seen as a way to render deep learning more robust in general. A consequence of the different views is that adversarial attacks are designed for specific purposes, while our approach can be seen as a general way to make deep learning more robust in general. It would be misleading to include methods from the adversarialattack literature in our numerical comparisons (they do not perform well simply because they are typically designed for very specific types of attacks and pipelines), but one could use our method in adversarialattack frameworks. To avoid digression, we defer such studies to future work.
The following papers are related on a more general level: He et al. (2017) highlights that the combination of nonrobust methods does not lead to a robust method. Carlini and Wagner (2017) shows that even the detection of problematic input is very difficult. Xu and Mannor (2010) introduces a notion of algorithmic robustness to study differences between training errors and expected errors. Tramèr and Boneh (2019) states that an ensemble of two robust methods, each of which is robust to a different form of perturbation, may be robust to neither. Tsipras et al. (2019) demonstrates that there exists a tradeoff between a model’s standard accuracy and its robustness to adversarial perturbations.
4 Algorithm and numerical analysis
In this section, we devise an algorithm for computing the DeepMoM estimator of Definition 1. We then corroborate in simulations that our estimator is both robust toward corruptions as well as efficient in using benign data.
4.1 Algorithm
It turns out that can be computed with standard optimization techniques. In particular, we can apply stochasticgradient steps. The only minor challenge is that the estimator involves both a minimum and a supremum, but this can be addressed by using two updates in each optimization step: one update to make progress with respect to the minimum, and one update to make progress with respect to the supremum.
To be more formal, we want to compute the estimator of Definition 1 for given blocks on data defined in Section 2. This amounts to finding updates such that our objective function
descents in its first arguments and ascents in its second arguments . Hence, we are concerned with the gradients of
(7) 
and
(8) 
for fixed .
For simplicity, the gradient of an objective function with respect to at a point is denoted by , and the derivative of the activation functions are denoted by . (In line with the usual conventions, the derivative of the ReLU function at zero is set to zero.)
The above computations are then the basis for our computation of in Algorithm 1. In that algorithm, we set
for .
Throughout the paper, the batch size is , maximum number of iteration , and the stopping criterion .
4.2 Numerical analysis for regression data
We now consider the regression data and show that our approach can indeed outmatch other approaches, such as vanilla squarederror, absolutedeviation, and Huber estimation.
General setup
We consider a uniform width . The elements of the input vectors
are i.i.d. sampled from a standard Gaussian distribution and then normalized such that
for all . The elements of the true weight matrices and true bias parametersare i.i.d. sampled from a uniform distribution on
. The stochastic noise variablesare i.i.d. sampled from a centered Gaussian distribution such that the empirical signal to noise ratio equals
.Data corruptions
We corrupt the data in three different ways. Recall that and
denote the sets of informative samples and corrupted samples (outliers), respectively.
Corrupted outputs (outliers): The noise variables for outliers are replaced by i.i.d. samples from a uniform distribution on . This means that the corresponding outputs are subject to heavy yet bounded corruptions.
Corrupted outputs (everywhere): All noise variables are replaced by i.i.d. samples from a Student’s tdistribution with
degrees of freedom. This means that all outputs
are subject to unbounded corruptions.Corrupted inputs: The elements of the input vectors for outliers receive (after computing ) an additional perturbation that is i.i.d. sampled from a standard Gaussian distribution. This means that the analyst gets to see corrupted versions of the input.
Error quantification
data sets are generated as described above. The first half of the samples in each data set are assigned to training and the remaining half of the samples to testing. For each estimator , the average of the generalization error is computed and rescaled with respect to the approach with informative data.
The contenders are , absolute error, Huber, and the squared error loss functions. For convenience, we denote , , and as the estimators obtained by minimizing , , and on training data, respectively.
Besides, we consider a sequence of estimators , where , and we define as
We further consider a sequence of Huber estimators , where are the qth quantile of with , and we define as
Results and conclusions
The results for different settings are summarized in Tables 1–3. First, we observe that , leastsquares, and Huber estimators behave very similarly in the uncorrupted case () and for mildly corrupted outputs (). But once the corruptions are more substantial, our approach clearly outperforms the other approaches. In general, we conclude that is efficient on benign data and robust on problematic data.
4.3 Numerical analysis for classification data
We now consider the multiclass problems and demonstrate that the DeepMoM estimator in Definition 1 outperforms the softmax crossentropy estimation in terms of prediction accuracies.
General setup
We consider a spiral data set with consisting of five classes with identical cardinalities that span the entire index set . We denote the symbol as the Hadamard product between two vectors and
as the normal distribution with mean
. For each class , we have, and , where the elements of and are given by and for and . The data is visualized in Figure 2.
To fit this data, we consider a two layers ReLU network with uniform width . Each element of the input vectors are then divided by the maximum among them such that for all and .
Data corruptions
We corrupt the spiral data using the two methods mentioned below.
Corrupted labels: The problematic labels for are shuffled to other class labels.
Corrupted inputs: The elements of the input vectors for outliers are subjected to an additional perturbation, as prescribed under Section 4.2.
Error quantification
The first half of the samples will be used for training, while the other half will be used for testing. For the estimator , which is computed by minimizing the mean of on training data, the generalization accuracy is calculated, where is the indicator function defined by
We define the quantity as in Section 4.2 with the considered number of blocks .
Results and conclusions
Table 4 summarizes the results for various settings. To begin, we see that DeepMoM and estimators behave very closely in the uncorrupted and slightly corrupted labels () and slightly corrupted inputs () cases. However, when the corruptions are more serious, our DeepMoM estimator clearly outperforms the other approaches. In general, we conclude that is efficient on benign data and robust on problematic data.
5 Applications
We now illustrate the potential of DeepMoM in practice. We demonstrate that DeepMoM can withstand data corruptions considerably better than usual approaches.
Application in regression data
The first application is the prediction of the critical temperature of a superconductor. Superconductor materials have a wide range of practical uses. Because of their frictionless property, superconducting wires and electrical systems, for example, have the potential to transport and deliver electricity without any energy loss in the energy industry. This frictionless property, however, occurs only when the ambient temperature is at or below the critical temperature of the superconductor. As a result, the estimation of a superconductor’s critical temperature has baffled scientists since the discovery of superconductivity.
The data Hamidieh (2018); Dua and Graff (2017) contains samples and features extracted from the superconductor’s chemical formula Hamidieh (2018). In this example, we first select samples randomly for training and keep the remaining samples for testing. The normalization of the input data are the same as we described in Section 4.2
. We fit this data by considering a two layers ReLU network with 50 neurons in the first hidden layer and 5 neurons in the second one.
Application in classification data
The second application is the classification of handwritten digit images for the values within
from the wellknown MNIST data set
LeCun et al. (1998); Lecun et al. .This data contains samples for training, samples for testing, and pixels as features. The preprocessing of the input data and the network settings are the same as we described in Section 4.2.
Our aim for these two types of applications is to validate the DeepMoM method on original data with some corruptions and compare the results to other robust competitors for deep learning, as specified in Section 4.2 and Section 4.3, respectively.


6 Discussion
Our new approach to training the parameters of neural networks is robust against corrupted samples and yet leverages informative samples efficiently. We have confirmed these properties numerically in Sections 4.2, 4.3, and 5. The approach can, therefore, be used as a general substitute for basic leastsquarestype or crossentropytype approaches.
We have restricted ourselves to feedforward neural networks with ReLU activation, but there are no obstacles for applying our approach more generally, for example, to convolutional networks or other activation functions. However, to keep the paper clear and concise, we defer a detailed analysis of in other deeplearning frameworks to future work.
Similarly, we model corruption by uniform or heavytailed random perturbations of the inputs or outputs or by randomly swapping labels, but, of course, one can conceive a plethora of different ways to corrupt data.
In sum, given modern data’s limitations and our approach’s ability to make efficient use of such data, we believe that our contribution can have a substantial impact on deeplearning practice.
Acknowledgments
We thank Guillaume Lecué, Timothé Mathieu, Mahsa Taheri, and Fang Xie for their insightful inputs an suggestions.
References

[1]
(2018)
Threat of adversarial attacks on deep learning in computer vision: a survey
. arXiv:1801.00553. Cited by: §1. 
[2]
(2019)
Unsupervised label noise modeling and loss correction.
Proceedings of the 36th international conference on machine learning
97, pp. 312–321. Cited by: §1.  [3] (2018) Synthesizing robust adversarial examples. arXiv:1707.07397. Cited by: §3.
 [4] (2019) A general and adaptive robust loss function. arXiv:1701.03077. Cited by: §1, §3.
 [5] (2015) Robust optimization for deep regression. arXiv:1505.06606. Cited by: §1, §3.
 [6] (2012) Static prediction games for adversarial learning problems. Journal of machine learning research 13 (85), pp. 2617–2654. Cited by: §3.
 [7] (2017) Adversarial examples are not easily detected: bypassing ten detection methods. arXiv:1705.07263. Cited by: §3.
 [8] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Cited by: §5.

[9]
(2020)
Is there a role for statistics in artificial intelligence?
. arXiv:2009.09070. Cited by: §1.  [10] (2018) Making machine learning robust against adversarial inputs : such inputs distort how machinelearning based systems are able to function in the world as it is. Communications of the ACM 61 (7), pp. 56–66. Cited by: §3.
 [11] (2015) Explaining and harnessing adversarial examples. arXiv:1412.6572. Cited by: §3.
 [12] (2018) A datadriven statistical model for predicting the critical temperature of a superconductor. Computational materials science 154, pp. 346–354. Cited by: §5.
 [13] (2011) Robust statistics: the approach based on influence functions. Wiley. Cited by: §3.
 [14] (2017) Adversarial example defenses: ensembles of weak defenses are not strong. arXiv:1706.04701. Cited by: §3.
 [15] (2009) Robust statistics. Wiley. Cited by: §3.
 [16] (1964) Robust estimation of a location parameter. Annals of mathematical statistics 35 (1), pp. 73–101. Cited by: §2, §3.
 [17] (2018) MentorNet: learning datadriven curriculum for very deep neural networks on corrupted labels. arXiv:1712.05055. Cited by: §1, §3.
 [18] (2019) Experimental security research of tesla autopilot. Tencent keen security lab. Cited by: §1.
 [19] (2017) Delving into adversarial attacks on deep policies. arXiv:1705.06452. Cited by: §1.
 [20] (2016) Adversarial examples in the physical world. arXiv:1607.02533. Cited by: §1.
 [21] (2016) Adversarial machine learning at scale. arXiv:1611.01236. Cited by: §1, §3.
 [22] (1973) Convergence of estimates under dimensionality restrictions. The annals of statistics 1 (1), pp. 38–53. Cited by: §1.

[23]
(1986)
Sums of independent random variables
. Asymptotic methods in statistical decision theory springer series in statistics, pp. 399–456. Cited by: §1.  [24] (2020) Robust classification via MOM minimization. Machine learning 109 (8), pp. 1635–1665. Cited by: §1.
 [25] (2017) Learning from mom’s principles: le cam’s approach. arXiv:1701.01961. Cited by: §1.
 [26] (2017) Robust machine learning by medianofmeans : theory and practice. arXiv:1711.10306. Cited by: §1, §2.
 [27] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE. Cited by: §5.
 [28] The MNIST database. MNIST handwritten digit database. External Links: Link Cited by: §5.
 [29] (2020) Risk bounds for robust deep learning. arXiv:2009.06202. Cited by: §1, §3.
 [30] (2021) Activation functions in artificial neural networks: a systematic overview. arXiv:2101.09957. Cited by: §2.
 [31] (2019) Accelerating SGD with momentum for overparameterized learning. arXiv:1810.13395. Cited by: Appendix B.
 [32] (2019) Regularization, sparse recovery, and medianofmeans tournaments. Bernoulli 25 (3). Cited by: §1.
 [33] (2019) Risk minimization by medianofmeans tournaments. Journal of the European mathematical society 22 (3), pp. 925–965. Cited by: §1.
 [34] (2017) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083. Cited by: §1.
 [35] (2019) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083. Cited by: §3.
 [36] (2015) Distillation as a defense to adversarial perturbations against deep neural networks. arXiv:1511.04508. Cited by: §1.
 [37] (2017) Making deep neural networks robust to label noise: a loss correction approach. arXiv. Cited by: §1.
 [38] (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing. Cited by: Appendix D.
 [39] (2021) A survey on data collection for machine learning: a big data  AI integration perspective. IEEE transactions on knowledge and data engineering 33 (4), pp. 1328–1347. Cited by: §1.
 [40] (1986) Learning representations by backpropagating errors. Nature 323 (6088), pp. 533–536. Cited by: Appendix A.

[41]
(2019)
Provably robust deep learning via adversarially trained smoothed classifiers
. arXiv:1906.04584. Cited by: §1. 
[42]
(2016)
Accessorize to a crime: real and stealthy attacks on stateoftheart face recognition
. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. Cited by: §1. 
[43]
(2019)
One pixel attack for fooling deep neural networks.
IEEE transactions on evolutionary computation
23 (5), pp. 828–841. Cited by: §3.  [44] (2019) Optimization for deep learning:theory and algorithms. arXiv:1912.08957. Cited by: Appendix B.
 [45] (2019) Adversarial training and robustness for multiple perturbations. arXiv:1904.13000. Cited by: §3.
 [46] (2017) Ensemble adversarial training: attacks and defenses. arXiv:1705.07204. Cited by: §1.

[47]
(2019)
Robustness may be at odds with accuracy
. Cited by: §3.  [48] (2019) A direct approach to robust deep learning using adversarial networks. arXiv:1905.09591. Cited by: §1.
 [49] (2018) Towards robust deep neural networks. arXiv:1810.11726. Cited by: §1.
 [50] (2016) Studying very low resolution recognition using deep networks. arXiv:1601.04153. Cited by: §1, §3.
 [51] (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv:1711.00851. Cited by: §3.
 [52] (2010) Robustness and generalization. arXiv:1005.2243. Cited by: §3.

[53]
(2019)
Probabilistic endtoend noise correction for learning with noisy labels.
In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, Cited by: §1.  [54] (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §1.
Appendix A Gradients
Given , we can find—see the definition of the empirical medianofmeans in (6)—indexes (which depend on , respectively) such that and