DeepMoM: Robust Deep Learning With Median-of-Means

by   Shih-Ting Huang, et al.
Ruhr University Bochum

Data used in deep learning is notoriously problematic. For example, data are usually combined from diverse sources, rarely cleaned and vetted thoroughly, and sometimes corrupted on purpose. Intentional corruption that targets the weak spots of algorithms has been studied extensively under the label of "adversarial attacks." In contrast, the arguably much more common case of corruption that reflects the limited quality of data has been studied much less. Such "random" corruptions are due to measurement errors, unreliable sources, convenience sampling, and so forth. These kinds of corruption are common in deep learning, because data are rarely collected according to strict protocols – in strong contrast to the formalized data collection in some parts of classical statistics. This paper concerns such corruption. We introduce an approach motivated by very recent insights into median-of-means and Le Cam's principle, we show that the approach can be readily implemented, and we demonstrate that it performs very well in practice. In conclusion, we believe that our approach is a very promising alternative to standard parameter training based on least-squares and cross-entropy loss.



There are no comments yet.


page 1

page 2

page 3

page 4


A remark on "Robust machine learning by median-of-means"

We explore the recent results announced in "Robust machine learning by m...

Fooling Object Detectors: Adversarial Attacks by Half-Neighbor Masks

Although there are a great number of adversarial attacks on deep learnin...

Every Untrue Label is Untrue in its Own Way: Controlling Error Type with the Log Bilinear Loss

Deep learning has become the method of choice in many application domain...

K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap Median-of-Means

We propose a new clustering algorithm that is robust to the presence of ...

The Earth Mover's Pinball Loss: Quantiles for Histogram-Valued Regression

Although ubiquitous in the sciences, histogram data have not received mu...

Model-Based Robust Deep Learning

While deep learning has resulted in major breakthroughs in many applicat...

Median of means principle as a divide-and-conquer procedure for robustness, sub-sampling and hyper-parameters tuning

Many learning methods have poor risk estimates with large probability un...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning is regularly used in safety-critical applications. For example, deep learning is used in the object-recognition systems of autonomous cars, where malfunction may lead to severe injury or death. It has been shown that data corruption can have dramatic effects on such critical deep-learning pipelines Akhtar and Mian (2018); Yuan et al. (2019); Kurakin et al. (2016a); Wang and Yu (2019); Sharif et al. (2016); keen security lab (2019); Kurakin et al. (2016b). This insight has sparked research on robust deep learning based on, for example, adversarial training Madry et al. (2017); Kos and Song (2017); Papernot et al. (2015); Tramèr et al. (2017); Salman et al. (2019), sensitivity analysis Wang et al. (2018), or noise correction Patrini et al. (2017); Yi and Wu (2019); Arazo et al. (2019).

Research on robust deep learning focuses usually on “adversarial attacks,” that is, intentional data corruptions designed to cause failures of specific pipelines. In contrast, the fact that data is often of poor quality much more generally has received little attention. But low-quality data is very common, simply because data used for deep learning is rarely collected based on rigid experimental designs but rather amassed from whatever resources are available Roh et al. (2021); Friedrich et al. (2020). Among the few papers that consider such corruptions are Barron (2019); Belagiannis et al. (2015); Jiang et al. (2018); Wang et al. (2016); Lederer (2020)

, who replace the standard loss functions, such as squared error and soft-max cross entropy loss functions by some Lipschitz-continuous alternatives, such as Huber loss functions. But there is much room for improvement, especially because the existing methods do not make efficient use of the uncorrupted samples in the data.

In this paper, we devise a novel approach to deep learning with “randomly” corrupted data. The inspiration is the very recent line of research on median-of-means Lugosi and Mendelson (2019a, b); Lecué et al. (2020) and Le Cam’s procedure Le Cam (1973, 1986)

in linear regression

Lecué and Lerasle (2017a, b). We especially mimic Lecué and Lerasle (2017b) in defining parameter updates in a min-max fashion. We show that this approach indeed outmatches other approaches on simulated and real-world data.

Our three main contributions are:

  • We introduce a robust training scheme for neural networks that incorporates the median-of-means principle through Le Cam’s procedure.

  • We show that our approach can be implemented by using a simple gradient-based algorithm.

  • We demonstrate that our approach outperforms standard deep-learning pipeline across different levels of corruption.

Outline of the paper

In Section 2, we state the problem and give a step-by-step derivation of the DeepMoMestimator (Definition 1). In Section 4, we highlight similarities and differences to other approaches. In Section 4.1, we establish a stochastic-gradient algorithm to compute our estimators (Algorithm 1). In Sections 4.2 and 4.3, we demonstrate that our approach rivals or outmatches traditional training schemes based on least-squares and on cross-entropy both for simulated corrupted and uncorrupted data. In Section 5, we demonstrate the same on real data. Finally, in Section 6, we summarize the results of this work and conclude.

2 Framework and estimator

We first introduce the statistical framework and our corresponding estimator. We consider data with such that


where is the unknown data-generating function, and

are the stochastic error vectors. In particular, each

is an input of the system and the corresponding output. The data is partitioned into two parts: the first part comprises the informative samples; the second part comprises the problematic samples (such as corrupted samples—irrespective of what the source of the corruption is). The two parts are index by and , respectively. Of course, the sets  and  are unknown in practice (otherwise, one could simply remove the problematic samples). In brief, we consider a standard deep-learning setup—with the only exception that we make an explicit distinction between “good” and “bad” samples.

Our general goal is then, as usual, to approximate the data-generating function defined in (1

). But our specific goal is to take into account the fact that there may be problematic samples. Our candidate functions are feed-forward neural networks

of the form


indexed by the parameter spaces


where is the number of layers, are the (finite-dimensional) weight matrices with and , are bias parameters, and

are the activation functions. For ease of exposition, we concentrate on ReLU activation, that is,

for , , and Lederer (2021).

Neural networks, such as those in (2), are typically fitted to data by minimizing the sum of loss function: . The two standard loss functions for regression problems () and classification problems () are the squared-error (SE) loss


and the soft-max cross entropy (SCE) loss


respectively. It is well known that such loss functions efficient on benign data but sensitive to heavy-tailed data, corrupted samples, and so forth Huber (1964).

We want to keep those loss functions’ efficiency on benign samples, but, at the same time, avoid their failure in the presence of problematic samples. We achieve this by a median-of-means approach () inspired by Lecué and Lerasle (2017b). The details of the approach are mathematically intricate, but the general idea is simple. We thus describe the general idea first and then formally define the estimator afterward. The approach can roughly be formulated in terms of three-step updates:

  • Partition the data into blocks of samples.

  • On each block, calculate the empirical mean of the loss increment with respect to two separate sets of parameters and the standard loss function ( in regression and in classification).

  • Use the block that corresponds to the median of the empirical means in Step 2 to update the parameters.

  • Go back to Step 1 until convergence.

Let us now be more formal. In the first step, we consider blocks , that are an equipartition of , which means the blocks have equal cardinalities , that cover the entire index set . In practice, we set , where and , if does not divide .

Given , the quantities in Step 2 is defined by


and we denote by an

-quantile of the set

in particular, in Step 3, we compute the empirical median-of-means in Step 2 by defining


Our estimator is then the solution of the min-max problem of the increment tests defined in the following. (The estimator can also be seen as a generalization of the standard least-squares/cross-entropy approaches—see Appendix C.)

Definition 1 (DeepMoM).

For and given blocks described in the above, we define

The rational is as follows: one the one hand, using least-squares/cross-entropy on each block ensures efficient use of the “good” samples; on the other hand, using the median over the blocks removes the corruptions and, therefore, ensures robustness toward the “bad” samples.

3 Related literature

We now take a moment to highlight relationships with other approaches as well as differences to those approaches. Since problematic samples are the rule rather than an exception in deep learning, the sensitivity of the standard loss functions has sparked much research interest. In regression settings, for example,

Barron (2019); Belagiannis et al. (2015); Jiang et al. (2018); Wang et al. (2016); Lederer (2020) have replaced the squared-error loss by the absolute-deviation loss , which generates estimators for the empirical median, or the Huber loss function Huber (1964); Huber and Ronchetti (2009); Hampel et al. (2011)

where is a tuning parameter that determines the robustness of the function. In classification settings, for example, Goodfellow et al. (2015); Madry et al. (2019); Wong and Kolter (2018) have added an penalty on the parameters. Changing the loss function in those ways can make the estimators robust toward the problematic data, but it also forfeits the efficiency of the standard loss functions in using the informative data. In contrast, our approach offers robustness with respect to the “bad” samples but also efficient use of the “good” samples.

Another related topic is adversarial attacks. Adversarial attacks are intentional corruptions of the data with the goal of exploiting weaknesses of specific deep-learning pipelines Kurakin et al. (2016b); Goodfellow et al. (2018); Brückner et al. (2012); Su et al. (2019); Athalye et al. (2018). Hence, papers on adversarial attacks and our paper study data corruption. However, the perspectives on corruptions are very different: while the literature on adversarial attacks has a notion of a “mean-spirited opponent,” we have a much more general notion of “good” and “bad” samples. The adversarial-attack perspective is much more common in deep learning, but our view is much more common in the sciences more generally. The different notions also lead to different methods: methods in the context of adversarial attacks concern specific types of attacks and pipelines, while our method can be seen as a way to render deep learning more robust in general. A consequence of the different views is that adversarial attacks are designed for specific purposes, while our approach can be seen as a general way to make deep learning more robust in general. It would be misleading to include methods from the adversarial-attack literature in our numerical comparisons (they do not perform well simply because they are typically designed for very specific types of attacks and pipelines), but one could use our method in adversarial-attack frameworks. To avoid digression, we defer such studies to future work.

The following papers are related on a more general level: He et al. (2017) highlights that the combination of non-robust methods does not lead to a robust method. Carlini and Wagner (2017) shows that even the detection of problematic input is very difficult. Xu and Mannor (2010) introduces a notion of algorithmic robustness to study differences between training errors and expected errors. Tramèr and Boneh (2019) states that an ensemble of two robust methods, each of which is robust to a different form of perturbation, may be robust to neither. Tsipras et al. (2019) demonstrates that there exists a trade-off between a model’s standard accuracy and its robustness to adversarial perturbations.

4 Algorithm and numerical analysis

In this section, we devise an algorithm for computing the DeepMoM estimator of Definition 1. We then corroborate in simulations that our estimator is both robust toward corruptions as well as efficient in using benign data.

4.1 Algorithm

It turns out that can be computed with standard optimization techniques. In particular, we can apply stochastic-gradient steps. The only minor challenge is that the estimator involves both a minimum and a supremum, but this can be addressed by using two updates in each optimization step: one update to make progress with respect to the minimum, and one update to make progress with respect to the supremum.

To be more formal, we want to compute the estimator  of Definition 1 for given blocks on data defined in Section 2. This amounts to finding updates such that our objective function

descents in its first arguments and ascents in its second arguments . Hence, we are concerned with the gradients of




for fixed .

For simplicity, the gradient of an objective function with respect to at a point is denoted by , and the derivative of the activation functions are denoted by . (In line with the usual conventions, the derivative of the ReLU function at zero is set to zero.)

The computation of the gradients of (7) and (8) are deferred to the Appendix A.

The above computations are then the basis for our computation of in Algorithm 1. In that algorithm, we set

for .

  Input: data , number of blocks , initial parameters , iteration number , stopping criterion , batch size , and learning rate .   Output: of Definition 1.   while  do       1. Randomly select a batch of data points.       2. Generate blocks for the selected data.       3. Update gradients for the first arguments :             4. First stopping criterion:             if  then           break       end if       5. Update gradients for the second arguments :             6. Second stopping criterion:             if  then           break       end if   end while
Algorithm 1 stochastic gradient-based algorithm for DeepMoM

Throughout the paper, the batch size is , maximum number of iteration , and the stopping criterion .

Algorithm 1 provides a stochastic approximation method for standard gradient-descent optimization of the empirical-median-of-means function formulated in Display (6). A mathematical result on the convergence of the algorithm is provided in the Appendix B.

4.2 Numerical analysis for regression data

We now consider the regression data and show that our approach can indeed outmatch other approaches, such as vanilla squared-error, absolute-deviation, and Huber estimation.

General setup

We consider a uniform width . The elements of the input vectors

are i.i.d. sampled from a standard Gaussian distribution and then normalized such that

for all . The elements of the true weight matrices and true bias parameters

are i.i.d. sampled from a uniform distribution on

. The stochastic noise variables

are i.i.d. sampled from a centered Gaussian distribution such that the empirical signal to noise ratio equals


Data corruptions

We corrupt the data in three different ways. Recall that  and 

denote the sets of informative samples and corrupted samples (outliers), respectively.

Corrupted outputs (outliers): The noise variables for outliers are replaced by i.i.d. samples from a uniform distribution on . This means that the corresponding outputs  are subject to heavy yet bounded corruptions.

Corrupted outputs (everywhere): All noise variables are replaced by i.i.d. samples from a Student’s t-distribution with

 degrees of freedom. This means that all outputs 

are subject to unbounded corruptions.

Corrupted inputs: The elements of the input vectors for outliers receive (after computing ) an additional perturbation that is i.i.d. sampled from a standard Gaussian distribution. This means that the analyst gets to see corrupted versions of the input.

Figure 1: a comparison of the averaged squared loss of least-squares and DeepMoM (as in Algorithm 1) with , , and as a function of the gradient updates
Table 1: outperforms , , and in robustness for all settings. Corrupted outputs (uniform distribution) Scaled mean of prediction errors 1.00 01.58 01.29 001.60 1.22 12.09 07.36 017.42 1.48 26.81 33.69 072.21 2.47 80.34 68.76 121.58 Corrupted outputs (t-distribution) Scaled mean of prediction errors df 10 1.16 1.40 1.34 1.93 1 1.11 1.38 1.32 1.83 Corrupted inputs Scaled mean of prediction errors 1.26 1.46 1.77 1.61 1.27 1.38 1.38 1.80 1.37 1.66 1.90 1.75 Table 2: outperforms , , and in robustness for all settings. Corrupted outputs (uniform distribution) Scaled mean of prediction errors 1.00 002.04 001.71 003.62 1.58 019.85 010.66 031.66 1.73 063.69 070.71 116.85 2.06 159.17 133.71 229.14 Corrupted outputs (t-distribution) Scaled mean of prediction errors df 10 1.05 1.72 1.75 2.83 1 1.99 2.18 1.81 3.93 Corrupted inputs Scaled mean of prediction errors 1.63 2.37 1.97 2.70 1.75 2.08 2.53 3.21 1.86 2.22 2.63 3.78 Table 3: outperforms , , and in robustness for all settings. Corrupted outputs (uniform distribution) Scaled mean of prediction errors 1.00 01.74 01.53 02.05 1.05 14.34 10.84 14.45 1.39 44.80 44.29 44.58 1.95 78.19 95.02 82.56 Corrupted outputs (t-distribution) Scaled mean of prediction errors df 10 0.97 1.58 1.70 2.44 1 0.99 1.68 1.78 2.05 Corrupted inputs Scaled mean of prediction errors 1.01 1.60 1.76 1.99 1.07 1.66 1.58 1.87 1.13 1.58 1.38 2.13

Error quantification

data sets are generated as described above. The first half of the samples in each data set are assigned to training and the remaining half of the samples to testing. For each estimator , the average of the generalization error is computed and re-scaled with respect to the approach with  informative data.

The contenders are , absolute error, Huber, and the squared error loss functions. For convenience, we denote , , and as the estimators obtained by minimizing , , and on training data, respectively.

Besides, we consider a sequence of estimators , where , and we define as

We further consider a sequence of Huber estimators , where are the q-th quantile of with , and we define as

Results and conclusions

The results for different settings are summarized in Tables 13. First, we observe that , least-squares, and Huber estimators behave very similarly in the uncorrupted case () and for mildly corrupted outputs (). But once the corruptions are more substantial, our approach clearly outperforms the other approaches. In general, we conclude that is efficient on benign data and robust on problematic data.

4.3 Numerical analysis for classification data

We now consider the multi-class problems and demonstrate that the DeepMoM estimator in Definition 1 outperforms the soft-max cross-entropy estimation in terms of prediction accuracies.

General setup

We consider a spiral data set with consisting of five classes with identical cardinalities that span the entire index set . We denote the symbol as the Hadamard product between two vectors and

as the normal distribution with mean

and standard deviation

. For each class , we have

, and , where the elements of and are given by and for and . The data is visualized in Figure 2.

To fit this data, we consider a two layers ReLU network with uniform width . Each element of the input vectors are then divided by the maximum among them such that for all and .

Figure 2: the two-dimensional spiral data set with five classes
Table 4: outperforms in robustness for all settings Corrupted outputs (shuffled labels) Prediction accuracies 95.6% 95.6% 95.2% 94.2% 88.0% 79.6% 85.6% 73.2% Corrupted inputs Prediction accuracies 95.6% 95.6% 95.6% 95.6% 93.8% 90.2%

Data corruptions

We corrupt the spiral data using the two methods mentioned below.

Corrupted labels: The problematic labels for are shuffled to other class labels.

Corrupted inputs: The elements of the input vectors for outliers are subjected to an additional perturbation, as prescribed under Section 4.2.

Error quantification

The first half of the samples will be used for training, while the other half will be used for testing. For the estimator , which is computed by minimizing the mean of on training data, the generalization accuracy is calculated, where is the indicator function defined by

We define the quantity as in Section 4.2 with the considered number of blocks .

Results and conclusions

Table 4 summarizes the results for various settings. To begin, we see that DeepMoM and estimators behave very closely in the uncorrupted and slightly corrupted labels () and slightly corrupted inputs () cases. However, when the corruptions are more serious, our DeepMoM estimator clearly outperforms the other approaches. In general, we conclude that is efficient on benign data and robust on problematic data.

5 Applications

We now illustrate the potential of DeepMoM in practice. We demonstrate that DeepMoM can withstand data corruptions considerably better than usual approaches.

Application in regression data

The first application is the prediction of the critical temperature of a superconductor. Superconductor materials have a wide range of practical uses. Because of their frictionless property, superconducting wires and electrical systems, for example, have the potential to transport and deliver electricity without any energy loss in the energy industry. This frictionless property, however, occurs only when the ambient temperature is at or below the critical temperature of the superconductor. As a result, the estimation of a superconductor’s critical temperature has baffled scientists since the discovery of superconductivity.

The data Hamidieh (2018); Dua and Graff (2017) contains samples and features extracted from the superconductor’s chemical formula Hamidieh (2018). In this example, we first select samples randomly for training and keep the remaining samples for testing. The normalization of the input data are the same as we described in Section 4.2

. We fit this data by considering a two layers ReLU network with 50 neurons in the first hidden layer and 5 neurons in the second one.

Application in classification data

The second application is the classification of handwritten digit images for the values within

from the well-known MNIST data set 

LeCun et al. (1998); Lecun et al. .

This data contains samples for training, samples for testing, and pixels as features. The preprocessing of the input data and the network settings are the same as we described in Section 4.2.

Our aim for these two types of applications is to validate the DeepMoM method on original data with some corruptions and compare the results to other robust competitors for deep learning, as specified in Section 4.2 and Section 4.3, respectively.

Corrupted outputs: uniform distribution
Scaled mean of prediction errors
Corrupted outputs: t-distribution
Scaled mean of prediction errors
10 1.25
1 1.36
Corrupted inputs
Scaled mean of prediction errors
(a) Superconductor data set
Corrupted outputs (shuffled labels)
Prediction accuracies
Corrupted inputs
Prediction accuracies
(b) MNIST data set
Table 5: outperforms its contenders in robustness for all settings

The predicted results of these two applications are presented in Table 5(a) and 5(b), respectively. These results again suggest that DeepMoM estimator provides higher accuracies than its contenders.

6 Discussion

Our new approach to training the parameters of neural networks is robust against corrupted samples and yet leverages informative samples efficiently. We have confirmed these properties numerically in Sections 4.2, 4.3, and 5. The approach can, therefore, be used as a general substitute for basic least-squares-type or cross-entropy-type approaches.

We have restricted ourselves to feed-forward neural networks with ReLU activation, but there are no obstacles for applying our approach more generally, for example, to convolutional networks or other activation functions. However, to keep the paper clear and concise, we defer a detailed analysis of in other deep-learning frameworks to future work.

Similarly, we model corruption by uniform or heavy-tailed random perturbations of the inputs or outputs or by randomly swapping labels, but, of course, one can conceive a plethora of different ways to corrupt data.

In sum, given modern data’s limitations and our approach’s ability to make efficient use of such data, we believe that our contribution can have a substantial impact on deep-learning practice.


We thank Guillaume Lecué, Timothé Mathieu, Mahsa Taheri, and Fang Xie for their insightful inputs an suggestions.


  • [1] N. Akhtar and A. Mian (2018)

    Threat of adversarial attacks on deep learning in computer vision: a survey

    arXiv:1801.00553. Cited by: §1.
  • [2] E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. Mcguinness (2019) Unsupervised label noise modeling and loss correction.

    Proceedings of the 36th international conference on machine learning

    97, pp. 312–321.
    Cited by: §1.
  • [3] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2018) Synthesizing robust adversarial examples. arXiv:1707.07397. Cited by: §3.
  • [4] J. Barron (2019) A general and adaptive robust loss function. arXiv:1701.03077. Cited by: §1, §3.
  • [5] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab (2015) Robust optimization for deep regression. arXiv:1505.06606. Cited by: §1, §3.
  • [6] M. Brückner, C. Kanzow, and T. Scheffer (2012) Static prediction games for adversarial learning problems. Journal of machine learning research 13 (85), pp. 2617–2654. Cited by: §3.
  • [7] N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. arXiv:1705.07263. Cited by: §3.
  • [8] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Cited by: §5.
  • [9] S. Friedrich, G. Antes, S. Behr, H. Binder, W. Brannath, F. Dumpert, K. Ickstadt, H. Kestler, J. Lederer, H. Leitgöb, M. Pauly, A. Steland, A. Wilhelm, and T. Friede (2020)

    Is there a role for statistics in artificial intelligence?

    arXiv:2009.09070. Cited by: §1.
  • [10] I. Goodfellow, P. McDaniel, and N. Papernot (2018) Making machine learning robust against adversarial inputs : such inputs distort how machine-learning based systems are able to function in the world as it is. Communications of the ACM 61 (7), pp. 56–66. Cited by: §3.
  • [11] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. arXiv:1412.6572. Cited by: §3.
  • [12] K. Hamidieh (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Computational materials science 154, pp. 346–354. Cited by: §5.
  • [13] F. Hampel, E. Ronchetti, P. Rousseeuw, and W. Stahel (2011) Robust statistics: the approach based on influence functions. Wiley. Cited by: §3.
  • [14] W. He, J. Wei, X. Chen, N. Carlini, and D. Song (2017) Adversarial example defenses: ensembles of weak defenses are not strong. arXiv:1706.04701. Cited by: §3.
  • [15] P. Huber and E. Ronchetti (2009) Robust statistics. Wiley. Cited by: §3.
  • [16] P. Huber (1964) Robust estimation of a location parameter. Annals of mathematical statistics 35 (1), pp. 73–101. Cited by: §2, §3.
  • [17] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and F.-F. Li (2018) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv:1712.05055. Cited by: §1, §3.
  • [18] T. keen security lab (2019) Experimental security research of tesla autopilot. Tencent keen security lab. Cited by: §1.
  • [19] J. Kos and D. Song (2017) Delving into adversarial attacks on deep policies. arXiv:1705.06452. Cited by: §1.
  • [20] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv:1607.02533. Cited by: §1.
  • [21] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv:1611.01236. Cited by: §1, §3.
  • [22] L. Le Cam (1973) Convergence of estimates under dimensionality restrictions. The annals of statistics 1 (1), pp. 38–53. Cited by: §1.
  • [23] L. Le Cam (1986)

    Sums of independent random variables

    Asymptotic methods in statistical decision theory springer series in statistics, pp. 399–456. Cited by: §1.
  • [24] G. Lecué, M. Lerasle, and T. Mathieu (2020) Robust classification via MOM minimization. Machine learning 109 (8), pp. 1635–1665. Cited by: §1.
  • [25] G. Lecué and M. Lerasle (2017) Learning from mom’s principles: le cam’s approach. arXiv:1701.01961. Cited by: §1.
  • [26] G. Lecué and M. Lerasle (2017) Robust machine learning by median-of-means : theory and practice. arXiv:1711.10306. Cited by: §1, §2.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §5.
  • [28] Y. Lecun, C. Cortes, and C. Burges The MNIST database. MNIST handwritten digit database. External Links: Link Cited by: §5.
  • [29] J. Lederer (2020) Risk bounds for robust deep learning. arXiv:2009.06202. Cited by: §1, §3.
  • [30] J. Lederer (2021) Activation functions in artificial neural networks: a systematic overview. arXiv:2101.09957. Cited by: §2.
  • [31] C. Liu and M. Belkin (2019) Accelerating SGD with momentum for over-parameterized learning. arXiv:1810.13395. Cited by: Appendix B.
  • [32] G. Lugosi and S. Mendelson (2019) Regularization, sparse recovery, and median-of-means tournaments. Bernoulli 25 (3). Cited by: §1.
  • [33] G. Lugosi and S. Mendelson (2019) Risk minimization by median-of-means tournaments. Journal of the European mathematical society 22 (3), pp. 925–965. Cited by: §1.
  • [34] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083. Cited by: §1.
  • [35] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2019) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083. Cited by: §3.
  • [36] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2015) Distillation as a defense to adversarial perturbations against deep neural networks. arXiv:1511.04508. Cited by: §1.
  • [37] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. arXiv. Cited by: §1.
  • [38] R core team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing. Cited by: Appendix D.
  • [39] Y. Roh, G. Heo, and S. Whang (2021) A survey on data collection for machine learning: a big data - AI integration perspective. IEEE transactions on knowledge and data engineering 33 (4), pp. 1328–1347. Cited by: §1.
  • [40] D. Rumelhart, G. Hinton, and R. Williams (1986) Learning representations by back-propagating errors. Nature 323 (6088), pp. 533–536. Cited by: Appendix A.
  • [41] H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn, and S. Bubeck (2019)

    Provably robust deep learning via adversarially trained smoothed classifiers

    arXiv:1906.04584. Cited by: §1.
  • [42] M. Sharif, S. Bhagavatula, L. Bauer, and M. Reiter (2016)

    Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition

    Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. Cited by: §1.
  • [43] J. Su, D. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE transactions on evolutionary computation

    23 (5), pp. 828–841.
    Cited by: §3.
  • [44] R. Sun (2019) Optimization for deep learning:theory and algorithms. arXiv:1912.08957. Cited by: Appendix B.
  • [45] F. Tramèr and D. Boneh (2019) Adversarial training and robustness for multiple perturbations. arXiv:1904.13000. Cited by: §3.
  • [46] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv:1705.07204. Cited by: §1.
  • [47] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019)

    Robustness may be at odds with accuracy

    Cited by: §3.
  • [48] H. Wang and C.-N. Yu (2019) A direct approach to robust deep learning using adversarial networks. arXiv:1905.09591. Cited by: §1.
  • [49] T. Wang, Y. Gu, D. Mehta, X. Zhao, and E. Bernal (2018) Towards robust deep neural networks. arXiv:1810.11726. Cited by: §1.
  • [50] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. Huang (2016) Studying very low resolution recognition using deep networks. arXiv:1601.04153. Cited by: §1, §3.
  • [51] E. Wong and J. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv:1711.00851. Cited by: §3.
  • [52] H. Xu and S. Mannor (2010) Robustness and generalization. arXiv:1005.2243. Cited by: §3.
  • [53] K. Yi and J. Wu (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cited by: §1.
  • [54] X. Yuan, P. He, Q. Zhu, and X. Li (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §1.

Appendix A Gradients

Given , we can find—see the definition of the empirical median-of-means in (6)—indexes (which depend on , respectively) such that and