Guarantees on learning depth-2 neural networks under a data-poisoning attack

05/04/2020 ∙ by Anirbit Mukherjee, et al. ∙ Johns Hopkins University 0

In recent times many state-of-the-art machine learning models have been shown to be fragile to adversarial attacks. In this work we attempt to build our theoretical understanding of adversarially robust learning with neural nets. We demonstrate a specific class of neural networks of finite size and a non-gradient stochastic algorithm which tries to recover the weights of the net generating the realizable true labels in the presence of an oracle doing a bounded amount of malicious additive distortion to the labels. We prove (nearly optimal) trade-offs among the magnitude of the adversarial attack, the accuracy and the confidence achieved by the proposed algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The seminal paper [35] was among the first to highlight a key vulnerability of state-of-the-art network architectures like GoogLeNet, that adding small imperceptible adversarial noise

to test data can dramatically impact the performance of the network. In these cases despite the vulnerability of the predictive models to the distorted input, human observers are still able to correctly classify this adversarially corrupted data. In the last few years

adversarial attack experiments of the above kind have been improved upon and replicated on several state-of-the-art neural network implementations [13, 27, 3, 14]. This phenomena has also resulted in new adversarial defences being proposed to counter the attacks. Such empirical observations have been systematically reviewed in [1, 28]

Here we quickly review the conventional mathematical setup of adversarial risk that is typically used to quantify the robust performance of predictors. Suppose is the measure space where the data lives,

is the hypothesis space where the predictor is being searched and suppose the loss function is mapping,

. Then an adversarial attack is a set map . If

is the data distribution/probability measure on

then given an adversarial attack , the adversarial risk of a hypothesis is defined as,

The adversarial learning task is to find . Any hypothesis with low adversarial risk will be stable to attacks of type on the test data. This optimization formulation of adversarial robustness has been extensively explored in recent years. : multiple attack strategies have been systematically catalogued in [7, 22, 33], computational hardness of finding an adversarial risk minimizing hypothesis has been analyzed in [4, 6, 32, 24], the issue of certifying adversarial robustness of a given predictor has been analyzed in [30, 29] and bounds on the Rademacher complexity of adversarial risk have been explored in [40, 17].

In contrast we are aware of only one attempt at provable training of a neural network in the above model of adversarial risk, [10]. This result as it stands only applies when the net is asymptotically large/in the NTK regime. The authors here were able to establish that (deterministic) gradient descent can minimize the adversarial empirical risk on asymptotically large networks.

Thus it still remains an open challenge to demonstrate an example of provably robust training when (a) the neural network is realistically large/finite and (b) the algorithm has the common structure of being iterative and stochastic in nature. In this work we take a few steps towards this goal by working in a framework inspired by the causative attack model [34, 2]. The learning task in this attack model is framed as a game between a defender who seeks to learn and an attacker who aims to prevent this. In a typical scenario the defender draws a finite number of samples from the true distribution (say ) and for some the attacker mixes into the training data a set of (maybe adaptively) corrupted training samples such that . Now the defender has to train on the set . We note that our model differs from this causative attack model because we allow for any arbitrary fraction (including all) of the training data to be corrupted by the specific kind of adversarial attack that we allow. Also our adversary corrupts the training data in an online fashion - what is sometimes also called the data-poisoning attack [36], [41], [19].

We also note that we consider the case of regression on nets as opposed to the more frequently done empirical studies on adversarially robust classification by neural nets. To the best of our knowledge previous work on robust regression have been limited to linear functions and have either considered corruptions that are limited to a small subset [15], [37]

of the input space/feature predictors or make structural assumptions on the corruption. Despite the substantial progress with understanding robust linear regression

[38], [39], [8], [5], [23], [21], [20], [25], the corresponding questions have remained open for even simple neural networks.

Our first key step is to make a careful choice of the neural network class to work with, as given in Definition 1. Secondly, for the optimization algorithm, we draw inspiration from the different versions of the iterative stochastic non-gradient Tron algorithms analyzed in the past, [31], [26], [9], [16], [18], [12], [11]. We generalize this class of algorithms to a form as given in Algorithm 1 and we run it in the presence of an adversarial oracle that we instantiate which is free to make any additive perturbation (within a constant norm constraint) to the true output generated by a network of the same architecture as the one being trained on. Beyond boundedness, we impose no other distributional constraint on how the adversary additively distorts the true label.

In our main result, theorem II.1 we show that against the above adversary this algorithm while trying to recover the original net’s parameters, achieves a certain trade off between accuracy, confidence and the maximum allowed perturbation that the adversary is allowed to make. Further in section II-A we will explain that this trade-off is nearly optimal in the worst-case.

I-a The Mathematical Setup

As alluded to earlier we move away from the adversarial risk framework. We work in the supervised learning framework where we observe (input,output) data pairs

where and are measure spaces. Let be the distribution over the measure space and let be the marginal distribution over the input space . If is the hypothesis space for our learning task then let be a loss function. We model an adversarial oracle as a map , that corrupts the data with the intention of making the learning task harder. The learner observes only the corrupted data where and aims to find a hypothesis that minimizes the true risk,

In this work, we specifically consider the instance when , , is the square loss function and the hypothesis space , the class of depth-2, width-k neural networks defined as follows,

Definition 1 (Single Filter Neural Nets of Depth- and Width-).

Given a set of sensing matrices , an -leaky activation mapping, and a filter space we define the function class as,

Note that the above class of neural networks encompasses the following common instances, (a) single gates as and (b) Depth-, Width- Convolutional Neural Nets when the sensing matrices are such that each have exactly one in each row and at most one in each column and rest entries being zero.

Optimization with an Adversarial Oracle

We assume that such that the adversarial oracle acts as, while for some fixed and our risk minimization optimization problem can be be stated as : where one only has access to as defined above. Towards being able to solve the described optimization problem we make the following assumptions about the data distribution ,

  • Assumptions 1.1 : Parity Symmetry

    We assume that the input distribution is symmetric under the parity transformation i.e if

    is a random variable such that

    then we would also have .

  • Assumptions 1.2 : Finiteness of certain expectations

    The following expectations under are are assumed to be finite,

Note that in the above the adversarial oracle is free to design however cleverly (maybe even as a function of all the data seen so far and ) while obeying the norm bound. For ease of notation we define, , and,

We now state a stochastic non-gradient algorithm, Neuro-Tron inspired by [16],[18],[11],

Input: Sampling access to the marginal input distribution on .
Input: Access to adversarially corrupted output when queried with
Input: Access to the output of any for any and input.
Input: A sensing matrix and an arbitrarily chosen starting point of
for  do
     Sample and query the (adversarial) oracle with it.
     The oracle replies back with
     Form the Tron-gradient,
end for
Algorithm 1 Neuro-Tron (multi-gate, single filter, stochastic)

Ii The main result

Theorem II.1.

Suppose that Assumptions are satisfied and .  
Case I : Realizable and no noise, .  
Suppose the oracle returns faithful output i.e it sends when queried with .  
Then if we choose step size in Algorithm 1 as,

for all desired accuracy parameter and failure probability , for we have that,

Case II : Realizable with bounded adversarial corruption on labels, .  
Suppose that when queried with , the oracle returns an adversarially corrupted output where . Define . Suppose the distribution , matrix in Algorithm 1, noise bound , target accuracy and target confidence are such that,


Then if we choose step size in Algorithm 1 as,

Then for,

we have,


(a) If and is full rank i.e rank then in the above we can always chosen Also if is PD then is also a valid choice. (b) It can be easily seen that the term occurring in the expression of above, “” is positive because of the lowerbound imposed on the parameter (c) The above theorem essentially gives a worst-case trade-off between (the accuracy ) and (the confidence) that can be achieved when training against a fixed adversary additively corrupting the true output generated by the network by at most . In subsection II-A we will show that this worst-case trade-off is nearly optimal.

To develop some feel for the constraint imposed by equation 1

in the main theorem we look at a situation where the input data is being sampled from the normal distribution and the true output is computed by

, a single gate neural network.

Lemma II.2.

(Provable training for Gaussian distributions and single-

gate with an adversarial oracle). Suppose and . For the choice of in Algorithm 1, the constraint in theorem II.1 can be written as,


Here , and . We invoke standard results about the Gaussian distribution to see,

Therefore, . Thus the condition obtained in theorem II.1 results in equation 2. ∎

Hence we conclude that if we want to defend against an adversary with a fixed corruption budget of with a desired accuracy of and failure probability of then a sufficient condition (a safe data distribution) is if the data distribution is with being an increasing function of the data dimension in an appropriate way s.t the the RHS of equation 2 remains fixed.

Ii-A Demonstrating The Near Optimality Of The Guarantees Of Theorem ii.1

We recall that Case I of theorem II.1 shows that Algorithm 1 recovers the true filter when it has access to clean/exactly realizable data. For a given true filter consider another value for the filter and suppose that for some . It is easy to imagine cases where the supremum in the RHS exists like when is compactly supported. Now in this situation equation 1 says that , Hence proving optimality of the guarantee is equivalent to showing the existence of an attack within this bound for which the best accuracy possible nearly saturates this lowerbound.

Now note that this choice of allows for the adversarial oracle to be such that when queried at it replies back with where . Hence the data the algorithm receives will be such that it can be exactly realized with the filter choice being . Hence the realizable case analysis of theorem II.1 will apply showing that the algorithm’s iterates are converging in high probability to . Hence the error incurred is such that .

Now consider an instantiation of the above attack happening with for and being a single gate i.e . Its easy to imagine cases where is such that is finite and it also satisfies Assumptions 1.1 and 1.2. Further, this choice of is valid since the following holds,

Thus the above setup invoked on training a gate with inputs being sampled from as above while the labels are being additively corrupted by at most demonstrates a case where the worst case accuracy guarantee of is optimal upto a constant . We note that this argument also implies the near optimality of equation 1 for any algorithm defending against this attack which also has the property of recovering the parameters correctly when the labels are exactly realizable.

Iii Analyzing Tron Dynamics (Proof of Theorem ii.1)


Let the training data sampled till the iteration be We shall overload the notation to also denote by the sigma algebra generated by the first

samples. We recall that the weight vector update at iteration

is and this gives us,


Conditioned on , is determined while and are random and dependent on the random choice of . Now we compute the following conditional expectation,

In the first term above we can remove the activation by recalling an identity proven in [11] which we have reproduced here as lemma B.1 here. Thus we get,


In the above steps we have invoked the definition of that we had defined earlier. Now we bound the second term in the RHS of equation III as follows,


In the above lines we have invoked lemma B.2 twice to upperbound the term, and we have defined, . Now we take total expectations of both sides of equations III and III recalling that the conditional expectation of functions of w.r.t are random variables which are independent of the powers of . Then we substitute the resulting expressions into the RHS of equation III invoking the definitions of and to get,


Case I : Realizable, .

Here the recursion above simplifies to,


Let . Thus, for all ,

Recalling that , we can verify that the choice of step size given the theorem is and the assumption on ensures that for this , . Therefore, for , we have

The conclusion now follows from Markov’s inequality.

Case II : Realizable + Adversarial Noise, .

Note that the linear term in equation III is an unique complication that is introduced here because of the absence of distributional assumptions on the noise in the labels. We can now upperbound the linear term using the AM-GM inequality as follows, which also helps decouple the adversarial noise terms from the distance to the optima,

Thus equation III becomes,

Let us define , , , and . Then the dynamics of the algorithm is given by,

We note that the above is of the same form as lemma A.1 in the Appendix with . We invoke the lemma with s.t equation 1 holds. This along with the bound on noise that ensures that, as required by lemma A.1. The chosen value of in the theorem follows from the sufficient condition specified for in the lemma A.1.

Recalling the definition of as given in the theorem statement we can see that and hence we can read off from lemma A.1 that at the value of as specified in the theorem statement we have from lemma A.1 that,

and the needed high probability guarantee follows by Markov inequality. ∎

Iv Conclusion

To the best of our knowledge in this paper we have provided the first demonstration of a class of provably robustly learnable finitely large neural networks i.e along with the neural network class, we have given a class of adversarial oracles supplying additively corrupted labels and a corresponding stochastic algorithm which upto a certain accuracy and confidence performs supervised learning on our network in the presence of this malicious oracle corrupting the realizable true labels. We have also established as to why our guarantees are nearly optimal.

There are a number of exciting open questions that now open up from here. Firstly it remains to broaden the scope of such proofs to more complicated neural networks and to more creative adversaries which can say do more difficult distortions to the labels or corrupt even the samples from the data distribution . It will also be interesting to be able to characterize the information theoretic limits of accuracy and confidence trade-offs as a function of the corruption budget of the adversary, even while staying within the setup of theorem II.1


We would like to thank Amitabh Basu for extensive discussions on various parts of this paper. The first author would like to thank the MINDS Data Science Fellowship of JHU for supporting this work. The first author would also like to acknowledge the extensive discussions on this topic with Anup Rao, Sridhar Mahadevan, Pan Xu and Wenlong Mou when he was interning at Adobe, San Jose during summer


  • [1] N. Akhtar and A. Mian (2018)

    Threat of adversarial attacks on deep learning in computer vision: a survey

    IEEE Access 6, pp. 14410–14430. Cited by: §I.
  • [2] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar (2010) The security of machine learning. Machine Learning 81 (2), pp. 121–148. Cited by: §I.
  • [3] V. Behzadan and A. Munir (2017)

    Vulnerability of deep reinforcement learning to policy induction attacks


    International Conference on Machine Learning and Data Mining in Pattern Recognition

    pp. 262–275. Cited by: §I.
  • [4] S. Bubeck, E. Price, and I. Razenshteyn (2018) Adversarial examples from computational constraints. arXiv preprint arXiv:1805.10204. Cited by: §I.
  • [5] Y. Chen, C. Caramanis, and S. Mannor (2013) Robust sparse regression under adversarial corruption. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pp. III–774–III–782. Cited by: §I.
  • [6] A. Degwekar, P. Nakkiran, and V. Vaikuntanathan (2019) Computational limitations in robust classification and win-win results. arXiv preprint arXiv:1902.01086. Cited by: §I.
  • [7] Z. Dou, S. J. Osher, and B. Wang (2018) Mathematical analysis of adversarial attacks. arXiv preprint arXiv:1811.06492. Cited by: §I.
  • [8] J. Feng, H. Xu, S. Mannor, and S. Yan (2014)

    Robust logistic regression and classification

    In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, Cambridge, MA, USA, pp. 253–261. Cited by: §I.
  • [9] Y. Freund and R. E. Schapire (1999)

    Large margin classification using the perceptron algorithm

    Machine learning 37 (3), pp. 277–296. Cited by: §I.
  • [10] R. Gao, T. Cai, H. Li, C. Hsieh, L. Wang, and J. D. Lee (2019) Convergence of adversarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems, pp. 13009–13020. Cited by: §I.
  • [11] S. Goel, A. Klivans, and R. Meka (2018) Learning one convolutional layer with overlapping patches. arXiv preprint arXiv:1802.02547. Cited by: Lemma B.1, §I-A, §I, §III.
  • [12] S. Goel and A. Klivans (2017) Learning depth-three neural networks in polynomial time. arXiv preprint arXiv:1709.06010. Cited by: §I.
  • [13] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I.
  • [14] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel (2017)