# Trust but Verify: An Information-Theoretic Explanation for the Adversarial Fragility of Machine Learning Systems, and a General Defense against Adversarial Attacks

Deep-learning based classification algorithms have been shown to be susceptible to adversarial attacks: minor changes to the input of classifiers can dramatically change their outputs, while being imperceptible to humans. In this paper, we present a simple hypothesis about a feature compression property of artificial intelligence (AI) classifiers and present theoretical arguments to show that this hypothesis successfully accounts for the observed fragility of AI classifiers to small adversarial perturbations. Drawing on ideas from information and coding theory, we propose a general class of defenses for detecting classifier errors caused by abnormally small input perturbations. We further show theoretical guarantees for the performance of this detection method. We present experimental results with (a) a voice recognition system, and (b) a digit recognition system using the MNIST database, to demonstrate the effectiveness of the proposed defense methods. The ideas in this paper are motivated by a simple analogy between AI classifiers and the standard Shannon model of a communication system.

## Authors

• 10 publications
• 4 publications
• 6 publications
• 18 publications
• 20 publications
• 4 publications
• ### An Information-Theoretic Explanation for the Adversarial Fragility of AI Classifiers

We present a simple hypothesis about a compression property of artificia...
01/27/2019 ∙ by Hui Xie, et al. ∙ 0

• ### Derivation of Information-Theoretically Optimal Adversarial Attacks with Applications to Robust Machine Learning

We consider the theoretical problem of designing an optimal adversarial ...
07/28/2020 ∙ by Jirong Yi, et al. ∙ 0

• ### Sparsity-based Defense against Adversarial Attacks on Linear Classifiers

Deep neural networks represent the state of the art in machine learning ...
01/15/2018 ∙ by Zhinus Marzi, et al. ∙ 0

• ### Combating Adversarial Attacks Using Sparse Representations

It is by now well-known that small adversarial perturbations can induce ...
03/11/2018 ∙ by Soorya Gopalakrishnan, et al. ∙ 0

• ### Exploiting vulnerabilities of deep neural networks for privacy protection

07/19/2020 ∙ by Ricardo Sanchez-Matilla, et al. ∙ 0

• ### DeepCloak: Adversarial Crafting As a Defensive Measure to Cloak Processes

Over the past decade, side-channels have proven to be significant and pr...
08/03/2018 ∙ by Mehmet Sinan Inci, et al. ∙ 0

• ### Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning

Prediction credibility measures, in the form of confidence intervals or ...
11/24/2020 ∙ by Luiz F. O. Chamon, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent advances in machine learning have led to the invention of complex classification systems that are very successful in detecting features in datasets such as images, hand-written texts, or audio recordings. However, recent works have also discovered what appears to be a universal property of AI classifiers: vulnerability to small adversarial perturbations. Specifically, we know that it is possible to design “adversarial attacks” that manipulate the output of AI classifiers arbitrarily by making small carefully-chosen modifications to the input. Many such successful attacks only require imperceptibly small perturbations of the inputs, which makes these attacks almost undetectable. Thus AI classifiers exhibit two seemingly contradictory properties: (a) high classification accuracy even in very noisy conditions, and (b) high sensitivity to very small adversarial perturbations. In this paper, we will use the term “adversarial fragility” to refer to this property (b).

The importance of the adversarial fragility problem is widely recognized in the AI community and there now exists a vast and growing literature studying this property, see e.g. [1, 55, 25]

for a comprehensive survey. These studies, however, have not yet resulted in a consensus on two important questions: (a) a theoretical explanation for adversarial fragility, and (b) a general and systematic defense against adversarial attacks. Instead, we currently have multiple competing theoretical explanations, multiple defense strategies based on both theoretical and heuristic ideas and many methods for generating adversarial examples for AI classifiers. Theoretical hypotheses from the literature include (a) quasi-linearity/smoothness of the decision function in AI classifiers

[22], (b) high curvature of the decision boundary [20] and (c) closeness of the classification boundary to the data sub-manifold [49]. Defenses against adversarial attacks have also evolved from early methods using gradient masking [38], to more sophisticated recent methods such as adversarial training where an AI system is specifically subjected to adversarial attacks as part of its training process [50][40]. These new defenses in turn motivate the development of more sophisticated attacks [8] in an ongoing arms race.

In this paper, we show that property “adversarial fragility” is an unavoidable consequence of a simple “feature compression” hypothesis about AI classifiers. This hypothesis is illustrated in Fig. 3: we assume that the output of AI classifiers is a function of a highly compressed version of the input. More precisely, we assume that the output of AI classifiers is a function of an intermediate set of variables of much smaller dimension than the input. The intuition behind this hypothesis is as follows. AI classifiers typically take high-dimensional inputs e.g. image pixels, audio samples, and produce a discrete label as output. The input signals (a) contain a great deal of redundancy, and (b) depend on a large number of irrelevant variables that are unrelated to the output labels. Efficient classifiers, therefore, often remove a large amount of redundant and/or irrelevant information from the inputs before making a classification decision. We show in this paper that adversarial fragility is an immediate and necessary consequence of this “feature compression” property.

Certain types of AI systems can be shown to satisfy the feature compression property simply as a consequence of their structure. For instance, AI classifiers for the MNIST dataset [29]

typically feature a final layer in the neural network architecture that consists of softmax over a

real-numbered vector corresponding to the

different label values; this amounts to a substantial dimension reduction from the dimensional pixel vector at the inputs. More generally, there is some empirical evidence showing that AI classifiers actively compress their inputs during their training process [46]. The distance between two data samples in high dimensional space will always be larger than that between their low dimensional representations. This can allow small perturbations in high dimensional space to cause low dimensional representations in one decision region move to another decision region. Please see Figure 1 for an illustration.

Our proposed explanation of adversarial fragility also immediately leads to an obvious and very powerful defense: if we enhance a classifier with a generative model that at least partially “decompresses” the classifier’s output, and compare it with the raw input signal, it becomes easy to check when adversarial attacks produce classifier outputs that are inconsistent with their inputs. Interestingly, while our theory is novel, other researchers have recently developed defenses for AI classifiers against adversarial attacks that are consistent with our proposed approach [21, 44].

### 1.1 Related Works

Ever since Szegedy et al. pointed out the vulnerability of deep learning models in [48]

, the community has witnessed a large volume of works on this topic, from the angle of either attackers or defenders. From the attackers’ side, various types of attacking methods have been proposed in different scenarios, ranging from white-box attack where the attackers know everything about the deep learning system such as system structure and weights, hyperparameters and training data

[48, 22, 42, 28, 36, 39, 10, 13] to black-box attacks where the attackers know nothing about the system structure or parameters but only access to the queries of the system [38, 6, 13, 16, 23, 35]. Although the existence of adversarial samples was originally pointed out in image classification tasks, the attacking methods for generating adversarial samples have been applied to various applications such as text classification [30], object detection [52, 58], speech recognition [11], and autonomous driving [5].

From the defenders’ side, recently proposed methods for improving the safety of deep learning systems include [40, 9, 26, 34, 50, 4, 21, 27, 43, 45, 53, 56, 2, 3, 12, 18, 19, 31, 32, 37, 47, 57]. Most of these methods fall broadly into the following several classes: (1) adversarial training where the adversarial samples are used for retraining the deep learning systems [48, 22, 50, 45, 3]

; (2) gradient masking where the deep learning system is designed to have an extremely flat loss function landscape with respect to the perturbations in input samples

[40, 4]; (3) feature discretization where we simply discretize the features of samples (both benign samples and adversarial samples) before we feed it to the deep learning systems [37, 57]; (4) generative model based approach where we find a sample from the distribution of benign samples to approximate an arbitrary given sample, and then use the approximation as input for the deep learning systems [26, 47, 21, 43, 31, 32, 18].

The vulnerability of deep learning systems and its ubiquitousness raised the security concerns about such systems, and the community has been making attempts to explain the vulnerability phenomena [48, 22, 36, 33, 20, 49, 17, 41, 54] either informally or rigorously. In [48]

, Szegedy et al. argued that the adversarial samples are low-probability elements within the whole sample space, and less likely to be sampled to form a training or testing data set when compared with those from training or testing data set. This results in the fact that the deep learning classifiers cannot learn these adversarial samples and can easily make wrong decisions over these samples. Besides, since these low-probability samples are scattered around the training or testing samples, the samples in training or testing data set can be slightly perturbed to get these adversarial samples. In

[22], Goodfellow et al. proposed a linearity argument for explaining the existence of adversarial samples, which motivated them to develop a fast gradient sign method (FGSM) for generating adversarial samples. Later on, some first attempts from the theoretical side are made in [49, 20, 41]. A boundary tilting argument was proposed by Tanay et al. in [49] to explain the fragility of linear classifiers, and they established conditions under which the fragility of classifiers can be avoided. Later on in [20], Fawzi et al. investigated the adversarial attacking problem by analyzing the curvature properties of the classifiers’ decision boundary. The most recent work on explaining the vulnerability of deep learning classifiers was done by Romano et al. in [41] where they assumed a sparse representation model for the input of a deep learning classifier.

In this paper, based on the feature compression properties of deep learning systems, we propose a new rigorous theoretical understanding of the adversarial phenomena. Our explanation is distinct from previous work. Compared with [48, 22] which are empirical, our results are more rigorous. The results in [49] are applicable for linear classifiers, while our explanation holds for both linear and nonlinear classifiers. In [20], the authors exploited the curvature condition of the decision boundary of the classifiers, while we only utilize the fact that the classifiers will always compress high dimensional inputs to low dimensional latent codes before they make any decisions. Our results are also different from [41] where they required the inputs to satisfy a sparse representation model, while we do not need this assumption. Our theoretical explanation applies to both targeted and untargeted attacks, and is based on an very intuitive and ubiquitous assumption, i.e., feature compression property.

In [21], the authors used class-dependent image reconstructions based on capsule networks to detect the presence of adversarial attacks. The method in [21] is in spirit similar to our work: both approaches try to “decompress” from classifier output (or from outputs of hidden layers), to reconstruct higher-dimensional signals, in order to detect whether there exists adversarial attacks. Compared with [21], our trust-but-verify framework is inspired by information and coding theory, and comes with theoretical performance guarantees. After we independently worked on experiments of our trust-but-verify adversarial attack methods for MNIST dataset, we learned of the work [44] which proposed an optimization-based image reconstruction approach via generative models, to perform robust classification for MNIST dataset. The approach in [44] is close to one of our trust-but-verify approaches (see Section 4.2.2) for MNIST dataset. Compared with [44], this paper has several differences: a) the trust-but-verify approaches were inspired by information and coding theory and comes with corresponding theoretical performance guarantees; b) the trust-but-verify approaches which are based optimizations can be more general, and can be used to reconstruct functions of the higher-dimensional signals, rather than the full signals themselves (please see Section 2.1); c) the trust-but-verify approach is more computationally efficient than the method [44], without requiring solving an optimization problem for every class (10 optimization problems for MNIST); and d) the trust-but-verify approaches do not have to solve optimization problems to perform signal reconstructions, for example, the pixel regeneration network (Section 4.2.1) for MNIST.

Notations: Within this paper, we denote the set by , and the cardinality of a set by . For a vector , we use to refer to the sub-vector of with entries specified by set .

## 2 Problem Statement

An AI classifier can be defined as a system that takes a high-dimensional vector as input and maps it to a discrete set of labels. As an example, a voice-recognition AI takes as input a time series containing the samples of an audio signal and outputs a string representing a sentence in English (or other spoken language). More concretely, consider Fig. 2 which explores a simple analogy between an AI classification system and a digital communication system.

The purpose of the AI system in Fig. 2

is to estimate the state of the world

where the set of all possible world states is assumed to be finite and are enumerated as , where is the size of . The input to the AI classifier is a noisy version of signals , and depend on and on a number of extraneous parameters . Note that the state is uniquely determined by its index or “label” . The output of the AI classifier is a state estimate , or equivalently, its label.

The AI classifier in Fig. 2 is clearly analogous to a communication decoder: it looks at a set of noisy observations and attempts to decide which out of a set of possible input signals was originally “transmitted” over the “channel”, which in the AI system models all signal impairments such as distortion, random noise and hostile attackers.

The “Signal Synthesis” block in the AI system maps input features into an observable signal . In the abstract model of Fig. 2, the synthesis function is deterministic with all random effects being absorbed into the “channel” without loss of generality. Note that while the encoder in the communication system is under the control of its designers, the signal synthesis in an AI system is determined by physical laws and is not in our control. However, the most important difference between communication and AI systems is the presence of the nuisance parameters . For instance, in a voice recognition system, the input features consist of the text being spoken () and also a very large number of other characteristics () of the speaker’s voice such as pitch, accent, dialect, loudness, emotion etc. which together determine the mapping from a text to an audio signal. Thus there are a very large number of different “codewords” that encode the same label . Let us define the “codeword set” for label :

 Xi ≐{c∈RN:∃v, c=f(ui,v)} (1)

We assume that the codeword sets satisfy:

 mini,j, i≠jminci∈Xi, cj∈Xj∥∥ci−cj∥∥≥2r0 (2)

for some , where represents norm. In other words, all valid codewords corresponding to different labels are separated by at least a distance . In the voice recognition example, under this assumption audio signals corresponding to two different sentences must sound different. This guarantees the existence of the ideal classifier defined as the function that satisfies . By definition, the ideal classifier maps any valid input signal to the correct label in the absence of noise.

Fig. 3 shows an abstract model of a classifier that is constrained to make final classification decisions based on only a compressed version of . Specifically, we assume that there exists a compression function , where such that the classifier output can be written as , where is a decision function. We define the “compressed codeword sets” as

 Zi≐{z∈RM:v∈V, h(f(ui,v))=z}.

We will assume that the sets are disjoint so that the compression map preserves information in about the label .

We will show that a classifier constrained to use only for decoding, even if designed optimally, can retain its robustness to random noises, but is necessarily vulnerable to adversarial attacks that are significantly smaller in magnitude. By contrast, uncompressed classifiers can be robust to both random and worst-case noise. In other words, we show that adversarial fragility can be explained as an artifact of feature compression in decoders. We will detail our analysis in Section 3.

### 2.1 Trust but Verify: an Information Theory Inspired Defense against Adversarial Attacks

We propose a general class of defenses, inspired by methods in information and coding theory, to detect the presence of adversarial attacks. To motivate our proposed defense methods, consider a communication system, where a decoder decodes the received signal to message label “”. To test the correctness of the decoded label , the receiver can “decompress” label by re-encoding it to its corresponding codeword , and check whether the pair input and output are consistent, under the communication channel model . For example, in decoding using typical sequences [14], the decoder checks whether the codeword is jointly typical with the received signal , namely whether the pair follows the typical statistical behavior of the channel model. Sphere decoding algorithms for multiple-input multiple-output (MIMO) wireless communications check the statistics of the residual noise (specifically, to check whether the noise vector is bounded within a sphere), to provide the maximum-likelihood certificate property for the decoded result (or the decoded label) [51, 24].

Similarly for AI classifiers, we propose to check whether the classification result label is consistent with input signal , in order to detect adversarial attacks. Generally, we compute a consistency score between label and : the lower the score is, and label are more consistent with each other. Specifically, suppose the classifier outputs label for input signal . Define as:

 cj(y) ≐argminc∈Xj∥y−c∥, (3)

and define

 dj(y) ≐∥∥y−cj(y)∥∥. (4)

If we observe that (namely the score function ) is abnormally large, this means that the observed signal is far from any valid codeword with label and we conclude that label is inconsistent with observations . This, however, requires a feasible method for calculating for a label and signal . When, for a label , there is one unique codeword corresponding to , we can easily evaluate (4) and thus determine whether label is consistent with input . However, as noted earlier, in AI classification problems, a label does not uniquely correspond to a single codeword; instead there is a large codeword set associated with each label corresponding to different values of the nuisance parameters . In this case, evaluating (4) will need a conditional generative model mapping label to all the possible corresponding codewords in , using which we perform optimization (3) to obtain (4). Under mild assumptions on the encoding function , we can provide theoretical guarantees on a detector assuming a well-functioning generative model.

Solving (3), however, can be computationally expensive since there can be a vast number of codewords corresponding to label . To resolve the issue of high computational complexity of the former approach, we further propose a more general, and sometimes more computationally efficient, approach for checking the consistency between label and input signal . We consider two functions: and , where and are two positive integers. Our approach checks the consistency between label , and . Here and serve as prior information for the codewords, and, conditioning on them, we try to “predict” .

We compute a consistency score between label and : the lower the score is, label , and are more consistent with each other. One example of such a score is given by the following optimization problem. We define as:

 cj(y,p(y),ta(y)) ≐argminc∈Xj,p(c)∈N(p(y))∥ta(y)−ta(c)∥, (5)

where means a neighborhood of . We further define

 dj(y,p(y),ta(y))≐∥∥ta(y)−ta(cj(y,p(y),ta(y)))∥∥. (6)

Similarly, if we observe that (namely ) is abnormally large, the observed signal is far from any valid codeword with label and we conclude that label is inconsistent with observations . Compared with (4), the upshot of this approach is that there can be a unique or a much smaller set of codewords satisfying being in the neighborhood of . Namely, assuming label is correct, there is often a sufficiently accurate prediction of , based on function . Suppose that the original signal belongs to label . Then we would pick functions and such that, for different labels and ,

 mini,j,i≠jminci∈Xi, cj∈Xj,p(cj)∈N(p(ci))∥∥ta(ci)−ta(cj)∥∥≥2r1, (7)

where is a constant. The criterion (7) means that, even though a classier can be fooled into classifying to label , a prediction , conditioned on , will be dramatically different from , thus leading to the detection of the adversarial attacks.

## 3 Theoretical Analysis

In this section, we perform theoretical analysis of the effects of adversarial attacks and random noises on AI classifiers. We assume that, for a signal , an ideal classifier will classify to label , if there exits a codeword for and a certain , such that , where is a constant.

Without loss of generality, we consider a signal which an ideal classifier will classify to label . We further assume that the closest codeword to is for and a certain . For any , we also define as the codeword with label that is closest to :

 ci≐argminc∈Xi∥x−c∥.

We define the sets and as the spheres of radius around and respectively, namely

 S1≐{b∈RN:∥b−c1∥

and

 Si≐{b∈RN:∥b−ci∥

We assume that . For simplicity of analysis, we assume that, for a vector , the classifier outputs label if and only if for a certain .

We consider the problem of finding the smallest targeted perturbation in magnitude which fools the decoder into outputting label . Formally, for any , we define the minimum perturbation size needed for target label as:

 (8)

Let us define a quantity , which we term as “effective distance between and with respect to function ” as

 d(x,t)=minw∈RN, h(x+w)=h(t)∥w∥,

Then for any vector , we can use (8) to upper bound the smallest required perturbation size

 di(x)≤mint∈Sid(x,t).

For an and , we say a classifier has -robustness at signal , if

 P(g(h(x+w))=g(h(x)))≥1−ϵ,

where is randomly sampled uniformly on a sphere111Defined for some given norm, which we will take to be norm throughout this paper. of radius , and means probability. In the following, we will show that for a small , compressed classifiers can still have -robustness for , namely the classifier can tolerate large random perturbations while being vulnerable to much smaller adversarial attacks.

### 3.1 Classifiers with Linear Compression Functions

We first consider the special case where the compression function is linear, namely with . While this may not be a reasonable model for practical AI systems, analysis of linear compression functions will yield analytical insights that generalize to nonlinear as we show later.

###### Theorem 1.

Let be the input to a classifier, which makes decisions based on the compression function , where the elements of (

) are i.i.d. following the standard Gaussian distribution

. Let be the compressed image of . Then the following statements hold for arbitrary , , and a big enough .
1) With high probability (over the distribution of ), an attacker can design a targeted adversarial attack with

 ∥w∥2≤√1+ϵ√MN∥ci−x∥2−r

such that the classifier is fooled into classifying the signal into label . Moreover, with high probability (over the distribution of ), an attacker can design an (untargeted) adversarial perturbation with

 ∥w∥2≤r−√1−ϵ√MN∥x−c1∥

such that the classifier will not classify into label .
2) Suppose that is randomly uniformly sampled from a sphere of radius in . With high probability (over the distribution of and ), if

 l<√1−ϵ1+ϵ∥ci−x∥2−r√1+ϵ√MN,

the classifier will not classify into label . Moreover, with high probability (over the distribution of and ), if

 l<(1−ϵ)√NM√r2−MN∥x−c1∥2,

the classifier still classifies the into label correctly.
3) Let represent a successful adversarial perturbation i.e. the classifier outputs target label for the input . Then as long as

 ∥w∥2

our adversarial detection approach will be able to detect the attack.

###### Proof.

1) We first look at the targeted attack case. For linear decision statistics

 d(x,t)=minw∈RN,A(x+w)=A(t)∥w∥,

by solving this optimization problem, we know the optimal is given by where is the Moore-Penrose inverse of . We can see that is nothing but the projection of onto the row space of . We denote the projection matrix as . Then the smallest magnitude of an effective adversarial perturbation is upper bounded by

 mint∈Sid(x,t)=mint∈Si∥A†A(t−x)∥=mint∈Si∥P(t−x)∥.

For , we have

 ∥P(t−x)∥=∥P(ci−x)+P(t−ci)∥≥∥P(ci−x)∥−∥P(t−ci)∥.

One can show that, when , we can always achieve the equality, namely

 mint∈Si∥P(t−x)∥=∥P(ci−x)∥−r.

Now we evaluate . Suppose that ’s elements are i.i.d., and follow the standard zero-mean Gaussian distribution , then the random projection is uniformly sampled from the Grassmannian . We can see that the distribution of is the same as the distribution of the magnitude of the first elements of , where is a vector with its elements being i.i.d. following the standard Gaussian distribution . From the concentration of measure [15], for any positive ,

 P(∥P(ci−x)∥≤√1−ϵ∥ci−x∥√MN)≤e−Mϵ24, P(∥P(ci−x)∥≥√1+ϵ∥ci−x∥√MN)≤e−Mϵ212.

Then when is big enough,

 mint∈Si∥P(t−x)∥≤√1+ϵ√MN∥ci−x∥−r

with high probability, for arbitrary .

Now let us look at what perturbation we need such that is not in . One can show that is outside if and only if, . Then by the triangular inequality, the attacker can take an attack with , which is no bigger than with high probability, for arbitrary and big enough .

2) If and only if , , will not fool the classifier into label . If , “, ” is equivalent to “, ”, which is in turn equivalent to “, ”, where is the projection onto the row space of . Assuming that is uniformly randomly sampled from a sphere in of radius , then

 ∥P(x+w−t)∥ =∥P(ci−x)+P(t−ci)−Pw∥ ≥∥P(ci−x)∥−∥P(t−ci)∥−∥Pw∥.

From the concentration inequality,

 P(∥Pw∥≥√1+ϵ∥w∥√MN)≤e−M(ϵ2/2−ϵ3/3)2.

Thus if is big enough, with high probability,

 ∥P(x+w−t)∥≥√1−ϵ∥ci−x∥2√MN−r−√1+ϵ∥w∥√MN.

If ,

Now let us look at what magnitude we need for a random perturbation such that is in with high probability. We know is in if and only if, . Through a large deviation analysis, one can show that, for any and big enough , is smaller than and bigger than with high probability. Thus, for an arbitrary , if , with high probability, implying the AI classifier still classifies the into Class correctly.

3) Suppose that an AI classifier classifies the input signal into label . We propose to check whether belongs to . In our model, the signal belongs to only if . Let us take any codeword . We show that when , we can always detect the adversarial attack if the AI classifier misclassifies to that codeword corresponding to label . In fact, , which is no smaller than .

We note that means for every codeword , thus implying that the adversary attack detection technique can detect that is at more than distance from every codeword from .

### 3.2 Nonlinear Decision Statistics in AI Classifiers

In this subsection, we show that an AI classifier using nonlinear compressed decision statistics is significantly more vulnerable to adversarial attacks than to random perturbations. We will quantify the gap between how much a random perturbation and a well-designed adversarial attack affect .

###### Theorem 2.

Let us assume that the nonlinear function is differentiable at . For , we define

 α(ϵ)=max∥w∥≤ϵ(∥h(x+w)−h(x)∥),

and

 β(o,ϵ)=∥h(x+ϵo)−h(x)∥,

where is uniformly randomly sampled from a unit sphere. Then

 limϵ→0α(ϵ)Eo{β(o,ϵ)}≥√NM,

where means expectation over the distribution of . If we assume that the entries of the Jacobian matrix are i.i.d. distributed following the standard Gaussian distribution , then, when is big enough, with high probability,

 limϵ→0α(ϵ)Eo{β(o,ϵ)}≥(1−δ)√N+MM

for any .

Before we proceed, we introduce some technical lemmas which are used to establish the gap quantification in Theorem 2.

###### Lemma 3.

(Section III in [7]

) For a random matrix

with every entry being i.i.d. random variable distributed accord to Gaussian distribution

, we can have

 P(σmax(F)>1+√NM+o(1)+t)≤e−Mt2/2,∀t>0,

and

 P(σmin(F)<1−√NM+o(1)−t)≤e−Mt2/2,∀t>0,

where

is the maximal singular value of

, the is the smallest singular value of , and the is a small term tending to zero as .

From Lemma 3, we see that for a random matrix with all entries i.i.d. distributed according to standard Gaussian distribution , the scaled matrix will satisfy for all

 P(1√Nσmax(F)>1+√MN+o(1)+t)≤e−Nt2,

and

 P(1√Nσmin(F)<1−√MN+o(1)−t)≤e−Nt2/2, (9)

since

 σi(1√NFT)=1√Nσi(F),

where is the -th largest singular value of .

###### Proof.

(of Theorem 2) From the Taylor expansion, we know

 h(x+w)=h(x)+∇h(x)w+⎡⎢ ⎢ ⎢⎣o(∥w∥22)⋮o(∥w∥22)⎤⎥ ⎥ ⎥⎦ (10)

and

 h(x+ϵo)=h(x)+ϵ∇h(x)o+⎡⎢ ⎢ ⎢⎣o(ϵ2∥o∥22)⋮o(ϵ2∥o∥22)⎤⎥ ⎥ ⎥⎦, (11)

where and as . Thus

 limϵ→0α(ϵ)Eo{β(o,ϵ)} =σmax(∇h(x))Eo{∥∇h(x)o∥}, (12)

where is the maximal singular value of . Here the random vector is obtained by first sampling each entry i.i.d. from the standard Gaussian distribution, and then normalizing the magnitude, i.e.,

 o=1∥g∥g, (13)

where all entries of are i.i.d. distributed according to the standard Gaussian distribution.

We first consider the deterministic . Let the SVD of be where , and . Then from the convexity of and Jensen’s inequality, we have

 Eo{∥∇h(x)o∥} ≤√Eo{∥∇h(x)o∥2} =√Eo{∥UΣV∗o∥2} =√Eo{∥ΣV∗o∥2} ≤√σ2max(∇h(x))Eo{∥V∗o∥2} =σmax(∇h(x))  ⎷Eo⎧⎨⎩∑Mi=1(g′i)2∑Nj=1(gj)2⎫⎬⎭ =σmax(∇h(x))√MN,

where is a Gaussian random vector after rotating the Gaussian random vector by .

Actually, each element of is which is a standard Gaussian random variable where is the complex conjugate of . We have . We can find a matrix such that is unitary. When acts on a standard Gaussian random vector , we will get

 g′:=[V Q]∗g=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣g′1g′2⋮g′N⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,

and

 ∥g∥=∥g′∥.

Then

 Eo⎧⎨⎩∑Mi=1(g′i)2∑Nj=1(gj)2⎫⎬⎭=Eo{∑Mi=1(g′i)2N∑j=1(g′j)2}=M∑i=1Eo{ri},

where .

Since , then from the symmetry of , we have

 Eo{ri}=1N.

This gives

 Eo⎧⎨⎩∑Mi=1(g′i)2∑Nj=1(gj)2⎫⎬⎭=MN.

Thus, combining (3.2), we get

 limϵ→0α(ϵ)Eo{β(o,ϵ)}≥√NM.

We now consider the case where the entries of are i.i.d. distributed according to standard Gaussian distribution . From Lemma 3, we have with high probability that for

 σmax(∇h(x))≥(1−δ)(√N+√M). (14)

Since the Gaussian random vector is rotationally invariant, without loss of generality, we take the as

 o=[1 0 ⋯ 0]T∈RN. (15)

Then the is a vector with all entries being i.i.d. distributed according to standard Gaussian distribution. From the convexity of and Jensen’s inequality, we have

 E∇h(x){∥∇h(x)o∥}≤√E∇h(x){∥∇h(x)o∥2}=√M.

For a norm function , since

 |f(x)−f(y)|=|∥x∥−∥y∥|≤∥x−y∥,

then the function is Lipschitz continuous with Lipschitz constant . Then for , we have for every ,

 P(∣∣∥∇h(x)o∥−E∇h(x)[∥∇h(x)o∥]∣∣≥t)≤2e−t22.

This means the is concentrated at which is less than .

Thus, combining (3.2) and (14), we have with high probability

 limϵ→0α(ϵ)Eo{∇h(x)} ≥(1−δ)√N+√M√M ≥(1−δ)√M+N√N =(1−δ)√M+NM.

## 4 Experimental Results

We now present two sets of experimental results to demonstrate the efficacy of our proposed defense against adversarial attacks.

### 4.1 Speech Recognition

Our first set of experiments were based on a popular voice recognition AI classifier DeepSpeech222https://github.com/mozilla/DeepSpeech. The experimental setup is illustrated in Fig. 4; a visual comparison with the abstract model in Fig.2 shows how the various functional blocks are implemented.

The experiment 333https://github.com/Hui-Xie/AdversarialDefense consisted of choosing sentences randomly from the classic 19-th century novel “A Tale of Two Cities.” A Linux text-to-speech (T2S) software, Pico2wave, converted a chosen sentence e.g. into a female voice wave file. The use of a T2S system for generating the source audio signal (instead of human-spoken audio) effectively allows us to hold the all “irrelevant” variables constant, and thus renders the signal synthesis block in Fig. 2 as a deterministic function of just the input label .

Let denote the samples of this source audio signal. This audio signal is played over a PC speaker and recorded by a USB microphone on another PC. Let denote the samples of this recorded wave file. The audio playback and recording was performed in a quiet room with no audible echoes or distortions, so this “channel” can be approximately modeled as a simple AWGN channel: , where is a scalar representing audio signal attenuation and is random background noise. In our experiment, the was approximately dB.

We input

into a voice recognition system, specifically, the Mozilla implementation DeepSpeech V0.1.1 based on TensorFlow. The 10 detailed sentences are demonstrated in Table

1. We then used Nicholas Carlini’s adversarial attack Python script444https://nicholas.carlini.com/code with Deep Speech (V0.1.1) through gradient back-propagation to generate a targeted adversarial audio signal where is a small adversarial perturbation that causes the DeepSpeech voice recognition system to predict a completely different sentence . Thus, we have a “clean” audio signal , and a “targeted corrupted” adversarial audio signal that upon playback is effectively indistinguishable from , but successfully fools DeepSpeech into outputting a different target sentence. In our experiment, the power of over the adversarial perturbation was approximately dB.

We then implemented a version of our proposed defense to detect whether the output of the DeepSpeech is wrong, whether due to noises or adversarial attacks. For this purpose, we fed the decoded text output of the DeepSpeech system into the same T2S software Pico2Wave, to generate a reconstructed female voice wave file, denoted by . We then performed a simple cross-correlation of a portion of the reconstructed signal (representing approximately reconstruction of the original number of samples in ) with the input signal to the DeepSpeech classifier:

 ρmax(^x,y)=maxm∣∣∣∑n^x[n]y[n−m]∣∣∣, (16)

where denotes the -th entry of . If is smaller than a threshold (0.4), we declare that the speech recognition classification is wrong. The logic behind this test is as follows. When the input signal is i.e. the non-adversarial-perturbed signal, the DeepSpeech successfully outputs the correct label ,which results in . Since is just a noisy version of , it will be highly correlated with . On the other hand, for the adversarial-perturbed input , the reconstructed signal is completely different from and therefore can be expected to be practically uncorrelated with .