An Information-Theoretic Explanation for the Adversarial Fragility of AI Classifiers

01/27/2019 ∙ by Hui Xie, et al. ∙ 0

We present a simple hypothesis about a compression property of artificial intelligence (AI) classifiers and present theoretical arguments to show that this hypothesis successfully accounts for the observed fragility of AI classifiers to small adversarial perturbations. We also propose a new method for detecting when small input perturbations cause classifier errors, and show theoretical guarantees for the performance of this detection method. We present experimental results with a voice recognition system to demonstrate this method. The ideas in this paper are motivated by a simple analogy between AI classifiers and the standard Shannon model of a communication system.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advances in machine learning have led to the invention of complex classification systems that are very successful in detecting features in datasets such as images, hand-written texts, or audios. However, recent works have also discovered what appears to be a universal property of AI classifiers: vulnerability to small adversarial perturbations. Specifically, we know that it is possible to design “adversarial attacks” that manipulate the output of AI classifiers arbitrarily by making small carefully-chosen modifications to the input. Many such successful attacks only require imperceptibly small perturbations of the inputs, which makes these attacks almost undetectable. Thus AI classifiers exhibit two seemingly contradictory properties: (a) high classification accuracy even in very noisy conditions, and (b) high sensitivity to very small adversarial perturbations. In this paper, we will use the term “adversarial fragility” to refer to this property (b).

The importance of the adversarial fragility problem is widely recognized in the AI community and there now exists a vast and growing literature studying this property, see e.g. [1]

for a comprehensive survey. This work, however, has not yet resulted in a consensus on two important questions: (a) a theoretical explanation for adversarial fragility, and (b) a general and systematic defense against adversarial attacks. Instead, we currently have multiple competing theoretical explanations, multiple defense strategies based on both theoretical and heuristic ideas and many methods for generating adversarial examples for AI classifiers. Theoretical hypotheses from the literature include (a) quasi-linearity/smoothness of the decision function in AI classifiers

[2], (b) high curvature of the decision boundary [3] and (c) closeness of the classification boundary to the data sub-manifold [4]. Defenses against adversarial attacks have also evolved from early methods using gradient masking [5], to more sophisticated recent methods such as adversarial training where an AI system is specifically subjected to adversarial attacks as part of its training process [6]

, and defensive distillation

[7]. These new defenses in turn motivate the development of more sophisticated attacks [8] in an ongoing arms race.

In this paper, we show that property “adversarial fragility” is an unavoidable consequence of a simple “compression” hypothesis about AI classifiers. This hypothesis is illustrated in Fig. 2: we assume that the output of AI classifiers is a function of a highly compressed version of the input. More precisely, we assume that the output of AI classifiers is a function of an intermediate set of variables of much smaller dimension than the input. The intuition behind this hypothesis is as follows. AI classifiers typically take high-dimensional inputs e.g. image pixels, audio samples, and produce a discrete label as output. The input signals (a) contain a great deal of redundancy, and (b) depend on a large number of irrelevant variables that are unrelated to the output labels. Efficient classifiers, therefore, must remove a large amount of redundant and/or irrelevant information from the inputs before making a classification decision. Indeed, a classifier that generalizes well, must, by definition, be insensitive to as many non-essential input features as possible. We show in this paper that adversarial fragility is an immediate and necessary consequence of this “compression” property.

Certain types of AI systems can be shown to satisfy the compression property simply as a consequence of their structure. For instance, AI classifiers for the MNIST dataset [9]

typically feature a final layer in the neural network architecture that consists of softmax over a

real-numbered vector corresponding to the

different label values; this amounts to a substantial dimension reduction from the dimensional pixel vector at the inputs. More generally, there is some empirical evidence showing that AI classifiers actively compress their inputs during their training process [10].

Our proposed explanation of adversarial fragility also immediately leads to an obvious and very powerful defense: if we enhance a classifier with a generative model that at least partially “decompresses” the classifier’s output, and compare it with the raw input signal, it becomes easy to check when adversarial attacks produce classifier outputs that are inconsistent with their inputs. While we present some simple experimental results to validate our theory, our focus here is on the theoretical ideas; the important and challenging problem of designing good generative models to implement the proposed defense for general AI classification systems is deferred to future work. Interestingly, while our theory is novel, other researchers have recently developed defenses for AI classifiers against adversarial attacks that are consistent with our proposed approach [11, 12].

Ii Problem Statement

Fig. 1: Top: standard abstract model of a communication system; Bottom: abstract model of an AI classifier system.

An AI classifier can be defined as a system that takes a high-dimensional vector as input and maps it to a discrete set of labels. As an example, a voice-recognition AI takes as input a time series containing the samples of an audio signal and outputs a string representing a sentence in English (or other spoken language). More concretely, consider Fig. 1 which explores a simple analogy between an AI classification system and a digital communication system.

The purpose of the AI system in Fig. 1

is to estimate the state of the world

where the set of all possible world states is assumed to be finite and are enumerated as , where is the size of . The input to the AI classifier is a noisy version of signals , and depend on and on a number of extraneous parameters . Note that the state is uniquely determined by its index or “label” . The output of the AI classifier is a state estimate , or equivalently, its label.

The AI classifier in Fig. 1 is clearly analogous to a communication decoder: it looks at a set of noisy observations and attempts to decide which out of a set of possible input signals was originally “transmitted” over the “channel”, which in the AI system models all signal impairments such as distortion, random noise and hostile attackers.

The “Signal Synthesis” block in the AI system maps input features into an observable signal . In the abstract model of Fig. 1, the synthesis function is deterministic with all random effects being absorbed into the “channel” without loss of generality. Note that while the encoder in the communication system is under the control of its designers, the signal synthesis in an AI system is determined by physical laws and is not in our control. However, the most important difference between communication and AI systems is the presence of the nuisance parameters . For instance, in a voice recognition system, the input features consist of the text being spoken () and also a very large number of other characteristics () of the speaker’s voice such as pitch, accent, dialect, loudness, emotion etc. which together determine the mapping from a text to an audio signal. Thus there are a very large number of different “codewords” that encode the same label . Let us define the “codeword set” for label :


We assume that the codeword sets satisfy:


for some , where represents norm. In other words, all valid codewords corresponding to different labels are separated by at least a distance . In the voice recognition example, under this assumption audio signals corresponding to two different sentences must sound different. This guarantees the existence of the ideal classifier defined as the function that satisfies . By definition, the ideal classifier maps any valid input signal to the correct label in the absence of noise.

Fig. 2: AI classifier using a information compression process and its analogy with a communication decoder

Fig. 2 shows an abstract model of a classifier that is constrained to make final classification decisions based on only a compressed version of . Specifically, we assume that there exists a compression function , where such that the classifier output can be written as , where is a decision function. We define the “compressed codeword sets” as . We will assume that the sets are disjoint so that the compression map preserves information in about the label .

We will show that a classifier constrained to use only for decoding, even if designed optimally, can retain its robustness to random noise , but is necessarily vulnerable to adversarial attacks that are significantly smaller in magnitude. By contrast, uncompressed classifiers can be robust to both random and worse-case noise. In other words, we show that adversarial fragility can be explained as an artifact of compression or dimension reduction in decoders.

Our method for detecting adversarial attacks is based on the idea of at least partially “decompressing” the output of the classifier and checking it for consistency against the raw observations . Specifically, suppose the classifier outputs label for input signal . Define as:


If we observe that is abnormally large, this means that the observed signal is far from any valid codeword with label and we conclude that label is inconsistent with observations . This, however, requires a feasible method for calculating for a label and signal . This is basically a denoising operation that outputs a noise-free codeword given a label and noisy observation . Generative models are capable of performing such a denoising operation. We do not discuss the design of such models in this paper; instead we will show that under mild assumptions on the encoding function , we can provide theoretical guarantees on a detector assuming a well-functioning generative model.

Iii Theoretical Analysis

In this section, we consider a signal , which can be a noisy version of a codeword. Without loss of generality, we assume that an ideal classifier will classify to label , and assume that the closest codeword to is for and a certain . For any , we also define as the codeword with label that is closest to : . We define the sets and as the spheres of size around and respectively, namely and . We assume that . For simplicity of analysis, we assume that, for a vector , the classifier outputs label if and only if for a certain .

We consider the problem of finding the smallest targeted perturbation in magnitude which fools the decoder into outputting label . Formally, for any , we define the minimum perturbation size needed for target label as:


Let us define a quantity , which we term as “effective distance between and with respect to function ” as . Then for any vector , we can use (4) to upper bound the smallest required perturbation size .

For an and , we say a classifier has -robustness at signal , if , where is randomly sampled uniformly on a sphere222Defined for some given norm, which we will take to be norm throughout this paper. of radius , and

means probability. In the following, we will show that for a small

, compressed classifiers can still have -robustness for , namely the classifier can tolerate large random perturbations while being vulnerable to much smaller adversarial attacks.

Iii-a Classifiers with Linear Compression Functions

We first consider the special case where the compression function is linear, namely with . While this may not be a reasonable model for practical AI systems, analysis of linear compression functions will yield analytical insights that generalize to nonlinear as we show later.

Theorem 1

Let be the input to a classifier, which makes decisions based on the compression function , where the elements of (

) are i.i.d. following the standard Gaussian distribution

. Let be the compressed image of . Then the following statements hold for arbitrary , , and a big enough .
1) With high probability (over the distribution of ), an attacker can design a targeted adversarial attack with such that the classifier is fooled into classifying the signal into label . Moreover, with high probability (over the distribution of ), an attacker can design an (untargeted) adversarial perturbation with such that the classifier will not classify into label .
2) Suppose that is randomly uniformly sampled from a sphere of radius in . With high probability (over the distribution of and ), if , the classifier will not classify into label . Moreover, with high probability (over the distribution of and ), if , the classifier still classifies the into label correctly.
3) Let represent a successful adversarial perturbation i.e. the classifier outputs target label for the input . Then as long as , our adversarial detection approach will be able to detect the attack.

Proof: 1) We first look at the targeted attack case. For linear decision statistics, . Solving this optimization problem, we know the optimal is given by where is the Moore-Penrose inverse of . We can see that is nothing but the projection of onto the row space of . We denote the projection matrix as . Then the smallest magnitude of an effective adversarial perturbation is upper bounded by

For , we have . One can show that, when , we can always achieve the equality, namely .

Now we evaluate . Suppose that ’s elements are i.i.d., and follow the standard zero-mean Gaussian distribution , then the random projection is uniformly sampled from the Grassmannian . We can see that the distribution of is the same as the distribution of the magnitude of the first elements of , where is a vector with its elements being i.i.d. following the standard Gaussian distribution . From the concentration of measure, for any positive ,

Then when is big enough, with high probability, for arbitrary .

Now let us look at what perturbation we need such that is not in . One can show that is outside if and only if, . Then by the triangular inequality, the attacker can take an attack with , which is no bigger than with high probability, for arbitrary and big enough .

2) If and only if , , will not fool the classifier into label . If , “, ” is equivalent to “, ”, which is in turn equivalent to “, ”, where is the projection onto the row space of . Assuming that is uniformly randomly sampled from a sphere in of radius , then

From the concentration inequality, . Thus if is big enough, with high probability, . If , .

Now let us look at what magnitude we need for a random perturbation such that is in with high probability. We know is in if and only if, . Through a large deviation analysis, one can show that, for any and big enough , is smaller than and bigger than with high probability. Thus, for an arbitrary , if , with high probability, implying the AI classifier still classifies the into Class correctly.

3) Suppose that an AI classifier classifies the input signal into label . We propose to check whether belongs to . In our model, the signal belongs to only if . Let us take any codeword . We show that when , we can always detect the adversarial attack if the AI classifier misclassifies to that codeword corresponding to label . In fact, , which is no smaller than .

We note that means for every codeword , thus implying that the adversary attack detection technique can detect that is at more than distance from every codeword from .


Iii-B Nonlinear Decision Statistics in AI Classifiers

In this subsection, we show that an AI classifier using nonlinear compressed decision statistics is significantly more vulnerable to adversarial attacks than to random perturbations. We will quantify the gap between how much a random perturbation and a well-designed adversarial attack affect , the full proof of which is in [13].

Theorem 2

Let us assume that the nonlinear function is differentiable at . For , we define , and , where is uniformly randomly sampled from a unit sphere. Then , where means expectation over the distribution of . If we assume that the entries of the Jacobian matrix are i.i.d. distributed following the standard Gaussian distribution , then, when is big enough, with high probability, for any , .

Iv Experimental Results

Fig. 3: System Modules of Using Correlation Coefficients to Detect Adversarial Attacks

We performed a series of experiments333 to test and illustrate our proposed defense for a popular voice recognition AI classifier DeepSpeech444 The experimental setup is illustrated in Fig. 3; a visual comparison with the abstract model in Fig.1 shows how the various functional blocks are implemented in our experiment.

The experiment consisted of choosing sentences randomly from the classic 19-th century novel “A Tale of Two Cities.” A Linux text-to-speech (T2S) software, Pico2wave, converted a chosen sentence e.g. into a female voice wave file. The use of a T2S system for generating the source audio signal (instead of human-spoken audio) effectively allows us to hold the all “irrelevant” variables constant, and thus renders the signal synthesis block in Fig. 1 as a deterministic function of just the input label .

Fig. 4: The change of cross correlation coefficients . Blue circles indicate between input signals without adversarial attack and their corresponding reconstructed signals from decoded labels, and red triangles indicate between input signals with adversarial attacks and its coresponding reconstructed signals . Low “blue circles” mean DeepSpeech runs into recognition failure in several error characters, even if no adversarial attacks are present.

Let denote the samples of this source audio signal. This audio signal is played over a PC speaker and recorded by a USB microphone on another PC. Let denote the samples of this recorded wave file. The audio playback and recording was performed in a quiet room with no audible echoes or distortions, so this “channel” can be approximately modeled as a simple AWGN channel: , where is a scalar representing audio signal attenuation and is random background noise. In our experiment, the was approximately dB.

We input

into a voice recognition system, specifically, the Mozilla implementation DeepSpeech V0.1.1 based on TensorFlow. The 10 detailed sentences are demonstrated in the table of our full paper

[13]. We then used Nicholas Carlini’s adversarial attack Python script555 with Deep Speech (V0.1.1) through gradient back-propagation to generate a targeted adversarial audio signal where is a small adversarial perturbation that causes the DeepSpeech voice recognition system to predict a completely different sentence . Thus, we have a “clean” audio signal , and a “targeted corrupted” adversarial audio signal that upon playback is effectively indistinguishable from , but successfully fools DeepSpeech into outputting a different target sentence. In our experiment, the power of over the adversarial perturbation was approximately dB.

We then implemented a version of our proposed defense to detect whether the output of the DeepSpeech is wrong, whether due to noises or adversarial attacks. For this purpose, we fed the decoded text output of the DeepSpeech system into the same T2S software Pico2Wave, to generate a reconstructed female voice wave file, denoted by . We then performed a simple cross-correlation of a portion of the reconstructed signal (representing approximately reconstruction of the original number of samples in ) with the input signal to the DeepSpeech classifier: . If is smaller than a threshold (0.4), we declare that the speech recognition classification is wrong. The logic behind this test is as follows. When the input signal is i.e. the non-adversarial-perturbed signal, the DeepSpeech successfully outputs the correct label ,which results in . Since is just a noisy version of , it will be highly correlated with . On the other hand, for the adversarial-perturbed input , the reconstructed signal is completely different from and therefore can be expected to be practically uncorrelated with .

Fig.4 shows the cross-correlation for sets of recorded signals (a) with and (b) without adversarial perturbations (red triangles and blue circles respectively in Fig. 4) . The adversarial perturbations all successfully fool the DeepSpeech AI to output the target text “he travels the fastest who travels alone”. We see that the observed correlations for the adversarial signals are always very small, and are therefore successfully detected by our correlation test. Interestingly, some of the non-adversarial signals yield low correlations as well, but this is because the DeepSpeech cannot decode perfectly even when there are no adversarial attacks present.