Input Validation for Neural Networks via Runtime Local Robustness Verification

02/09/2020 ∙ by Jiangchao Liu, et al. ∙ 0

Local robustness verification can verify that a neural network is robust wrt. any perturbation to a specific input within a certain distance. We call this distance Robustness Radius. We observe that the robustness radii of correctly classified inputs are much larger than that of misclassified inputs which include adversarial examples, especially those from strong adversarial attacks. Another observation is that the robustness radii of correctly classified inputs often follow a normal distribution. Based on these two observations, we propose to validate inputs for neural networks via runtime local robustness verification. Experiments show that our approach can protect neural networks from adversarial examples and improve their accuracies.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the tremendous success (LeCun et al., 2015) of deep neural networks in recent years, their applications in safety critical areas are still concerning.

Can we trust neural networks? This question arose when people found it hard to explain or interpret neural networks (Castelvecchi, 2016) and drew more attention since the discovery of adversarial examples (Szegedy et al., 2014). An adversarial example is an input that is obtained by adding a small, imperceptible perturbation to a valid input (i.e., correctly classified input), and that is designed to be misclassified. Recent studies (Ilyas et al., 2019) demonstrate that adversarial examples are features that widely exist in common datasets, thus can hardly be avoided. This means neural networks inherently lack robustness and are vulnerable to malicious attacks.

Considerable amount (Yuan et al., 2019) of works have been proposed to improve the robustness of neural networks against adversarial examples. One method is adversarial training (Goodfellow et al., 2014; Kurakin et al., 2016) which feeds adversarial examples to neural networks in the training stage. Adversarial training works well on the types of adversarial examples considered in the training dataset, but provides no guarantee on other types. Some works (Bradshaw et al., 2017; Abbasi and Gagné, 2017) focus on designing robust architectures of neural networks. However, similarly to adversarial training, these methods do not guarantee robustness on all adversarial examples.

One promising solution is formal verification, which can prove that a network satisfies some formally defined specifications. To give a formal specification on robustness, we first define a network as , where

is a vector space of input (e.g., images) and

is a set of class labels. Then we define robustness radius (Wang et al., 2017) of a network on an input as

where means norm distance. Robustness radius measures the region in which a network is robust against perturbations. Another equivalent definition is minimal distortion (Weng et al., 2018) (i.e., the minimal distance required to craft an adversarial example). We prefer to use the term robustness radius since it is defined from a defensive perspective. With robustness radius, we can define global robustness property of a network as

where is a user-provided threshold and denotes the oracle on the classification of . Note that outputs if an input in is not classified into any class in . This property ensures that for any input that can be recognized by human and correctly classified, the neural network is robust to any perturbation to some extent (i.e., in normal distance). Unfortunately, the global robustness property following this definition can hardly be verified because of the huge input space and absence of the oracle . Some researchers tried to verify global robustness property of a weaker definition (Katz et al., 2017) (without

), but only succeeded on very small networks (i.e., consisting of a few dozens of neurons).

Given the difficulties in verifying global robustness properties, many researchers turned to local robustness properties, i.e., . Instead of the whole input space, local robustness property only considers a set of inputs (denoted as in the formula), for instance, the training dataset. Various techniques (Huang et al., 2017; Gehr et al., 2018; Singh et al., 2018b; Ehlers, 2017) have been successfully applied in this kind of verification. However, local robustness properties are currently only used to evaluate the robustness of a given network or a defense technique, since they do not provide guarantee for robustness on inputs outside of the set .

Can we trust neural networks on a specific runtime input? Although this question is a compromise to the sad fact that the global robustness properties can hardly be guaranteed, it is still practically useful if we can know whether a neural network gives the expected output on an input at runtime. Adversarial detection (Lu et al., 2017; Grosse et al., 2017) rejects inputs that are suspected of being adversarial examples based on the characteristics observed on known adversarial attacks. Input reconstruction (Meng and Chen, 2017) tries to transform adversarial examples to the inputs that can be correctly classified. Runtime verification (Desai et al., 2018) checks whether an output satisfies some safety specifications at runtime and drops the output if not (traditional software is used as backup). This method, however, needs to know the constraints on outputs, which is not the case in tasks like image classification.

In this paper, we propose to validate inputs at runtime in a new way, i.e., via local robustness verification, which can compute the robustness radius of any input (as opposed to correctly classified inputs only). We utilize robustness radius as the characterics of inputs to distinguish correctly classified inputs and misclassified (possibly adversarial) inputs. Although it is known that adversarial examples themselves are often not robust to small perturbations (Luo et al., 2018; Wang et al., 2018a), to the best of our knowledge, we are the first to validate inputs by observing robustness radius. To be specific, we have two observations. The first is that the average robustness radius of valid inputs (i.e., correctly classified inputs) is much larger than that of misclassified inputs, no matter whether adversarial or not. To be formal, given a neural network , and a set of inputs at runtime (which may include adversarial examples), let and , then we have


where denotes cardinality and denotes “much larger than”. Note that we only consider inputs that can be classified into labels, which exclude randomly generated inputs mapping to no label (i.e., mapped by to ). We believe that this assumption is practical. Our experiments show that Equation 1 holds on adversarial examples from all attacks we have tried, especially on those strong attacks which seek the smallest perturbations.

Another observation is that the robustness radii of valid inputs (i.e., ) follow a normal distribution.

Based on these two observations, we propose a new way of validating inputs for neural networks. It can reject both adversarial examples and misclassified clean data (i.e., without crafted adversarial examples). Thus it not only protects neural networks from adversarial attacks, but also improves their accuracies. More importantly, this way does not need knowledge of the classification scenario and is not specific to any attack. We have conducted experiments on Feedforward Neural Networks (FNN) and Convolutional Neural networks (CNN) with three representative attacks, i.e., FGSM (fast, white-box) 

(Goodfellow et al., 2014), C&W (strong, white-box) (Carlini and Wagner, 2017), and HOP (i.e., Hopskipjump, black-box) (Chen et al., 2019). The results demonstrate the effectiveness of our method. To be more specific, on a random CNN for MNIST (LeCun et al., 1998), our method can reject 75% misclassified natural inputs, 95% and 100% FGSM adversarial examples with different parameters respectively, 100% C&W adversarial examples and 100% HOP adversarial examples, with only 3% false alarm rate.

It is worth mentioning that the two observations are valid not only on exact robustness radius computed by complete verification, but also on under-approximated robustness radius computed by incomplete verification, which is fast enough to be deployed at runtime.

We make the following contributions:

  • We observed that, on FNNs and CNNs, the average robustness radius of the valid inputs is much larger than that of the misclassified inputs (no matter whether adversarial or not);

  • We observed that, on most FNNs and CNNs, the robustness radii of the valid inputs follow a normal distribution;

  • Based on these two observations, we propose a new input validation method based on local robustness verification (which currently is only used to evaluate the robustness of a given network in existing work, as opposed to validate inputs), which can protect neural networks from adversarial examples, especially from strong attacks, and improve their accuracies on clean data.

2 Observation on Robustness Radii of Inputs from Different Categories

In this section, we show our observation on the robustness radii of valid (i.e., correctly classified) data, misclassified clean data and adversarial examples.

2.1 Background and Experimental Setup

Local Robustness Verification. Local robustness properties ensure that, a neural network is immune to adversarial examples on a set of inputs within in norm distance. To prove it, we only need to prove that, for given and ,


In this paper, we only consider the case . Current verifiers for this property can be categorized as complete and

. Complete verifiers can give an exact answer on whether the property is satisfied. Most complete verifiers are based on Mixed Integer Linear Programming (MILP) 

(Dutta et al., 2018; Fischetti and Jo, 2018) or Satisfiability Modulo Theories (SMT) (Ehlers, 2017; Katz et al., 2017). These methods are NP-hard, thus can hardly be applied to large networks.

Incomplete verifiers only provide conservative answers, that is, they could return unknown even if the property holds. Thus incomplete verifiers usually can only verify an under-approximation of robustness radius. Typical incomplete verification methods on neural networks include symbolic intervals (Wang et al., 2018b) and abstract interpretation (Singh et al., 2019). These methods are much more scalable than complete ones.

Experimental Setup. We take MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky et al., 2009) as our input datasets and use IBM’s Adversarial Robustness Toolbox (Nicolae et al., 2018) to generate FGSM, C&W, and HOP adversarial examples with default parameters, except for FGSM where we set (i.e., a parameter (Goodfellow et al., 2014)) as 0.1 (by default) and 0.05 (which is stronger) respectively.

We use ERAN (Singh et al., 2018a) as the verifier which supports both complete and incomplete robustness verifications. ERAN does not compute robustness radius directly, but can judge whether the robustness radius is larger than a given (i.e., the network is robust on all inputs that are within in norm distance with , as Equation 2). We denote it as . Note that, ERAN supports two versions of : the complete one and the incomplete one. Applying binary search on complete (resp. incomplete) , we can find a value close enough to the robustness radius (resp. an underapproximation of the robustness radius). This algorithm is described in Algorithm 1. In the following, we will call the computed value with complete verification the (asymptotically) exact robustness radius, and that with incomplete verification the approximate robustness radius. All experiments are conducted on an Ubuntu 18.04 running on a desktop with an Intel i9-9900K CPU, 32GB Memory.

  Input: network , input , big value , tolerance
     if  then
     end if
  Output: low
Algorithm 1 Computation of robustness radius

2.2 Observation on Exact Robustness Radius

ERAN combines abstract interpretation, linear programming and MILP to completely verify a network. To make the verification terminate in a reasonable time, we trained a small FNN (denoted as FNN-MNIST) on MNIST (with accuracy), which consists of 5 layers: the input layer, three fully connected layers, each with 30 neurons and one output layer.

We run ERAN with RefineZono (Singh et al., 2018b) domain, and set as 0.256 and as 0.001 in Algorithm 1. We computed the robustness radii of the first 100 samples from each of following six categories in the MNIST test dataset:

  • samples that can be correctly classified by the network

  • samples that are misclassifed by the network

  • adversarial examples from successful FGSM attack with

  • adversarial examples from successful FGSM attack with

  • adversarial examples from successful C&W attack

  • adversarial examples from successful HOP attack

(a) Exact robustness radius on FNN-MNIST
(b) Approximate robustness radius on FNN-MNIST
(c) Approximate robustness radius on CNN-MNIST
(d) Approximate robustness radius on CNN-CIFAR
Figure 1: The numbers of inputs which have a larger robustness radius than a given value

Figure 1(a) shows the number of inputs, the exact robustness radii of which are above a given value (i.e., the x-axis). We can see that the robustness radii of valid inputs are much larger than that of the other categories of data, especially the adversarial examples from the strong white-box C&W attack and black-box HOP attack. The robustness radii of adversarial examples from FGSM attack with are significantly larger than those with .

Our experiments suggest that we can use robustness radius to evaluate to what extent we should trust the output of a neural network on a given input. By setting a threshold to reject any input the robustness radius of which is lower, we can protect the neural network from adversarial examples and improve its accuracy. However, complete verification is time-consuming. In our experiments, each call to function takes 11s on average, even though our network contains only 100 neurons. It seems that complete verification can hardly be deployed at runtime, especially considering that the running time of complete verification increases exponentially with the number of neurons.

2.3 Observation on Approximate Robustness Radius

Incomplete verification usually runs much faster than complete verification and has the potential to be deployed at runtime. However, Algorithm 1 with incomplete verification can only give an approximate robustness radius. We would like to know (1) whether approximate robustness radius from incomplete verification is close to the exact robustness radius; (2) whether the approximate robustness radii of valid inputs are significantly larger than that of the misclassified inputs. Actually the second question is more important since it decides whether we can use incomplete verification to validate inputs.

Observation on FNN. We utilize ERAN with DeepZono domain (Singh et al., 2018a) (which is incomplete) to compute the approximate robustness radii of the same inputs on the network FNN-MNIST. The results are shown in Figure 1(b). Comparing Figure 1(a) and Figure 1(b), we can see that the values of approximate and exact robustness radii of the same inputs are very close (comparing the x-axis). In fact, the approximate robustness radii (except those equal to 0) of all inputs are between 44% and 100% of their exact robustness radii. Moreover, we can see that, similar to exact robustness radii, the approximate robustness radii of valid inputs are significantly larger than that of misclassified inputs. This means, we can utilize approximate robustness radius to protect neural networks. Moreover, each call to in incomplete verification costs less than 1s on the given network, and has polynomial time complexity wrt. the number of neurons, which means it has potential to be deployed at runtime.

Observation on CNN. We have also conducted experiments on Convolutional Neural Networks. They are significantly larger than the network FNN-MNIST, and complete verification methods can hardly compute robustness radius in a reasonable time. Thus we only tried incomplete verification. Our experiments on CNN are conducted on two datasets: MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky et al., 2009).

We trained a CNN (denoted as CNN-MNIST) on MNIST of 7 layers: the input layer, a convolutional layer with 6 filters of size

, a max-pooling layer of

, a convolutional layer with 16 filters of size , a max-pooling layer of , a fully connected layer of 128 neurons and an output layer with 10 labels. The accuracy is .

As in Section 2.2, we utilize ERAN with DeepZono domain (Singh et al., 2018a) to compute the approximate robustness radii of the first 100 inputs from each of the six categories. Figure 1(c) shows the results on CNN-MNIST. We can see that the computed approximate robustness radii of all inputs are much smaller (i.e., ) than those computed on the small network FNN-MNIST. We do not know whether the approximate robustness radii are close to the exact robustness radii, which we cannot get even after several days of computation. However, most importantly, the characteristics of approximate robustness radii of inputs of different categories are the same as exact robustness radii. That is the approximate robustness radii of valid inputs (i.e., the red line) are much larger than that of other inputs. In fact, if we set the threshold as 0.01, we can reject misclassified clean data, FGSM adversarial examples where , FGSM adversarial examples where , CW adversarial examples, and HOP adversarial examples, and only valid inputs.

We trained a LeNet-5 (LeCun et al., 1998) CNN (denoted as CNN-CIFAR in this paper) on CIFAR10 of 8 layers: the input layer, a convolutional layer with 6 filters of size , a max-pooling layer of , a convolutional layer with 16 filters of size , a max-pooling layer of , two fully connected layers of 120 and 84 neurons respectively, an output layer with 10 labels. The accuracy is .

Figure 1(d) shows the results of the first 100 inputs of each category in CIFAR10 test database. Even the approximate robustness radii of the valid inputs are significantly larger than those of misclassified inputs and adversarial examples from C&W and HOP attacks, but are almost indistinguishable from FGSM attacks. We believe that the reason is the accuracy of the network is too low such that it leaves big “holes” for adversarial examples in the input space.

To validate our observations, we trained more FNNs and CNNs of various structures and conducted the same measure. Table 1 shows the results. In the table, on each network, we show its training dataset (column Dataset), network structure (column Network), where () describes FNN-MNIST, () describes CNN-MNIST and (

) describes CNN-CIFAR. For each network structure, we adopted different activation functions (column Activation). The table also shows the accuracy on the test dataset (column Acc.) and the average approximate robustness radii of the first 10 inputs (we chose 10 because we believe that 10 is enough to compare the average values and generating adversarial attacks e.g., HOP can be very time-consuming) in the test dataset of six categories: correctly classified inputs (column Valid), misclassified inputs (column Mis.), adversarial examples from FGSM attacks with

, FGSM attacks with , C&W attacks and HOP attacks. The average running time of each call to is also recorded in column Time(s). The column P-value will be explained later (see Section 4

). From the table, we can see that our observation is valid on all trained networks. Our experiments can be easily reproduced since we only use open source tools with a little modification (e.g., Algorithm 

1). The modified code and all trained networks in this paper have been uploaded to an online repository111 Even though we believe that people can easily reproduce our experiments with their own trained networks.

Dataset Network Acivation Acc. Valid Mis. FGSM() FGSM() C&W HOP P-value Time(s)
MNIST ReLU 95.82 0.0227 0.0056 0.0091 0.0066 0.0020 0.0009 0.033 0.242
MNIST ReLU 96.49 0.0183 0.0069 0.0087 0.0065 0.0020 0.0009 0.026 0.239
MNIST ReLU 96.55 0.0194 0.0069 0.0078 0.0048 0.0027 0.0007 0.001 0.241
MNIST ReLU 96.80 0.0176 0.0047 0.0090 0.0068 0.0013 0.0009 0.140 0.143
MNIST ReLU 97.49 0.0162 0.0070 0.0080 0.0054 0.0023 0.0010 0.916 0.357
MNIST ReLU 97.37 0.0143 0.0076 0.0057 0.0051 0.0019 0.0010 0.091 0.441
MNIST ReLU 97.61 0.0127 0.0057 0.0047 0.0043 0.0013 0.0011 0.082 0.517
MNIST ReLU 97.70 0.0101 0.0054 0.0038 0.0041 0.0027 0.0010 0.001 1.746
MNIST Sigmoid 95.26 0.0145 0.0071 0.0074 0.0050 0.0020 0.0011 0.001 0.251
MNIST Sigmoid 96.15 0.0130 0.0053 0.0071 0.0050 0.0015 0.0010 0.003 0.156
MNIST Sigmoid 97.11 0.0115 0.0057 0.0047 0.0043 0.0017 0.0008 0.142 0.408
MNIST Sigmoid 96.86 0.0101 0.0057 0.0041 0.0042 0.0007 0.0008 0.126 0.551
MNIST Sigmoid 96.30 0.0102 0.0023 0.0050 0.0024 0.0011 0.0008 0.492 0.670
MNIST Sigmoid 96.90 0.0086 0.0052 0.0042 0.0033 0.0018 0.0007 0.034 2.720

Tanh 96.25 0.0081 0.0046 0.0030 0.0020 0.0016 0.0008 0.001 0.255
MNIST Tanh 96.98 0.0070 0.0032 0.0034 0.0034 0.0008 0.0007 0.001 0.156
MNIST Tanh 97.42 0.0059 0.0024 0.0034 0.0026 0.0011 0.0007 0.713 0.408
MNIST Tanh 97.80 0.0046 0.0026 0.0017 0.0021 0.0013 0.0008 0.307 0.550
MNIST Tanh 97.62 0.0035 0.0020 0.0024 0.0016 0.0008 0.0007 0.022 0.681
MNIST Tanh 97.52 0.0028 0.0015 0.0015 0.0010 0.0007 0.0007 0.142 2.736

ReLU 98.62 0.0175 0.0064 0.0055 0.0032 0.0021 0.0011 0.828 0.232
MNIST ReLU 98.29 0.0188 0.0061 0.0049 0.0040 0.0034 0.0019 0.492 0.253
MNIST ReLU 98.45 0.0174 0.0065 0.0051 0.0029 0.0030 0.0019 0.622 0.213
MNIST ReLU 97.45 0.0239 0.0092 0.0089 0.0080 0.0025 0.0021 0.001 0.073
MNIST ReLU 98.99 0.0170 0.0059 0.0051 0.0032 0.0027 0.0019 0.333 0.754
MNIST Sigmoid 98.45 0.0178 0.0045 0.0053 0.0045 0.0017 0.0018 0.773 0.344
MNIST Sigmoid 97.78 0.0249 0.0081 0.0102 0.0060 0.0037 0.0022 0.068 0.119
MNIST Sigmoid 98.85 0.0172 0.0061 0.0038 0.0059 0.0016 0.0019 0.522 1.318
MNIST Tanh 98.73 0.0039 0.0013 0.0022 0.0015 0.0006 0.0006 0.188 0.369
MNIST Tanh 98.56 0.0103 0.0036 0.0042 0.0024 0.0014 0.0012 0.314 0.124
MNIST Tanh 99.11 0.0037 0.0017 0.0024 0.0013 0.0004 0.0006 0.001 1.488
CIFAR10 ReLU 73.66 0.0017 0.0009 0.0017 0.0016 0.0003 0.0002 0.005 1.680
CIFAR10 Sigmoid 65.30 0.0025 0.0014 0.0029 0.0022 0.0005 0.0004 0.839 2.418
CIFAR10 Tanh 70.68 0.0008 0.0005 0.0009 0.0007 0.0002 0.0002 0.390 2.443

Table 1: The average robustness radius and p-value of different networks

There are CNNs (Hu et al., 2018; Xie et al., 2017) that have high accuracies on CIFAR10. But these networks usually adopt layers other than fully-connected, convolutional and max-pooling layers and are out of the scope of this paper and beyond the ability of current local robustness verifiers (Katz et al., 2017; Ehlers, 2017; Wang et al., 2018b; Dutta et al., 2018; Singh et al., 2018a).

Figure 2: The numbers of inputs which have a larger minimal distortion than a given value
Figure 3: Minimal distortion from CLEVER and exact/approximate robustness radius from ERAN

We also tried another tool CLEVER (Weng et al., 2018)

which estimates the minimal distortion to craft adversarial examples (which should be equal to robustness radius). Figure 

2 illustrates the robustness radii on the first 100 inputs of the six categories on the networks FNN-MNIST and CNN-MNIST. We can see that, the difference of minimal distortions of valid inputs and other categories is less significant than that of robustness radii in Figure 1. Figure 3 shows the histogram of the density of the minimal distortion from CLEVER and exact/approximate robustness radii from ERAN of the first 100 valid inputs on the network FNN-MNIST. Compared with approximate robustness radii, both the values and distribution of minimal distortions are much farther from the exact robustness radii.

3 Input Validation with Observation I

Based on our first observation, we can design an algorithm to validate the inputs of a neural network at runtime to protect it from adversarial examples and improve its accuracy.

A naive idea is to set a threshold and reject all inputs the approximate robustness radii of which are below it. However, it is non-trivial to choose the threshold. One solution is to set the threshold according to the maximal false alarm rate that can be tolerated, which depends on the application. ROC curve plots the true alarm rate against the false alarm rate at various threshold settings. Figure 4 shows the ROC curves of the network CNN-MNIST and CNN-CIFAR on the first 100 inputs from each category. The result on MNIST is good on all kinds of adversarial examples. However, on CIFAR10, our method is not very helpful on FGSM attacks. The reason, we believe, is that the accuracy of our CNN on CIFAR10 is not high enough.

Until now, we have only studied the first 100 inputs in each category. In Table 2, we show the effect of different choices of thresholds on the network CNN-MNIST on the first 100 and random 100 inputs in each category. To be more specific, we show with different thresholds (column Th.), the percentage of rejected valid inputs (column Vic.), the percentage of rejected misclassified inputs (column W.), and the rejected adversarial examples from FGSM attack with (column F ()), FGSM attack with (column F ()), C&W attack and HOP attack. The result of the first 100 inputs and random 100 inputs are on the left and right sides of ”/” respectively in each cell. This table shows that the observation of the first 100 inputs of each category are also valid in the whole test database.

(a) ROC curve on MNIST
(b) ROC curve on CIFAR
Figure 4: ROC curves on MNIST and CIFAR

The benefit of validation by threshold is that once the threshold is decided, we just need to call once to test whether the approximate robustness radius of an input is above the threshold.

4 Distribution of The Approximate Robustness Radius of Valid Inputs

Th. Vic. W F () F () C&W HOP
0.002 0/0 14/14 17/13 27/24 48/47 91/99
0.004 0/0 28/29 42/38 57/56 77/79 100/100
0.006 0/0 49/48 62/59 87/79 89/89 100/100
0.008 2/1 66/69 88/79 100/95 96/99 100/100
0.010 3/4 75/80 95/97 100/99 100/100 100/100
0.012 8/7 89/90 99/100 100/100 100/100 100/100
0.014 13/16 97/95 100/100 100/100 100/100 100/100
0.016 24/28 100/100 100/100 100/100 100/100 100/100
Table 2: The rejection rates with different thresholds

One thing concerns us: if the attackers have the knowledge of our neural network and our detection method, they can generate adversarial examples with large approximate robustness radii on purpose (even though we believe that such adversarial examples can be hardly found on a neural network with high accuracy). To avoid this, we study further whether the approximate robustness radii of valid inputs follow a certain distribution. If they do, then the attackers not only need to generate adversarial examples with large enough robustness radii, but also need to make sure that such robustness radii follow a certain distribution, which is much harder. Observing Figure 3, one thought is that the exact/approximate robustness radii follow a normal distribution. To test that, we compute the approximate robustness radii of the first 100 valid inputs for all networks in Table 1. We test whether they follow a normal distribution by D’Agostino and Pearson’s test (D’Agostino and Pearson, 1973), which returns a p-value 222

In statistics, p-value (or probability value) is the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct.

(shown in column P-value). If the p-value is larger than , then they are believed to follow a normal distribution. We can see that 25 networks follow normal distributions, but 9 not. We cannot give a conclusion on what factors make the difference, but it seems that a medium size network with high accuracy usually enjoys this property. It is worth mentioning that the exact robustness radii of the FNN-MNIST network do not pass the D’Agostino and Pearson’s test either, the same as its approximate robustness radii, even though they look like a normal distribution in Figure 3.

5 Input Validation with Observation II

If the approximate robustness radii of the valid inputs on a network follow a normal distribution, we can utilize this to improve our input validation method.

  Input: network , input
   a queue (the size of which is ) of valid inputs
     Input: input
     if  then
     end if
  until END OF INPUT
Algorithm 2 Validation by distribution

The new algorithm is shown in Algorithm 2. It maintains a sliding window (Rebbapragada et al., 2009) of size containing the inputs believed to be valid. When a new input comes, the algorithm checks whether it breaks the original normal distribution by function . If it does, then this new input is deleted ( ), otherwise the first element is deleted ( ), thus the window slides. The design of function

is heuristic, and we propose the one below

This function combines our two observations. It returns true if the approximate robustness radius of the last input is larger than a threshold (from Observation I) or the p-value of the new sliding window does not drop sharply from the last one (from Observation II). However, we do not have a method to decide the two parameters and . On the network CNN-MNIST, we set , and . To test this algorithm, we take 100 valid inputs, 1 misclassified input (because the accuracy is 98.62% ), and 100 adversarial examples from each of the four types of attacks (which totally makes 400) as the inputs. If all the inputs come in sequence, our algorithm can reject all adversarial examples and the misclassified input, with only 3 valid inputs rejected. However, if we shuffle the inputs randomly, the average rejected valid inputs are 5 and adversarial examples are 28 respectively (by 10 times experiments). Actually, in both cases, the first condition () accepts 87 valid inputs and rejects all invalid inputs. The second condition accounts for other accepted valid inputs and false positives.

The disadvantage of validation by distribution is that it needs to compute the approximate robustness radius with Algorithm 1 which needs several iterative calls to , which takes more time. However, the time complexity of for incomplete verification is polynomial wrt. the number of neurons, and the potential of its speed is far from fully explored (e.g., GPU is not utilized).

6 Related Work

Some researchers focus on finding new adversarial attacks. According to whether the attackers have all knowledge about the target neural networks, adversarial attacks can be divided into two types: white-box attack and black-box attack. Most adversarial attacks including the first one (i.e., L-BFGS (Szegedy et al., 2014)) are white-box attacks. White-box attacks can be fast (e.g., FGSM (Goodfellow et al., 2014)) and strong (i.e., to find the adversarial examples close to the original inputs, e.g., DeepFool (Moosavi-Dezfooli et al., 2016), C&W (Carlini and Wagner, 2017), Ground-truth attack (Carlini et al., 2018)). Black-box attacks usually need more computational power (e.g., Hopskipjump attack (Chen et al., 2019), ZOO (Chen et al., 2017)). Because of transferability (Papernot et al., 2016), white-box attacks can be transformed to black-box.

There are many countermeasures for adversarial attacks, among which, verification and adversarial detection are mostly related to our work.

Verification methods check whether a neural network satisfies a given formally defined property before it is deployed. Such properties include safety constraints (Katz et al., 2017) and robustness (Ehlers, 2017; Dutta et al., 2018). However, due to the non-linearity of activation functions, complete verification is NP-hard, and thus can hardly scale. Incomplete verification sacrifices the ability to falsify a property so as to gain performance. Current incomplete verifiers (Wang et al., 2018b; Singh et al., 2019) can deal with neural networks of thousands of neurons in seconds. However, both verifications can only prove local robustness properties (Huang et al., 2017), rather than global robustness properties. Thus these verifiers can only give metrics on evaluating how robust a neural network is, rather than proving that a neural network is robust.

Adversarial detection methods make use of characterics of adversarial examples. (Feinman et al., 2017) found that the uncertainty of adversarial examples to be higher than clean data, and utilized a Bayesian neural network to estimate that. (Song et al., 2017) found the distribution of adversarial examples is different from clean data. Compared to their methods focusing on the inputs, our method computes the accumulative gradients information of the neural network in the regions around the inputs. (Wang et al., 2018a) proposed to detect adversarial examples by mutation test based on the belief that they are not robust to mutations. Their method shares the similar intuition with our method, that is, the adversarial examples must be some corner cases in the input space. However, we utilize local robustness verification which takes the whole region around an input into account instead of testing which considers some points near an input. (Lu et al., 2017) distinguishes adversarial examples from clean data by the threshold of their values on each ReLU layer. (Henzinger et al., 2019)

proposed to detect novel inputs by observing the hidden layers, i.e., whether they are outside the value ranges during training. Given the fact that these works are not open source and the results in their papers are often given in the form of graphs (like ROC curve), it is hard to have a fair comparison with their results. However, from the results in their papers, our method is comparable (if not better) with their work, especially on strong attacks. Our method utilizes only the value of robustness radius to validate inputs, and can also be seen as an anomaly detection technique 

(Chandola et al., 2009).

7 Conclusion

Exact/approximate robustness radius reflects the accumulative gradients information of the neural network in the region around an input. We believe that adversarial examples are often in the region with high accumulative gradients. Based on this belief, we observed the exact/approximate robustness radii of inputs of valid inputs and misclassified (possibly adversarial) inputs. We found that (1) the exact/approximate robustness radii of valid inputs are much larger than those of misclassified inputs and adversarial examples; (2) the approximate robustness radii of valid inputs can follow a normal distribution. Based on the two observations, we proposed a new way for input validation. Our experiments showed that the method is very effective in improving the accuracy of neural networks and protecting them from adversarial examples. Moreover, we believe that even if the attackers know our methods, they can hardly attack the protected neural networks by generating adversarial example, the approximate robustness radii of which are large enough and follow a normal distribution.


  • M. Abbasi and C. Gagné (2017) Robustness to adversarial examples through an ensemble of specialists. arXiv preprint arXiv:1702.06856. Cited by: §1.
  • J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani (2017) Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476. Cited by: §1.
  • N. Carlini, G. Katz, C. Barrett, and D. L. Dill (2018) Ground-truth adversarial examples. Cited by: §6.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §1, §6.
  • D. Castelvecchi (2016) Can we open the black box of ai?. Nature News 538 (7623), pp. 20. Cited by: §1.
  • V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §6.
  • J. Chen, M. I. Jordan, and M. J. Wainwright (2019) Hopskipjumpattack: a query-efficient decision-based attack. arXiv preprint arXiv:1904.02144. Cited by: §1, §6.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    pp. 15–26. Cited by: §6.
  • R. D’Agostino and E. S. Pearson (1973) Tests for departure from normality. empirical results for the distributions of b 2 and√ b. Biometrika 60 (3), pp. 613–622. Cited by: §4.
  • A. Desai, S. Ghosh, S. A. Seshia, N. Shankar, and A. Tiwari (2018) SOTER: programming safe robotics system using runtime assurance. arXiv preprint arXiv:1808.07921. Cited by: §1.
  • S. Dutta, S. Jha, S. Sankaranarayanan, and A. Tiwari (2018) Output range analysis for deep feedforward neural networks. In NASA Formal Methods Symposium, pp. 121–138. Cited by: §2.1, §2.3, §6.
  • R. Ehlers (2017)

    Formal verification of piece-wise linear feed-forward neural networks

    In Automated Technology for Verification and Analysis, D. D’Souza and K. Narayan Kumar (Eds.), Cham, pp. 269–286. External Links: ISBN 978-3-319-68167-2 Cited by: §1, §2.1, §2.3, §6.
  • R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §6.
  • M. Fischetti and J. Jo (2018) Deep neural networks and mixed integer linear optimization. Constraints 23 (3), pp. 296–309. Cited by: §2.1.
  • T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev (2018) Ai2: safety and robustness certification of neural networks with abstract interpretation. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §2.1, §6.
  • K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §1.
  • T. A. Henzinger, A. Lukina, and C. Schilling (2019) Outside the box: abstraction-based monitoring of neural networks. arXiv preprint arXiv:1911.09032. Cited by: §6.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 7132–7141. Cited by: §2.3.
  • X. Huang, M. Kwiatkowska, S. Wang, and M. Wu (2017) Safety verification of deep neural networks. In International Conference on Computer Aided Verification, pp. 3–29. Cited by: §1, §6.
  • A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175. Cited by: §1.
  • G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer (2017) Reluplex: an efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Cited by: §1, §2.1, §2.3, §6.
  • A. Krizhevsky, V. Nair, and G. Hinton (2009) CIFAR-10 (canadian institute for advanced research). . Cited by: §2.1, §2.3.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §2.1, §2.3, §2.3.
  • J. Lu, T. Issaranon, and D. Forsyth (2017) Safetynet: detecting and rejecting adversarial examples robustly. In Proceedings of the IEEE International Conference on Computer Vision, pp. 446–454. Cited by: §1, §6.
  • B. Luo, Y. Liu, L. Wei, and Q. Xu (2018) Towards imperceptible and robust adversarial example attacks against neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • D. Meng and H. Chen (2017) MagNet: a two-pronged defense against adversarial examples. Cited by: §1.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §6.
  • M. Nicolae, M. Sinn, M. N. Tran, B. Buesser, A. Rawat, M. Wistuba, V. Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. Molloy, and B. Edwards (2018) Adversarial robustness toolbox v1.0.1. CoRR 1807.01069. Cited by: §2.1.
  • N. Papernot, P. McDaniel, and I. Goodfellow (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277. Cited by: §6.
  • U. Rebbapragada, P. Protopapas, C. E. Brodley, and C. Alcock (2009) Finding anomalous periodic time series. Machine learning 74 (3), pp. 281–313. Cited by: §5.
  • G. Singh, T. Gehr, M. Mirman, M. Püschel, and M. Vechev (2018a) Fast and effective robustness certification. In Advances in Neural Information Processing Systems, pp. 10802–10813. Cited by: §2.1, §2.3, §2.3, §2.3.
  • G. Singh, T. Gehr, M. Püschel, and M. Vechev (2018b) Boosting robustness certification of neural networks. Cited by: §1, §2.2.
  • G. Singh, T. Gehr, M. Püschel, and M. Vechev (2019) An abstract domain for certifying neural networks. Proceedings of the ACM on Programming Languages 3 (POPL), pp. 41. Cited by: §2.1, §6.
  • Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman (2017) Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §6.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Cited by: §1, §6.
  • J. Wang, J. Sun, P. Zhang, and X. Wang (2018a) Detecting adversarial samples for deep neural networks through mutation testing. arXiv preprint arXiv:1805.05010. Cited by: §1, §6.
  • S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana (2018b) Formal security analysis of neural networks using symbolic intervals. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1599–1614. Cited by: §2.1, §2.3, §6.
  • Y. Wang, S. Jha, and K. Chaudhuri (2017) Analyzing the robustness of nearest neighbors to adversarial examples. arXiv preprint arXiv:1706.03922. Cited by: §1.
  • T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel (2018) Evaluating the robustness of neural networks: an extreme value theory approach. arXiv preprint arXiv:1801.10578. Cited by: §1, §2.3.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §2.3.
  • X. Yuan, P. He, Q. Zhu, and X. Li (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems. Cited by: §1.