1 Background and Related Works
Over the years, the development of numerous machine learning (ML) frameworks along with the availability of vast amounts of data have made the task of training DNNs quite simple. Despite this, researchers have only scratched the surface of such networks, which are still treated as black-box entities. The possibility to unveil hidden properties and behaviors of DNNs has started a real arms-race. Since the discovery of adversarial examples, an ever-increasing number of researchers attempted to design new, more powerful attacks capable of defeating the current state-of-the-art defense. In this section we describe some of the most important attacks and defensive strategies present in the literature.
1.1 Adversarial Attacks
The development of an attack method requires profound knowledge of intrinsic properties of DNN systems. Adversarial examples are, indeed, the result of an optimization algorithm over the input domain and the possibility to fine tuning such optimization procedure opens up to a variety of different scenarios.
Formally, it is possible to define the task of generating a malicious input out of a legitimate input as the act of finding the perturbation such that,
leads to a different classification result. The perturbation is generally the result of an optimization algorithm.
The Fast Gradient Sign Method (FGSM) attack was originally designed to speed-up the generation process 
. It does so by performing a single step along the direction of the gradients w.r.t. the loss function calculated on a generic clean input. Equation (2) shows the computation of the perturbation, where is the ground-truth label associated with input and represents the loss function. The magnitude of the step is the same for every pixel and is controlled by the step-size parameter .
FGSM is one of the earlier attack techniques ever developed. It is very fast compared to other methods but produces more aggressive perturbations, i.e., the malicious inputs are mostly visible to the human eye. However, it is interesting to notice how a very greedy technique can effectively undermine the stability of a neural network. This is also confirmed by the manifold variants which succeeded FGSM, attempting to improve the efficiency of the method [9, 16, 19, 4, 10].
Other authors tried to refine the algorithm to achieve the smallest adversarial perturbation (in terms of magnitude). Moosavi-Dezfooli et al. 
believed that the geometrical representation of DNN models represents a powerful tool for understanding the nature of such malicious samples. They leveraged the hyperplane theory to represent the decision boundaries imposed by a network in the high-dimensional input domain. Finally, they created DeepFool (DF), an iterative procedure to achieve the smallest perturbation moving a natural inputjust beyond its closest decision boundary. Compared with FGSM, it produces much finer perturbation masks, at the expense of higher computational time. Moreover, since it pushes the input just beyond the closest decision boundary, it does not allow to accurately control the effectiveness of the attack – sometimes it results in uncertain malicious samples because of their close proximity to the decision boundary.
Carlini and Wagner  presented an effective attack technique to be used as benchmark against new attacks/defenses. They mathematically approached the adversarial generation process, providing new ways to address the optimization problem. The resulting attack method is an iterative process named after its inventors Carlini&Wagner (C&W). Compared to DF it has the advantage of providing the attacker with several hyper-parameters to tune in order to achieve adversarial samples with a broad variety of properties. C&W is actually considered the state-of-the-art attack techniques, as confirmed by the broad variety of works present in literature [20, 12, 3].
Finally, it is worth introducing the definition of two common attack settings: white-box and black-box. The former accounts for an attacker which has full access to the classifier, including the backpropagation signal, while the latter assumes that the attacker does not have access to the attacked model.
1.2 Defensive Strategies
The ability of reliably defend against adversarial inputs is of paramount importance to enable massive deployment of DNN-based solutions, especially in sensitive areas. Over the years, researchers have developed various approaches which can be divided in two main families: proactive and reactive. Strategies of the former type aim at making the defended network more robust during the design and/or training process whilst approaches from the latter family are deployed after the classifier has already been trained, i.e. without tempering the design/training process.
Some researchers believe that learning from adversarial examples would increase DNNs robustness without significantly harming the performance. Goodfellow et al.  proposed to introduce a factor accounting for adversarial examples in the loss function used during training. Adversarial training of a network is a proactive solution: it requires to generate malicious inputs live – for the current learned network configuration – and uses the loss produced to perform the next optimization step. In this way, one can increase the model’s generalization power to adversarially manipulated inputs. This technique has proven to successfully increase robustness at the expense of increased training time. In addition, another desired property conferred by adversarial training is the better generalization power, as reported in  and .
Xu et al. 
instead, realized that it is possible to lower the degree of freedom available to an attacker by reducing the input space dimensionality. They proposed to detect malicious samples by using distortions to squeeze the pixel domain. The resulting system, named Feature Squeezing (FS), acts as a reactive defense: when the classifier receives a generic input, it computes the squeezed version and compares the classifier outcomes of both original and squeezed input. Then, a thresholding operation decides whether the input is adversarial or legitimate. FS possesses the advantages of being a low-cost attack-agnostic technique, which affects neither the model architecture nor the training procedure. On the other hand, it requires to query the classifier multiple times, thus resulting in higher inference time.
Liao et al.  considered the possibility to reconstruct the original, legitimate input from an adversarial one. They proposed to clean the classifier’s input by means of a data-driven pre-processing entity. The devised denoiser was trained exploiting information from higher levels of the defended network, hence named High-level representation Guided Denoiser (HGD). The authors showed that their proposal efficiently solves the adversarial issue with good transferability properties. Unfortunately, when deployed along with the classifier, it creates a new DL system which is very prone to be attacked in white-box settings. Moreover it is not attack agnostic since it requires adversarial samples to train the Denoiser.
2 Our Method
In this section, we provide a description of our defense approach. We believe that the deployment of an adversarial examples detector represents the most interesting solution to defend DNN models. Indeed, this external module does not affect the classifier’s performance and can be activated only when required. Therefore, we designed a detector that exploits the robustness of DNN models to greedy distortions.
As in 
, we first noticed that after being distorted, adversarial examples manifest different statistics than legitimate ones. Since it is well-known that DNNs are invariant to several distortions, we thought that it could be possible to exploit such distortions to build a meaningful signature for the detection task. In our work we propose to concatenate the probability vectors outputted by the classifier when queried with the distorted replicas of the received input.
The extracted signature is then compared with a reference vector in order to assess the input nature. We decided to use class-representative vectors: in particular, per-class first order statistics, so that legitimate samples belonging to a given class must exhibit statistics similar to the class-representative vector. The reference vectors are computed on the same set used to train the network: the detector extracts the signature vector for every training sample and then averages across samples from the same class. Equation (3) contains the formal definition of the described procedure, where is the function learned by the network, the set of distortions used, the function building the signature, the dimensionality operator returning the number of samples in the set, and the total number of classes and distortions, respectively.
We use the class predicted on the original input to select the class-specific vector. Fig. 1 shows the framework of the detector described.
The scheme described so far forced adversarial examples to produce signatures which are orthogonal to the corresponding reference vector (for proper choice of the distortions). This discovery led us to choose the normalized projection score
(a.k.a. cosine similarity) as metric in the comparison, hence quantifying the similarity among signature and class-representative statistics. Equation (4) formally defines the computation, where is the predicted class for the original input, is the –norm and for brevity – the function building the signature in Eq. (3). Technically, it computes the cosine of the angle between the two vectors under comparison.
The higher the score is, the more aligned the two vectors are, the more likely is the input to be legitimate.
Fig. 2 contains the histogram produced by our detector on the GTSRB dataset . Legitimate samples get scores close to , whilst adversarial ones are almost all close to be orthogonal. It proves the validity of our detection scheme.
Before delving into the results section, it is worth discussing the distortions used in our work. In order to enable fair comparison with previous works, we borrowed some of the distortions from , namely median filtering and bit-depth reduction. The former consists of a simple spatial smoothing operation, replacing the pixel intensities with the median value of their neighbors, whilst the latter re-quantizes the intensity values with a reduced number of bits – common camera sensors acquire images with 8-bit resolution pixel intensities per channel. In addition, it was devised a new distortion which converts RGB images to the corresponding gray-scale version. In order to maintain the model’s information flow, the distortion rearranges the input as expected by the classifier: several copies of the single channel, gray-scale image are stacked on top of each other to generate a 3 channel input.111The distortion was tested only on the RGB color-space representations.
We will refer to the described distortion as gray-scale.
In this section we describe the experimental setup used and report the results of the experiments carried out.
Our proposal was tested against two different datasets: GTSRB  and CIFAR10 . The former contains a collection of 43 traffic signs organized in 40k training and 13k test samples circa. The samples come in an RGB format with different shapes. Every image is resized to 46x46x3 before feeding the classifier. The latter consists of 10 different classes and provides 50k training and 10k test samples. The images come in an RGB format of shape 32x32x3.
A critical point during the evaluation of any defensive strategy is the selection of the attack methods. In this work we decided to focus on three different attacks for different reasons:
DF represents a very strong technique, able to generate adversarial inputs with imperceptible perturbations.
C&W is the current state-of-the-art attack for benchmarking new defenses; beyond producing invisible perturbations, it allows the attacker to control the effectiveness of the attack through the customizable parameter .
FGSM produces very aggressive perturbations, sometimes visible to humans; however, it is of interest to study the detector behavior against such kind of malicious inputs.
The adversarial sets were generated in both white and black-box attack settings (crafted from a substitute model trained on the same dataset). Table 1 reports the performance of the trained models on both legitimate and adversarial test-sets. The number beside C&W specifies the value of the parameter – e.g. C&W9 has – while the number beside FGSM stresses out the magnitude of the step size parameter – e.g. FGSM1 has . The substitute models were devised to have more capacity than the relative victim.
In terms of results, we compare our method with Feature Squeezing (FS) which represents the closest state-of-the-art solution . Indeed, it shares with our method the properties of being a low-cost, model-free technique, thus enabling a fair comparison.
For each experiment, it was chosen to extract the detection ROC curves and provide the comparison as AUC score in order to be independent to the threshold level. The ROC curves were computed pairing the attack-set with an equal number of correctly predicted samples from the original test-set.
Unless otherwise stated, every detector was tested using both the distortions found in literature, namely median filtering and bit-depth reduction.
3.1 White-box Results
Fig. 3 shows the ROC curves for both FS and our detectors on the CIFAR10 dataset. The improvement introduced by our method is very clear, passing closer to the top-left corner and demonstrating a more convergent behavior.
Table 2, reports the AUC scores produced for all the white-box attack-sets. It confirms that our detector outperforms FS on the strongest attack-sets, i.e. C&W and DF, whilst FS is able to perform as well as our method (or better) when tested against the FGSM attacks. However, the latter aspect does not represent a serious issue, as shown in section 3.2.
Furthermore, bearing table 1 in mind, it is interesting to see how the effectiveness of FS significantly drops when the classifier to defend comes with more uncertainty (i.e. CIFAR10 model), thus suggesting that our detector is more robust to changes in performance of the defended model.
3.2 Compatibility with Adversarial Training
The previous subsection showed that our detector is highly effective against attacks which produce fine perturbations. Simultaneously, it showed a lack of efficiency in detecting more aggressive attacks. Luckily, the literature is full of solutions capable to tackle such malicious inputs.
In this section we compare the performance of FS and our detector when deployed along with another defensive strategy: adversarial training 
. In more details, the classifier to defend was fine trained for a few epochs, substituting half of the samples in the mini-batch with the corresponding FGSM4 adversarial version against the current network configuration. Every adversarial attack-set was then computed again on the final, adversarially trained version of the network.
Table 3 contains the performance of the models with and without adversarial training. As expected, the results are significantly more robust to FGSM attacks, suggesting that the network learned the heavy pattern of the adversarial perturbations. On the other hand, we notice a drawback in the classifier’s performance, manifesting a slight drop in accuracy. Fine tuning the adversarial training process allows the prevention of this effect.
Table 4 reports the AUC scores produced by both FS and our detector deployed with and without adversarial training. We notice that adversarially training the classifier helps the detection task, especially on the FGSM attack-sets. Moreover, the results shows that our approach comes with better compatibility property than FS, benefiting the most from adversarial training. Last but not the least, it is important to consider the slight accuracy drop of the adversarially trained networks during our analysis, because it accounts for poorer results against both C&W and DF attack-set – as shown in section 3.1.
3.3 Distortions Configuration
In this section we provide an analysis of the detector behavior using different configuration of distortions. We tested both FS and our detectors with a single distortion (using those introduced in section 2), two distortions and all three distortions.
The results are presented in table 5. It is clear that our detector benefits from the use of several distortions: the configurations which use multiple distortions nearly always obtain the best scores. The same does not hold for FS, achieving its peak performance when the single-distortion settings are adopted. These insights suggest that the detector developed is able to build a meaningful signature for the detection task, hence combining the effect of multiple distortions in a better way than FS does.
Focusing on the single-distortion configurations, we notice the new gray-scale distortion to perform nicely, outperforming the bit-depth reduction on the FGSM4 attack-set. This proves that it represents a valid distortion, especially when used in multiple-distortions configurations. Finally, we find that the effectiveness of each distortion is problem-dependent, as certifies the different results achieved on CIFAR10 and GTSRB datasets.
3.4 Black-box Results
The much lower number of samples which succesfully fool the models in black-box settings does not allow us to use the AUC scores as performance quantifiers. Thus, according to , we decided to set the threshold such that it rejects only 5% of the legitimate test samples and collect the detection rates. Table 6 reports the results. It is clear that our detector outperforms FS even in black-box settings. We believe that the use of per-class statistics provides the main contribution in this setting, forcing the reference vector to be well-shaped, regardless of the effectiveness of the malicious input.
The approach devised results in a cost-effective, reliable detector showing an impressive ability to identify adversarial samples. Compared to FS,our method:
performs better on a broad variety of classifiers.
is more compatible with other defensive techniques such as adversarial training.
combines in a better way the effects coming from multiple distortions.
performs better in both white and black-box settings.
Despite the experiments and the development presented in this work, there are multiple directions for further improvement.
As suggested in section 3.2, it would be interesting to explore the parameter space of the adversarial training defensive technique, finding out the configuration which benefits the most from our detector. Furthermore, it would be intriguing to see how the proposed detector reacts when combined with other defensive strategies.
Another promising work direction consists of merging various distortions to be used as a single one (using less aggressive settings), achieving a distortion able to deal with most of the attack techniques. This would reduce the number of queries to the model, thus reducing the overall inference time.
Finally, we think that the normalized projection score might be used directly as an indicator of the input nature (it is close to for adversarial samples, whilst close to for legitimate ones), thus multiplying directly the classification output rather than undergoing a thresholding operation.
-  (2017) Towards evaluating the robustness of neural networks. See DBLP:conf/sp/2017, pp. 39–57. External Links: Cited by: §1.1.
-  (2015) DeepDriving: learning affordance for direct perception in autonomous driving. CoRR abs/1505.00256. External Links: Cited by: A Statistical Defense Approach for Detecting Adversarial Examples.
-  (2017) ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. See DBLP:conf/ccs/2017aisec, pp. 15–26. External Links: Cited by: §1.1.
-  (2018) Boosting adversarial attacks with momentum. See DBLP:conf/cvpr/2018, pp. 9185–9193. External Links: Cited by: §1.1.
-  (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115–118. External Links: Cited by: A Statistical Defense Approach for Detecting Adversarial Examples.
-  (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Cited by: §1.1, §1.2, §3.2.
-  (2015) Learning with a strong adversary. CoRR abs/1511.03034. External Links: Cited by: §1.2.
-  (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §3.
-  (2017) Adversarial examples in the physical world. See DBLP:conf/iclr/2017w, External Links: Cited by: §1.1, A Statistical Defense Approach for Detecting Adversarial Examples.
-  (2017) Adversarial machine learning at scale. See DBLP:conf/iclr/2017, External Links: Cited by: §1.1.
-  (2018) Defense against adversarial attacks using high-level representation guided denoiser. See DBLP:conf/cvpr/2018, pp. 1778–1787. External Links: Cited by: §1.2.
-  (2017) MagNet: A two-pronged defense against adversarial examples. See DBLP:conf/ccs/2017, pp. 135–147. External Links: Cited by: §1.1.
-  (2016) DeepFool: A simple and accurate method to fool deep neural networks. See DBLP:conf/cvpr/2016, pp. 2574–2582. External Links: Cited by: §1.1.
-  (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. See DBLP:conf/cvpr/2015, pp. 427–436. External Links: Cited by: A Statistical Defense Approach for Detecting Adversarial Examples.
Emulating human conversations using convolutional neural network-based IR. CoRR abs/1606.07056. External Links: Cited by: A Statistical Defense Approach for Detecting Adversarial Examples.
-  (2016) Adversarial diversity and hard positive generation. See DBLP:conf/cvpr/2016w, pp. 410–417. External Links: Cited by: §1.1.
-  (2011) The german traffic sign recognition benchmark: A multi-class classification competition. See DBLP:conf/ijcnn/2011, pp. 1453–1460. External Links: Cited by: §2, §3.
-  (2014) Intriguing properties of neural networks. See DBLP:conf/iclr/2014, External Links: Cited by: A Statistical Defense Approach for Detecting Adversarial Examples.
-  (2018) Ensemble adversarial training: attacks and defenses. See DBLP:conf/iclr/2018, External Links: Cited by: §1.1.
-  (2018) Feature squeezing: detecting adversarial examples in deep neural networks. See DBLP:conf/ndss/2018, External Links: Cited by: §1.1, §1.2, §2, §2, §3.4, §3.