Deep Neural Rejection against Adversarial Examples

10/01/2019 ∙ by Angelo Sotgiu, et al. ∙ Universita Cagliari 47

Despite the impressive performances reported by deep neural networks in different application domains, they remain largely vulnerable to adversarial examples, i.e., input samples that are carefully perturbed to cause misclassification at test time. In this work, we propose a deep neural rejection mechanism to detect adversarial examples, based on the idea of rejecting samples that exhibit anomalous feature representations at different network layers. With respect to competing approaches, our method does not require generating adversarial examples at training time, and it is less computationally demanding. To properly evaluate our method, we define an adaptive white-box attack that is aware of the defense mechanism and aims to bypass it. Under this worst-case setting, we empirically show that our approach outperforms previously-proposed methods that detect adversarial examples by only analyzing the feature representation provided by the output network layer.



There are no comments yet.


page 5

page 6

page 7

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Despite their impressive performances on a variety of tasks, it has been known for more than a decade that machine-learning algorithms can be misled by different adversarial attacks, staged either at training or at test time 

[1, 2]

. After the first attacks proposed against linear classifiers in 2004 

[3, 4], Biggio [5, 6]

have been the first to show that nonlinear machine-learning algorithms, including support vector machines (SVMs) and neural networks, can be misled by gradient-based optimization attacks 

[1]. Nevertheless, such vulnerabilities of learning algorithms have become extremely popular only after that Szegedy [7, 8]

have demonstrated that also deep learning algorithms exhibiting superhuman performances on image classification tasks suffer from the same problems. They have shown that even only slightly-manipulating the pixels of an input image can be sufficient to induce deep neural networks to misclassify its content. Such attacks have then been popularized under the name of

adversarial examples [7, 6, 2].

Since the seminal work by Szegedy [7, 8], many defense methods have been proposed to mitigate the threat of adversarial examples. Most of the proposed defenses have been shown to be ineffective against more sophisticated attacks (i.e., attacks that are aware of the defense mechanism), leaving the problem of defending neural networks against adversarial examples still open. According to [2], the most promising defenses can be broadly categorized into two families. The first includes approaches based on robust optimization and game-theoretical models [9, 10, 11]. These approaches, which also encompass adversarial training [8], explicitly model the interactions between the classifier and the attacker to learn robust classifiers. The underlying idea is to incorporate knowledge of potential attacks during training. The second family of defenses (complementary to the first) is based on the idea of rejecting samples that exhibit an outlying behavior with respect to unperturbed training data [12, 13, 14, 15, 16].

In this work, we focus on defenses based on rejection mechanisms and try to improve their effectiveness. In fact, it has been shown that only relying upon the feature representation learned by the last network layer to reject adversarial examples is not sufficient [12, 13]. In particular, it happens that adversarial examples become indistinguishable from samples of the target class at such a higher representation level even for small input perturbations. To overcome this issue, we propose here a defense mechanism, named Deep Neural Rejection (DNR), based on analyzing the representations of input samples at different network layers, and on rejecting samples which exhibit anomalous behavior with respect to that observed from the training data at such layers (Sect. II). With respect to similar approaches based on analyzing different network layers [15, 14], our defense does not require generating adversarial examples during training, and it is thus less computationally demanding.

We evaluate our defense against an adaptive white-box attacker that is aware of the defense mechanism and tries to bypass it. To this end, we propose a novel gradient-based attack that accounts for the rejection mechanism and aims to craft adversarial examples that avoid it (Sect. III).

It is worth remarking here that correctly evaluating a defense mechanism is a crucial point when proposing novel defenses against adversarial examples [17, 2]. The majority of previous work proposing defense methods against adversarial examples has only evaluated such defenses against previous attacks rather than against an ad-hoc attack crafted specifically against the proposed defense (see, e.g., [18, 15, 19] and all the other re-evaluated defenses in [17, 20]). The problem with these black-box and gray-box evaluations in which the attack is essentially unaware of the defense mechanism is that they are overly optimistic. It has indeed been shown afterwards that such defenses can be easily bypassed by simple modifications to the attack algorithm [17, 21, 20]. For instance, many defenses have been found to perform gradient obfuscation, i.e., they learn functions which are harder to optimize for gradient-based attacks; however, they can be easily bypassed by constructing a smoother, differentiable approximation of their function, e.g., via learning a surrogate model [6, 2, 22, 23, 24, 25] or replacing network layers which obfuscate gradients with smoother mappings [17, 21, 20]. In our case, an attack that is unaware of the defense mechanism may tend to craft adversarial examples in areas of the input space which are assigned to the rejection class; thus, such attacks, as well as previously-proposed ones, may rarely bypass our defense. For this reason, we believe that our adaptive white-box attack, along with the security evaluation methodology adopted in this work, provide another significant contribution to the state of the art related to the problem of properly evaluating defenses against adversarial examples.

The security evaluation methodology advocated in [26, 2, 17], which we also adopt in this work, consists of evaluating the accuracy of the system against attacks crafted with an increasing amount of perturbation. The corresponding security evaluation curve [2] shows how gracefully the performance decreases while the attack increases in strength, up to the point where the defense reaches zero accuracy. This is another important phenomenon to be observed, since any defense against test-time evasion attacks has to fail when the perturbation is sufficiently large (or, even better, unbounded); in fact, in the unbounded case, the attacker can ideally replace the source sample with any other sample from another class [17]

. If accuracy under attack does not reach zero for very large perturbations, then it may be that the attack algorithm fails to find a good optimum (i.e., a good adversarial example). This in turn means that we are probably providing an optimistic evaluation of the defense. As suggested in 

[17], the purpose of a security evaluation should not be to show which attacks the defense withstands to, but rather to show when the defense fails. If one shows that only very large perturbations that substantially compromise the content of the input samples and its nature (i.e., its true label) are required to break the defense, then we can retain the defense mechanism to be sufficiently robust. Another relevant point is to show that such a breakdown point occurs at a larger perturbation than that exhibited by competing defenses, to show that the proposed defense is more robust than previously-proposed ones.

The empirical evaluation reported in Sect. IV, using both MNIST handwritten digits and CIFAR10 images, provides consistent results with the aforementioned aspects. First, it shows that our adaptive white-box attack is able to break our defensive method only at very large perturbations (which significantly alter the initial content of the input samples, requiring the attack to mimic the target class). Second, it shows that our method outperforms competing rejection mechanisms which only leverage the deep representation learned at the output network layer.

We conclude the paper by discussing related work (Sect. V), and the main contributions of this work along with promising future research directions (Sect. VI).

Ii Deep Neural Rejection

The underlying idea of our DNR method is to estimate the distribution of unperturbed training points at different network layers, and reject anomalous samples that may be incurred at test time, including adversarial examples. The architecture of DNR is shown in Fig. 


Before delving into the details of our method, let us introduce some notation. We denote the prediction function of a deep neural network with , where is the -dimensional space of input samples (e.g., image pixels) and is the space of the output predictions (i.e., the estimated confidence values for each class), being the number of classes. If we assume that the network consists of layers, then the prediction function can be rewritten to make this explicit as: , where and denote the mapping function learned respectively by the output and the input layer, and and are their weight parameters (learned during training).

For our defense mechanism to work, one has first to select a set of network layers; e.g., in Fig. 1 we select the outer layers , and . Let us assume that the representation of the input sample at level is . Then, on each of these selected representations, DNR learns an SVM with the RBF kernel , trying to correctly predict the input sample. The confidence values on the classes provided by this classifier are then concatenated with those provided by the other base SVMs, and used to train a combiner, using again an RBF SVM.111Validation samples should be used to train the combiner here and avoid overfitting, as suggested by stacked generalization [27]. The combiner will output predictions for the classes, but will reject samples if the maximum confidence score is not higher than a rejection threshold . This decision rule can be compactly represented as: , where we define an additional, constant output for the rejection class. According to this rule, if is the highest value in the set, the sample is rejected; otherwise, it is assigned to the class exhibiting the larger confidence value.

As proposed in [12], we use an RBF SVM here to ensure that the confidence values , for each given class, decrease while moves further away from regions of the feature space which are densely populated by training samples of that class. This property, named compact abating probability in open-set problems [28, 13], is a desirable property to easily implement a distance-based rejection mechanism as the one required in our case to detect outlying samples. With respect to [12], we train this combiner on top of other base classifiers rather than only on the representation learned by the last network layer, to further improve the detection of adversarial examples. For this reason, in the following we refer to the approach by Melis [12], rejecting samples based only on their representation at the last layer, as Neural Rejection (NR); and to ours, exploiting also representations from other layers, as Deep Neural Rejection (DNR).

Fig. 1: Architecture of Deep Neural Rejection (DNR). DNR considers different network layers and learns an SVM with the RBF kernel on each of their representations. The otputs of these SVMs are then combined using another RBF SVM, which will provide prediction scores for each class. This classifier will reject samples if the maximum score is not higher than the rejection threshold . This decision rule can be equivalently represented as , if we consider rejection as an additional class with .

Iii Attacking Deep Neural Rejection

To properly evaluate security, or adversarial robustness, of rejection-based defenses against adaptive white-box adversarial examples, we propose the following. Given a source sample and a maximum-allowed -sized perturbation, the attacker can optimize a defense-aware adversarial example by solving the following constrained optimization problem:


where is an -norm constraint (typical norms used for crafting adversarial examples are , and , for which efficient projection algorithms exist [29]), is the true class, and is the rejection class. In practice, the attacker will minimize the output on either the true or reject class to avoid that the sample is either correctly classified or rejected, while maximizing the output on the closest competing class (excluding the aforementioned two) to achieve (untargeted) evasion. This amounts to performing a strong maximum-confidence evasion attack (rather than searching for a minimum-distance adversarial example). We refer the reader to [6, 2, 24] for a more detailed discussion on such topic. While we focus here on untargeted (error-generic) attacks, our formulation can be extended to account for targeted (error-specific) evasion as also done in [12].

The optimization problem in Eq. (1) can be solved through a standard projected gradient descent (PGD) algorithm, as given in Algorithm 1.

1:: the input sample; : the step size; : a projection operator on the -norm constraint ; : a small positive number to ensure convergence.
2:: the adversarial example.
Algorithm 1 PGD-based Maximum-confidence Adversarial Examples

In Fig. 2 we report an example on a bi-dimensional toy problem to show how our defense-aware attack works against a rejection-based defense mechanism.

Fig. 2: Our defense-aware attack against an RBF SVM with rejection, on a 3-class bi-dimensional classification problem. The initial sample and the adversarial example are respectively represented as a red exagon and a green star, while the -norm perturbation constraint is shown as a black circle. The left plot shows the decision region of each class, along with the reject region (in white). The right plot shows the values of the attack objective (in colors), which correctly enforces our attacks to avoid the reject region.

Iv Experimental Analysis

In this section, we evaluate the security of the proposed DNR method against adaptive, defense-aware adversarial examples222We have implemented our code using the SecML library [30]

. We consider two common computer-vision benchmarks for this task, i.e., handwritten digit recognition (MNIST data) and image classification (CIFAR10 data). Our goal is to investigate whether and to which extent DNR can improve security against adversarial examples; in particular, compared to the previously-proposed neural rejection (NR) defense (which only leverages the feature representation learned at the last network layer to reject adversarial examples) 


Iv-a Experimental Setup

We discuss here the experimental setup used to evaluate our defense mechanism.

Datasets. As mentioned before, we run experiments on MNIST and CIFAR10 data. MNIST handwritten digit data consists of 60,000 training and 10,000 test gray-scale 28x28 images. CIFAR10 consists of 50,000 training and 10,000 test RGB 32x32 images. We normalized the images of both datasets in [0,1] by simply dividing the input pixel values by 255.

Train-test splits. We average the results on five different runs. In each run, we consider 10,000 training samples and 1,000 test samples, randomly drawn from the corresponding datasets. The deep neural networks (DNNs) used in our experiments are pre-trained on a separate split of 30,000 and 40,000 training samples, respectively for MNIST and CIFAR10.

Classifiers. We compare the DNR approach (which implements rejection here based on the representations learned by three different network layers) against an undefended DNN (without any rejection mechanism) and against the NR defense by Melis [12] (which implements rejection on top of the representation learned by the output network layer). To implement the undefended DNNs for the MNIST dataset, we used the same architecture suggested by Carlini [21]. For CIFAR10, instead, we considered a lightweight network that, despite its size, allows obtaining high performances. The two considered architectures are shown in Table LABEL:tab:architectures, whereas table II

shows the model parameters that we used to train the overmentioned architectures. The three layers considered by the DNR classifier are the last three layers for the network trained on MNIST, and the last layer plus the last batch norm layer and the second to last max-pooling layer for the one trained on CIFAR10 (chosen to obtain a reasonable amount of features).

Layer Type Dimension

Conv. + ReLU

32 filters (3x3)
Conv. + ReLU 32 filters (3x3)
Max Pooling 2×2
Conv. + ReLU 64 filters (3x3)
Conv. + ReLU 64 filters (3x3)
Max Pooling 2×2
Fully Connected + ReLU 200 units
Fully Connected + ReLU 200 units
Softmax 10 units
TABLE I: Model architecture for MNIST (left) and CIFAR10 (right) networks
Layer Type Dimension
Conv. + Batch Norm. + ReLU 64 filters (3x3)
Conv. + Batch Norm. + ReLU 64 filters (3x3)
Max Pooling + Dropout () 2x2
Conv. + Batch Norm. + ReLU 128 filters (3x3)
Conv. + Batch Norm. + ReLU 128 filters (3x3)
Max Pooling + Dropout () 2x2
Conv. + Batch Norm. + ReLU 256 filters (3x3)
Conv. + Batch Norm. + ReLU 256 filters (3x3)
Max Pooling + Dropout () 2x2
Conv. + Batch Norm. + ReLU 512 filters (3x3)
Max Pooling + Dropout () 2x2
Fully Connected 512 units
Softmax 10 units
Parameter MNIST Model CIFAR Model
Learning Rate 0.1 0.01
Momentum 0.9 0.9
Dropout 0.5 (see table IV-A)
Batch Size 128 100
Epochs 50 75
TABLE II: Parameters used to train MNIST and CIFAR10 networks

Security Evaluation. We compare these classifiers in terms of their security evaluation curves [2], reporting classification accuracy against an increasing -norm perturbation size , used to perturb all test samples. We consider adversarial examples to be correctly classified if they are assigned either to the rejection class or to their original class (which typically happens when the perturbation is too small to cause a misclassification). For DNR and NR, we also report the rejection rates, to highlight the fraction of correct rejections (i.e., adversarial examples) and wrong rejections (i.e., unperturbed test samples).

Parameter setting.

We use a 5-fold cross validation to select the hyperparameters that maximize classification accuracy on unperturbed training data. We set the rejection threshold

of NR and DNR using the same procedure to reject less than 5% training samples on MNIST, and less than 10% training samples on CIFAR10.

Iv-B Experimental Results

Fig. 3:

Security evaluation curves for MNIST (left) and CIFAR10 (right) data, reporting mean accuracy (and standard deviation) against

-sized attacks.

The results are reported in Fig. 3. In the absence of attack (), the undefended DNNs slightly outperform NR and DNR, since they reject some legitimate samples. However, under attack, (), both defenses outperform the undefended DNNs, and in particular DNR significantly outperforms NR, exhibiting a much more graceful decrease in performance. Interestingly, on MNIST, NR tends to reject more samples for , but its accuracy is lower than DNR. The reason is that those adversarial examples that are not rejected by NR are already misclassified at lower . Instead, even though such samples may not be rejected by DNR, they are correctly assigned to their true classes when DNR is used. This means that DNR provides tighter boundaries closer to the training classes than NR, thanks to the exploitation of lower-level network representations, which makes the corresponding defended classifier more difficult to evade. In particular, the adversarial perturbations required to evade DNR are not only typically larger in magnitude, but they also seem to be more structured, trying to mimic more closely samples belonging to different classes. This effect is quite evident in the case of MNIST digits, as shown in Fig. 4, and to a lower extent also in the case of CIFAR10 images, as shown in Fig. 5. For instance, see how the bird image is altered to resemble the shape of an airplane: the background pixels are manipulated to represent the airplane cockpit and its tail. Notably, the truck image is also modified to better resemble the shape of a car. The perturbations crafted to evade DNR are more structured since they are required to mimic the target class not only at the output layer, but at lower representations too.

Fig. 4: Adversarial examples computed on the MNIST data to evade DNN, NR and DNR. The source image is reported on the left, followed by the (magnified) adversarial perturbation crafted with (top) and (bottom) against each classifier, and the resulting adversarial examples. We remind the reader that the attacks considered in this work are untargeted, i.e., they succeed when the attack sample is not correctly assigned to its true class. Note how DNR enforces the attacks to better mimic samples from different classes, but remains able to reject even strongly-perturbed adversarial examples.
Fig. 5: Adversarial examples computed on the CIFAR10 dataset adding a perturbation computed with (top) and (bottom). See the caption of Fig.4 for further details.

V Related Work

Different approaches have been recently proposed to perform rejection of samples that are outside of the training data distribution [31, 32, 33]. For example, Thulasidasan [31] and Geifman [32]

have proposed novel loss functions accounting for rejection of inputs on which the classifier is not sufficiently confident. Guyon 

[33] have proposed a method that allows the system designer to set the desired risk level by adding a rejection mechanism to a pre-trained neural network architecture. These approaches have, however, not been originally tested against adversarial examples, and it is thus of interest to assess their performance under attack in future work, also in comparison to our proposal.

Even if the majority of approaches implementing rejection or abstaining classifiers have not considered the problem of defending against adversarial examples, some recent work has explored this direction too [13, 12]. Nevertheless, with respect to the approach proposed in this work, they have only considered the output of the last network layer and perform rejection based solely on that specific feature representation. In particular, Bendale and Boult [13] have proposed a rejection mechanism based on reducing the open-set risk in the feature space of the activation vectors extracted from the last layer of the network, while Melis [12] have applied a threshold on the output of an RBF SVM classifier. Despite these differences, the rationale of the two approaches is quite similar and resembles the older idea of distance-based rejection.

Few approaches have considered a multi-layer detection scheme similar to that envisioned in our work [15, 34, 14, 35, 16]. However, most of these approaches require generating adversarial examples at training time, which is computationally intensive, especially for high-dimensional problems and large datasets [15, 34, 14, 35]. Finding a methodology to tune the hyperparameters for generating the attack samples is also an open research challenge. Finally, even though the DkNN approach by Papernot [16] does not require generating adversarial examples at training time, it requires computing the distance of each test sample against all the training points at different network layer representations, which again raises scalability issues to high-dimensional problems and large datasets.

Vi Conclusions and Future Work

We have proposed deep neural rejection (DNR), i.e., a multi-layer rejection mechanism that, differently from other state-of-the-art rejection approaches against adversarial examples, does not require generating adversarial examples at training time, and it is less computationally demanding. Our approach can be applied to pre-trained network architectures to implement a defense against adversarial attacks at test time. The base classifiers and the combiner used in our DNR approach are trained separately. As future work, it would be interesting to perform an end-to-end training of the proposed classifier similarly to the approaches proposed in [31] and [32]. Another research direction may be that of testing our defense against training-time poisoning attacks [5, 36, 37, 38, 2].

Availability of data and materials

The dataset MNIST and CIFAR10 analysed in this work are available respectively at the following web pages: and kriz/cifar.html.[39, 40]


This work was partly supported by the PRIN 2017 project RexLearn, funded by the Italian Ministry of Education, University and Research (grant no. 2017TWNMH2); and by the EU H2020 project ALOHA, under the European Union’s Horizon 2020 research and innovation programme (grant no. 780788).


  • [1] A. D. Joseph, B. Nelson, B. I. P. Rubinstein, and J. Tygar, Adversarial Machine Learning. Cambridge University Press, 2018.
  • [2] B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,” Pattern Recognition, vol. 84, pp. 317–331, 2018.
  • [3] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma, “Adversarial classification,” in Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), (Seattle), pp. 99–108, 2004.
  • [4] D. Lowd and C. Meek, “Good word attacks on statistical spam filters,” in Second Conference on Email and Anti-Spam (CEAS), (Mountain View, CA, USA), 2005.
  • [5] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support vector machines,” in 29th Int’l Conf. on Machine Learning (J. Langford and J. Pineau, eds.), pp. 1807–1814, Omnipress, 2012.
  • [6] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli, “Evasion attacks against machine learning at test time,” in Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Part III (H. Blockeel, K. Kersting, S. Nijssen, and F. Železný, eds.), vol. 8190 of LNCS, pp. 387–402, Springer Berlin Heidelberg, 2013.
  • [7] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014.
  • [8] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
  • [9] A. Globerson and S. T. Roweis, “Nightmare at test time: robust learning by feature deletion,” in Proceedings of the 23rd International Conference on Machine Learning (W. W. Cohen and A. Moore, eds.), vol. 148, pp. 353–360, ACM, 2006.
  • [10] M. Brückner, C. Kanzow, and T. Scheffer, “Static prediction games for adversarial learning problems,” J. Mach. Learn. Res., vol. 13, pp. 2617–2654, September 2012.
  • [11] S. Rota Bulò, B. Biggio, I. Pillai, M. Pelillo, and F. Roli, “Randomized prediction games for adversarial machine learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 11, pp. 2466–2478, 2017.
  • [12] M. Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli, “Is deep learning safe for robot vision? Adversarial examples against the iCub humanoid,” in ICCVW Vision in Practice on Autonomous Robots (ViPAR), pp. 751–759, IEEE, 2017.
  • [13] A. Bendale and T. E. Boult, “Towards open set deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563–1572, 2016.
  • [14] F. Crecchi, D. Bacciu, and B. Biggio, “Detecting adversarial examples through nonlinear dimensionality reduction,” in ESANN ’19, In press.
  • [15] J. Lu, T. Issaranon, and D. Forsyth, “Safetynet: Detecting and rejecting adversarial examples robustly,” in The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [16] N. Papernot and P. D. McDaniel, “Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning,” CoRR, vol. abs/1803.04765, 2018.
  • [17] A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in ICML, vol. 80 of JMLR Workshop and Conference Proceedings, pp. 274–283,, 2018.
  • [18] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597, May 2016.
  • [19] D. Meng and H. Chen, “MagNet: a two-pronged defense against adversarial examples,” in 24th ACM Conf. Computer and Comm. Sec. (CCS), 2017.
  • [20] N. Carlini and D. A. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in

    10th ACM Workshop on Artificial Intelligence and Security

    (B. M. Thuraisingham, B. Biggio, D. M. Freeman, B. Miller, and A. Sinha, eds.), AISec ’17, (New York, NY, USA), pp. 3–14, ACM, 2017.
  • [21] N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE Symposium on Security and Privacy, pp. 39–57, IEEE Computer Society, 2017.
  • [22] P. Russu, A. Demontis, B. Biggio, G. Fumera, and F. Roli, “Secure kernel machines against evasion attacks,” in 9th ACM Workshop on Artificial Intelligence and Security, AISec ’16, (New York, NY, USA), pp. 59–69, ACM, 2016.
  • [23] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, (New York, NY, USA), pp. 506–519, ACM, 2017.
  • [24] A. Demontis, M. Melis, M. Pintor, M. Jagielski, B. Biggio, A. Oprea, C. Nita-Rotaru, and F. Roli, “Why do adversarial attacks transfer? Explaining transferability of evasion and poisoning attacks,” in 28th USENIX Security Symposium (USENIX Security 19), USENIX Association, 2019.
  • [25] M. Melis, D. Maiorca, B. Biggio, G. Giacinto, and F. Roli, “Explaining black-box android malware detection,” in 26th European Signal Processing Conf., EUSIPCO, (Rome, Italy), pp. 524–528, IEEE, 2018.
  • [26] B. Biggio, G. Fumera, and F. Roli, “Security evaluation of pattern classifiers under attack,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, pp. 984–996, April 2014.
  • [27] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259, 1992.
  • [28] W. Scheirer, L. Jain, and T. Boult, “Probability models for open set recognition,” IEEE Trans. Patt. An. Mach. Intell., vol. 36, no. 11, pp. 2317–2324, 2014.
  • [29] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficient projections onto the l1-ball for learning in high dimensions,” in Proceedings of the 25th International Conference on Machine Learning, ICML ’08, (New York, NY, USA), pp. 272–279, ACM, 2008.
  • [30] M. Melis, A. Demontis, M. Pintor, and B. Biggio, “SecML.”
  • [31] S. Thulasidasan, T. Bhattacharya, J. Bilmes, G. Chennupati, and J. Mohd-Yusof, “Knows when it doesn’t know: Deep abstaining classifiers,” 2019.
  • [32] R. E.-Y. Yonatan Geifman, “Selectivenet: A deep neural network with an integrated reject option,” 2019.
  • [33] Y. Geifman and R. El-Yaniv, “Selective classification for deep neural networks,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 4878–4887, Curran Associates, Inc., 2017.
  • [34] F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, and G. Amato, “Adversarial examples detection in features distance spaces,” in The European Conference on Computer Vision (ECCV) Workshops, September 2018.
  • [35] T. Pang, C. Du, and J. Zhu, “Max-mahalanobis linear discriminant analysis networks,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 4013–4022, 2018.
  • [36] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li, “Manipulating machine learning: Poisoning attacks and countermeasures for regression learning,” in IEEE Symposium on Security and Privacy, SP ’18, pp. 931–947, IEEE CS, 2018.
  • [37]

    H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli, “Is feature selection secure against training data poisoning?,” in

    JMLR W&CP - Proc. 32nd Int’l Conf. Mach. Learning (ICML) (F. Bach and D. Blei, eds.), vol. 37, pp. 1689–1698, 2015.
  • [38] S. Mei and X. Zhu, “Using machine teaching to identify optimal training-set attacks on machine learners,” in 29th AAAI Conf. Artificial Intelligence (AAAI ’15), 2015.
  • [39] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, Nov. 1998.
  • [40] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” University of Toronto, 2012.