Better the Devil you Know: An Analysis of Evasion Attacks using Out-of-Distribution Adversarial Examples

05/05/2019 ∙ by Vikash Sehwag, et al. ∙ Princeton University 18

A large body of recent work has investigated the phenomenon of evasion attacks using adversarial examples for deep learning systems, where the addition of norm-bounded perturbations to the test inputs leads to incorrect output classification. Previous work has investigated this phenomenon in closed-world systems where training and test inputs follow a pre-specified distribution. However, real-world implementations of deep learning applications, such as autonomous driving and content classification are likely to operate in the open-world environment. In this paper, we demonstrate the success of open-world evasion attacks, where adversarial examples are generated from out-of-distribution inputs (OOD adversarial examples). In our study, we use 11 state-of-the-art neural network models trained on 3 image datasets of varying complexity. We first demonstrate that state-of-the-art detectors for out-of-distribution data are not robust against OOD adversarial examples. We then consider 5 known defenses for adversarial examples, including state-of-the-art robust training methods, and show that against these defenses, OOD adversarial examples can achieve up to 4× higher target success rates compared to adversarial examples generated from in-distribution data. We also take a quantitative look at how open-world evasion attacks may affect real-world systems. Finally, we present the first steps towards a robust open-world machine learning system.



There are no comments yet.


page 2

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

               Input Images
Mean output
Classification confidence
Mitigation using
robust training
Mitigation using
OOD detectors
Airplane, Automobile, Bird, Cat, Deer,
Dog, Frog, Horse, Ship, Truck
0.99 N.A. N.A.
Cat 1.00 N.A.
Truck, Horse, Truck, Truck, Ship,
Ship, Truck, Frog, Bird, Airplane
0.53 X
Cat 1.00 X X
Table 1. Distinguishing between the key properties of unmodified and adversarial in-distribution and out-of-distribution (OOD) images for CIFAR-10 trained models. In an open-world environment, the adversary can construct adversarial examples using OOD inputs, called OOD adversarial examples. In contrast to non-modified OOD inputs, OOD adversarial examples achieve high confidence targeted classification. Mitigation strategies such as robust training and OOD detection are largely ineffective against OOD adversarial examples.

Machine learning (ML), spurred by the advent of deep neural networks, has become ubiquitous due to its impressive performance in domains as varied as image recognition (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014), natural language and speech processing (Collobert et al., 2011; Hinton et al., 2012; Deng et al., 2013), game-playing (Silver et al., 2017; Brown and Sandholm, 2017; Moravčík et al., 2017) and aircraft collision avoidance (Julian et al., 2016). However, its ubiquity provides adversaries with both opportunities and incentives to develop strategic approaches to fool machine learning systems during both training (poisoning attacks) (Biggio et al., 2012; Rubinstein et al., 2009; Mozaffari-Kermani et al., 2015; Jagielski et al., 2018) and test (evasion attacks) (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016a; Moosavi-Dezfooli et al., 2016, 2017; Carlini and Wagner, 2017b) phases. This paper’s focus is evasion attacks which have been proposed against supervised ML algorithms used for image classification (Biggio et al., 2013; Szegedy et al., 2014; Goodfellow et al., 2015; Carlini and Wagner, 2017b; Papernot et al., 2016a; Chen et al., 2018b), object detection (Xie et al., 2017b; Lu et al., 2017; Chen et al., 2018a), image segmentation (Fischer et al., 2017; Arnab et al., 2018), speech recognition (Carlini and Wagner, 2018; Yuan et al., 2018) as well as other tasks (Cisse et al., 2017; Kantchelian et al., 2016; Grosse et al., 2017b; Xu et al., 2016); generative models for image data (Kos et al., 2017)

, and even reinforcement learning algorithms

(Kos and Song, 2017; Huang et al., 2017). Successful instantiations of these attacks have been demonstrated in black-box (Szegedy et al., 2014; Papernot et al., 2016; Papernot et al., 2017; Liu et al., 2017; Brendel et al., 2018; Bhagoji et al., 2018; Chen et al., 2017) as well as physical settings (Sharif et al., 2016; Kurakin et al., 2016; Evtimov et al., 2018; Sitawarin et al., 2018). The vast majority of these attacks consider adversarial examples generated from inputs taken from the training or test data (referred to as in-distribution examples henceforth). In this paper, we consider the threat posed by evasion attacks in the open-world learning model, where the ML system is expected to handle inputs that come from the same input space as the in-distribution data, but may arise from completely different distributions, referred to as out-of-distribution

(OOD) data. Motivated by the fact that real-world applications of machine learning are likely to operate in the open-world environment, this problem has been studied widely under the context of out-of-distribution detection, anomaly detection, and selective prediction

(Hendrycks and Gimpel, 2017; Liang et al., 2018; Lee et al., 2018a; DeVries and Taylor, 2018; Jiang et al., 2018; Lakshminarayanan et al., 2017; Lee et al., 2018b; Zong et al., 2018; Liu et al., 2018; Ruff et al., 2018; Chalapathy et al., 2018; Hendrycks et al., 2018; NIP, 2018; Dhamija et al., 2018; Bendale and Boult, 2015; Bendale and Boult, 2015; Yoshihashi et al., 2018; Günther et al., 2017; El-Yaniv and Wiener, 2010; Chang and Lippmann, 1994; Geifman and El-Yaniv, 2017). Indeed, there have been many recent attempts extending the state-of-the-art deep learning systems to the open-world environment  (Bendale and Boult, 2015; Bendale and Boult, 2015; Yoshihashi et al., 2018; Günther et al., 2017). However, we note that work on the open-world learning model has not considered the presence of adversaries, while work on adversarial examples has been restricted to generating them from in-distribution adversarial examples. We close this gap by considering adversaries operating in the open-world learning setting that carry out evasion attacks by modifying OOD data (compliant with the environment/application constraints, if any) with the aim of inducing high-confidence targeted misclassification. To carry out evasion attacks on open-world learning models, we introduce out-of-distribution (OOD) adversarial examples: adversarial examples created by perturbing an OOD input to be classified as a target class , where an OOD input is an arbitrary input drawn from a distribution different from training/test distribution. We demonstrate that OOD adversarial examples are able to bypass current OOD detectors as well as defenses against adversarial examples (Table 1). Under the taxonomy of attacks on ML systems laid out by Huang et al. (Huang et al., 2011), OOD adversarial examples are a form of exploratory integrity attacks with the intent of targeted

misclassification of input data. Intuitively, we would expect a well-behaved classifier to classify an OOD adversarial example with low confidence, since it has never encountered an example from that portion of the input space before. This intuition underlies the design of state-of-the-art OOD detectors. However, OOD adversarial examples constitute an integrity violation for classifiers as well as OOD detectors as they induce high confidence targeted misclassification in state-of-the-art classifiers, which is unwanted system behavior. While previous work

(Goodfellow et al., 2015; Nguyen et al., 2015; Sharif et al., 2016; Sitawarin et al., 2018) has hinted at the possibility of generating adversarial examples without the use of in-distribution data, we are the first to rigorously examine OOD adversarial examples and their impact on ML classifiers, including those secured with state-of-the-art defenses, as well as OOD detectors.

1.1. Contributions

We now detail our contributions in this paper. Introduction and analysis of open-world evasion attacks: We introduce open-world evasion attacks and propose OOD adversarial examples to carry out these attacks. We study their impact on state-of-the-art classifiers, both defended and undefended, as well as on OOD detectors. We consider 11 different neural networks trained across 3 benchmark image datasets of varying complexity. For each dataset used for training, we use 5 OOD datasets for the generation of OOD adversarial examples in order to determine the effect of the choice of OOD data on robustness. Evading state-of-the-art OOD detectors: We evaluate the robustness of the state-of-the-art OOD detection mechanisms ODIN (Liang et al., 2018) and Confidence-calibrated classifiers (Lee et al., 2018a) against OOD adversarial examples. Our results show that the OOD detectors, which can detect close to 85 of non-modified OOD inputs, fail to detect a significant percentage of OOD adversarial examples (up to 99.8). Bypassing state-of-the-art defenses for in-distribution attacks: Although state-of-the-art defenses such as iterative adversarial training (Madry et al., 2018) and provably robust training with the convex outer polytope (Kolter and Wong, 2018) are promising approaches for the mitigation of in-distribution attacks, their performance significantly degrades with the use of OOD adversarial examples. We demonstrate that OOD adversarial examples can achieve a significantly higher target success rate (up to 4 greater) than that of adversarial examples generated from in-distribution data. Further, we demonstrate that OOD adversarial examples are able to evade adversarial example detectors such as feature squeezing (Xu et al., 2018) and MagNet (Meng and Chen, 2017), with close to 100

success rate (similar to in-distribution adversarial examples). We also show this for the Adversarial Logit Pairing defense

(Kannan et al., 2018). OOD adversarial examples in the real world: We demonstrate the success of OOD adversarial examples in real-world settings by targeting a content moderation service provided by Clarifai (Clarifai, 2019). We also show how physical OOD adversarial examples can be used to fool traffic sign classification systems. Towards robust open-world learning: We explore if it is possible to increase the robustness of defended models by including a small number of OOD adversarial examples during robust training. Our results show that such an increase in robustness, even against OOD datasets excluded in training, is possible. Overall, we demonstrate that considering the threats posed to open-world learning models is imperative for the secure deployment of ML systems in practice.

2. Background and Related Work

In this section we present the background and related work on adversarial examples generated from in-distribution data, defenses against evasion attacks and state-of-art detection methods for unmodified OOD data.

2.1. Supervised classification

Let be a space of examples and let be a finite set of classes. A classifier is a function . Let

be the set of probability distributions over

. In our setting, a classifier is always derived from a function that provides confidence information, i.e. . In particular, for DNNs, the outputs of the penultimate layer of a neural network

, representing the output of the network computed sequentially over all preceding layers, are known as the logits. We represent the logits as a vector

. The classifier is trained by minimizing the empirical loss over samples (training set), where

is a loss function such as the cross-entropy loss

(Goodfellow et al., 2016) that depends on the output confidence function (Murphy, 2012). The training set is drawn from a distribution over the domain . The marginal distribution over the space of examples is represented as . These samples usually represent an application-specific set of concepts that the classifier is being trained for.

2.2. Evasion attacks

We focus on test phase or evasion attacks.

2.2.1. Adversarial examples generated from in-distribution data

Evasion attacks have been demonstrated to be highly successful for a number of classifiers (Biggio et al., 2013; Szegedy et al., 2014; Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016; Papernot et al., 2016a; Carlini and Wagner, 2017b; Kurakin et al., 2016; Evtimov et al., 2018; Sitawarin et al., 2018; Suciu et al., 2018; Wang et al., 2018d). All of these aim to modify benign, in-distribution examples by adding a imperceptible perturbation to them such that the modified examples are adversarial (Szegedy et al., 2014). The adversary’s aim is to ensure that these adversarial examples are successfully misclassified by the ML system in a targeted class (targeted attack), or any class other than the ground truth class (untargeted attack). We focus entirely on targeted attacks since these are more realistic from an attacker’s perspective and are strictly harder to carry out than untargeted attacks. To generate a successful targeted adversarial example for class starting from a benign example for a classifier , the following optimization problem must be solved


where is an appropriate distance metric for inputs from the input domain used to model imperceptibility-based adversarial constraints (Goodfellow et al., 2015; Carlini and Wagner, 2017b). The distance metric imposes an -ball constraint on the perturbation. The optimization problem in Eq. 1 is combinatorial and thus difficult to solve. In practice, a relaxed version using an appropriate adversarial loss function derived from the confidence function is used and solved with an iterative optimization technique (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini and Wagner, 2017b; Madry et al., 2018; Athalye et al., 2018; Moosavi-Dezfooli et al., 2016). Details of the state-of-the-art attack methods we use are in Section 4.2.

2.2.2. Real-world attacks

Work on adversarial examples has also examined the threat they pose in real-world settings. One line of work has been to analyze attacks on black-box models hidden behind APIs (Bhagoji et al., 2018; Chen et al., 2017; Ilyas et al., 2018; Brendel et al., 2018)

where access to the internals of a model is often unavailable due to privacy concerns. In these settings, gradient estimation techniques or other black-box optimization techniques such as particle swarm optimization

(Bhagoji et al., 2018) are used to generate adversarial examples. Further, the possibility of using physically realized adversarial examples (Anonymous, 2018; Evtimov et al., 2018; Athalye et al., 2017) to attack ML systems has also been considered. Since the image capture process introduces artifacts related to brightness, scaling and rotation, these have to be considered in the adversarial example generation process, which is done using the expectation over transformations method (Athalye et al., 2017). We examine both of these attacks for OOD adversarial examples in Section 5.4.

2.2.3. Threat models

Based on the degree of access the adversary has to the classifier under attack, different threat models can be defined. The two threat models we consider in this paper are:
White-box: In this threat model, the adversary has full access to the classifier under attack, including any possible defenses that may have been employed, as well as the training and test data (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini and Wagner, 2017b; Athalye et al., 2018). This is the threat model under which the evaluation of defenses has been recommended in the literature (Carlini and Wagner, 2017b; Athalye et al., 2018) in order to avoid reliance on ‘security through obscurity’ and is the primary threat model used in this paper.
Black-box with query access: This is a more restrictive threat model, where the adversary is only assumed to have access to the output probability distribution for any input (Dang et al., 2017; Narodytska and Kasiviswanathan, 2016; Chen et al., 2017; Bhagoji et al., 2018; Ilyas et al., 2018). No access to either the model structure and parameters or the training and test data is assumed. This is a secondary threat model we use to demonstrate attacks on cloud-based classifiers (Clarifai, 2019) (Section 5.4).

2.3. Defenses against evasion attacks

In order to defend against evasion attacks, there are broadly two approaches used in the literature as noted in the review by Papernot et al. (Papernot et al., 2016b). The first, namely robust training, seeks to embed resilience into the classifier during training by modifying the standard loss used during the training process to one that accounts for the presence of an adversary at test time (Goodfellow et al., 2015; Tramèr et al., 2018; Madry et al., 2018; Kolter and Wong, 2018; Raghunathan et al., 2018). The second approach adds additional pre-processing steps at the input in order to detect or defend against adversarial examples (Bhagoji et al., 2017; Meng and Chen, 2017; Xu et al., 2018; Guo et al., 2017; Xie et al., 2017a; Samangouei et al., 2018; Song et al., 2017).

2.3.1. Robust training

Robust training can be performed either in a heuristic manner through the use of adversarial training

(Goodfellow et al., 2015; Kurakin et al., 2017; Tramèr et al., 2018; Madry et al., 2018; Kannan et al., 2018) or in a provable manner with the use of neural network verification (Ehlers, 2017; Gowal et al., 2018; Wang et al., 2018c, b, a; Weng et al., 2018; Xiao et al., 2018; Gehr et al., 2018). Since exact verification of state-of-the-art neural networks is inefficient due to their size and complexity, convex relaxations are used for training provably robust networks(Kolter and Wong, 2018; Wong et al., 2018; Raghunathan et al., 2018; Sinha et al., 2018). Adversarial training: These heuristic methods defend against adversarial examples by modifying the loss function such that it incorporates both clean and adversarial inputs.


where is the true label of the sample .In this paper, we consider the robustness of networks trained using iterative adversarial training (Madry et al., 2018), which uses adversarial examples generated using Projected Gradient Descent (PGD). This method has been shown to be empirically robust to adaptive white-box adversaries using adversarial examples generated from in-distribution data which use the same norm (Athalye et al., 2018) for models trained on the MNIST (LeCun, 1998) and CIFAR-10 (Krizhevsky et al., 2014) datasets. Provable robustness using convex relaxations: We focus on the approach of Kolter and Wong (Kolter and Wong, 2018; Wong et al., 2018) which scales to neural networks on the CIFAR-10 dataset (Krizhevsky et al., 2014). They aim to certify robustness in an -ball around any point in the input space by upper bounding the adversarial loss with a convex surrogate. They find a convex outer approximation of the activations in a neural network that can be reached with a perturbation bounded within an

-ball and show that an efficient linear program can be used to minimize the worst case loss over this region. In Section

5.2, we show reduced effectiveness of these two robust training based defenses for OOD adversarial examples.

2.3.2. Adversarial Example Detectors and secondary defenses

Seeking to exploit the difference in the properties of adversarial and unmodified inputs for in-distribution data, a large number of adversarial example detectors have been proposed (Meng and Chen, 2017; Xu et al., 2018; Guo et al., 2017; Xie et al., 2017a; Grosse et al., 2017a; Samangouei et al., 2018; Song et al., 2017). We consider two of the most promising detectors, namely feature squeezing (Xu et al., 2018) and MagNet (Meng and Chen, 2017). Both methods use input pre-processing in order to distinguish between benign and adversarial examples, but feature squeezing performs detection solely based on classifier outputs, while MagNet can perform detection at both the input and output. Unfortunately, these methods are not robust against adaptive white-box adversaries generating adversarial examples from in-distribution data (He et al., 2017; Carlini and Wagner, 2017a; Athalye et al., 2018). In Section 5.3, we show that this lack of robustness persists for OOD adversarial examples. Iterative adversarial training proposed by Madry et al. (Madry et al., 2018)

does not converge for Imagenet-scale models

(Madry et al., 2018). The Adversarial Logit Pairing (ALP) (Kannan et al., 2018) defense claimed to provide robustness for Imagenet-scale models by adding an additional loss term proportional to the distance between and during training, with generated using PGD. However, it was shown that simply increasing the number of PGD iterations used to generate adversarial examples from in-distribution data reduced the additional robustness to a negligible amount (Engstrom et al., 2018). In Section 5.2, we show that this lack of robustness persists for OOD adversarial examples.

2.4. Open-world Deep learning

The closed-world approach to deep learning, described in Section 2.1 operates using the assumption that both training and test data are drawn from the same application-specific distribution . However, in a real-world environment, ML systems are expected to encounter data at test time that is not drawn from but belongs to the same input space, i.e. they encounter samples that are out-of-distribution (OOD). This leads to the open-world learning model

. Thus, in order to extend supervised learning algorithms to the open-world learning model, it is critical to enable them to reject out-of-distribution inputs. The importance of this learning model is highlighted by the fact that a number of security and safety-critical applications such as biometric authentication, intrusion detection, autonomous driving, medical diagnosis are natural settings for the use of open-world machine learning

(Günther et al., 2017; Chalapathy and Chawla, 2019; Ramanagopal et al., 2018; Litjens et al., 2017).

2.4.1. Out-of-Distribution data

To design and evaluate the success of an open-world ML approach, it is first critical to define out-of-distribution data. Existing work on open-world machine learning (Bendale and Boult, 2015; Dhamija et al., 2018; Bendale and Boult, 2015) defines an example as OOD if it is drawn from a marginal distribution (over , the input feature space) which is different from and has a label set that is disjoint from that of in-distribution data. As a concrete example, consider a classifier trained on the CIFAR-10 image dataset (Krizhevsky and Hinton, 2009). This dataset only has 10 output classes and does not include classes for digits such as ‘3’ or ‘7’ like the MNIST (LeCun, 1998) dataset or ‘mushroom’ or ‘building’ like the Imagenet dataset (Deng et al., 2009). Thus, these datasets can act as a source of OOD data.

2.4.2. OOD detectors

Here, we only review recent approaches to OOD detection that scale to DNNs used for image classification. Hendrycks and Gimpel (Hendrycks and Gimpel, 2017) proposed a method for detecting OOD inputs for neural networks which uses a threshold for the output confidence vector to classify an input as in/out-distribution. This method relies on the assumption that the classifier will tend to have higher confidence values for in-distribution examples than OOD examples. An input is classified as being OOD if its output confidence value is smaller than a certain learned threshold. In this work, we evaluate the state-of-the-art OOD detectors for DNNs proposed by Liang et al. (Liang et al., 2018) (ODIN) and Lee et al. (Lee et al., 2018a) which also use output thresholding for OOD detection but significantly improve upon the baseline approach of Hendrycks and Gimpel (Hendrycks and Gimpel, 2017). The ODIN detector uses temperature scaling and input pre-processing to improve detection rates. Lee et al. (Lee et al., 2018a)

propose a modification to the training procedure to ensure that the neural network outputs a confidence vector which has probabilities uniformly distributed over classes for OOD inputs. However, as OOD inputs are unavailable at training time, they generate synthetic data, using a modified Generative Adversarial Network (GAN), which lies on the boundary between classes to function as OOD data.

Summary and implications: Most previous work on adversarial examples has used data from the training or test datasets as starting points for the generation of adversarial examples. However, given the the importance of open-world machine learning for the deployment of machine learning systems in practice, this choice fails to consider an important attack vector. On the other hand, previous work dealing with the open-world learning paradigm (Section 2.4) has largely not considered the possibility that this data could be adversarial, in addition to being OOD. In the next section, we demonstrate how we combine the two paradigms of open world learning and evasion attacks.

3. Open-world Evasion Attacks

Behavior on data type
Defense Type Defense Name
In-distribution adversarial
(white-box, adaptive)
Out-of-distribution adversarial
(white-box, adaptive)
OOD adversarial examples


Not robust (Madry et al., 2018; Carlini and Wagner, 2017b)
Rate: 100.0, Conf: 1.00
Not robust
Rate: 100.0, Conf: 1.00 (ImageNet)

Robust Training

Iterative adv. training (Madry et al., 2018)
Somewhat robust (Madry et al., 2018)
Rate: 22.9, Conf: 0.81
Not robust
Rate: 87.9, Conf: 0.86 (Gaussian Noise)
Convex polytope relaxation (Kolter and Wong, 2018)
Provably robust (Kolter and Wong, 2018)
Rate: 15.1, Conf: 0.41
Somewhat robust
Rate: 29.1, Conf: 0.32 (Gaussian Noise)

OOD Detection

ODIN (Liang et al., 2018) N.A.
Not robust
Rate: 81.6, Conf: 0.97 (Internet Photographs)
Confidence-calibrated (Lee et al., 2018a) N.A.
Somewhat robust
Rate: 47.1 , Conf: 0.99 (VOC12)
Table 2. Summary of results on CIFAR-10 trained models. Novel results and empirical conclusions from this paper are in bold. The last column shows the successful OOD adversarial examples for each defense, where each of them is generated to be classified as an by the network.

Deployed ML systems must be able to robustly handle inputs which are drawn from distributions other than those used for training/testing. We thus define open-world evasion attacks, which can use arbitrary points from the input space to generate out-of-distribution adversarial examples, making our work the first to combine the paradigms of adversarial examples and open-world learning. We then analyze how they are effective in bypassing OOD detectors and defenses for in-distribution adversarial examples, including adversarial example detectors, making them a potent threat in the open-world machine learning model. These results are summarized in Table 2. Finally, we examine how robustness against OOD adversarial examples can be achieved.

3.1. OOD adversarial examples

In the open-world learning model, an adversary can generate adversarial examples using OOD data, and is not restricted to in-distribution data. In order to carry out an evasion attack in this setting, the adversary generates an OOD adversarial example starting from .

Definition 0 (OOD adversarial examples).

An OOD adversarial example is generated from an OOD example drawn from by adding a perturbation with the aim of inducing classification in a target class , i.e. .

Any attack method used to generate adversarial examples starting from in-distribution data (Szegedy et al., 2014; Goodfellow et al., 2015; Chen et al., 2018b; Athalye et al., 2018), even against defended classifiers, can be used to generate OOD adversarial examples. These attack methods are typically calibrated to ensure that adversarial examples are misclassified with high confidence (see Section 4 on our design choices). Next, we highlight the importance of constructing OOD adversarial examples to fool classifiers by discussing the limitations of directly using unmodified/benign OOD data. Limitations of unmodified OOD data for evasion attacks: While unmodified OOD data already represents a concern for the deployment of ML classifiers (Section 2.4), we now discuss why they are severely limited from an adversarial perspective. First, with unmodified OOD data, the typical output confidence values are small, while an attacker aims for high-confidence targeted misclassification. Second, the attacker will have no control over the target class reached by the unmodified OOD example. Finally, due to the low typical output confidence values of unmodified OOD examples they can easily be detected by state-of-the-art OOD detectors (Section 2.4.2) which rely on confidence thresholding. We provide quantitative evidence for this discussion by comparing the behavior of state-of-the-art classifiers on unmodified OOD data and OOD adversarial examples (details in Appendix B).

3.1.1. Evading OOD detectors

OOD detectors are an essential component of any open-world learning system and OOD adversarial examples should be able to bypass them in order to be successful. State-of-the-art OOD detectors mark inputs which have confidence values below a certain threshold as being OOD. The intuition is that a classifier is more likely to be highly confident on in-distribution data. Recall that when generating adversarial examples, the adversary aims to ensure high-confidence misclassification in the desired target class. The goal of an adversary seeking to generate high-confidence targeted OOD adversarial examples will align with that of an adversary aiming to bypass an OOD detector. In other words, OOD adversarial examples that achieve high-confidence targeted misclassification also bypass OOD detectors. Our empirical results in Section 5.1 demonstrate this conclusively, with OOD adversarial examples inducing high false negative rates in the OOD detectors, which mark them as being in-distribution.

3.1.2. Evading robust training based defenses

Robustly trained neural networks (Madry et al., 2018; Kolter and Wong, 2018; Raghunathan et al., 2018) (recall Section 2.3.1), incorporate the attack strategy into the training process. Since the training and test data are drawn in an i.i.d. fashion, the resulting neural networks are robust at test time as well. However, these networks may not be able to provide robustness if the attack strategy were to be changed. In particular, we change the starting point for the generation of adversarial examples and since the training process for these robust defenses does not account for the possibility of OOD data being encountered at test time, they remain vulnerable to OOD adversarial examples. We find that for defenses based on robust training, OOD adversarial examples are able to increase targeted success rate by (Sections 5.2.2 and 5.2.3). This finding illustrates the potent threat of open-world evasion attacks, which must be addressed for secure deployment of ML models in practice. We further demonstrate that adversarial example detectors such as MagNet and Feature Squeezing can be similarly bypassed by incorporating the metrics and pre-processing they use into the attack objective for OOD adversarial examples.

3.1.3. Real-world attacks

Since the aim of the open-world threat model is to elucidate the wider range of possible threats to a deployed ML model than previously considered, we analyze the possibility of realizing OOD adversarial examples in the following real-world settings:

  1. Physical attacks: We consider attacks on a traffic sign recognition system where an adversary uses custom signs and logos in the environment as a source of OOD adversarial examples, since the classifier has only been trained on traffic signs. In a physical setting, there is the additional challenge of ensuring that the OOD adversarial examples remain adversarial in spite of environmental factors such as lighting and angle. We ensure this by incorporating random lighting, angle and re-sizing transformations into the OOD adversarial example generation process (Athalye et al., 2017; Anonymous, 2018; Evtimov et al., 2018).

  2. Query-limited black-box attacks: We use OOD adversarial examples to carry out a Denial of Service style attack on a content moderation model provided by Clarifai (Clarifai, 2019), by classifying clearly unobjectionable content as objectionable with high confidence. Since we only have query-access to the model being attacked, the model gradients usually needed to generate adversarial examples (see Section 4.2) have to be estimated. This is done using the finite difference method with random grouping based query-reduction (Bhagoji et al., 2018).

Our results in Section 5.4 show that OOD adversarial examples remain effective in these settings and are able to successfully attack content moderation and traffic sign recognition systems.

3.2. Towards Robust Open-World Deep Learning

A robust open-world deep learning system is expected to satisfy the following two properties: (i) It should have high accuracy in detecting unmodified and adversarial OOD inputs; (ii) It should have high accuracy in classifying unmodified and adversarial in-distribution inputs. To move towards a robust open-world deep learning system, we take inspiration from previous work on selective prediction (El-Yaniv and Wiener, 2010; Chang and Lippmann, 1994; Geifman and El-Yaniv, 2017; McCoyd and Wagner, 2018) which augments classifiers for in-distribution data with an additional class (referred to as background class) so they can be extended to open-world learning environment and detect OOD inputs. Further, since iterative adversarial training (Madry et al., 2018) enables the robust classification of in-distribution adversarial examples, we can intuit that a similar approach may provide robustness to OOD adversarial examples. Thus, we examine a hybrid approach where we use iterative adversarial training to ensure robust classification of OOD data, both unmodified and adversarial, to the background class. Similar to other OOD detection approaches (Liang et al., 2018; Lee et al., 2018a; Hendrycks and Gimpel, 2017; Hendrycks et al., 2018), selective prediction is semi-supervised, i.e. it assumes access to a small subset of OOD data at training time. We note that since all of these state-of-the-art approaches consider the detection of specific (multiple) OOD datasets, we follow the same methodology for robust OOD classification. To achieve robust classification in the open-world environment, we perform iterative adversarial training with the following loss function:


where , y is true label for sample and is the background class. refers to the robust loss used in adversarial training (Eq. 2). The question now arises: to what extent does this formulation satisfy the two desired properties from a robust open-world learning system? In particular, we examine if the following goals are feasible using small subsets of OOD data: i) robust classification of a single OOD dataset?, ii) generalization of robustness to multiple OOD datasets while training with a single one?, iii) simultaneous robustness to multiple OOD datasets while training with data from all of them? Again, we emphasize that these must be achieved while maintaining high accuracy on in-distribution data. Our evaluation in Section 6 answers these questions in the affirmative. For example, we observe that a subset as small as 0.5% of the total number of samples from an OOD dataset can significantly enhance robustness against OOD adversarial examples.

4. Design Choices for Open-World Evasion Attacks

In this section, we present and examine the design choices we make to carry out our experiments on both evaluation and training of classifiers in the open-world model. In particular, we discuss the types of datasets, attack methods, models and metrics we consider.

4.1. Datasets

We consider 3 publicly available datasets for image classification as sources of in-distribution data for training (). These are MNIST (LeCun, 1998), CIFAR-10 (Krizhevsky et al., 2014), and ImageNet (ILSVRC12 release) (Deng et al., 2009). When one of the above datasets is used for training, all the other datasets we consider are used as a source of OOD data. We consider two types of OOD data: i) semantically meaningful OOD data and ii) noise OOD data. Semantically meaningful OOD data: Datasets such as MNIST, CIFAR-10 and Imagenet are semantically meaningful as the images they contain generally have concepts recognizable to humans. To further explore the space of semantically meaningful OOD data, we also consider the VOC12 (Everingham et al., 2010) dataset as a source of OOD data. Further, to emulate the type of data that may be encountered in an open-world setting, we construct a Internet Photographs dataset by gathering 10,000 natural images from the internet using the Picsum service (Picsum Authors, 2019). To avoid any ambiguity over examples from different datasets that contain similar semantic information, we ensure the label set for semantically meaningful OOD examples is distinct from that of the in-distribution dataset. Additional details of these datasets are in Appendix A. Noise OOD data:

By definition, OOD data does not have to contain recognizable concepts. Thus, we construct a Gaussian Noise dataset consisting of 10,000 images for each of which the pixel values are sampled from a Gaussian distribution. This dataset is very different from the image datasets considered before as it consists of random points in the input space which have no relation to any naturally occuring images. In settings where inputs to an ML classifier are not checked for any sort of semantics, this dataset is a viable input and thus must be analyzed when checking for robustness.

4.2. OOD Evasion Attack Methods

Iterative optimization based methods have been found to be the most effective for solving the relaxed version of Eq. 1 (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini and Wagner, 2017b; Madry et al., 2018; Athalye et al., 2018; Moosavi-Dezfooli et al., 2016) to generate successful in-distribution adversarial examples regardless of the choice of distance function (Madry et al., 2018; Athalye et al., 2018). Thus, we exclusively use these to generate OOD adversarial examples.

4.2.1. Loss functions and optimizers

We use two commonly used loss functions to generate OOD adversarial examples. The first is the standard cross-entropy loss (Goodfellow et al., 2016). The second is the logit loss from Carlini and Wagner (Carlini and Wagner, 2017b):


where is a confidence parameter that can be adjusted to control the strength of the adversarial example. Optimizer choice: The loss is minimized using Projected Gradient Descent (PGD) which we choose for its state-of-the-art attack performance (Madry et al., 2018; Athalye et al., 2018). PGD iteratively minimizes the loss while projecting the obtained minimizer onto the constraint set (subset of depending on ) for iterations,


where is a projection operator, and is the final adversarial example. This formulation allows us to work with fixed constraint sets , which we use throughout to compare across datasets, attacks and defenses. As noted by Athalye et al. (Athalye et al., 2018), the exact choice of the optimizer does not matter much in practice, as long as an iterative optimization procedure is used. We refer to the use of PGD with the cross-entropy loss as PGD-xent and with the logit loss from Carlini and Wagner as PGD-CW. We run PGD for - iterations for every attack. Attacks to generate OOD adversarial examples usually converge within 100 iterations. If they do not converge in 100 iterations, we monitor the loss and stop if it has plateaued. The step size is a hyper-parameter that is varied in each experiment depending on the maximum possible perturbation and attack performance. Note. While most of our experiments are for adaptive white-box adversaries, in Section 5.4 we use query-based black-box attacks which rely on PGD but due to a lack of access to the true gradient, we use estimated gradients instead.

4.2.2. Choosing targets for attacks

We only consider targeted attacks for two reasons. First, they are strictly harder than non-targeted attacks for the adversary (Carlini and Wagner, 2017b). Second, unmodified OOD examples have no ground truth labels, which raises difficulties in defining untargeted attacks and comparing them to the in-distribution case. We use two methods for target label selection for our attacks: i) random: this refers to the selection of a random target label from a set that excludes the current predicted label for an unmodified example (referred to as ‘rand’); ii) least likely: this refers to the selection of the label with the least value in the output confidence vector for a given input (referred to as ‘LL’).

4.2.3. Overcoming gradient obfuscation

Defenses such as feature squeezing (Xu et al., 2018) obfuscate gradients, which can cause PGD-based attacks to fail due to the presence of uninformative gradients (Athalye et al., 2018; Uesato et al., 2018). In particular, Athalye et al. (Athalye et al., 2018) demonstrated that these obfuscated gradients are caused simply due to complex input pre-processing which can be approximated by simpler, differentiable transformations through which gradients are able to pass. We adopt this approach for our experiments in Section 5.3.

4.2.4. Distance constraints

In this paper, we use the perturbation constraint for most of our attacks, except for the attack on feature squeezing (Xu et al., 2018), where the metric cannot be used due to bit depth reduction and thus the perturbation is adopted instead. These metrics are widely used to generate in-distribution adversarial examples because examples and that are -close in the these distance metrics, are also visually similar (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini and Wagner, 2017b). For any vector , these metrics are:- i) , i.e. the absolute value of the largest component of the vector; ii) , i.e. the square root of the sum of the squared value of the components. Why use distance constraints for OOD adversarial examples?: There are two main reasons why we use distance constraints to generate OOD adversarial examples. The first reason, which applies only to semantically meaningful OOD data, is to model the content in the input that the adversary wishes to preserve, in spite of it being OOD. In other words, the starting point itself models a constraint on the adversary, which may arise from the environment (see Section 5.4.2 for an example for traffic sign recognition systems) or to prevent the OOD adversarial example from having undesirable artifacts, e.g. turning a non-objectionable image into an objectionable one (see Section 5.4.1). The second reason, which applies to both semantically meaningful and noise OOD data, stems from the need to measure the spread of successful OOD adversarial examples in the input space. Previous work has measured the spread of adversarial examples around the training and test data, in terms of distance constraints and found that for undefended models adversarial examples are present close to their starting points. On the other hand, while the use of robust training defenses makes it challenging to find adversarial examples within a given constraint set, we show for OOD data, successful OOD adversarial examples can be found in small balls around unmodified starting points for both undefended and defended models. We note that for noise OOD data it is possible to relax distance constraints to generate OOD adversarial examples which will lead to higher attack success rates. In Section 5.4.2, we demonstrate open-world evasion attacks using custom signs on traffic sign recognition systems that do not restrict the perturbation budget as well.

4.3. Metrics and Models

4.3.1. Evaluation metrics

We consider the following metrics to measure the performance and robustness of image classifiers:
Target success rate: This is the percentage of adversarial examples classified as the desired target, which measures model robustness.
Classification accuracy: This is the percentage of in-distribution test data for which the predicted label matches the ground-truth label. It is not reported for OOD data as it has no ground truth labels.
Mean classification confidence: This is the mean value of the output probability corresponding to the predicted label of the classifier on correctly classified inputs in the case of benign in-distribution data. For adversarial examples, both in-distribution and OOD, it is the mean value of the output probability corresponding to the target label for successful adversarial examples. The confidence values lie in .

4.3.2. Models

We experiment with three robust training defenses (Iterative adversarial training (Madry et al., 2018), Adversarial logit pairing (Kannan et al., 2018), and robust training with convex outer polytope (Kolter and Wong, 2018)), two adversarial example detectors (Feature Squeezing (Xu et al., 2018) and MagNet (Meng and Chen, 2017)), and two OOD detectors (ODIN (Liang et al., 2018) and Confidence calibrated classifiers (Lee et al., 2018a)). These works use multiple different neural network models, each trained on the MNIST, CIFAR-10, or ImageNet datasets. We show the architecture and performance of the 11 DNNs from these works used in this paper in Table 3

. We use multiple models for each dataset to carefully compare with previous work. All the DNNs we consider are convolutional neural networks as these lead to the state-of-the-art performance on image classification

(ima, 2016). To characterize performance on in-distribution data, we report both the classification accuracy and mean classification confidence on these models. Following the convention in previous work (Madry et al., 2018; Carlini and Wagner, 2016; Kolter and Wong, 2018), we report the perturbation budget for models trained on MNIST dataset using [0,1] scale instead of [0,255]. All experiments are run on a GPU cluster with 8 Nvidia P100 GPUs using mainly Python-based Tensorflow (Abadi et al., 2015) to implement and run DNNs along with the Numpy and Scipy packages for other mathematical operations. We also use Pytorch (Paszke et al., 2017) in cases when previous work that we build on has used it. All our code and data will be publicly released for the purposes of reproducible research.

Dataset Model
accuracy ()
MNIST (LeCun, 1998) 4-layer CNN((Madry et al., 2018)* 98.8 0.98
4-layer CNN ((Kolter and Wong, 2018) 98.2 0.98
7-layer CNN ((Carlini and Wagner, 2017b) 99.4 0.99
CIFAR-10 (Krizhevsky et al., 2014)
Wide Residual net
(WRN-28-10) (Zagoruyko and Komodakis, 2016)
95.1 0.98
WRN-28-10-A (Zagoruyko and Komodakis, 2016; Madry et al., 2018)* 87.2 0.93
WRN-28-1 (Wong et al., 2018) 66.2 0.57
DenseNet (Huang et al., 2016) 95.2 0.98
VGG13 (Simonyan and Zisserman, 2014) 80.1 0.94
All Convolution Net (Springenberg et al., 2014) 85.7 0.96
ImageNet (Deng et al., 2009) MobileNet (Howard et al., 2017) 70.4 0.71
ResNet-v2-50 (Kannan et al., 2018; He et al., 2016)
(for 6464 size images)
60.5 0.28
  • Iterative adversarial training from Madry et. al. (Madry et al., 2018).

  • Robust training with convex polytope relaxation from Wong et. al. (Wong et al., 2018).

  • Adversarial logit pairing (ALP) from Kannan et. al. (Kannan et al., 2018).

Table 3. Deep neural networks used for each dataset in this work. The top-1 accuracy and confidence values are calculated using the test set of the respective datasets.

5. Results

In this section, we present the experimental results. We first evaluate the robustness of OOD detectors to OOD adversarial examples. Next, we analyze the performance of state-of-the-art defenses against adversarial examples in the open-world learning model. We also evaluate adversarial detectors and secondary defenses in this setup. Finally, we demonstrate the relevance of open-world evasion attacks with real-world attacks using OOD adversarial examples.

5.1. OOD detectors are not robust

In open-world deep learning, the first step is to detect out-of-distribution inputs. Previous work has demonstrated that unmodified OOD inputs to a DNN can be detected with high probability by using an additional detector (Liang et al., 2018; Lee et al., 2018a) (recall Section 2.4.2). In this section, we evaluate the robustness of these OOD detection approaches to OOD adversarial examples. Specifically, we evaluate two state-of-the-art OOD detection methods, ODIN (Liang et al., 2018) and confidence calibrated classification (Lee et al., 2018a), on trained CIFAR-10 models. OOD detector setup. The success of an OOD detector is measured by False Negative Rate (FNR), which represents the fraction of OOD inputs the detector fails to detect. The threshold values reported in (Liang et al., 2018; Lee et al., 2018a) are calibrated such that the True Negative Rate (TNR) i.e., the fraction of in-distribution inputs the detector classifies as non-OOD, is equal to 95. For both detectors, the PGD-xent attack with 100 iterations and random target label selection is used. Summary of results. While OOD detectors achieve good detection performance on unmodified OOD data, these detectors can be evaded with OOD adversarial examples. The key reason for the failure to detect OOD adversarial examples is that both ODIN (Liang et al., 2018) and confidence calibrated classification (Lee et al., 2018a) rely on low output confidence values to distinguish OOD examples. As we will discuss next in Section 5.2 (Table 5

), our results show that OOD adversarial examples can achieve target output classification with confidence equal to (and often higher than) that of in-distribution images.

5.1.1. Effect of OOD attacks on ODIN

We use the code and pre-trained DenseNet (Huang et al., 2016) model on CIFAR-10 dataset from Liang et al. (Liang et al., 2018). For consistency, we follow Liang et al. (Liang et al., 2018) and use the temperature scaling and perturbation budget for input pre-processing equal to 1000 and 0.0014 respectively. OOD adversarial examples can evade the ODIN detector successfully. We first test the performance of ODIN with multiple unmodified OOD datasets. As expected, ODIN achieves more than 78 detection accuracy for all unmodified OOD datasets. However, the detection rate of ODIN drastically decreases with OOD adversarial examples (Table 4). For OOD adversarial examples generated with ImageNet dataset and of 16, ODIN misses 97.4 inputs. Except for Gaussian noise, the mean success rate is 98.3 for other OOD datasets with . For OOD adversarial examples generated from Gaussian noise, we observe that the success in evading ODIN is dependent on the neural network model. As highlighted in Table 4, a DenseNet model leads to poor success of OOD adversarial examples from Gaussian noise. However, for a WRN-28-10 model, we found that the OOD adversarial examples generated from Gaussian noise achieve FNR close to 100.

OOD dataset ODIN (Liang et al., 2018)
classifier (Lee et al., 2018a)
= 8.0 = 16.0 = 8.0 = 16.0
ImageNet 68.8 97.4 46.4 47.5
VOC12 74.4 97.4 47.1 47.2
81.6 98.7 42.5 45.4
MNIST 72.6 99.8 4.6 5.2
0 4.2 20.9 21.9
Table 4. False Negative rate (FNR) of ODIN (Liang et al., 2018) and confidence calibrated classifier (Lee et al., 2018a) approaches for OOD adversarial examples. The results are reported with the respective models for each detector trained on CIFAR-10 dataset. The TNR of each detector with in-distribution dataset is 95. These results show that a high percentage of OOD adversarial examples can evade OOD detectors. The maximum values among all OOD datasets are highlighted in bold.

5.1.2. Effect of OOD attacks on confidence calibrated classifiers

For consistency with prior work of Lee et al. (Lee et al., 2018a), we used a similar model (VGG13) and training procedure. We also validate the results from Lee et al. (Lee et al., 2018a) by evaluating this detector on unmodified OOD datasets. Up to 47.5 OOD adversarial examples could bypass the detection approach based on confidence calibrated classifiers. In our baseline experiments with unmodified OOD datasets, we found that the confidence-calibrated classifier has good detection performance. For example, unmodified ImageNet and VOC12 dataset are correctly detected with an accuracy of and , respectively. However, the detection performance degrades significantly for adversarial examples generated from OOD datasets except MNIST (Table 4). For example, when equals to 16, more than adversarial examples generated from ImageNet, VOC12, and Internet Photographs datasets are missed by the detector. However, in comparison to ODIN (Liang et al., 2018), the gradient-based attacks for this detector fail to achieve close to 100 FNR. We observe that even with an unconstrained adversarial attack the FNR doesn’t approach 100. We speculate that this behavior might be due to non-informative gradients presented by the model at the input. It should be noted that first-order attack approaches which can succeed in presence of obfuscated gradients (Athalye et al., 2018) aren’t applicable here. This is because instead of any additional input-processing step the gradients are obfuscated by the model itself.

5.2. Fooling robustly trained models

In this section, we first evaluate the robustness of baseline, undefended models (trained with natural training) to OOD adversarial examples. Next, we evaluate robustly trained models with OOD adversarial examples. For robust training, we consider the two state-of-the-art approaches discussed in Section 2.3.1 which are iterative adversarial training (Madry et al., 2018) and robust training using the convex polytope relaxation (Kolter and Wong, 2018). For undefended models, we show that OOD adversarial examples achieve target success rates almost identical to adversarial examples generated from in-distribution data. We further show that robustly trained models are much less robust against OOD adversarial examples.

(a) Lack of robustness of natural training.
(b) Lack of robustness of iterative adversarial training (Madry et al., 2018).
Figure 1. Target success rate of adversarial examples generated from different datasets for the state-of-the-art WRN-28-10 (Zagoruyko and Komodakis, 2016) model trained on CIFAR-10 (Krizhevsky et al., 2014). The PGD-xent attack is used (Section 4.2) with ( perturbation budget) up to 8 and random target label selection. Though iterative adversarial training (Madry et al., 2018) (with ) improves robustness for in-distribution data (CIFAR-10), OOD adversarial examples are up to as successful as those generated from in-distribution data.
Iterative adversarial
training (Madry et al., 2018)
logits pairing (Kannan et al., 2018)
Defenses (Kolter and Wong, 2018)
Test () \ Train ()
MNIST () CIFAR-10 () ImageNet () MNIST () CIFAR-10 ()
rate ()
rate ()
rate ()
rate ()
clean adv clean adv clean adv clean adv clean adv
MNIST PGD-xent rand 1.5 0.98 0.76 5.1 0.81 0.60 98.8 0.27 0.96 0.6 0.97 0.64 3.9 0.48 0.37
LL 0.0 0.0 0.0 0.0 99.4 0.95 0.0 0.0 0 0
PGD-CW rand 1.9 0.71 5.7 0.54 97.0 0.91 1.2 0.67 5.1 0.33
LL 0.0 0.0 0.0 0.0 96.6 0.88 0.0 0.0 0 0
CIFAR-10 PGD-xent rand 97.6 0.77 0.99 22.9 0.93 0.81 100.0 0.14 0.99 67.2 0.88 0.97 15.1 0.27 0.41
LL 95.3 0.99 5.1 0.69 100.0 0.99 30.3 0.93 0.4 0.15
PGD-CW rand 97.6 0.99 22.3 0.76 99.9 0.96 61.5 0.97 16.4 0.35
LL 93.8 0.98 4.4 0.60 99.5 0.95 27.6 0.91 0.4 0.15
ImageNet PGD-xent rand 97.2 0.79 0.99 44.9 0.74 0.78 99.4 0.30 0.98 72.1 0.88 0.97 23.4 0.40 0.36
LL 94.7 0.98 4.9 0.64 98.9 0.98 33.6 0.93 0.8 0.20
PGD-CW rand 97.2 0.99 39.5 0.76 98.6 0.90 68.9 0.97 28.4 0.33
LL 93.9 0.97 4.6 58.3 93.9 84.5 30.4 89.1 0.8 0.17
VOC12 PGD-xent rand 99.3 0.76 0.99 54.9 0.75 0.79 99.9 0.19 0.99 68.7 0.89 0.97 26.4 0.38 0.36
LL 97.5 0.98 11.3 0.61 99.8 0.99 29.9 0.91 0.9 0.21
PGD-CW rand 98.5 0.99 50.3 0.76 99.7 0.95 63.7 0.96 27.0 0.31
LL 96.0 0.98 9.9 0.55 98.4 0.92 26.0 0.89 1.2 0.16
PGD-xent rand 95.4 0.74 0.99 46.3 0.74 0.80 100.0 0.17 0.99 58.4 0.89 0.97 19.8 0.43 0.35
LL 92.1 0.98 12.2 0.34 99.6 0.98 26.8 0.93 0.5 0.24
PGD-CW rand 95.4 0.98 47.5 0.77 99.8 0.96 55.5 0.95 22.4 0.33
LL 91.2 0.96 11.4 0.62 99.0 0.93 23.3 0.91 0.6 0.19
PGD-xent rand 100 0.92 1.00 87.9 0.52 0.86 48.5 0.41 0.45 79.0 0.91 0.99 29.1 0.31 0.32
LL 100 1.00 7.1 0.39 0 0 0 0 0 0
PGD-CW rand 100.0 1.00 87.3 0.81 47.7 0.29 79.7 0.97 28.4 0.30
LL 100 1.00 4.9 0.31 0 0 0 0 0 0
Table 5. Target success rate of adversarial examples generated from different datasets for models trained with iterative adversarial training (Madry et al., 2018), adversarial logit pairing (ALP) (Kannan et al., 2018), and convex polytope relaxation (Kolter and Wong, 2018). The norm used to generate adversarial examples is listed along with the training dataset. The maximum target success by the adversarial examples for every model is highlighted in bold. The results for in-distribution data are highlighted in italics.

5.2.1. Baseline models are highly vulnerable to OOD adversarial examples

Fig. 0(a) shows the target success rate with adversarial examples generated from different OOD datasets for the WRN-28-10 network (Table 3) trained on the CIFAR-10 dataset. The x-axis represents the maximum perturbation for the PGD-xent attack used to generate the adversarial examples. The results show that similar to adversarial examples generated from in-distribution images, OOD adversarial examples also achieve a high target success rate. For example, the target success rate increases rapidly to 100 for both in- and out-of-distribution data.

5.2.2. OOD attacks on adversarially trained models

We now evaluate the robustness of adversarially trained models (iterative adversarial training (Madry et al., 2018)) to OOD adversarial examples. Previous work (Goodfellow et al., 2015; Tramèr et al., 2018; Madry et al., 2018; Kannan et al., 2018) has shown that adversarial training can significantly increase the model robustness against adversarial examples generated from in-distribution data. For example, for the WRN-28-10 network trained on CIFAR-10, adversarial training reduces the target success rate from 100 to 22.9 for the PGD-xent attack with equal to 8, as shown in Figure 0(b). Experimental details: For the MNIST and CIFAR-10 datasets, we use the iterative adversarial training approach proposed in Madry et al. (Madry et al., 2018). Models corresponding to MNIST, CIFAR-10 datasets in this experiment are , WRN-28-10-A (Table 3) respectively. Each model is adversarially trained with an perturbation budget () equal to 0.3, 8 respectively. OOD adversarial examples generated from OOD datasets (except MNIST) achieve high target success rates for multiple adversarial trained models and datasets: Fig. 0(b) shows target success rate of OOD adversarial examples for a WRN-28-10 network, trained on the CIFAR-10 dataset using adversarial training, with different perturbation budget () for PGD-xent attack. It shows that though adversarial training improves robustness for in-distribution dataset (CIFAR-10), OOD adversarial examples can achieve up to higher target success rate compared adversarial examples generated from in-distribution images. Table 5 presents the detailed results for different datasets, attacks, and label selection. We can see the improvement in target success rate with OOD adversarial examples. When using the VOC12 dataset to generate OOD adversarial examples, we can achieve around and improvement in target success rate compared to in-distribution attacks for MNIST and CIFAR-10 models respectively. The mean classification confidence for OOD adversarial examples is also competitive with adversarial examples generated from in-distribution data and typically higher than unmodified OOD data. However, achieving the least likely target label (LL) is a significantly harder objective for evasion attacks with both in- and out-of-distribution data. For each dataset, we observe that PGD-CW is relatively less successful than PGD-xent attack.

5.2.3. OOD attacks on robust training with the convex polytope relaxation

Robust training using convex polytope relaxation from Wong et al. (Kolter and Wong, 2018; Wong et al., 2018) provides a provable upper bound on the adversarial error and thus on target success rate. Experiment Details: We use the models from Wong et al. (Kolter and Wong, 2018; Wong et al., 2018) for MNIST and CIFAR-10 dataset. The corresponding Models are and WRN-28-1 respectively (Table 3). These models achieve the state-of-the-art minimum provable adversarial error with perturbation () equals to 0.1 and 2 for MNIST and CIFAR-10 data respectively. With more complex dataset such as CIFAR-10, this defense is limited to the use of a small perturbation budget in training to avoid reducing the performance on unmodified inputs. For example, training the WRN-28-1 network using convex polytope relaxation on CIFAR-10 dataset with a more realistic of 8 achieves only 27.1 classification accuracy on unmodified CIFAR-10 test data. Given that the defense approach cannot simultaneously maintain high benign accuracy for CIFAR-10 while using a more realistic equals to 8, we make a design choice of equals to 2 in training the CIFAR-10 model and equals to 8 for adversarial example generation. On the other hand, for simpler dataset such as MNIST, we continue to use the same equals to 0.1 for both training and adversarial example generation. Robust training with convex polytope relaxation lacks robustness to OOD adversarial examples: Table 5 shows the experimental results for MNIST and CIFAR10 models using the provable defense approach of convex polytope relaxation (Kolter and Wong, 2018; Wong et al., 2018). Though this approach significantly improves the robustness for in-distribution adversarial examples, it lacks robustness to OOD adversarial examples. For the model trained on MNIST, the target success rate increases from 0.6 to 72.1 by using ImageNet as a source of OOD data with = 0.1 and PGD-xent attack. Similar success rate is observed with other OOD datasets for this model. For the model trained on CIFAR-10 dataset, the target success rate increases from 15.1 to 29.1 with the use of OOD adversarial examples generated from Gaussian noise. The relatively poor performance of adversarial examples for the CIFAR-10 model could be due to the poor classification accuracy of this model, where it achieves only 66.2 classification accuracy on the CIFAR-10 images. We argue that the principle behind this defense is not robust to OOD adversarial examples demonstrated by OOD attacks on the provably trained MNIST model.

5.2.4. Discussion: Impact of OOD dataset

In this subsection, we further discuss the influence of dataset selection and robust learning on the success of evasion attacks. We observe in Table 5 that the target success rate is affected by the choice of the OOD dataset. In particular, we observe that the target success rate for MNIST dataset is significantly lower than both in-distribution and other OOD datasets. We speculate that this behavior could arise due to the specific semantic structure of MNIST images. Nevertheless, we emphasize that the threat posed by OOD adversarial examples still persists, since adversarial examples from multiple other OOD datasets achieve high target success rates.

5.3. Evading Adversarial Example Detectors and Secondary Defenses

5.3.1. Adversarial detectors.

Previous work (He et al., 2017; Carlini and Wagner, 2017a) has shown that both adversarial detectors based on Feature Squeezing (Xu et al., 2018) or MagNet (Meng and Chen, 2017) approach can be evaded with adaptive white-box attacks accounting for the detector mechanism to generate adversarial examples from in-distribution data. Our results show that similar to in-distribution data, these adversarial detectors don’t provide robustness to adversarial examples generated from OOD data. For feature squeezing, we also show that OOD data requires a smaller perturbation budget than in-distribution data for a similar target success rate of corresponding adversarial examples. We further show that OOD adversarial examples can achieve up to target success rate in presence of MagNet on a model trained on the CIFAR-10 dataset. We provide detailed results in Appendix C.

5.3.2. Adversarial logit pairing.

The robust training based on iterative adversarial training by Madry et al. (Madry et al., 2018) works well for small-scale datasets but does not provide robustness with Imagenet. Adversarial logit pairing (Kannan et al., 2018) extends this approach to provide robustness on Imagenet. However, ALP suffers from loss of robustness for in-distribution adversarial examples when the number of attack iterations is increased (Engstrom et al., 2018). We show that this vulnerability also exists for OOD adversarial examples (Table 5). We use ResNet-v2-50 (Table 3) and perturbation budget () of 16.

5.4. Towards real-world attacks

In this section, we demonstrate how OOD adversarial examples may affect real-world ML systems.

5.4.1. Attacking Content Moderation systems

To achieve low false positive rate (FPR), ML classifiers deployed in the real-world are also expected to detect OOD inputs. The is because a high FPR can significantly affect the performance (Murphy, 2017) and cost (pon, 2008)

of the service provided by the these models. For example, the Metropolitan police in London are attempting to use computer vision to detect nudity in photographs, but a high FPR is reportedly occurring due to the prevalence of desert scenes as wallpapers etc.

(Murphy, 2017). This example represents an inadvertent denial-of-service (DoS) attack, where a large number of false positives affects the effectiveness of the automated content moderation system. We use OOD adversarial examples to carry out a similar DoS attack on Clarifai’s content moderation model (Clarifai, 2019), by classifying clearly unobjectionable content as objectionable with high confidence. In a real deployment, a deluge of such data will force human content moderators to spend time reviewing safe content. Sybil attacks (Yu et al., 2006; Danezis and Mittal, 2009) can enable attackers to create a plethora of fake accounts and upload large amounts of OOD adversarial examples. Fooling the Clarifai model: Using the query-based black-box attacks proposed by Bhagoji et al. (Bhagoji et al., 2018), we construct OOD adversarial examples for Clarifai’s content moderation model. It provides confidence scores for the input image belonging the 5 classes ‘safe’, ‘suggestive’, ‘explicit’, ‘gore’ and ‘drugs’, and is accessible through an API. We use 10 images each from the MNIST, Fashion-MNIST (Xiao et al., 2017) and Gaussian Noise datasets to generate OOD adversarial examples. All of these images are initially classified as ‘safe’ by the model. In Figure 2, we show a representative attack example. Our attack is able to successfully generate OOD adversarial examples for the 4 classes apart from ‘safe’ for all 30 images with 3000 queries on average and a mean target confidence of 0.7.

5.4.2. Physically-realizable attacks on traffic signs

Traffic sign recognition systems are intended for operation in the real-world, where they encounter arbitrary objects in the environment. These objects can be modified to become physical OOD adversarial examples which are detected and classified with high confidence as traffic signs. We demonstrate attack success with both imperceptible perturbations in a OOD logo attack and unconstrained perturbations within a mask in a custom sign attack. OOD adversarial examples in Figure 3 are detected and classified with high confidence as traffic signs (by a CNN with 98.5% accuracy on test data) over a range of physical conditions when printed out. The targeted attack success rate is 95.2% for the custom sign attack. Details of these attacks and further results are in our short workshop paper (Anonymous, 2018).

Figure 2. OOD adversarial examples against Clarifai’s Content Moderation model. Left: original image, classified as ‘safe’ with a confidence of 0.96. Right: adversarial example with , classified as ‘explicit’ with a confidence of 0.9.
Figure 3. OOD adversarial examples for a traffic sign recognition pipeline. These adversarial examples are classified as the desired target traffic sign with high confidence under a variety of physical conditions when printed out.

6. Towards Robust Open-world Deep Learning

In this section, we present experimental results for our proposed hybrid combination of iterative adversarial training and selective prediction to enhance classifier robustness against OOD adversarial examples. We experiment with image classification task with CIFAR-10 dataset as in-distribution and multiple other datasets as the source of out-of-distribution images. Setup: In this experiment we use WRN-28-10 model with the hyper-parameters from Madry et al. (Madry et al., 2018) for iterative adversarial training and CIFAR-10 as the in-distribution dataset (Table II). To train a robust classifier with a background class, we use 5,000 images from one of the MNIST, ImageNet, VOC12, and Internet Photographs datasets. An image will be specified as OOD if it is classified to the background class. We select random target labels from CIFAR-10 classes as the desired target class (excluding the predicted class for the unmodified input), and compute the target success rate. This metric allows us to capture the adversary’s success at both evading detection and achieving targeted misclassification with OOD adversarial examples. The number of test images and perturbation budget () is 1000 and 8 respectively.

Figure 4. Target success rate of OOD adversarial examples (lower is better). The classifier is adversarially trained on in-distribution inputs along with a small subset of OOD data. It shows that datasets such as VOC12 and ImageNet provide a high inter-dataset and intra-dataset generalization for detection of adversarial OOD inputs.

Small subset of OOD data can enable robust detection. With each of the OOD dataset, we observe that using only 5,000 training images leads to significant decrease in the success of adversarial attacks from these datasets (Fig. 4). For example, after adversarial training with only 5,000 training images (out of 1.2 million) from ImageNet, the target success of OOD adversarial examples from ImageNet dataset decreases from 44.9% to 7.4%. Robust training with one OOD dataset can generalize to multiple other OOD datasets. From Figure 4, we observe that adversarial training with one dataset also decreases attack success of OOD adversarial examples from other datasets. This effect is significant for feature-rich datasets such as ImageNet and VOC12. For example, using VOC12 for training reduces the target success rate of OOD adversarial examples from ImageNet from 44.9% to 15.8%. Multiple OOD datasets can be combined for robust detection: By including 5,000 images from each of the four OOD datasets in robust training, we demonstrate that a single network can learn robust detection for each of them. We observe that the combination of all datasets is constructive as the best results are achieved when multiple datasets are used in training. Small impact on in-distribution performance: Robust training with OOD data has a small impact on classifier performance for in-distribution data. The maximum decrease in benign classification accuracy is 1% (for single OOD dataset) and 3.1% (for all four OOD datasets). The robust accuracy remains largely unchanged. Detailed results are in Table 9 in Appendix D. Discussion and limitations.

Our results highlight that it is feasible to robustly classify one or multiple OOD datasets along with in-distribution data using a semi-supervised learning approach. However, the key challenge in this domain is to achieve robustness against all OOD inputs. As a first step, our approach and results motivate the design of robust unsupervised OOD detectors for deep learning. They also highlight the need for rigorous evaluation methods to determine the robustness of an open-world learning system against all possible adversarial examples.

7. Conclusion

In this paper, we investigated evasion attacks in the open-world learning model and defined OOD adversarial examples, which represent a new attack vector on ML models used in practice. We found that existing OOD detectors are insufficient to deal with this threat. Further, assumptions regarding the source of adversarial examples, namely, in-distribution data, have led to tailored defenses. We showed that these state-of-the-art defenses exhibit increased vulnerability to OOD adversarial examples, which makes their deployment challenging. With these findings in mind, we took a first step at countering OOD adversarial examples using adversarial training with background class augmented classifiers. We now urge the community to consider the exploration of strong defenses against open-world evasion attacks.


  • (1)
  • pon (2008) 2008. (2008). [Online; accessed 10-November-2018].
  • ima (2016) 2016. Classification datasets results. (2016).
  • NIP (2018) 2018. A loss framework for calibrated anomaly detection. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 1494–1504.
  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). Software available from
  • Anonymous (2018) Anonymous. 2018. Rogue Signs: Deceiving Traffic Sign Recognition with Malicious Ads and Logos. In 1st Deep Learning and Security Workshop (IEEE S&P 2018) (DLS 2018).
  • Arnab et al. (2018) Anurag Arnab, Ondrej Miksik, and Philip H. S. Torr. 2018. On the Robustness of Semantic Segmentation Models to Adversarial Attacks. In CVPR.
  • Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML.
  • Athalye et al. (2017) Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2017. Synthesizing Robust Adversarial Examples. CoRR abs/1707.07397 (2017). arXiv:1707.07397
  • Bendale and Boult (2015) A. Bendale and T. Boult. 2015. Towards Open Set Deep Networks. ArXiv e-prints (Nov. 2015). arXiv:cs.CV/1511.06233
  • Bendale and Boult (2015) Abhijit Bendale and Terrance Boult. 2015. Towards open world recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 1893–1902.
  • Bhagoji et al. (2017) Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. 2017. Dimensionality Reduction as a Defense against Evasion Attacks on Machine Learning Classifiers. arXiv preprint arXiv:1704.02654 (2017).
  • Bhagoji et al. (2018) Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. 2018. Practical Black-box Attacks on Deep Neural Networks using Efficient Query Mechanisms. In ECCV.
  • Biggio et al. (2013) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 387–402.
  • Biggio et al. (2012) Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012.

    Poisoning Attacks against Support Vector Machines. In

    Proceedings of the 29th International Conference on Machine Learning (ICML-12). 1807–1814.
  • Brendel et al. (2018) Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2018. Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models. In ICLR.
  • Brown and Sandholm (2017) Noam Brown and Tuomas Sandholm. 2017. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science (2017), eaao1733.
  • Carlini and Wagner (2017a) Nicholas Carlini and David Wagner. 2017a. MagNet and “Efficient Defenses Against Adversarial Attacks" are Not Robust to Adversarial Examples. arXiv preprint arXiv:1711.08478 (2017).
  • Carlini and Wagner (2017b) Nicholas Carlini and David Wagner. 2017b. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on. IEEE, 39–57.
  • Carlini and Wagner (2018) Nicholas Carlini and David Wagner. 2018. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. In DLS (IEEE SP).
  • Carlini and Wagner (2016) Nicholas Carlini and David A. Wagner. 2016. Towards Evaluating the Robustness of Neural Networks. CoRR abs/1608.04644 (2016). arXiv:1608.04644
  • Chalapathy and Chawla (2019) Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep Learning for Anomaly Detection: A Survey. arXiv e-prints, Article arXiv:1901.03407 (Jan. 2019), arXiv:1901.03407 pages. arXiv:cs.LG/1901.03407
  • Chalapathy et al. (2018) Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. 2018. Anomaly Detection using One-Class Neural Networks. arXiv preprint arXiv:1802.06360 (2018).
  • Chang and Lippmann (1994) Eric I Chang and Richard P Lippmann. 1994. Figure of merit training for detection and spotting. In Advances in Neural Information Processing Systems. 1019–1026.
  • Chen et al. (2018b) Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. 2018b. EAD: elastic-net attacks to deep neural networks via adversarial examples. In AAAI.
  • Chen et al. (2017) Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    . ACM, 15–26.
  • Chen et al. (2018a) Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng Chau. 2018a. Robust Physical Adversarial Attack on Faster R-CNN Object Detector. arXiv preprint arXiv:1804.05810 (2018).
  • Cisse et al. (2017) Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling deep structured prediction models. In NIPS.
  • Clarifai (2019) Clarifai 2019. Clarifai | Image & Video Recognition API. (2019).
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.
  • Danezis and Mittal (2009) George Danezis and Prateek Mittal. 2009. SybilInfer: Detecting Sybil Nodes using Social Networks.. In NDSS. San Diego, CA, 1–15.
  • Dang et al. (2017) Hung Dang, Huang Yue, and Ee-Chien Chang. 2017. Evading Classifiers by Morphing in the Dark. ACM CCS.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248–255.
  • Deng et al. (2013) Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 8599–8603.
  • DeVries and Taylor (2018) Terrance DeVries and Graham W Taylor. 2018. Learning Confidence for Out-of-Distribution Detection in Neural Networks. arXiv preprint arXiv:1802.04865 (2018).
  • Dhamija et al. (2018) Akshay Raj Dhamija, Manuel Günther, and Terrance Boult. 2018. Reducing Network Agnostophobia. In Advances in Neural Information Processing Systems. 9175–9186.
  • Ehlers (2017) Ruediger Ehlers. 2017.

    Formal verification of piece-wise linear feed-forward neural networks. In

    International Symposium on Automated Technology for Verification and Analysis. Springer, 269–286.
  • El-Yaniv and Wiener (2010) Ran El-Yaniv and Yair Wiener. 2010. On the foundations of noise-free selective classification. Journal of Machine Learning Research 11, May (2010), 1605–1641.
  • Engstrom et al. (2018) Logan Engstrom, Andrew Ilyas, and Anish Athalye. 2018. Evaluating and Understanding the Robustness of Adversarial Logit Pairing. arXiv preprint arXiv:1807.10272 (2018).
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303–338.
  • Evtimov et al. (2018) Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. 2018. Robust Physical-World Attacks on Machine Learning Models. In CVPR.
  • Fischer et al. (2017) Volker Fischer, Mummadi Chaithanya Kumar, Jan Hendrik Metzen, and Thomas Brox. 2017. Adversarial examples for semantic image segmentation. In ICLR Workshop.
  • Gehr et al. (2018) Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. 2018. Ai 2: Safety and robustness certification of neural networks with abstract interpretation. In Security and Privacy (SP), 2018 IEEE Symposium on.
  • Geifman and El-Yaniv (2017) Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. In Advances in neural information processing systems. 4878–4887.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT Press.
  • Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
  • Gowal et al. (2018) Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Timothy Mann, and Pushmeet Kohli. 2018. On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models. arXiv preprint arXiv:1810.12715 (2018).
  • Grosse et al. (2017a) Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. 2017a. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017).
  • Grosse et al. (2017b) Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. 2017b. Adversarial examples for malware detection. In European Symposium on Research in Computer Security. Springer, 62–79.
  • Günther et al. (2017) Manuel Günther, Steve Cruz, Ethan M Rudd, and Terrance E Boult. 2017.

    Toward open-set face recognition. In

    Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE.
  • Guo et al. (2017) Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117 (2017).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • He et al. (2017) Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. 2017. Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong. arXiv preprint arXiv:1706.04701 (2017).
  • Hendrycks and Gimpel (2017) D. Hendrycks and K. Gimpel. 2017. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. (2017).
  • Hendrycks et al. (2018) Dan Hendrycks, Mantas Mazeika, and Thomas G Dietterich. 2018. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606 (2018).
  • Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82–97.
  • Howard et al. (2017) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861
  • Huang et al. (2016) Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. CoRR abs/1608.06993 (2016). arXiv:1608.06993
  • Huang et al. (2011) Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and Artificial Intelligence. ACM, 43–58.
  • Huang et al. (2017) Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. 2017. Adversarial attacks on neural network policies. In ICLR.
  • Ilyas et al. (2018) Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box Adversarial Attacks with Limited Queries and Information. arXiv preprint arXiv:1804.08598 (2018).
  • Jagielski et al. (2018) Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. 2018. Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning. In IEEE Security and Privacy.
  • Jiang et al. (2018) Heinrich Jiang, Been Kim, and Maya Gupta. 2018. To Trust Or Not To Trust A Classifier. arXiv preprint arXiv:1805.11783 (2018).
  • Julian et al. (2016) Kyle D Julian, Jessica Lopez, Jeffrey S Brush, Michael P Owen, and Mykel J Kochenderfer. 2016. Policy compression for aircraft collision avoidance systems. In Digital Avionics Systems Conference (DASC), 2016 IEEE/AIAA 35th. IEEE, 1–10.
  • Kannan et al. (2018) Harini Kannan, Alexey Kurakin, and Ian Goodfellow. 2018. Adversarial Logit Pairing. arXiv preprint arXiv:1803.06373 (2018).
  • Kantchelian et al. (2016) Alex Kantchelian, JD Tygar, and Anthony D Joseph. 2016. Evasion and Hardening of Tree Ensemble Classifiers. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16).
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kolter and Wong (2018) J Zico Kolter and Eric Wong. 2018. Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML.
  • Kos et al. (2017) Jernej Kos, Ian Fischer, and Dawn Song. 2017. Adversarial examples for generative models. arXiv preprint arXiv:1702.06832 (2017).
  • Kos and Song (2017) Jernej Kos and Dawn Song. 2017. Delving into adversarial attacks on deep policies. In ICLR Workshop.
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009).
  • Krizhevsky et al. (2014) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2014. The CIFAR-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html (2014).
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).
  • Kurakin et al. (2017) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial machine learning at scale. In ICLR.
  • Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems. 6402–6413.
  • LeCun (1998) Yann LeCun. 1998.

    The MNIST database of handwritten digits.

    http://yann. lecun. com/exdb/mnist/ (1998).
  • Lee et al. (2018a) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2018a. Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples. In International Conference on Learning Representations.
  • Lee et al. (2018b) Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018b. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. arXiv preprint arXiv:1807.03888 (2018).
  • Liang et al. (2018) S. Liang, Y. Li, and R. Srikant. 2018. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. In International Conference on Learning Representations (ICLR).
  • Litjens et al. (2017) Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. 2017. A survey on deep learning in medical image analysis. Medical image analysis 42 (2017), 60–88.
  • Liu et al. (2018) Si Liu, Risheek Garrepalli, Thomas G Dietterich, Alan Fern, and Dan Hendrycks. 2018. Open category detection with PAC guarantees. arXiv preprint arXiv:1808.00529 (2018).
  • Liu et al. (2017) Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into Transferable Adversarial Examples and Black-box Attacks. In ICLR.
  • Lu et al. (2017) Jiajun Lu, Hussein Sibai, and Evan Fabry. 2017. Adversarial Examples that Fool Detectors. arXiv preprint arXiv:1712.02494 (2017).
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.
  • McCoyd and Wagner (2018) Michael McCoyd and David Wagner. 2018. Background Class Defense Against Adversarial Examples. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 96–102.
  • Meng and Chen (2017) Dongyu Meng and Hao Chen. 2017. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 135–147.
  • Moosavi-Dezfooli et al. (2017) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. In CVPR.
  • Moosavi-Dezfooli et al. (2016) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. DeepFool: a simple and accurate method to fool deep neural networks. In CVPR.
  • Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356, 6337 (2017), 508–513.
  • Mozaffari-Kermani et al. (2015) Mehran Mozaffari-Kermani, Susmita Sur-Kolay, Anand Raghunathan, and Niraj K Jha. 2015. Systematic poisoning attacks on and defenses for machine learning in healthcare. IEEE journal of biomedical and health informatics 19, 6 (2015), 1893–1905.
  • Murphy (2012) Kevin P Murphy. 2012. Machine learning: a probabilistic perspective. MIT press.
  • Murphy (2017) Margi Murphy. 2017. Artificial intelligence will detect child abuse images to save police from trauma. (2017).
  • Narodytska and Kasiviswanathan (2016) Nina Narodytska and Shiva Prasad Kasiviswanathan. 2016. Simple Black-Box Adversarial Perturbations for Deep Networks. arXiv preprint arXiv:1612.06299 (2016).
  • Nguyen et al. (2015) Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 427–436.
  • Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. 2016. Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples. arXiv preprint arXiv:1605.07277 (2016).
  • Papernot et al. (2017) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. 2017. Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. In Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security.
  • Papernot et al. (2016a) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. 2016a. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 372–387.
  • Papernot et al. (2016b) Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. 2016b. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814 (2016).
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Picsum Authors (2019) Picsum Authors. 2019. Picsum Random image generator. "". (2019).
  • Raghunathan et al. (2018) Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. 2018. Certified defenses against adversarial examples. In ICLR.
  • Ramanagopal et al. (2018) Manikandasriram Srinivasan Ramanagopal, Cyrus Anderson, Ram Vasudevan, and Matthew Johnson-Roberson. 2018. Failing to learn: autonomously identifying perception failures for self-driving cars. IEEE Robotics and Automation Letters 3, 4 (2018), 3860–3867.
  • Rubinstein et al. (2009) Benjamin IP Rubinstein, Blaine Nelson, Ling Huang, Anthony D Joseph, Shing-hon Lau, Satish Rao, Nina Taft, and JD Tygar. 2009. Stealthy poisoning attacks on PCA-based anomaly detectors. ACM SIGMETRICS Performance Evaluation Review 37, 2 (2009), 73–74.
  • Ruff et al. (2018) Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning. 4393–4402.
  • Samangouei et al. (2018) Pouya Samangouei, Maya Kabkab, and Rama Chellappa. 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605 (2018).
  • Sharif et al. (2016) Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. 2016. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1528–1540.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John Duchi. 2018. Certifying some distributional robustness with principled adversarial training. (2018).
  • Sitawarin et al. (2018) Chawin Sitawarin, Arjun Nitin Bhagoji, Arsalan Mosenia, Prateek Mittal, and Mung Chiang. 2018. Rogue Signs: Deceiving Traffic Sign Recognition with Malicious Ads and Logos. In DLS (IEEE SP).
  • Song et al. (2017) Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766 (2017).
  • Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. 2014. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014).
  • Suciu et al. (2018) Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Dumitras. 2018. When Does Machine Learning FAIL? Generalized Transferability for Evasion and Poisoning Attacks. In 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, Baltimore, MD, 1299–1316.
  • Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations.
  • Tramèr et al. (2018) Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick McDaniel. 2018. Ensemble Adversarial Training: Attacks and Defenses. In ICLR.
  • Uesato et al. (2018) Jonathan Uesato, Brendan O’Donoghue, Aaron van den Oord, and Pushmeet Kohli. 2018. Adversarial risk and the dangers of evaluating against weak attacks. In ICML.
  • Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In International Conference on Machine Learning. 1058–1066.
  • Wang et al. (2018d) Bolun Wang, Yuanshun Yao, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2018d.

    With great training comes great vulnerability: practical attacks against transfer learning. In

    USENIX Security 18. 1281–1297.
  • Wang et al. (2018a) Shiqi Wang, Yizheng Chen, Ahmed Abdou, and Suman Jana. 2018a. MixTrain: Scalable Training of Formally Robust Neural Networks. arXiv preprint arXiv:1811.02625 (2018).
  • Wang et al. (2018b) Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. 2018b. Efficient Formal Safety Analysis of Neural Networks. arXiv preprint arXiv:1809.08098 (2018).
  • Wang et al. (2018c) S Wang, K Pei, J Whitehouse, J Yang, and S Jana. 2018c. Formal Security Analysis of Neural Networks using Symbolic Intervals. In USENIX Security 18).
  • Weng et al. (2018) Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. 2018. Towards Fast Computation of Certified Robustness for ReLU Networks. arXiv preprint arXiv:1804.09699 (2018).
  • Wong et al. (2018) Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. 2018. Scaling provable adversarial defenses. arXiv preprint arXiv:1805.12514 (2018).
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
  • Xiao et al. (2018) Kai Y Xiao, Vincent Tjeng, Nur Muhammad Shafiullah, and Aleksander Madry. 2018. Training for faster adversarial robustness verification via inducing relu stability. arXiv preprint arXiv:1809.03008 (2018).
  • Xie et al. (2017a) Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2017a. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991 (2017).
  • Xie et al. (2017b) Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017b. Adversarial examples for semantic segmentation and object detection. In International Conference on Computer Vision. IEEE.
  • Xu et al. (2018) Weilin Xu, David Evans, and Yanjun Qi. 2018. Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. In NDSS.
  • Xu et al. (2016) Weilin Xu, Yanjun Qi, and David Evans. 2016. Automatically evading classifiers. In Proceedings of the 2016 Network and Distributed Systems Symposium.
  • Yamada et al. (2018) Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. 2018. ShakeDrop regularization. arXiv preprint arXiv:1802.02375 (2018).
  • Yoshihashi et al. (2018) Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi You, Makoto Iida, and Takeshi Naemura. 2018. Classification-Reconstruction Learning for Open-Set Recognition. arXiv preprint arXiv:1812.04246 (2018).
  • Yu et al. (2006) Haifeng Yu, Michael Kaminsky, Phillip B Gibbons, and Abraham Flaxman. 2006. Sybilguard: defending against sybil attacks via social networks. In ACM SIGCOMM Computer Communication Review, Vol. 36. ACM, 267–278.
  • Yuan et al. (2018) Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A Gunter. 2018. CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition. In USENIX Security.
  • Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).
  • Zong et al. (2018) Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018.

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection.

  • Zoph et al. (2017) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2017. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012 2, 6 (2017).


Appendix A Additional details for the datasets

Table 6 provides the details of the datasets used in this work. In particular, it provides the state-of-the-art top-1 classification accuracy for each dataset, which can be used to compare with the performance of models used in the paper.

Number of
Number of
images (in thousands)
accuracy (%)
Train Test/Validation
MNIST 10 60 10 99.79 (Wan et al., 2013)
CIFAR-10 10 50 10 97.69 (Yamada et al., 2018)
1000 1281 50 82.7 (Zoph et al., 2017)
Table 6. Details of the three dataset used as as source of in-distribution images in this work.

We also use VOC12 dataset as a source of OOD images. In addition, the two datasets we constructed as sources of OOD data are detailed below:

  1. Internet photographs: We sample random natural images from the Picsum random image generator (Picsum Authors, 2019), which is a source of real-world photographs. These images have dimensions of between 300300 to 800800 pixels.

  2. Gaussian noise: We synthesize 10000 random images using a Gaussian distribution for each pixel value. We set the mean () of the Gaussian distribution equal to

    and the standard deviation (

    ) to 50. We don’t select a very low or high standard deviation as it leads to very sharp or uniform distribution in the valid pixel range ([0, 255]) respectively.

OOD Dataset
metric MNIST ImageNet VOC12
0.00 0.01 0.02 0.00 0.00
0.62 0.29 0.17 0.20 0.97
Adversarial 0.99 0.99 0.99 0.99 0.99
1.00 1.00 1.00 1.00 1.00
Table 7. Minimum and maximum of expected confidence over all output classes of a WRN-28-10 model (trained on CIFAR-10 dataset) for unmodified and adversarial OOD inputs. High and values demonstrate that OOD adversarial examples can achieve target classification to any output class with high confidence.

Appendix B Comparing the impact of unmodified and adversarial OOD data

Since the classifier has never encountered inputs drawn from , it is unclear what its behavior will be. An empirical analysis of the behavior of state-of-the-art classifiers on unmodified OOD data motivates the need, from an adversarial perspective, to generate OOD adversarial examples. Methodology: We use a wide residual net (WRN-28-10) model trained on CIFAR-10 dataset to illustrate the differences between unmodified and adversarial OOD data. The adversary first selects a target class and then samples 1,000 images from OOD datasets which should be classified as class (). The adversarial counterparts of these images are generated using the PGD attack with norm constraints (Madry et al., 2018). Further details of attacks, models and datasets are in Section 4. For each set of examples with target , is the expected confidence. We then report the minimum and maximum of this expectation over targets in Table 7. Unmodified OOD data: For unmodified OOD images, the adversary doesn’t have control over target classification and output confidence. As an example, for unmodified MNIST images as OOD inputs, one class is rarely predicted (resulting in = 0), while another class is predicted with an average confidence of 0.62. This indicates that while some targets can be met confidently with unmodified OOD data, not all outputs are equally likely for a given and that targets may not be reachable with unmodified OOD data. High-confidence targeted misclassification: On the other hand, it is clear that for OOD adversarial examples, any desired target class can be achieved with high confidence as both the and values are high. This holds across OOD datasets.

Appendix C Detailed Results: Adversarial Example Detectors

In this section, we evaluate OOD attacks against the state-of-art adversarial example detectors, including feature squeezing (Xu et al., 2018) and MagNet (Meng and Chen, 2017) (recall Section 2.3.2). Target success rate in presence of these detectors refers to the percentage of adversarial examples which both evade the detectors and achieve target label after classification. Summary of results. Previous work (He et al., 2017; Carlini and Wagner, 2017a) has shown that these adversarial detectors can be evaded with adaptive adversaries by successfully generating adversarial examples from in-distribution data. Our results show that similar to in-distribution data, these adversarial detectors don’t provide robustness to adversarial examples generated from OOD data. For feature squeezing, we also show that OOD data requires a smaller perturbation budget than in-distribution data for a similar target success rate of corresponding adversarial examples. We further show that OOD adversarial examples can achieve up to target success rate in presence of MagNet on a model trained on the CIFAR-10 dataset.

Feature squeezing (Xu et al., 2018)
MagNet (Meng and Chen, 2017)
Test () \ Train ()
MNIST 98.1 100.0 3.12 32.3 18.0
CIFAR-10 99.2 100.0 96.1 0.7 90.1
ImageNet 99.3 100.0 85.1 0.8 92.5
VOC12 99.3 100.0 96.0 0.8 96.9
98.1 100.0 85.1 1.2 97.3
100.0 100.0 25.0 0.0 93.9
Table 8. Target success rate for adversarial examples with random target labels from different datasets in presence of adversarial detectors, including feature squeezing (Xu et al., 2018) and MagNet (Meng and Chen, 2017). Similar to adversarial examples generated from in-distribution inputs, OOD adversarial examples are also able to evade the adversarial detectors with a high success rate (with the exception of MagNet for models trained on the MNIST dataset, which is explained in Section C.2).

c.1. Feature Squeezing

Experimental details. We use the joint squeezers (bit depth reduction, non-local smoothing, and median filtering) recommended by the Xu et al. (Xu et al., 2018) for MNIST, CIFAR-10, and ImageNet dataset. The corresponding models for these datasets are , WRN-28-10, and MobileNet (see Table 3). We use Adam (Kingma and Ba, 2014) solver with perturbation and straight-through-estimator approach from Athalye et al. (Athalye et al., 2018). We report the target success rate of adversarial examples for feature squeezing in Table 8. Based on these results, the following two conclusions can be drawn: Adversarial examples generated from both in- and out-of- distribution images achieve high success rate: For all target models, we observe that the adaptive adversary can achieve a high success rate (in most cases, around ) for OOD and in-distribution adversarial examples. This is because Feature Squeezing uses a non-adversarial trained model, which lacks robustness once the adversary can calculate information gradients in presence of the squeezers. Most OOD adversarial examples from require less input perturbation: In addition to achieving high target success rate, adversarial examples generated from OOD datasets, also require less input perturbation. For networks trained on MNIST dataset, the mean norm of input perturbations for MNIST and CIFAR-10 datasets is 5.67 and 2.12 respectively. Similarly, for CIFAR-10 trained model, the mean perturbation required for CIFAR-10 and ImageNet images is 1.24 and 1.00 respectively.

c.2. MagNet

We follow the code released by Meng and Chen (Meng and Chen, 2017) to train the MNIST and CIFAR-10 classifiers and corresponding autoencoders111 The models for MNIST and CIFAR-10 are and All Convolution Net (Springenberg et al., 2014), as shown in Table 3. We consider an adaptive adversary which incorporates the distance between input and projected output from the autoencoder into the loss function. The summary of key results, presented in Tabel 8, for MagNet is:OOD adversarial examples can achieve high target success rate for model trained on CIFAR-10. As shown in Table 8 in the setting of PGD-xent attacks and random target labeling, OOD adversarial examples except MNIST achieve a target success rate higher than in presence of MagNet for the model trained on CIFAR-10 dataset. The target success rate of adversarial examples generated from in-distribution data is .OOD adversarial examples have reduced effectiveness for MNIST models. For the model trained on the MNIST dataset, we observe that the distance values between the projection of autoencoder and original OOD input tend to be around 10 times the threshold used by MagNet. This makes it hard, even for the adaptive adversary, to reduce the distance while achieving the target classification at the output for OOD adversarial examples. We observe that additional pre-processing techniques, such as moving pixels of input images close to zero i.e., similar to the background of most images in MNIST, can increase the target success rate from 0 to up to 12%.

Appendix D Additional results: Robust open-world machine learning

To train the classifier in presence of background class, we use 5,000 images from one of the datasets from MNIST, ImageNet, VOC12, and Random Photographs. The reason to include only 5,000 images is to avoid data bias when each class in the CIFAR-10 dataset has 5,000 images. To include multiple OOD datasets, we add one background class for each. Figure 5 represent the success of the classifier in rejecting the out-of-distribution inputs. It shows that datasets such as VOC12 and ImageNet provide a high inter-dataset and intra-dataset generalization for detection of non-modified OOD inputs. Extensive experimental results for different attack methods, datasets are presented in Table 9.

Figure 5. Classification accuracy of non-modified OOD inputs (higher is better). The classification accuracy represents the percentage of OOD inputs classified to the background class.
OOD dataset
for training
rand ll rand ll rand ll rand ll rand ll rand ll rand ll rand ll rand ll rand ll rand ll rand ll
None 1 1 0 1 0 1.3 0 1.3 0 3.2 0 2.6 0 3.5 0 3.5 0 3.6 0 4.5 0 11.0 0 8.4 0
8 5.1 0 5.7 0 22.9 5.1 22.3 4.4 44.9 4.9 39.5 4.6 54.9 11.5 50.3 9.9 46.3 12.2 47.5 11.4 88.6 6.1 87.8 4.6
16 11.9 0 12.5 0 65.6 48.6 60.4 42.7 83.8 53.9 80.3 46.9 91 74.2 87.8 66.9 83.9 63.7 82.1 57.1 100 100 100 100
MNIST 1 0 0 0 0 1.4 0 1.6 0 5.1 0 3.9 0 4.3 0 3.8 0 3.6 0 3.3 0 3.5 0 5.4 0
8 0 0 0 0 24.4 1.7 22.3 1.4 42.9 1.2 41.3 1 51.2 2.4 50.9 1.9 45.4 3.5 44.6 3.5 72.2 0 73.5 0
16 0 0 0 0 65 28.5 61.6 24.8 82.9 17.7 81.7 14.6 90.4 33 89 27.3 82.3 38.3 80.3 31.6 100 0 100 0
ImageNet 1 0.6 0 0.7 0 0.7 0 1 0 0 0 0.2 0 2.1 0 1.2 0 1.8 0 1.7 0 0 0 0 0
8 5.5 0 4.6 0 21.2 4.3 23.9 3.2 7.4 0.5 8.3 0.5 36.3 4 34.7 3 34.7 5.8 30.5 5 0 0 0 0
16 13 0 13.4 0 63.7 47.3 60.3 40.7 39.1 13.6 36.7 12.3 84.8 56.6 80.2 48.3 77.2 52.5 72.2 44.3 5.4 0 6 0
VOC12 1 0.5 0 0.4 0 0.5 0 1.4 0 0.6 0 1 0 1.1 0 1.3 0 1.8 0 2.6 0 0 0 0 0
8 5.1 0 5.2 0 23.6 3.7 21.9 3.7 15.8 1.2 17.8 3.7 32.2 2.6 31.8 2.6 33.1 7 34.4 5.3 24 0 24.7 0
16 12.4 0 13.9 0 65.5 46.4 64.8 46.7 54.5 22 56.8 21.9 80.7 52 80.3 52 75.5 49.5 79.9 51.5 100 100 100 100
1 1 0 0.7 0 1 0 1.5 0 2.5 0 2.9 0 3.2 0 2.4 0 0 0 0 0 0 0 0 0
8 6.6 0 5.8 0 23.2 3.3 20.9 2.4 39.1 4 36.8 3.3 54.5 8.7 51.1 7.8 0.9 0.4 0.7 0.4 42.3 24 45.5 16.4
16 16.4 0 13.8 0 63.8 42.6 61.1 36 81 49.6 76.6 41.6 90.8 69.9 87.1 62.5 38.5 30.2 36.6 29.3 100 100 100 100
1 0 0 0.7 0 1.1 0 1.4 0 0.2 0 0.2 0 1 0 0.8 0 0 0 0 0 0 0 0 0
8 0 0 0 0 21.1 1.2 20.2 1 8.6 0.2 10.5 0.2 31.6 1 31 1 1.3 0.6 1.4 0.6 0 0 0 0
16 0 0 0 0 65.6 24.7 63.7 19 49 9.6 45.9 7.8 80.2 28.3 77.1 23.5 43.7 11.4 41.3 10.1 51 4 2.1 46.9 0.2
Table 9. Adversarial training with a model augmented with additional background classes to discriminate between in and out distribution data. The target success rate is reported with PGD-xent attack using and random target label selection. We can see that including a subset of an OOD dataset in training this model significantly decreases the target success rate of adversarial examples generated from this dataset. Surprisingly, this robustness can generalize to other OOD datasets, which are not included in the training. In addition, the robustness to multiple datasets can be achieved by including a subset of each in the training phase.