Deep neural networks are vulnerable to adversarial examples, which dramatically alter model output using small input changes. We propose Neural Fingerprinting, a simple, yet effective method to detect adversarial examples by verifying whether model behavior is consistent with a set of secret fingerprints, inspired by the use of biometric and cryptographic signatures. The benefits of our method are that 1) it is fast, 2) it is prohibitively expensive for an attacker to reverse-engineer which fingerprints were used, and 3) it does not assume knowledge of the adversary. In this work, we pose a formal framework to analyze fingerprints under various threat models, and characterize Neural Fingerprinting for linear models. For complex neural networks, we empirically demonstrate that Neural Fingerprinting significantly improves on state-of-the-art detection mechanisms by detecting the strongest known adversarial attacks with 98-100 and MiniImagenet (20 classes) datasets. In particular, the detection accuracy of Neural Fingerprinting generalizes well to unseen test-data under various black- and whitebox threat models, and is robust over a wide range of hyperparameters and choices of fingerprints.READ FULL TEXT VIEW PDF
Deep neural networks (DNNs) have shown phenomenal success in a wide rang...
Although deep neural networks have been successful in image classificati...
Although deep neural networks (DNNs) have achieved great success in many...
Adversarial examples are slight perturbations that are designed to fool
The vulnerabilities of deep neural networks against adversarial examples...
In this paper we propose to augment a modern neural-network architecture...
Adversarial examples have been well known as a serious threat to deep ne...
Deep neural networks are highly effective pattern-recognition models for many applications, such as computer vision, speech recognition and sequential decision-making. However, neural networks are vulnerable to adversarial examples: an attacker can add small perturbations to input data, that maximally change the model’s output(Szegedy et al., 2013; Goodfellow et al., 2014). Hence, a key challenge is how to make neural networks reliable and robust for large-scale applications in noisy environments or mission-critical applications, such as autonomous vehicles.
To make neural networks robust against adversarial examples, we propose NeuralFingerprinting (NeuralFP): a fast, secure and effective method to detect adversarial examples.
The key intuition for NeuralFP is that we can encode secret fingerprint patterns into the behavior of a neural network around the input data. This pattern characterizes the network’s expected behavior around real data and can thus be used to reject fake data, where the model outputs are not consistent with the expected fingerprint outputs. This process is shown in Figure 1.
This approach is highly effective as encoding fingerprints is feasible and simple to implement during training, and evaluating fingerprints is computationally cheap. NeuralFingerprinting is secure: to craft a successful adversarial example, an attacker would have to find input perturbations that significantly change the network’s output, and do not violate the secret fingerprints
. However, it is computationally and statistically expensive for an attacker to reverse-engineer the secret fingerprints used, and random attacks have low probability of success. Furthermore,NeuralFingerprinting does not require knowledge of the adversary’s attack method, and differs from state-of-the-art methods (Meng & Chen, 2017; Xingjun Ma, 2018) that detect adversarial examples using auxiliary classifiers.
In this work, we theoretically characterize the feasibility and security of NeuralFP, and experimentally validate that NeuralFP achieves almost perfect detection AUC scores against state-of-the-art adversarial attacks on various datasets. To summarize, our key contributions are:
We present NeuralFP: a simple and secure method to detect adversarial examples that does not rely on knowledge of the attack mechanism.
We describe a formal framework to characterize the hardness of reverse-engineering fingerprint patterns and characterize the effectiveness of NeuralFP for linear classification.
We empirically demonstrate that NeuralFP achieves state-of-the-art near-perfect AUC-scores against the strongest known adversarial attacks. In particular, we show that NeuralFP correctly distinguishes between unseen test data and adversarial examples.
We emprically show that the performance of NeuralFP is robust to the choice of fingerprints and is effective for a wide range of choices of fingerprints and hyperparameters.
We also show that NeuralFP can be robust even in the whitebox-attack setting, where an adaptive attacker has knowledge of the fingerprint data.
Source code is available at https://github.com/StephanZheng/neural-fingerprinting.
We consider supervised classification, where we aim to learn a model from labeled data , where and
is a 1-hot label vector( classes). The model predicts class probabilities :
and can be learned via a loss function, e.g. cross-entropy loss. Formally, this data is generated by sampling from a data-generating distribution , which characterizes “real” data ().
Adversarial attacks produce perturbations that exploit the behavior of a neural network at an input point , in particular when is a high-dimensional vector. An adversarial perturbation causes a large change in model output, i.e. for :
such that e.g. the class predicted by the model changes:
We define such data to be “fake”, i.e. formally
For instance, the Fast-Gradient Sign Method (Goodfellow et al., 2014) perturbs to along the gradient
Our goal is to defend neural networks by robustly detecting adversarial examples. Specifically, we introduce NeuralFP: a method that detects whether an input-output pair is consistent with the data distribution (“real”), or is adversarial (“fake”). This algorithm is summarized in Algorithm 1 and Figure 1.
The key idea of NeuralFP is to detect adversarial inputs by checking if the network output in specific points around closely resembles a set of fingerprints that can be chosen by the defender. These chosen outputs are embedded into the network during training. Formally, a fingerprint is:
For -class classifcation, we define a set of fingerprints:
where is the fingerprint for class . Here, the () are input (output) perturbations that are chosen by the defender. Note that across classes , we use the same input directions .
NeuralFP acts as follows: it classifies a new input as real if the change in model output is close to the for some class , for all . Here, we use a comparison function and threshold to define the level of agreement required, i.e. we declare real when
As the comparison function , we take:
where are normalized logits. Hence, NeuralFP is defined by the data:
Once a defender has constructed a set of desired fingerprints (7), the chosen fingerprints can be embedded in the network’s response by adding a fingerprint regression loss during training. We elaborate on how to choose fingerprints hereafter. Given a classification model (1) with logits , the fingerprint loss is:
where is the ground truth class for example and are the fingerprint outputs. Note that we only train on the fingerprints for the ground truth class. The total training objective then is:
where is a loss function for the task (e.g. cross-entropy loss for classification) and a positive scalar. In the hereafter we will use , but in practice, we choose such that it balances the task and fingerprint losses.
Evaluating fingerprints requires extra computation, as Algorithm 1 requires extra forward passes to compute the differences . A straightforward implementation is to check (8) iteratively for all classes, and stop whenever an agreement is seen or all classes have been exhausted. However, this operation can in principle be parallelized and performed in minibatches for real-time applications.
There are several levels of threat models with increasing levels of attacker knowledge of and the model for which we can analyze the security and effectiveness of NeuralFP. First, we will assume that the attacker has access to the parameters and can query the model and its derivatives without limitation.
We can then characterize the security and feasibility of NeuralFP in both the whitebox-attack (attacker has perfect knowledge of NFP) and blackbox-attack setting (attacker knows nothing about NFP but is aware of the model weights). For example, a physical instance of a whitebox-attack occurs when the attacker has access to the compute instance where the forward passes are executed (e.g. the model is queried on a local device). In this case, it should be assumed that the attacker can get access to the precise fingerprints that are used by the defender, e.g. by reading the raw memory state. A blackbox-attack might occur when the internal state of the compute instance is shielded, e.g. when is queried in the cloud.
Additionally, we define the notion of whitebox-defense (the defender has knowledge of the attacker’s strategy) and blackbox-defense (defender has no knowledge about attacker).
If the attacker has full knowledge of the fingerprint data NFP, a key question is how vulnerable the model is, i.e. how many example -s can an attacker find that will be falsely flagged as “real”?
Firstly, we will demonstrate in Theorem 1
that for Support Vector Machines (binary classification with linear models), we can characterize the region of inputs that will be classified as “real” for a given set of fingerprintsand to what extent this corresponds with the support of the data distribution .
In the blackbox-attack setting, the attacker has no knowledge of NFP, i.e. does not know the metric , threshold , and (the number of) fingerprints . An attacker would 1) have to find the fingerprints used by the defender and then 2) construct adversarial examples that are compatible with the fingerprints.
First, to reverse-engineer the fingerprints , it is intuitive to see this is combinatorially and statistically hard. Consider a simple case where only the are unknown, and that the attacker knows that each fingerprint is discrete, i.e. each component . Then the attacker would have to search over combinatorially () many to find the subset of that satisfy the detection criterion in equation (8). A defender can shrink this space of feasible fingerprints by setting the threshold level of . Hence, we conjecture that reverse-engineering the set of feasible fingerprints for a general model class is exponentially hard, but leave a full proof for future research.
Secondly, a successful attacker needs to construct that maximizes the loss:
ensuring that the model outputs are -close to the fingerprints for some class . This poses a difficult constrained optimization problem, where the feasible set of is non-convex. Hence, in general it is hard for a random attacker to succeed, i.e. find a feasible solution.
We now investigate the effect of fingerprints and to what extent fingerprints characterize data distributions. In particular, given fingerprint data NFP as in (12), can we characterize which inputs will be classified as “real”, how that depends on the number of fingerprints and how efficient fingerprints are?
To do so, we first consider fingerprints for Support Vector Machines, i.e. binary classification with linear models :
on inputs , where (e.g. for MNIST): The binary classifier defines a hyperplane , which aims to separate positive () from negative examples (). We will assume that the data-generating distribution for the positive and negative examples is perfectly separated by a hyperplane, defined by the normal . We define the minimal and maximal distance from the examples to the hyperplane along as:
In this setting, the set of classified as “real” by fingerprints is determined by the geometry of , where for detection we measure the exact change in predicted class (e.g. we use and ). Theorem 1 below then characterizes fingerprints for SVMs:
will detect adversarial perturbations as “fake” for which one the following hold:
This choice of s is optimal: these fingerprints minimize the set of inputs that can be misclassified as “real” using a first-order metric .
We illustrate this proof for two fingerprints in Figure 2. Consider any perturbation that is positively aligned with , and has . Then for any negative example (except for the support vectors that lie exactly from the hyperplane), adding the perturbation does not change the class prediction:
The fingerprint in (18) is an example of such an . However, if is large enough, that is:
(e.g. the fingerprint in (19)), for all negative examples the class prediction will always change (except for the that lie exactly from the hyperplane):
Note that if has a component smaller (or larger) than , it will exclude fewer (more) examples, e.g. those that lie closer to (farther from) the hyperplane. Similar observations hold for fingerprints (20) and (21) and the positive examples . Hence, it follows that for any that lies too close to the hyperplane (closer than ), or too far (farther than ), the model output after adding the four fingerprints will never perfectly correspond to their behavior on examples from the data distribution. For instance, for any that is closer than to the hyperplane, (21) will always cause a change in class, while none was expected. Similar observations hold for the other regions in (22). Since the SVM is translation invariant parallel to the hyperplane, the fingerprints can only distinguish examples based on their distance perpendicular to the hyperplane. Hence, this choice of s is optimal. ∎
Note that Theorem 1 by itself does not prevent attacks parallel to the decision boundary; an adversary could in principle add a (large) perturbation that pushes a negative example across the SVM decision boundary to a region where the data distribution , but which is classified as positive by the SVM, i.e. . However, we can further prevent attacks that stray far from the data distribution, e.g. that go too far in a direction parallel to the hyperplane, by checking the distance to the nearest example in the dataset. This essentially would restrict the adversary to perturbations of limited magnitude, i.e. , although could be computationally expensive.
In addition, Theorem 1 assumed that the data was perfectly separable. However, if this is not the case and there are misclassified examples , when using exact matching, the fingerprints would not detect adversarial perturbations correctly for misclassified examples and flag itself as fake. However, we observed empirically that this problem is ameliorated when using soft matching and nonlinear models, although theoretically characterizing this is an interesting question for future research.
For the general setting, e.g. for nonlinear models , Theorem 1 can be extended if the data is (locally) separable in some feature space and we can write a general (local) model as
In this case, the fingerprints can be defined analogous to (20-19). As such, it is straightforward to lift the analysis of Theorem 1 to this setting as well. However, depending on the feature space chosen, such as analysis might only be applicable to a local region of the input space and require more complex fingerprints.
When applying NeuralFP to complex models such as deep neural networks, a key challenge is how to choose the fingerprints such that the region of inputs that are classified as “real” corresponds as much as possible with the support of the data distribution .
Theorem 1 characterizes fingerprints for SVMs, where for detection we used the exact change in predicted class. However, in practice using exact class changes is challenging. First, the feature space can be highly complex (e.g. for deep neural networks) and hence intractable to describe the geometry of the decision boundaries of . However, NeuralFP utilizes a softer notion of fingerprint matching by checking whether the model outputs match (a pattern of) changes in logits as in equation (10), which can be learned by the model during training.
In this work, we focus on deep neural networks that have high model capacity, such that fitting almost arbitrary fingerprint patterns accurately is a feasible goal111In fact, (Zhang et al., 2016) has shown that neural networks can fit arbitrary random or complex patterns of prediction labels..
In particular, this motivates a general and straightforward method to construct the fingerprints , e.g. by randomly sampling directions , and encouraging constancy and/or changes to another class (e.g. (20-19) for positive examples in the SVM case). For example, a simple choice of fingerprints that increases class probabilities along uses 1-hot vectors:
although more complex choices are feasible as well. We will show in Section 3.3 that the performance of NeuralFP is robust over a wide range of fingerprints .
We now empirically validate the effectiveness of NeuralFP, as well as analyze the behavior and robustness of NeuralFP222Code for experiments: https://github.com/StephanZheng/neural-fingerprinting. Our goal is to answer the following questions:
How well does NeuralFP distinguish between normal and adversarial examples?
How sensitive and robust is NeuralFP to changes in hyperparameters?
How do the fingerprints of normal and adversarial examples differ?
How robust is NeuralFP to a attacker that has full knowledge of fingerprints?
Does NeuralFP scale to high-dimensional inputs?
To do so, we report the AUC-ROC performance of NeuralFP on MNIST, CIFAR-10 and MiniImagenet-20 datasets against four state-of-the art adversarial attacks, using (Xingjun Ma, 2018) (LID), the strongest-known detection based defense as the baseline. We compare against LID in both the blackbox-defense and whitebox-defense setting (LID has knowledge of the attack mechanism and the LID classifier has been trained on FGM examples).
We test on the following attacks:
Fast Gradient Method (FGM) (Goodfellow et al., 2014) and Basic Iterative Method (BIM) (Kurakin et al., 2016) are both gradient based attacks with BIM being an iterative variant of FGM. We consider both BIM-a (iterates until misclassification has been achieved) and BIM-b (iterates a fixed number of times (50)).
Jacobian-based Saliency Map Attack (JSMA) (Papernot et al., 2015) iteratively perturbs two pixels at a time, based on a saliency map.
For each dataset, we consider a randomly sampled pre-test-set of test-set images, and discard misclassified pre-test images. For the test-set of remaining images, we generate adversarial perturbations by applying each of the above mentioned attacks. We report AUC-ROC on sets composed in equal measures of the test-set and test-set adversarial examples with varying threshold . See the appendix for details regarding all model architectures used and construction of MiniImagenet-20.
We trained a 5-layer ConvNet to test-accuracy. The set of are chosen at random, with each pixel perturbation chosen uniformly in . For each , if is of label-class , is chosen to be such that and , with . The AUC-ROC for the best-performing fingerprints (best and ) using grid-search is reported in Table 1. We see that NeuralFP achieves near-perfect detection with AUC-ROC of across all attacks.
For CIFAR-10, we trained a 7-layer ConvNet (similar to (Carlini & Wagner, 2016)) to accuracy. The and are chosen similarly as for MNIST. Table 1 shows that in average, across attacks, NeuralFP outperforms LID-blackbox on average by 11.77% and LID-whitebox defense by 8%. Even in comparison with LID-whitebox, NeuralFP is competitive (CW-) or outperforms LID (other attacks).
To illustrate the scalability of NeuralFP, we also evaluated on MiniImagenet (Vinyals et al., 2016) with classes randomly chosen from the 100 MiniImageNet classes. For this, we trained an AlexNet network on 10,600 images (not downsampled) with top-1 accuracy. We generated test-set adversarial examples using BIM-b with 50 steps (NIP, ) and FGM. Here, NeuralFP achieves an AUC-ROC score of with on both attacks, similar to the near-perfect AUC-ROC for MNIST and CIFAR-10.
In Figure 5 we visualize the fingerprint-loss for test and adversarial examples. The fingerprint-loss differs significantly for most test and adversarial examples, resulting in the AUC-ROC scores being close to 100%.
Next, we study the effect of the hyper-parameters (number of fingerprint directions) and on the AUC-ROC for MNIST and CIFAR-10. Figure 3 shows that NeuralFP performs well across a wide range of hyperparameters and is robust to variation in the hyperparameters. With increasing , the AUC-ROC for CW- decreases. A possible explanation is that CW- produces smaller adversarial perturbations than other attacks, and for larger , fingerprints are less sensitive to those small adversarial perturbations. However, the degradation in performance is not substantial () as we increase over an order of magnitude. With increasing , we see that the AUC-ROC generally increases across attacks. We conjecture that a larger number of fingerprints are sensitive to perturbations in a larger number of directions, and hence result in better detection.
Figure 4 depicts the mean and standard deviation of AUC-ROC scores for 32 sets of randomly chosen fingerprints and varying hyperparameters for CIFAR-10. The mean AUC-ROC for all attacks is , with the standard deviation being less than . This indicates that the performance of NeuralFP is not very sensitive to the chosen fingerprint directions and shows that neural networks can learn complex patterns along arbitrary directions.
Furthermore, the test accuracy with NeuralFP for CIFAR-10 is and for MNIST is in the range . This is similar to the accuracy obtained when training the same models without fingerprints, illustrating that fingerprinting does not degrade prediction performance on the test-set, while the high AUC-ROC indicates that the fingerprints generalize well to the test-set, but not to adversarials.
Instead of simple , we can encode more complex fingerprints. For instance, we trained a network on CIFAR-10 using random : for each , if is in class :
is a random variable withand that is resampled for each , making it prohibitively hard for a brute-force attacker to guess. For this NFP, we achieve AUC-ROC of across attacks with , without extensive tuning.
We further considered an adaptive attacker that has knowledge of the predetermined fingerprints and model weights, similar to (Carlini & Wagner, 2017a). Here, the adaptive attacker (Adaptive-CW-) tries to find an adversarial example that also minimizes the fingerprint-loss, attacking a CIFAR-10 model trained with NeuralFP. To this end, the CW- objective is modified as:
Here, is the label-vector, is a scalar found through a bisection search, is the fingerprint-loss we trained on and is an objective encouraging misclassification. Under this threat model, NeuralFP achieves an AUC-ROC of 98.79% against Adaptive-CW-, with and for a set of unseen test-samples (1024 pre-test) and the corresponding adversarial examples. In contrast to other defenses that are vulnerable to Adaptive-CW- (Carlini & Wagner, 2017a), we find that NeuralFP is robust even under this whitebox-attack threat model.
Several forms of defense to adversarial examples have been proposed, including adversarial training, detection and reconstructing images using adversarial networks (Meng & Chen, 2017). However, (Carlini & Wagner, 2017a, b) showed many defenses are still vulnerable. (Madry et al., 2017) employs robust-optimization techniques to minimize the maximal loss the adversary can achieve through first-order attacks. (Raghunathan et al., 2018; Kolter & Wong, 2017) train on convex relaxations of the network to maximize robustness. Although these works are complementary to NeuralFP, they do not scale very well. Several other recent defenses attempt to make robust predictions based by relying on randomization (Cihang Xie, 2018), introducing non-linearity that is not differentiable (Jacob Buckman, 2018) and by relying on Generative Adversarial Networks (Yang Song, 2018; Pouya Samangouei, 2018) for denoising images. Instead, we focus on detecting adversarial attacks.
(Xingjun Ma, 2018)
detect adversarial samples using an auxiliary logistic regression classifier, which is trained to use an expansion-based measure,local intrinsic dimensionality (LID). A similar approach to detection is based on Kernel Density (KD) and Bayesian-Uncertainty (BU) using artifacts from pre-trained networks (Feinman et al., 2017). In contrast with these methods, NeuralFP encodes information into the network response during training, and does not depend on auxiliary detectors.
Our experiments suggest that NeuralFP is an effective method for safe-guarding against the strongest known state-of-the-art adversarial attacks. However, there is room for improvement as we do not achieve 100% detection rates. An interesting line of future work is to explore if total detection can be achieved via principled approaches to choosing the NFP. Although empirical evidence suggests NeuralFP is effective, an open question is if stronger attacks can be developed that can fool NeuralFP or if it can be proved that NeuralFP is invulnerable to adversarial perturbations. Other interesting avenues include using NeuralFP for robust prediction and extending it to domains beyond image-processing that have been shown to be vulnerable to adversarial attacks.
This work is supported in part by NSF grants #1564330, #1637598, #1545126; STARnet, a Semiconductor Research Corporation program, sponsored by MARCO and DARPA; and gifts from Bloomberg and Northrop Grumman. The authors would like to thank Xingjun Ma for providing the relevant baseline numbers for comparison.
Towards Deep Learning Models Resistant to Adversarial Attacks.ArXiv e-prints, June 2017.
For MNIST, we use the model described in Table 3.
For CIFAR-10, we use the model described in Table 4
We use a model similar to AlexNet for MiniImagenet-20. The model used is described in Table 2
We use images from the following 20 ImageNet classes for our experiments: