Defending Against Adversarial Examples with K-Nearest Neighbor

by   Chawin Sitawarin, et al.
berkeley college

Robustness is an increasingly important property of machine learning models as they become more and more prevalent. We propose a defense against adversarial examples based on a k-nearest neighbor (kNN) on the intermediate activation of neural networks. Our scheme surpasses state-of-the-art defenses on MNIST and CIFAR-10 against l2-perturbation by a significant margin. With our models, the mean perturbation norm required to fool our MNIST model is 3.07 and 2.30 on CIFAR-10. Additionally, we propose a simple certifiable lower bound on the l2-norm of the adversarial perturbation using a more specific version of our scheme, a 1-NN on representations learned by a Lipschitz network. Our model provides a nontrivial average lower bound of the perturbation norm, comparable to other schemes on MNIST with similar clean accuracy.


Evaluating the Robustness of Nearest Neighbor Classifiers: A Primal-Dual Perspective

We study the problem of computing the minimum adversarial perturbation o...

On the (Un-)Avoidability of Adversarial Examples

The phenomenon of adversarial examples in deep learning models has cause...

Adversarial Examples for k-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams

Adversarial examples are a widely studied phenomenon in machine learning...

Lightweight Lipschitz Margin Training for Certified Defense against Adversarial Examples

How can we make machine learning provably robust against adversarial exa...

Gradient Masking Causes CLEVER to Overestimate Adversarial Perturbation Size

A key problem in research on adversarial examples is that vulnerability ...

On the Connection between Differential Privacy and Adversarial Robustness in Machine Learning

Adversarial examples in machine learning has been a topic of intense res...

On the Tightness of Semidefinite Relaxations for Certifying Robustness to Adversarial Examples

The robustness of a neural network to adversarial examples can be provab...

1 Introduction


Given adequate data and compute power, neural networks have demonstrated their potential to surpass human-level performance on various benchmarks such as image classification [Krizhevsky, Sutskever, and Hinton2012, Simonyan and Zisserman2014], playing complex games [Silver et al.2017b, Silver et al.2017a, Mnih et al.2013], controlling driverless vehicles [Chen et al.2015, Bojarski et al.2016], and medical imaging [Litjens et al.2017]

. Nonetheless, it is well-known that neural networks and other machine learning classifiers still have a number of flaws, one of which is their excessive sensitivity to small perturbation (i.e. adversarial examples)

[Biggio et al.2013, Szegedy et al.2013, Goodfellow, Shlens, and Szegedy2015, Moosavi-Dezfooli, Fawzi, and Frossard2015, Nguyen, Yosinski, and Clune2015].

Figure 1: Illustration of our setup: the kNN search is done on the intermediate output of a neural network instead of the input space.

We propose that kNN on representations learned by neural networks can serve as a simple yet strong defense against adversarial examples, surpassing state-of-the-art defenses in an -norm setting on MNIST and CIFAR-10. Our model is illustrated in Figure 1. On MNIST, our best model requires an average perturbation size of 3.07 in order to reduce the accuracy to zero, surpassing the state-of-the-art by 0.77 (but with a modest drop of 1.7% on clean accuracy). Our best model on CIFAR-10 also outperforms the state-of-the-art by a large margin, increasing the mean adversarial perturbation to 2.30. Our results also suggest that replacing the final linear fully-connected layer in a network with a kNN consistently results in a more robust classifier.

In fact, there are multiple evidences of robustness of kNN in adversarial settings from both theoretical perspectives [Wang, Jha, and Chaudhuri2018, Khoury and Hadfield-Menell2019] and empirical analyses [Papernot, McDaniel, and Goodfellow2016, Papernot and McDaniel2018, Schott et al.2019, Dubey et al.2019]

. Despite its potential, kNN is known to not perform well on high-dimensional data like most of real-world image datasets. Therefore, by doing the neighbor search on representations learned by neural nets, we hope to obtain the robustness benefit of kNN as well as the flexibility and the performance of neural nets. We explore different choices of networks to use as the feature extractor across various models and training methods.

Furthermore, we demonstrate that a 1-NN on intermediate outputs of a Lipschitz network can serve as a certifiable defense where the lower bound of

-norm of perturbation required to change the classification of a given input can be computed. With similar accuracy to the heuristic defense, our certifiable defense is able to provide a reasonable lower bound (0.86) comparable to the previous works’. It can certify that classification of 80% of the test samples cannot be changed by a perturbation with an

-norm of 0.5 or less, or 36% with 1 or less. We hope that our scheme suggests a possibility of defending against adversarial examples with a form of similarity search as well as sheds some light on architecture designs for robust neural networks. The code we use for all of the experiments can be found at

2 Background and Related Work

2.1 Adversarial Examples

Adversarial examples are a type of an evasion attack against machine learning models. Most adversarial examples on deep neural networks are generated by adding very small perturbation to legitimate samples [Szegedy et al.2013, Goodfellow, Shlens, and Szegedy2015]. Previous works propose algorithms for finding such perturbation within some -norm ball which can be formulated as solving the following optimization problem:



is some loss function associated with the correct prediction of a clean sample

by the target neural network. The -norm constraint is treated as a proxy to imperceptibility of the noise.

2.2 Robustness of k-Nearest Neighbors

The kNN classifier is a popular non-parametric classifier that predicts the label of an input by finding its nearest neighbors in some distance metric such as Euclidean or cosine distance and taking a majority vote from the labels of the neighbors. While kNN has been widely used and well-studied for a long time, it has been barely investigated in adversarial settings.

To the extent of our knowledge, only the four following works directly study the adversarial robustness of kNN. Amsaleg et al. prove that under certain assumptions, the robustness of kNN is correlated with the intrinsic dimension of the data. Wang et al. provide a lower bound on the required value of such that robustness of kNN can approach that of the Bayes Optimal classifier [Wang, Jha, and Chaudhuri2018]. Sitawarin & Wagner propose an attack on kNN formulated as an optimization problem. Most recently, Khoury and Hadfield-Menell claim, under certain assumptions, that kNN is naturally robust because the Voronoi cells extend in directions orthogonal to the data manifold which are believed to be exploited by adversarial examples.

While the previous works provides some evidences suggesting the robustness of plain kNN on the metric, the main drawbacks of kNN are its scalability and its low accuracy on most real-world datasets. The scalability problem is well-studied and can be addressed with approximate algorithms or optimized data-structures. In this work, we suggest that a kNN on representations of neural networks, as opposed to directly on the input, can achieve relatively high accuracy while maintaining its robustness property.

2.3 Deep k-Nearest Neighbors

DkNN, proposed by Papernot & McDaniel, is a scheme that can be applied to any deep learning model, offering interpretability and robustness through a nearest neighbor search in each of the deep representation layers. The scheme also demonstrates the possibility of detecting adversarial examples and out-of-distribution samples. Soft nearest neighbor (SNN) loss, a regularization proposed by the follow-up work, that

encourages entanglement between samples of different classes in the representation space can increase the detection accuracy [Frosst, Papernot, and Hinton2019].

One may question the scalability of DkNN on large-scale datasets in term of both performance and accuracy. Dubey et al. [Dubey et al.2019]

applies kNN on representation space of ImageNet models and achieves both reasonable accuracy and improvement on adversarial robustness. They employ a fast but approximate nearest-neighbor search which makes the system feasible on a dataset with more than one billion images. However, in this work, the kNN is not used to obtain the final prediction but only to search for the neighbors. Instead, the final output is obtained by taking a weighted average of the softmax of the neighbors.

Furthermore, none of the three previous works explore representations from different networks other than those trained with cross-entropy loss. In fact, our work suggests that unsupervised representations as well as adversarial training can yield higher robustness than a vanilla CNN trained with cross-entropy loss.

3 k-Nearest Neighbor on Representations

While general trends suggest that robust networks usually suffer from low clean accuracy, we believe that kNN is a simple scheme that can push this trade-off curve further. Robust networks are insensitive to small changes in the input, but consequently, they cannot express an abrupt change in the output. This makes them unable to separate two samples from different classes that are very close together, resulting in low accuracy.

On the other hand, kNN does not suffer as much from such constraint since it can attain high accuracy as long as a distance metric in the input space is meaningful or the data exhibit some local structure. However, this assumption is not necessarily true for real-world datasets. Therefore, we aim to rely on neural networks as feature extractors to robustly map inputs to representations that satisfy this assumption and then using a kNN to make the classification on top of the representations.

Our system is simple. After training, weights of the network are frozen and used as a feature extractor. All the training samples are passed through the network to obtain the representations on a specified layer and then stored for the kNN part. At test time, features are extracted from an input and treated as a query to the kNN which returns indices of the nearest representations in the training set.

We build the kNN part on a Python library called Faiss [Johnson, Douze, and Jégou2017], which implements many algorithms for similarity search. Unless stated otherwise, we use Euclidean distance and choose to be 75 for all of the models as suggested by Papernot & McDaniel. We use an exact search as we require an accurate ordering and exact distances for a relatively large . We explore two different settings where this scheme can be used: (1) as a heuristic defense (Section 4 and 5), and (2) as a certifiable defense (Section 6).

4 Heuristic Defense on MNIST

4.1 Experimental Setup

We evaluate the robustness and the accuracy of our defense on representations learned by seven different models as well as the adversarially trained version of some of the models. The seven models include four supervised models: (1) vanilla CNN (Basic Network), (2) network trained with soft nearest neighbor loss (SNN loss) [Frosst, Papernot, and Hinton2019], (3) network trained with input mixup [Zhang et al.2018], and (4) network trained with manifold mixup [Verma et al.2019]

. The other three are unsupervised models: (5) autoencoder

[Ballard1987], (6) VAE [Kingma and Welling2014], and (7) rotation prediction [Gidaris, Singh, and Komodakis2018]. We experiment with a wide variety of models since we are interested in the robustness of the features learned by networks trained on various objectives.

Evaluating the SNN loss is a natural choice as it appears to improve the original DkNN [Frosst, Papernot, and Hinton2019]. We experiment with a few choices of (both positive and negative) and choose to report only the best one,

, which encourages disentanglement between samples of different classes. For the mixup loss, the network is trained with a linear interpolation of a pair of inputs (at the input space and/or the feature space) and their labels. The authors argue that it improves the interpolation behavior of the representation and has shown some improvement in the adversarial robustness

[Zhang et al.2018, Verma et al.2019]. Here, the manifold mixup model uses mixup on all the layers including the input.

In additional to supervised models, we are interested in unsupervised learning as it is not well-studied in the context of adversarial examples. With no access to labels, unsupervised models may learn very different sets of features from the supervised ones which only rely on features that help discriminate labels of the input. Arguably, representations learned by unsupervised methods such as generative models may contain more information of the input as they have to reconstruct it accurately rather than simply predicting its class. The redundancy or the extra information may provide more robustness against adversarial examples which usually rely on making small changes on some parts of the input.

For the unsupervised methods, the autoencoder is trained with MSE reconstruction loss in the pixel space. The VAE is trained by optimizing the ELBO, and the output is treated as Bernoulli random variables

[Kingma and Welling2014]. The rotation prediction is a self-supervised method that trains a model to recognize orientation of the input [Gidaris, Singh, and Komodakis2018]. Here, we use four rotations (0, 90, 180, and 270 degrees). Surprisingly, the rotation network we train is able to predict the rotation with an accuracy of 99.29% on the test set, considering that some digits are difficult to distinguish their rotations such as ‘1’, ‘6’, ‘8’, and ‘9’. We suspect that there are some subtle clues the network is able to pick up and relies on making the prediction.

For fair comparisons, we reimplement all the models, both supervised and unsupervised, with the same architecture used in Papernot & McDaniel. We also experiment with different model-specific hyperparameters such as a constant that balances two loss functions in SNN loss and Mixup techniques, but we only report the ones that yield the best robustness with comparable accuracy. For all of the models, we take the representation from the third convolutional layer as we find that it offers a good balance between clean accuracy and robustness.

4.2 Evaluations

Finding an optimal attack on kNN is intractable for a large . Therefore, we rely on a heuristic from Sitawarin & Wagner [Sitawarin and Wagner2019] to find the minimal -perturbation. The attack approximates kNN as a differentiable function and solves it as an optimization problem using gradient descent. We modify the original code slightly to work with Euclidean distance and remove the threshold function as we do not observe any noticeable difference without it. We refer the readers to the original paper for more details on the attack and to Appendix A for our hyperparameters of the attack.

4.3 Main Results

Models Clean Acc. Mean Acc. at Acc. at Acc. at
Basic Network 0.9878 1.4684 0.7820 0.1363 0.0098
SNN Loss 0.9919 1.2529 0.6747 0.0619 0.0035
Input Mixup 0.9887 0.6051 0.0295 0.0087 0.0087
Manifold Mixup 0.9917 0.9749 0.3942 0.0042 0.0018
Table 1: Robustness and clean accuracy of all the networks trained on MNIST (without kNN), excluding the ones that involve adversarial training. Here, the adversarial examples are generated by CW attack (500 iterations, 10 binary searches, initial of 10, learning rate of 0.1).
Models Clean Acc. Mean Acc. at Acc. at Acc. at
ABS [Schott et al.2019] 0.990 2.3 - - -
L2NNN [Qian and Wegman2019] 0.982 - - - 0.244
kNN () 0.9457 3.1389 0.8907 0.7675 0.5579
Basic Network 0.9870 2.1054 0.9103 0.5014 0.1144
SNN Loss 0.9940 1.7131 0.8558 0.2661 0.0547
Input Mixup 0.9812 2.0682 0.8133 0.5007 0.1729
Manifold Mixup 0.9838 2.2276 0.9020 0.5272 0.1790
Autoencoder 0.9509 3.0717 0.8855 0.7449 0.5336
VAE 0.9680 2.0859 0.8201 0.4876 0.1674
Rotation 0.9129 2.2157 0.7532 0.4830 0.1990
Table 2: Robustness and clean accuracy of the kNN on representations learned by all of the networks trained on MNIST, except for those that involve adversarial training (middle: supervised models, bottom: unsupervised models). The top two rows show the result from state-of-the-art defenses, taken directly from the original papers.

Table 2 displays the main results, comparing the robustness and accuracy of all the seven models. The mean is an average -norm of perturbation required to change the classification of the kNN on the entire test set. A larger norm suggests that the representation is less sensitive to adversarial perturbation and potentially contains robust features. The robustness and clean accuracy of the same models but without the kNN are also included in Table 1 for comparison. Notably, the kNN appears to increase the robustness of all the supervised models with a cost of a small drop on clean accuracy.

On average, kNN on supervised representations has a higher clean accuracy than the ones on unsupervised features. This is expected because supervised models, which utilize label signal, are trained for a discriminative task, and so samples from the same class have to activate a similar set of features. This is not necessary the case for unsupervised tasks as the representations can be more complex and are not necessarily clustered by class.

The model trained with SNN loss achieves the highest accuracy since the loss encourages clustering of samples in the same class, but it also has the lowest robustness, potentially due to clustering-focused mapping makes the network particularly sensitive. The kNN on the pixel space is the most robust but also the least accurate, suggesting a recurring trend of the trade-off between clean accuracy and robustness. Compared to the basic model, the two mixup models do not yield any substantial change in the robustness.

Perhaps surprisingly, features learned by the autoencoder appear to be the most robust among the networks, surpassing all of the supervised models. Since the VAE exhibits much less robustness, it is unlikely that the bottleneck architecture plays an important role. Although having a much lower accuracy, the plain kNN as well as the kNN on the autoencoder are more robust than the state-of-the-art defenses, ABS and L2NNN, by a significant margin.

It is likely that the attack does not find adversarial examples with the smallest perturbation, but this is the only attack, to the extend of our knowledge, that reliably and efficiently find adversarial examples with close to 100% success rate. In the next section, we improve the robustness of some of the models further by combining with adversarial training and evaluate their robustness in more detail.

4.4 kNN on Adversarially Trained Models

Models Clean Acc. Mean Acc. at Acc. at Acc. at
-Adv (no kNN) 0.9470 2.9060 0.8918 0.7575 0.4764
-Adv 0.9653 2.0* - - -
-Adv 0.9726 3.0378 0.9387 0.8027 0.5095
-Adv-Rot 0.9716 2.9973 0.9658 0.8302 0.4571
-Adv-AE 0.9641 3.0198 0.9323 0.8112 0.4899
Table 3: Robustness and clean accuracy of the networks trained with adversarial training. (*) The attack we use struggle to find adversarial examples for -Adv, most likely due to gradient obfuscation problem. With a boundary-based attack, we manage to find much smaller adversarial perturbation of about 2.0 averaged over the first 100 samples in the test set.
Figure 2: A comparison between -norm of the adversarial perturbation generated by the gradient attack [Sitawarin and Wagner2019] and by the boundary attack [Brendel, Rauber, and Bethge2018] on each of the first hundred samples in the test set (Left: autoencoder, Right: -Adv). The boundary attack is run twice with two sets of hyperparameters; The better of the two is chosen. The red line indicates or the points where the two attacks find adversarial perturbation with the same norm.
Figure 3: Adversarial examples on different models on MNIST (left) and CIFAR-10 (right). Adversarial examples on models without the kNN part are generated by CW attack, and those with kNN are generated by the gradient attack. On the CIFAR-10 models, we also show the adversarial perturbation scaled to the range . We omit adversarial examples on ResNet (both with and without kNN) since the perturbation is essentially imperceptible.

Adversarial retraining is one of a few methods which effectively improve adversarial robustness and has been shown to encourage networks to learn more robust features [Madry et al.2017, Athalye, Carlini, and Wagner2018]. We pick a subset of the networks in Section 4.3 and adversarially train them with the original objective. For the -adversarial training, we use 40 steps of size 0.01 with a maximum norm of 0.3, and 40 steps of size 0.1 for the version, except for the rotation model which uses 20 steps of size 0.05. The results are reported on Table 3.

Initially, -Adv appears very robust to the attack, we suspect that this is caused by the gradient obfuscation problem where a gradient-based attack fails to find good adversarial examples [Athalye, Carlini, and Wagner2018]. As a result, we evaluate some of the models with a boundary-based attack which only relies on hard labels [Brendel, Rauber, and Bethge2018]. We use the official implementation in Foolbox [Rauber, Brendel, and Bethge2017]. Still, the attack is extremely slow so we only manage to run the attack on the first hundred samples in the test set. The boundary attack does not always succeed, but we find that the mean perturbation norm is about 2.0 on -Adv which is much smaller than the ones found by the gradient attack. Nonetheless, the other models do not appear to have the same problem as the two attacks find adversarial examples of a comparable size. Figure 2 shows plots comparing the perturbation norm of the two attacks on two models. The gradient attack strictly performs better than the boundary attack on the autoencoder and roughly the same on average on -Adv.

All of the models, except for -Adv, exhibit similarly strong robustness. -Adv is significantly more robust than its original version (Basic Network with kNN) by sacrificing the clean accuracy. Interestingly, with the kNN, both of the unsupervised models, -Adv-Rot and -Adv-AE, achieve higher accuracy with the similar or improved robustness. This is unexpected as the unsupervised models still do not have an access to the labels. Also, it is well-known that supervised adversarial training generally reduces accuracy on benign samples.

We suspect that the adversarial training forces the model to learn robust features which are also more likely shared between samples from the same class. Consequently, they cluster more in the representation space of the adversarially trained models, increasing both the robustness and the accuracy of the kNN. On the other hand, for example, the rotation model without adversarial training predicts the rotation with a surprisingly high accuracy (where it should not as we mentioned in Section 4.1), suggesting that it potentially learns trivial and non-robust features. As a result, it has poor robustness as reported in Section 4.3.

It is also important to note that -Adv with the kNN has higher clean accuracy than -Adv without one. It contradicts the trend on the normally trained models, which have lower accuracy when combined with kNN. This observation shows that kNN does not always reduce accuracy on benign samples and helps support our intuition in Section 3 that kNN allows an access to a better robustness-accuracy trade-off curve than only relying on adversarial training. We hypothesize that adversarially trained models (with a sufficiently large ) suffers from a limited class of functions it can represent as high sensitivity is heavily penalized by the adversarial loss. So in some cases such as this one, kNN can simultaneously improve the robustness and the clean accuracy.

4.5 Analysis of Robustness via Local Intrinsic Dimension and Sensitivity

Model LID Norm of Jacobian
Input 12.87 1*
Basic Model 8.51 26.24
SNN Loss 7.00 37.83
VAE 9.38 2.12
Autoencoder 14.48 1.47
-Adv 6.58 9.45
-Adv 6.47 1.76
-Adv-AE 9.74 0.29
Table 4:

LID and spectral norm of the Jacobian matrix from input to the features of the third layer of the networks. Similarly to previous works, we use the Maximum Likelihood Estimator to approximate LID

[Amsaleg et al.2017, Ma et al.2018]. Both LID and Jacobian norm are calculated and averaged on the first thousand samples in the test set. (*) “Input” means MNIST in the pixel space without passing through any network so, equivalently, it has Jacobian norm of 1.

To better explain the robustness of different representations, we attempt to attribute it to two characteristics: (1) local intrinsic dimension (LID) and (2) sensitivity. First, the robustness of 1-NN has been analyzed and shown to depend on LID of the input space [Amsaleg et al.2017]. Specifically, Amsaleg et al. shows that as LID of a given input approaches infinity, the size of the perturbation required to change its k-th neighbor into the first neighbor tends to zero. Loosely speaking, data that cluster have smaller LID.

Sensitivity plays a more straightforward role in the robustness; Models that are less sensitive to a small change on the input should also be more robust to adversarial examples. However, it is unclear how to measure sensitivity for our task. For simplicity, we consider local sensitivity measured by spectral norm of Jacobian of the input to the representation at the layer of interest. Combining the two metrics, we expect that a representation with small LID and small spectral norm will be more robust than the one with large LID and large spectral norm.

Table 4 shows LID and sensitivity of a subset of the representations. There are several interesting observations:

  1. Except for the VAE, the models with a small Jacobian norm (i.e. “Input,” AE, -Adv, -Adv-AE) are very robust.

  2. As expected, supervised models have smaller LID, and adversarial training reduces the sensitivity.

  3. While LID of SNN Loss are small, they are also very sensitive. This might explain why it is the least robust. The SNN loss which encourages clustering of samples in the same class leads to low LID (since the neighbors are dense) but also high sensitivity.

  4. -Adv does not have particularly small LID or Jacobian norm, suggesting that it is not as robust, confirming our evaluation with the boundary-based attack.

These metrics can serve as a sanity-check for the empirical results as well as give us a better idea of what contributes to the robustness, but they are far from being an accurate measurement of the adversarial robustness. Nonetheless, with this intuition, one might find representation learning that directly encourages these components (i.e. small LID and sensitivity) to be useful in future directions.

5 Heuristic Defense on CIFAR-10

Models Clean Acc. Mean Acc. at Acc. at Acc. at
L2NNN [Qian and Wegman2019] 0.772 - - - 0.204
ResNet (no kNN) 0.9299 0.2124 0.0332 0.0018 0.0006
-Adv (no kNN) 0.8045 1.2400 0.6847 0.4924 0.2705
ResNet 0.9301 0.1429 0.0012 0.0001 0.0001
-Adv 0.7945 2.2970 0.7447 0.6370 0.5151
Table 5: Robustness and clean accuracy of the two networks trained on CIFAR-10 and the kNN’s on their representation. For comparison, the first row shows accuracy of L2NNN on CIFAR-10 taken directly from the original paper [Qian and Wegman2019].

We attempt to extend our scheme to a more complex dataset like CIFAR-10. We evaluate representations of two models: a pre-activation ResNet [He et al.2016] with 20 layers and its adversarially trained version (-Adv). We use 8 steps of adversarial training with a step size of 0.05. Both models are trained with Adam optimizer with a learning rate of 1e-3 and batch size of 128. We also experimented with other unsupervised models, similarly to the MNIST experiments. However, the accuracy of kNN on most of their representations is too low to consider. For datasets larger than MNIST, the representation may need to be either trained or fine-tuned in a supervised manner in order to reach a comparable accuracy on kNN.

As shown in Table 5, the adversarial training still provides a significant improvement in the robustness, again, with some drop on the clean accuracy. When combined with kNN, the adversarially trained model becomes more robust which aligns with the results on MNIST. However, the vanilla ResNet unexpectedly becomes less robust. The -Adv has a slightly higher accuracy and is significantly more robust compared to the state-of-the-art L2NNN. This result suggests that with an appropriate choice of the representation, kNN still strengthens the network with little-to-no drop in the clean accuracy even on a larger dataset. Note that we only use a relatively small ResNet-20 to demonstrate the improvement gained from kNN while it is highly likely that larger networks will achieve even higher robustness and accuracy.

6 Certifiable Defense

Layers Clean Acc. Avg. Cert.
ABS [Schott et al.2019] 0.99 0.69 - - - -
LMT [Tsuzuku, Sato, and Sugiyama2018] 0.95 1.02 - - - -
Input 0.9683 0.9567 0.8102 0.4059 0.0977 0.0064
relu1 0.9701 0.8576 0.8000 0.3574 0.0663 0.0034
relu2 0.9723 0.7821 0.7648 0.2677 0.0379 0.0005
relu3 0.9745 0.4295 0.3509 0.0013 0.0 0.0
bottleneck 0.9752 0.3909 0.2668 0.0001 0.0 0.0
Table 6: Clean accuracy and robustness certificate provided by our scheme: 1-NN on the Lipschitz autoencoder. The first two rows include results from the related works for comparison. The last four columns show the percentage of test samples that the adversary cannot change their classification given four different budgets (0.5, 1, 1.5, and 2).

In addition to the heuristic defense, we propose a novel construction of a certifiable defense based on our scheme with some specific components: a 1-nearest neighbor (1-NN) on features from a Lipschitz network. Lipschitz network, a network that is a Lipschitz function, is one of several frameworks that allow computation of a lower bound of the norm of the adversarial perturbation required to change the classification of a given input. In other words, a given input is robust to any perturbation with a norm smaller than the bound. In most of the previous works, to increase adversarial robustness, Lipschitz networks are trained to maximize a margin

between the logits of the correct class and the second largest logits, i.e.

where is the logits of input . Then, the lower bound of the perturbation norm required to change the classification of a given input is given as for -norm or for -norm where is the Lipschitz constant [Tsuzuku, Sato, and Sugiyama2018, Huster, Chiang, and Chadha2018, Qian and Wegman2019, Anil, Lucas, and Grosse2018].

Using a similar notion, we can define the margin for 1-NN as difference between distance from an input to the nearest neighbor of the correct class and distance to the nearest neighbor of the wrong class, i.e. the margin

where is the true label of , and is a set of training samples from class . It follows that the lower bound of the -norm of the perturbation needed to change the classification of 1-NN is , assuming that (otherwise, is already misclassified). Note that the bound is very simplistic and only tight when is a convex combination of and . In fact, an optimal perturbation for 1-NN (minimal Euclidean distance from to the edge of the nearest Voronoi cell of a different class) can be computed by solving a quadratic program. Nonetheless, it is still expensive to solve for a large number of constraints. While we believe that provides more robustness, it is not clear how to find the Voronoi cells efficiently or to provide the bound for a large . We leave this direction to future works.

Now returning to our scheme, we consider the case where the input to 1-NN is a representation from a Lipschitz network with Lipschitz constant of 1 in -norm. The same bound, in fact, still applies to the -norm of the perturbation in the pixel space: , but is now defined on the representation space instead of the input space: where is an output at -th layer of the network given input . We refer to Appendix B for the proof.

6.1 Experimental Setup

To train a Lipschitz network, we use the same method as Qian & Wegman [Qian and Wegman2019]. We implement an autoencoder that has the same architecture as the one in Section 4 but with the encoder now being a Lipschitz network. We choose the autoencoder since it performs well as a heuristic defense and is compatible with our loss function below. Through a number of experiments, we find that training the autoencoder with the MSE reconstruction loss and an additional regularization on the bottleneck layer which directly maximizes the distance between samples from different classes consistently yields good clean accuracy and robustness on MNIST. The loss function of our network can be written as:

where is the output of the autoencoder, and is the output at the bottleneck layer. is a set of samples in the same batch as that have a different label, and is a threshold. The reconstruction loss (first term) is needed to cluster samples from the same class together, which affects the accuracy, while the regularization (second term) encourages the representations to be some distance apart up to the threshold.

For each of the test samples , we compute the margin and the lower bound of the perturbation norm, , using different layers of the encoder.

6.2 Results

We report clean accuracy and the robustness certificates provided by our scheme on each of the layers of the network (Table 6). The layer closer to input has lower accuracy but a larger lower bound. The “Input” row in the table is simply a 1-NN on the input space, which provides a very large mean lower bound of 1.8 while achieving reasonable accuracy. The average lower bound which is is nontrivial, considering that the mean -norm of the perturbation used to fool networks without a defense is around 1.5.

In comparison with the related works, our scheme achieves higher accuracy with a slightly smaller bound than LMT [Tsuzuku, Sato, and Sugiyama2018] and a larger bound with lower accuracy compared to ABS [Schott et al.2019]. Using the first layer, we can guarantee that 80% of the classification of the test samples cannot be changed by any perturbation with an -norm less than 0.5. Nonetheless, on the bottleneck layer, we can only certify up to 27%. The bound becomes smaller for the deeper layers because our network is not perfectly -norm preserving. Consequently, distance to the nearest samples of a wrong class as well as the margin diminishes as the input is passed through the network.

We also attack our certifiable defense (relu1) with the heuristic attack and achieve the average perturbation norm of 1.82, which is twice as large as the average lower bound given by the 1-NN. The gap is contributed by both the looseness of our lower bound and the fact that the heuristic attack does not find the optimal perturbation. Nevertheless, this suggests that our scheme already produces a nontrivial certificate using only a naive lower bound.

Due to the flexibility of the scheme, by using a different choice of networks and hyperparameters, one can easily obtain a model with a higher accuracy but a smaller bound or vice versa. Nonetheless, a better construction of Lipschitz and -norm preserving networks is needed for our scheme to scale to larger datasets.

7 Conclusion

We propose a scheme combining kNN and robust representation learning as a defense against adversarial examples on MNIST and CIFAR-10. Our method pushes the accuracy-robustness trade-off curve further and is straightforward to construct. It is a general and flexible framework that provides clear directions to improve upon. Namely, one can improve it by learning more robust representations or by finding an adversarially robust variant of kNN. The two problems can be pursued independently, and we believe that they are both interesting on their own. We hope that our scheme will also inspire future architecture designs for robust neural networks that implicitly or explicitly impose a similarity search.


We would like to thank Nicolas Papernot for constructive feedback and insights on DkNN.


Appendix A Gradient-Based Attack on kNN

In summary, the attack operates by adding a perturbation to the input such that its representation, , moves closer to representations of a nearest group of training instances from a different class ( for ). This heuristic can be formulated as a constrained optimization problem as the following.

The box constraint is to ensure that the perturbed input lies in the feasible input region which in this case, is between 0 and 1 for pixel values. The optimization can be formulated as a Lagrangian, and so we can binary search the Lagrangian constant that yields the minimal perturbation. The optimization is solved with Adam optimizer.

For most of the models on MNIST, we use , the initial Lagrangian constant of 1e-3, a learning rate of 1e-1, a maximum iteration of 500, and 10 binary search steps. For adversarially trained models or more robust models like the autoencoder, we have to increase to 1e-1 and the number of iterations to 1000 in order to achieve near 100% attack success rate. For CIFAR-10, is set to 1e-5 for the vanilla ResNet and to 1e-3 for the adversarial training. We use a learning rate of 1e-2, 500 iterations, and again, 10 binary search steps.

Appendix B Proof on Certifiable Defense

Here, we refer to Lipschitz networks as neural networks that are a Lipschitz function with a constant in some -norm. More precisely, let be the neural network function that maps inputs to some representation (e.g. logits or intermediate layers) with dimension . Now is -Lipschitz if


Equation 2 can be rearranged by replacing with an input and with where is a perturbation:


Now we consider a particular case where and , and define as . Let be the distance function that maps a representation to distance between and where are the training samples, i.e.

We want to show a simple bound on :

Since from Equation 3, we know that , we get the bound:


For 1-NN, the classification of is simply the class of where . So in order to change the classification of , one must reduce the margin to zero where is a set of indices of training samples that are not from the same class as . In the worst case, an adversary can do so by decreasing by and increasing by simultaneously, which can only be achieved when is a convex combination of and the nearest neighbor of a different class. This implies that if then there is no that will change the class of .

The other way to arrive at this bound is to notice that is 2-Lipschitz with respect to . So similarly, we have . To summarize, our lower bound of the perturbation required to change the prediction of a 1-NN classifier is


Or in other words, there exists no with such that and are classified to different labels. This lower bound is similar but not the same as the margin in logits for the Lipschitz networks proposed by previous works. We believe that Lipschitz networks can learn a larger margin while achieving a high accuracy with 1-NN. We omit the case where here as the bound is much more complicated to compute and potentially looser.