1 Introduction
^{†}^{†}Preprint.Given adequate data and compute power, neural networks have demonstrated their potential to surpass humanlevel performance on various benchmarks such as image classification [Krizhevsky, Sutskever, and Hinton2012, Simonyan and Zisserman2014], playing complex games [Silver et al.2017b, Silver et al.2017a, Mnih et al.2013], controlling driverless vehicles [Chen et al.2015, Bojarski et al.2016], and medical imaging [Litjens et al.2017]
. Nonetheless, it is wellknown that neural networks and other machine learning classifiers still have a number of flaws, one of which is their excessive sensitivity to small perturbation (i.e. adversarial examples)
[Biggio et al.2013, Szegedy et al.2013, Goodfellow, Shlens, and Szegedy2015, MoosaviDezfooli, Fawzi, and Frossard2015, Nguyen, Yosinski, and Clune2015].We propose that kNN on representations learned by neural networks can serve as a simple yet strong defense against adversarial examples, surpassing stateoftheart defenses in an norm setting on MNIST and CIFAR10. Our model is illustrated in Figure 1. On MNIST, our best model requires an average perturbation size of 3.07 in order to reduce the accuracy to zero, surpassing the stateoftheart by 0.77 (but with a modest drop of 1.7% on clean accuracy). Our best model on CIFAR10 also outperforms the stateoftheart by a large margin, increasing the mean adversarial perturbation to 2.30. Our results also suggest that replacing the final linear fullyconnected layer in a network with a kNN consistently results in a more robust classifier.
In fact, there are multiple evidences of robustness of kNN in adversarial settings from both theoretical perspectives [Wang, Jha, and Chaudhuri2018, Khoury and HadfieldMenell2019] and empirical analyses [Papernot, McDaniel, and Goodfellow2016, Papernot and McDaniel2018, Schott et al.2019, Dubey et al.2019]
. Despite its potential, kNN is known to not perform well on highdimensional data like most of realworld image datasets. Therefore, by doing the neighbor search on representations learned by neural nets, we hope to obtain the robustness benefit of kNN as well as the flexibility and the performance of neural nets. We explore different choices of networks to use as the feature extractor across various models and training methods.
Furthermore, we demonstrate that a 1NN on intermediate outputs of a Lipschitz network can serve as a certifiable defense where the lower bound of
norm of perturbation required to change the classification of a given input can be computed. With similar accuracy to the heuristic defense, our certifiable defense is able to provide a reasonable lower bound (0.86) comparable to the previous works’. It can certify that classification of 80% of the test samples cannot be changed by a perturbation with an
norm of 0.5 or less, or 36% with 1 or less. We hope that our scheme suggests a possibility of defending against adversarial examples with a form of similarity search as well as sheds some light on architecture designs for robust neural networks. The code we use for all of the experiments can be found at https://github.com/chawins/knndefense.2 Background and Related Work
2.1 Adversarial Examples
Adversarial examples are a type of an evasion attack against machine learning models. Most adversarial examples on deep neural networks are generated by adding very small perturbation to legitimate samples [Szegedy et al.2013, Goodfellow, Shlens, and Szegedy2015]. Previous works propose algorithms for finding such perturbation within some norm ball which can be formulated as solving the following optimization problem:
(1)  
where
is some loss function associated with the correct prediction of a clean sample
by the target neural network. The norm constraint is treated as a proxy to imperceptibility of the noise.2.2 Robustness of kNearest Neighbors
The kNN classifier is a popular nonparametric classifier that predicts the label of an input by finding its nearest neighbors in some distance metric such as Euclidean or cosine distance and taking a majority vote from the labels of the neighbors. While kNN has been widely used and wellstudied for a long time, it has been barely investigated in adversarial settings.
To the extent of our knowledge, only the four following works directly study the adversarial robustness of kNN. Amsaleg et al. prove that under certain assumptions, the robustness of kNN is correlated with the intrinsic dimension of the data. Wang et al. provide a lower bound on the required value of such that robustness of kNN can approach that of the Bayes Optimal classifier [Wang, Jha, and Chaudhuri2018]. Sitawarin & Wagner propose an attack on kNN formulated as an optimization problem. Most recently, Khoury and HadfieldMenell claim, under certain assumptions, that kNN is naturally robust because the Voronoi cells extend in directions orthogonal to the data manifold which are believed to be exploited by adversarial examples.
While the previous works provides some evidences suggesting the robustness of plain kNN on the metric, the main drawbacks of kNN are its scalability and its low accuracy on most realworld datasets. The scalability problem is wellstudied and can be addressed with approximate algorithms or optimized datastructures. In this work, we suggest that a kNN on representations of neural networks, as opposed to directly on the input, can achieve relatively high accuracy while maintaining its robustness property.
2.3 Deep kNearest Neighbors
DkNN, proposed by Papernot & McDaniel, is a scheme that can be applied to any deep learning model, offering interpretability and robustness through a nearest neighbor search in each of the deep representation layers. The scheme also demonstrates the possibility of detecting adversarial examples and outofdistribution samples. Soft nearest neighbor (SNN) loss, a regularization proposed by the followup work, that
encourages entanglement between samples of different classes in the representation space can increase the detection accuracy [Frosst, Papernot, and Hinton2019].One may question the scalability of DkNN on largescale datasets in term of both performance and accuracy. Dubey et al. [Dubey et al.2019]
applies kNN on representation space of ImageNet models and achieves both reasonable accuracy and improvement on adversarial robustness. They employ a fast but approximate nearestneighbor search which makes the system feasible on a dataset with more than one billion images. However, in this work, the kNN is not used to obtain the final prediction but only to search for the neighbors. Instead, the final output is obtained by taking a weighted average of the softmax of the neighbors.
Furthermore, none of the three previous works explore representations from different networks other than those trained with crossentropy loss. In fact, our work suggests that unsupervised representations as well as adversarial training can yield higher robustness than a vanilla CNN trained with crossentropy loss.
3 kNearest Neighbor on Representations
While general trends suggest that robust networks usually suffer from low clean accuracy, we believe that kNN is a simple scheme that can push this tradeoff curve further. Robust networks are insensitive to small changes in the input, but consequently, they cannot express an abrupt change in the output. This makes them unable to separate two samples from different classes that are very close together, resulting in low accuracy.
On the other hand, kNN does not suffer as much from such constraint since it can attain high accuracy as long as a distance metric in the input space is meaningful or the data exhibit some local structure. However, this assumption is not necessarily true for realworld datasets. Therefore, we aim to rely on neural networks as feature extractors to robustly map inputs to representations that satisfy this assumption and then using a kNN to make the classification on top of the representations.
Our system is simple. After training, weights of the network are frozen and used as a feature extractor. All the training samples are passed through the network to obtain the representations on a specified layer and then stored for the kNN part. At test time, features are extracted from an input and treated as a query to the kNN which returns indices of the nearest representations in the training set.
We build the kNN part on a Python library called Faiss [Johnson, Douze, and Jégou2017], which implements many algorithms for similarity search. Unless stated otherwise, we use Euclidean distance and choose to be 75 for all of the models as suggested by Papernot & McDaniel. We use an exact search as we require an accurate ordering and exact distances for a relatively large . We explore two different settings where this scheme can be used: (1) as a heuristic defense (Section 4 and 5), and (2) as a certifiable defense (Section 6).
4 Heuristic Defense on MNIST
4.1 Experimental Setup
We evaluate the robustness and the accuracy of our defense on representations learned by seven different models as well as the adversarially trained version of some of the models. The seven models include four supervised models: (1) vanilla CNN (Basic Network), (2) network trained with soft nearest neighbor loss (SNN loss) [Frosst, Papernot, and Hinton2019], (3) network trained with input mixup [Zhang et al.2018], and (4) network trained with manifold mixup [Verma et al.2019]
. The other three are unsupervised models: (5) autoencoder
[Ballard1987], (6) VAE [Kingma and Welling2014], and (7) rotation prediction [Gidaris, Singh, and Komodakis2018]. We experiment with a wide variety of models since we are interested in the robustness of the features learned by networks trained on various objectives.Evaluating the SNN loss is a natural choice as it appears to improve the original DkNN [Frosst, Papernot, and Hinton2019]. We experiment with a few choices of (both positive and negative) and choose to report only the best one,
, which encourages disentanglement between samples of different classes. For the mixup loss, the network is trained with a linear interpolation of a pair of inputs (at the input space and/or the feature space) and their labels. The authors argue that it improves the interpolation behavior of the representation and has shown some improvement in the adversarial robustness
[Zhang et al.2018, Verma et al.2019]. Here, the manifold mixup model uses mixup on all the layers including the input.In additional to supervised models, we are interested in unsupervised learning as it is not wellstudied in the context of adversarial examples. With no access to labels, unsupervised models may learn very different sets of features from the supervised ones which only rely on features that help discriminate labels of the input. Arguably, representations learned by unsupervised methods such as generative models may contain more information of the input as they have to reconstruct it accurately rather than simply predicting its class. The redundancy or the extra information may provide more robustness against adversarial examples which usually rely on making small changes on some parts of the input.
For the unsupervised methods, the autoencoder is trained with MSE reconstruction loss in the pixel space. The VAE is trained by optimizing the ELBO, and the output is treated as Bernoulli random variables
[Kingma and Welling2014]. The rotation prediction is a selfsupervised method that trains a model to recognize orientation of the input [Gidaris, Singh, and Komodakis2018]. Here, we use four rotations (0, 90, 180, and 270 degrees). Surprisingly, the rotation network we train is able to predict the rotation with an accuracy of 99.29% on the test set, considering that some digits are difficult to distinguish their rotations such as ‘1’, ‘6’, ‘8’, and ‘9’. We suspect that there are some subtle clues the network is able to pick up and relies on making the prediction.For fair comparisons, we reimplement all the models, both supervised and unsupervised, with the same architecture used in Papernot & McDaniel. We also experiment with different modelspecific hyperparameters such as a constant that balances two loss functions in SNN loss and Mixup techniques, but we only report the ones that yield the best robustness with comparable accuracy. For all of the models, we take the representation from the third convolutional layer as we find that it offers a good balance between clean accuracy and robustness.
4.2 Evaluations
Finding an optimal attack on kNN is intractable for a large . Therefore, we rely on a heuristic from Sitawarin & Wagner [Sitawarin and Wagner2019] to find the minimal perturbation. The attack approximates kNN as a differentiable function and solves it as an optimization problem using gradient descent. We modify the original code slightly to work with Euclidean distance and remove the threshold function as we do not observe any noticeable difference without it. We refer the readers to the original paper for more details on the attack and to Appendix A for our hyperparameters of the attack.
4.3 Main Results
Models  Clean Acc.  Mean  Acc. at  Acc. at  Acc. at 

Basic Network  0.9878  1.4684  0.7820  0.1363  0.0098 
SNN Loss  0.9919  1.2529  0.6747  0.0619  0.0035 
Input Mixup  0.9887  0.6051  0.0295  0.0087  0.0087 
Manifold Mixup  0.9917  0.9749  0.3942  0.0042  0.0018 
Models  Clean Acc.  Mean  Acc. at  Acc. at  Acc. at 

ABS [Schott et al.2019]  0.990  2.3       
L2NNN [Qian and Wegman2019]  0.982        0.244 
kNN ()  0.9457  3.1389  0.8907  0.7675  0.5579 
Basic Network  0.9870  2.1054  0.9103  0.5014  0.1144 
SNN Loss  0.9940  1.7131  0.8558  0.2661  0.0547 
Input Mixup  0.9812  2.0682  0.8133  0.5007  0.1729 
Manifold Mixup  0.9838  2.2276  0.9020  0.5272  0.1790 
Autoencoder  0.9509  3.0717  0.8855  0.7449  0.5336 
VAE  0.9680  2.0859  0.8201  0.4876  0.1674 
Rotation  0.9129  2.2157  0.7532  0.4830  0.1990 
Table 2 displays the main results, comparing the robustness and accuracy of all the seven models. The mean is an average norm of perturbation required to change the classification of the kNN on the entire test set. A larger norm suggests that the representation is less sensitive to adversarial perturbation and potentially contains robust features. The robustness and clean accuracy of the same models but without the kNN are also included in Table 1 for comparison. Notably, the kNN appears to increase the robustness of all the supervised models with a cost of a small drop on clean accuracy.
On average, kNN on supervised representations has a higher clean accuracy than the ones on unsupervised features. This is expected because supervised models, which utilize label signal, are trained for a discriminative task, and so samples from the same class have to activate a similar set of features. This is not necessary the case for unsupervised tasks as the representations can be more complex and are not necessarily clustered by class.
The model trained with SNN loss achieves the highest accuracy since the loss encourages clustering of samples in the same class, but it also has the lowest robustness, potentially due to clusteringfocused mapping makes the network particularly sensitive. The kNN on the pixel space is the most robust but also the least accurate, suggesting a recurring trend of the tradeoff between clean accuracy and robustness. Compared to the basic model, the two mixup models do not yield any substantial change in the robustness.
Perhaps surprisingly, features learned by the autoencoder appear to be the most robust among the networks, surpassing all of the supervised models. Since the VAE exhibits much less robustness, it is unlikely that the bottleneck architecture plays an important role. Although having a much lower accuracy, the plain kNN as well as the kNN on the autoencoder are more robust than the stateoftheart defenses, ABS and L2NNN, by a significant margin.
It is likely that the attack does not find adversarial examples with the smallest perturbation, but this is the only attack, to the extend of our knowledge, that reliably and efficiently find adversarial examples with close to 100% success rate. In the next section, we improve the robustness of some of the models further by combining with adversarial training and evaluate their robustness in more detail.
4.4 kNN on Adversarially Trained Models
Models  Clean Acc.  Mean  Acc. at  Acc. at  Acc. at 

Adv (no kNN)  0.9470  2.9060  0.8918  0.7575  0.4764 
Adv  0.9653  2.0*       
Adv  0.9726  3.0378  0.9387  0.8027  0.5095 
AdvRot  0.9716  2.9973  0.9658  0.8302  0.4571 
AdvAE  0.9641  3.0198  0.9323  0.8112  0.4899 
Adversarial retraining is one of a few methods which effectively improve adversarial robustness and has been shown to encourage networks to learn more robust features [Madry et al.2017, Athalye, Carlini, and Wagner2018]. We pick a subset of the networks in Section 4.3 and adversarially train them with the original objective. For the adversarial training, we use 40 steps of size 0.01 with a maximum norm of 0.3, and 40 steps of size 0.1 for the version, except for the rotation model which uses 20 steps of size 0.05. The results are reported on Table 3.
Initially, Adv appears very robust to the attack, we suspect that this is caused by the gradient obfuscation problem where a gradientbased attack fails to find good adversarial examples [Athalye, Carlini, and Wagner2018]. As a result, we evaluate some of the models with a boundarybased attack which only relies on hard labels [Brendel, Rauber, and Bethge2018]. We use the official implementation in Foolbox [Rauber, Brendel, and Bethge2017]. Still, the attack is extremely slow so we only manage to run the attack on the first hundred samples in the test set. The boundary attack does not always succeed, but we find that the mean perturbation norm is about 2.0 on Adv which is much smaller than the ones found by the gradient attack. Nonetheless, the other models do not appear to have the same problem as the two attacks find adversarial examples of a comparable size. Figure 2 shows plots comparing the perturbation norm of the two attacks on two models. The gradient attack strictly performs better than the boundary attack on the autoencoder and roughly the same on average on Adv.
All of the models, except for Adv, exhibit similarly strong robustness. Adv is significantly more robust than its original version (Basic Network with kNN) by sacrificing the clean accuracy. Interestingly, with the kNN, both of the unsupervised models, AdvRot and AdvAE, achieve higher accuracy with the similar or improved robustness. This is unexpected as the unsupervised models still do not have an access to the labels. Also, it is wellknown that supervised adversarial training generally reduces accuracy on benign samples.
We suspect that the adversarial training forces the model to learn robust features which are also more likely shared between samples from the same class. Consequently, they cluster more in the representation space of the adversarially trained models, increasing both the robustness and the accuracy of the kNN. On the other hand, for example, the rotation model without adversarial training predicts the rotation with a surprisingly high accuracy (where it should not as we mentioned in Section 4.1), suggesting that it potentially learns trivial and nonrobust features. As a result, it has poor robustness as reported in Section 4.3.
It is also important to note that Adv with the kNN has higher clean accuracy than Adv without one. It contradicts the trend on the normally trained models, which have lower accuracy when combined with kNN. This observation shows that kNN does not always reduce accuracy on benign samples and helps support our intuition in Section 3 that kNN allows an access to a better robustnessaccuracy tradeoff curve than only relying on adversarial training. We hypothesize that adversarially trained models (with a sufficiently large ) suffers from a limited class of functions it can represent as high sensitivity is heavily penalized by the adversarial loss. So in some cases such as this one, kNN can simultaneously improve the robustness and the clean accuracy.
4.5 Analysis of Robustness via Local Intrinsic Dimension and Sensitivity
Model  LID  Norm of Jacobian 

Input  12.87  1* 
Basic Model  8.51  26.24 
SNN Loss  7.00  37.83 
VAE  9.38  2.12 
Autoencoder  14.48  1.47 
Adv  6.58  9.45 
Adv  6.47  1.76 
AdvAE  9.74  0.29 
LID and spectral norm of the Jacobian matrix from input to the features of the third layer of the networks. Similarly to previous works, we use the Maximum Likelihood Estimator to approximate LID
[Amsaleg et al.2017, Ma et al.2018]. Both LID and Jacobian norm are calculated and averaged on the first thousand samples in the test set. (*) “Input” means MNIST in the pixel space without passing through any network so, equivalently, it has Jacobian norm of 1.To better explain the robustness of different representations, we attempt to attribute it to two characteristics: (1) local intrinsic dimension (LID) and (2) sensitivity. First, the robustness of 1NN has been analyzed and shown to depend on LID of the input space [Amsaleg et al.2017]. Specifically, Amsaleg et al. shows that as LID of a given input approaches infinity, the size of the perturbation required to change its kth neighbor into the first neighbor tends to zero. Loosely speaking, data that cluster have smaller LID.
Sensitivity plays a more straightforward role in the robustness; Models that are less sensitive to a small change on the input should also be more robust to adversarial examples. However, it is unclear how to measure sensitivity for our task. For simplicity, we consider local sensitivity measured by spectral norm of Jacobian of the input to the representation at the layer of interest. Combining the two metrics, we expect that a representation with small LID and small spectral norm will be more robust than the one with large LID and large spectral norm.
Table 4 shows LID and sensitivity of a subset of the representations. There are several interesting observations:

Except for the VAE, the models with a small Jacobian norm (i.e. “Input,” AE, Adv, AdvAE) are very robust.

As expected, supervised models have smaller LID, and adversarial training reduces the sensitivity.

While LID of SNN Loss are small, they are also very sensitive. This might explain why it is the least robust. The SNN loss which encourages clustering of samples in the same class leads to low LID (since the neighbors are dense) but also high sensitivity.

Adv does not have particularly small LID or Jacobian norm, suggesting that it is not as robust, confirming our evaluation with the boundarybased attack.
These metrics can serve as a sanitycheck for the empirical results as well as give us a better idea of what contributes to the robustness, but they are far from being an accurate measurement of the adversarial robustness. Nonetheless, with this intuition, one might find representation learning that directly encourages these components (i.e. small LID and sensitivity) to be useful in future directions.
5 Heuristic Defense on CIFAR10
Models  Clean Acc.  Mean  Acc. at  Acc. at  Acc. at 

L2NNN [Qian and Wegman2019]  0.772        0.204 
ResNet (no kNN)  0.9299  0.2124  0.0332  0.0018  0.0006 
Adv (no kNN)  0.8045  1.2400  0.6847  0.4924  0.2705 
ResNet  0.9301  0.1429  0.0012  0.0001  0.0001 
Adv  0.7945  2.2970  0.7447  0.6370  0.5151 
We attempt to extend our scheme to a more complex dataset like CIFAR10. We evaluate representations of two models: a preactivation ResNet [He et al.2016] with 20 layers and its adversarially trained version (Adv). We use 8 steps of adversarial training with a step size of 0.05. Both models are trained with Adam optimizer with a learning rate of 1e3 and batch size of 128. We also experimented with other unsupervised models, similarly to the MNIST experiments. However, the accuracy of kNN on most of their representations is too low to consider. For datasets larger than MNIST, the representation may need to be either trained or finetuned in a supervised manner in order to reach a comparable accuracy on kNN.
As shown in Table 5, the adversarial training still provides a significant improvement in the robustness, again, with some drop on the clean accuracy. When combined with kNN, the adversarially trained model becomes more robust which aligns with the results on MNIST. However, the vanilla ResNet unexpectedly becomes less robust. The Adv has a slightly higher accuracy and is significantly more robust compared to the stateoftheart L2NNN. This result suggests that with an appropriate choice of the representation, kNN still strengthens the network with littletono drop in the clean accuracy even on a larger dataset. Note that we only use a relatively small ResNet20 to demonstrate the improvement gained from kNN while it is highly likely that larger networks will achieve even higher robustness and accuracy.
6 Certifiable Defense
Layers  Clean Acc.  Avg. Cert.  

ABS [Schott et al.2019]  0.99  0.69         
LMT [Tsuzuku, Sato, and Sugiyama2018]  0.95  1.02         
Input  0.9683  0.9567  0.8102  0.4059  0.0977  0.0064 
relu1  0.9701  0.8576  0.8000  0.3574  0.0663  0.0034 
relu2  0.9723  0.7821  0.7648  0.2677  0.0379  0.0005 
relu3  0.9745  0.4295  0.3509  0.0013  0.0  0.0 
bottleneck  0.9752  0.3909  0.2668  0.0001  0.0  0.0 
In addition to the heuristic defense, we propose a novel construction of a certifiable defense based on our scheme with some specific components: a 1nearest neighbor (1NN) on features from a Lipschitz network. Lipschitz network, a network that is a Lipschitz function, is one of several frameworks that allow computation of a lower bound of the norm of the adversarial perturbation required to change the classification of a given input. In other words, a given input is robust to any perturbation with a norm smaller than the bound. In most of the previous works, to increase adversarial robustness, Lipschitz networks are trained to maximize a margin
between the logits of the correct class and the second largest logits, i.e.
where is the logits of input . Then, the lower bound of the perturbation norm required to change the classification of a given input is given as for norm or for norm where is the Lipschitz constant [Tsuzuku, Sato, and Sugiyama2018, Huster, Chiang, and Chadha2018, Qian and Wegman2019, Anil, Lucas, and Grosse2018].Using a similar notion, we can define the margin for 1NN as difference between distance from an input to the nearest neighbor of the correct class and distance to the nearest neighbor of the wrong class, i.e. the margin
where is the true label of , and is a set of training samples from class . It follows that the lower bound of the norm of the perturbation needed to change the classification of 1NN is , assuming that (otherwise, is already misclassified). Note that the bound is very simplistic and only tight when is a convex combination of and . In fact, an optimal perturbation for 1NN (minimal Euclidean distance from to the edge of the nearest Voronoi cell of a different class) can be computed by solving a quadratic program. Nonetheless, it is still expensive to solve for a large number of constraints. While we believe that provides more robustness, it is not clear how to find the Voronoi cells efficiently or to provide the bound for a large . We leave this direction to future works.
Now returning to our scheme, we consider the case where the input to 1NN is a representation from a Lipschitz network with Lipschitz constant of 1 in norm. The same bound, in fact, still applies to the norm of the perturbation in the pixel space: , but is now defined on the representation space instead of the input space: where is an output at th layer of the network given input . We refer to Appendix B for the proof.
6.1 Experimental Setup
To train a Lipschitz network, we use the same method as Qian & Wegman [Qian and Wegman2019]. We implement an autoencoder that has the same architecture as the one in Section 4 but with the encoder now being a Lipschitz network. We choose the autoencoder since it performs well as a heuristic defense and is compatible with our loss function below. Through a number of experiments, we find that training the autoencoder with the MSE reconstruction loss and an additional regularization on the bottleneck layer which directly maximizes the distance between samples from different classes consistently yields good clean accuracy and robustness on MNIST. The loss function of our network can be written as:
where is the output of the autoencoder, and is the output at the bottleneck layer. is a set of samples in the same batch as that have a different label, and is a threshold. The reconstruction loss (first term) is needed to cluster samples from the same class together, which affects the accuracy, while the regularization (second term) encourages the representations to be some distance apart up to the threshold.
For each of the test samples , we compute the margin and the lower bound of the perturbation norm, , using different layers of the encoder.
6.2 Results
We report clean accuracy and the robustness certificates provided by our scheme on each of the layers of the network (Table 6). The layer closer to input has lower accuracy but a larger lower bound. The “Input” row in the table is simply a 1NN on the input space, which provides a very large mean lower bound of 1.8 while achieving reasonable accuracy. The average lower bound which is is nontrivial, considering that the mean norm of the perturbation used to fool networks without a defense is around 1.5.
In comparison with the related works, our scheme achieves higher accuracy with a slightly smaller bound than LMT [Tsuzuku, Sato, and Sugiyama2018] and a larger bound with lower accuracy compared to ABS [Schott et al.2019]. Using the first layer, we can guarantee that 80% of the classification of the test samples cannot be changed by any perturbation with an norm less than 0.5. Nonetheless, on the bottleneck layer, we can only certify up to 27%. The bound becomes smaller for the deeper layers because our network is not perfectly norm preserving. Consequently, distance to the nearest samples of a wrong class as well as the margin diminishes as the input is passed through the network.
We also attack our certifiable defense (relu1) with the heuristic attack and achieve the average perturbation norm of 1.82, which is twice as large as the average lower bound given by the 1NN. The gap is contributed by both the looseness of our lower bound and the fact that the heuristic attack does not find the optimal perturbation. Nevertheless, this suggests that our scheme already produces a nontrivial certificate using only a naive lower bound.
Due to the flexibility of the scheme, by using a different choice of networks and hyperparameters, one can easily obtain a model with a higher accuracy but a smaller bound or vice versa. Nonetheless, a better construction of Lipschitz and norm preserving networks is needed for our scheme to scale to larger datasets.
7 Conclusion
We propose a scheme combining kNN and robust representation learning as a defense against adversarial examples on MNIST and CIFAR10. Our method pushes the accuracyrobustness tradeoff curve further and is straightforward to construct. It is a general and flexible framework that provides clear directions to improve upon. Namely, one can improve it by learning more robust representations or by finding an adversarially robust variant of kNN. The two problems can be pursued independently, and we believe that they are both interesting on their own. We hope that our scheme will also inspire future architecture designs for robust neural networks that implicitly or explicitly impose a similarity search.
Acknowledgement
We would like to thank Nicolas Papernot for constructive feedback and insights on DkNN.
References
 [Amsaleg et al.2017] Amsaleg, L.; Bailey, J.; Barbe, D.; Erfani, S.; Houle, M. E.; Nguyen, V.; and Radovanović, M. 2017. The vulnerability of learning to adversarial perturbation increases with intrinsic dimensionality. In 2017 IEEE Workshop on Information Forensics and Security (WIFS), 1–6.
 [Anil, Lucas, and Grosse2018] Anil, C.; Lucas, J.; and Grosse, R. 2018. Sorting out lipschitz function approximation. CoRR abs/1811.05381.
 [Athalye, Carlini, and Wagner2018] Athalye, A.; Carlini, N.; and Wagner, D. A. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR abs/1802.00420.

[Ballard1987]
Ballard, D. H.
1987.
Modular learning in neural networks.
In
Proceedings of the Sixth National Conference on Artificial Intelligence  Volume 1
, AAAI’87, 279–284. AAAI Press.  [Biggio et al.2013] Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion attacks against machine learning at test time. In Blockeel, H.; Kersting, K.; Nijssen, S.; and Železný, F., eds., Machine Learning and Knowledge Discovery in Databases, 387–402. Berlin, Heidelberg: Springer Berlin Heidelberg.
 [Bojarski et al.2016] Bojarski, M.; Testa, D. D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L. D.; Monfort, M.; Muller, U.; Zhang, J.; Zhang, X.; Zhao, J.; and Zieba, K. 2016. End to end learning for selfdriving cars. CoRR abs/1604.07316.
 [Brendel, Rauber, and Bethge2018] Brendel, W.; Rauber, J.; and Bethge, M. 2018. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. In International Conference on Learning Representations.

[Chen et al.2015]
Chen, C.; Seff, A.; Kornhauser, A.; and Xiao, J.
2015.
Deepdriving: Learning affordance for direct perception in autonomous
driving.
In
The IEEE International Conference on Computer Vision (ICCV)
.  [Dubey et al.2019] Dubey, A.; van der Maaten, L.; Yalniz, Z.; Li, Y.; and Mahajan, D. K. 2019. Defense against adversarial images using webscale nearestneighbor search. CoRR abs/1903.01612.
 [Frosst, Papernot, and Hinton2019] Frosst, N.; Papernot, N.; and Hinton, G. 2019. Analyzing and improving representations with the soft nearest neighbor loss. CoRR abs/1902.01889.
 [Gidaris, Singh, and Komodakis2018] Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised representation learning by predicting image rotations. CoRR abs/1803.07728.
 [Goodfellow, Shlens, and Szegedy2015] Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity mappings in deep residual networks. CoRR abs/1603.05027.
 [Huster, Chiang, and Chadha2018] Huster, T.; Chiang, C. J.; and Chadha, R. 2018. Limitations of the lipschitz constant as a defense against adversarial examples. CoRR abs/1807.09705.
 [Johnson, Douze, and Jégou2017] Johnson, J.; Douze, M.; and Jégou, H. 2017. Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734.
 [Khoury and HadfieldMenell2019] Khoury, M., and HadfieldMenell, D. 2019. Adversarial training with voronoi constraints. CoRR abs/1905.01019.
 [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Autoencoding variational bayes. CoRR abs/1312.6114.

[Krizhevsky, Sutskever, and
Hinton2012]
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E.
2012.
ImageNet classification with deep convolutional neural networks.
In Advances in neural information processing systems, 1097–1105.  [Litjens et al.2017] Litjens, G. J. S.; Kooi, T.; Bejnordi, B. E.; Setio, A. A. A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J. A. W. M.; van Ginneken, B.; and Sánchez, C. I. 2017. A survey on deep learning in medical image analysis. CoRR abs/1702.05747.
 [Ma et al.2018] Ma, X.; Li, B.; Wang, Y.; Erfani, S. M.; Wijewickrema, S. N. R.; Houle, M. E.; Schoenebeck, G.; Song, D.; and Bailey, J. 2018. Characterizing adversarial subspaces using local intrinsic dimensionality. CoRR abs/1801.02613.
 [Madry et al.2017] Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards deep learning models resistant to adversarial attacks. CoRR abs/1706.06083.
 [Mnih et al.2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. CoRR abs/1312.5602.
 [MoosaviDezfooli, Fawzi, and Frossard2015] MoosaviDezfooli, S.M.; Fawzi, A.; and Frossard, P. 2015. Deepfool: a simple and accurate method to fool deep neural networks. arXiv preprint arXiv:1511.04599.

[Nguyen, Yosinski, and Clune2015]
Nguyen, A.; Yosinski, J.; and Clune, J.
2015.
Deep neural networks are easily fooled: High confidence predictions
for unrecognizable images.
In
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 427–436. IEEE.  [Papernot and McDaniel2018] Papernot, N., and McDaniel, P. D. 2018. Deep knearest neighbors: Towards confident, interpretable and robust deep learning. CoRR abs/1803.04765.
 [Papernot, McDaniel, and Goodfellow2016] Papernot, N.; McDaniel, P. D.; and Goodfellow, I. J. 2016. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. CoRR abs/1605.07277.
 [Qian and Wegman2019] Qian, H., and Wegman, M. N. 2019. L2nonexpansive neural networks. In International Conference on Learning Representations.
 [Rauber, Brendel, and Bethge2017] Rauber, J.; Brendel, W.; and Bethge, M. 2017. Foolbox: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131.
 [Schott et al.2019] Schott, L.; Rauber, J.; Bethge, M.; and Brendel, W. 2019. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations.
 [Silver et al.2017a] Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; Lillicrap, T. P.; Simonyan, K.; and Hassabis, D. 2017a. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. CoRR abs/1712.01815.
 [Silver et al.2017b] Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017b. Mastering the game of go without human knowledge. Nature 550(7676):354.
 [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
 [Sitawarin and Wagner2019] Sitawarin, C., and Wagner, D. 2019. On the robustness of deep knearest neighbors. volume abs/1903.08333.
 [Szegedy et al.2013] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I. J.; and Fergus, R. 2013. Intriguing properties of neural networks. CoRR abs/1312.6199.
 [Tsuzuku, Sato, and Sugiyama2018] Tsuzuku, Y.; Sato, I.; and Sugiyama, M. 2018. Lipschitzmargin training: Scalable certification of perturbation invariance for deep neural networks. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; CesaBianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 6541–6550.
 [Verma et al.2019] Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Courville, A.; Mitliagkas, I.; and Bengio, Y. 2019. Manifold mixup: Learning better representations by interpolating hidden states.
 [Wang, Jha, and Chaudhuri2018] Wang, Y.; Jha, S.; and Chaudhuri, K. 2018. Analyzing the robustness of nearest neighbors to adversarial examples. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5133–5142. Stockholmsmässan, Stockholm Sweden: PMLR.
 [Zhang et al.2018] Zhang, H.; Cisse, M.; Dauphin, Y. N.; and LopezPaz, D. 2018. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.
Appendix A GradientBased Attack on kNN
In summary, the attack operates by adding a perturbation to the input such that its representation, , moves closer to representations of a nearest group of training instances from a different class ( for ). This heuristic can be formulated as a constrained optimization problem as the following.
The box constraint is to ensure that the perturbed input lies in the feasible input region which in this case, is between 0 and 1 for pixel values. The optimization can be formulated as a Lagrangian, and so we can binary search the Lagrangian constant that yields the minimal perturbation. The optimization is solved with Adam optimizer.
For most of the models on MNIST, we use , the initial Lagrangian constant of 1e3, a learning rate of 1e1, a maximum iteration of 500, and 10 binary search steps. For adversarially trained models or more robust models like the autoencoder, we have to increase to 1e1 and the number of iterations to 1000 in order to achieve near 100% attack success rate. For CIFAR10, is set to 1e5 for the vanilla ResNet and to 1e3 for the adversarial training. We use a learning rate of 1e2, 500 iterations, and again, 10 binary search steps.
Appendix B Proof on Certifiable Defense
Here, we refer to Lipschitz networks as neural networks that are a Lipschitz function with a constant in some norm. More precisely, let be the neural network function that maps inputs to some representation (e.g. logits or intermediate layers) with dimension . Now is Lipschitz if
(2) 
Equation 2 can be rearranged by replacing with an input and with where is a perturbation:
(3) 
Now we consider a particular case where and , and define as . Let be the distance function that maps a representation to distance between and where are the training samples, i.e.
We want to show a simple bound on :
Since from Equation 3, we know that , we get the bound:
(4) 
For 1NN, the classification of is simply the class of where . So in order to change the classification of , one must reduce the margin to zero where is a set of indices of training samples that are not from the same class as . In the worst case, an adversary can do so by decreasing by and increasing by simultaneously, which can only be achieved when is a convex combination of and the nearest neighbor of a different class. This implies that if then there is no that will change the class of .
The other way to arrive at this bound is to notice that is 2Lipschitz with respect to . So similarly, we have . To summarize, our lower bound of the perturbation required to change the prediction of a 1NN classifier is
(5) 
Or in other words, there exists no with such that and are classified to different labels. This lower bound is similar but not the same as the margin in logits for the Lipschitz networks proposed by previous works. We believe that Lipschitz networks can learn a larger margin while achieving a high accuracy with 1NN. We omit the case where here as the bound is much more complicated to compute and potentially looser.