The security of deep learning models has gained tremendous attention, considering that they are the backbone techniques behind various applications, such as image recognition, translation, etc[14, 9, 10, 27]. Nonetheless, prior works mainly focus on higher accuracy, ignoring their robustness though adversaries can significantly affect the performance with small perturbations [16, 4, 15, 1, 29, 5, 19]
, limiting the domains in which neural networks can be used, such as safepay with face recognition and self-driving .
To handle this problem, effective and generic defenses for deep models are proposed, such as adversarial training, data augmentation, distillation and etc [20, 24, 2, 8]. K-nearest neighbor (kNN) based methods [21, 25, 6] are one of the toughest kind of defenses since they’re non-parametric and can hinder gradients for the guidance of adversaries generation. For example, based on kNN, Deep k-nearest neighbors algorithm (DkNN) 
defenses the adversarial examples by ensembling kNN classifiers based on features extracted from each layer, which can effectively defense adversaries generated by FGSM, BIM , CW  attacks.
This work studies the problem of attacking kNN classifiers and evaluating their robustness. Previous attempts on attacking kNN models either apply gradient-based attacks to some continuous substitute models of NN 
or use some heuristics. For example, Papernot et al.  proposed to employ a differentiable substitue for attacking 1-NN model, which is not applicable for kNN model with large ; Sitawarin et al.  proposed some heuristic methods to find adversarial targets and then use gradient-based model to find the optimal solutions to keep perturbations smallest. However, the distortions are still not small enough to be imperceptible.
In this paper, we propose a new adversarial attack called AdvKNN to attack kNN and DkNN classifiers with small distortions. Firstly, we design a deep kNN block (DkNNB) to approximate the output of kNN classifiers, which is differentiable so that it can guide the generation of adversarial examples with small distortion. To make the method more robust for DkNN, which summaries k nearest neighbors of each layer to get the final decision instead of the maximum probability like classical classifier does, we propose a new consistency learning (CL) for probability distribution of k nearest neighbors instead of labels only. Combined DkNNB and CL with simple attacks such as FGSM, BIM , we find that both kNN and DkNN are vulnerable to adversarial examples with a small perturbation. Under norm , our method manages to reduce the accuracy of DkNN on MNIST  to only 5.71% with mean distortion 1.4909, while Sitawarin et al.  got 17.44% with distortion 3.476.
The main contributions of this paper are as follows:
1) We propose a deep kNN block to approximate the output probability distribution of k nearest neighbors, which is differentiable and thus can provide gradients to attack NN and DkNN models with small distortions.
2) We propose a new consistency learning for distribution instead of classification, which makes our method more effective and robust for distribution based defenses.
3) We evaluate our method on kNN and DkNN models, showing that the proposed AdvKNN outperforms prior attacks with higher attack success rate and smaller mean distortion. Besides, we show that the credibility scores from DkNN models are not effective for detecting our attacks.
2 Background and Related Work
In this paper, we focus on the adversarial examples in classification task based on deep neural networks (DNN). Adversarial example is a type of evasion attack against DNN at test time, that aims to find small perturbations to fool DNN models, which is defined as follows:
where is the distance metrics, is the DNN model, is the true label of , and is the perturbation. Adversaries are generated by attack methods while defenses algorithms are designed to resist them.
2.1 Classical Attacks
Fast Gradient Sign Method (FGSM). FGSM  is one of the most classical attack method. It’s designed primarily for efficiency and optimized with distance metric, which controls the maximum absolute value of perturbation for one single pixel. Given an image , FGSM sets
where is chosen to be sufficiently small so as to be imperceptible.
Basic Iterative Method (BIM). BIM  is a refinement of FGSM. It takes multiple smaller steps in the direction of the gradient sign and during each step the result is clipped by the same instead of taking one single step of size . Specifically, , while
where is the generated adversarial example of input after step .
2.2 KNN-based Defenses
Fig. 1 illustrates the smallest perturbation needed to attack DNN and kNN based classifiers. As shown in Fig. 1, only a small perturbation is needed to cross the decision boundary when attacking normal DNN classifiers. To make models more robust, kNN-based methods are proposed. The kNN classifier is a popular non-parametric classifier that predicts the label of an input by finding its nearest neighbors in some distance metric such as Euclidean or cosine distance and taking a majority vote from the labels of neighbors . The perturbation needed for kNN classifier is much larger than normal DNN classifiers, which makes attacks more difficult.
DkNN is a more robust kNN-based algorithm, which integrates predicted k nearest neighbors of all layers. Denote as probability predicted by kNN of class in layer of input , then the final prediction of DkNN is
where and are number of classes and layers respectively. In addition to the final prediction, it proposes a metric called credibility to measure the consistency of k nearest neighbors in each layer. The higher the adversary’s credibility is, the model will treat it as clean sample with higher confidence. The credibility is computed by counting the number of k nearest neighbors of each layer from classes other than the majority, and this score is compared to the scores when classifying samples from a held-out calibration set.
where . and are the -th sample and corresponding true label in calibration set which has samples and
To evaluate the robustness of such kNN-based methods, Sitawarin et al.  mentioned that the minimum adversarial perturbation has to be on the straight line connecting and the center of training instances belonging to a different class, and once the target center is focused, they use gradient based method to find the smallest perturbation. However, as shown in Fig. 1, the optimal perturbation may not be on the lines connecting two points, but on the point who crosses the kNN decision boundary. Besides, the method proposed by Sitawarin et al.  that finds adversaries in input space instead of feature space is not computationally efficient.
3 Advknn: Adversarial Attack on Knn
This paper focuses on attacking kNN and DkNN classifiers. Our objective function can be denoted as:
We detail our method with the specific introduction of DkNNB and consistency learning for distribution.
Notation. Denote a predicted distribution of an input x as
where is the probability of belonging to class and is the predicted label of .
DkNNB. Let us start by precisely defining kNN. Assume that we are given a query item , a database of candidate items with indices for matching, and a distance metric between pairs of items. Suppose that is not in the database, yields a ranking of the database items according to the distance to the query. Let be a permutation that sorts the database items by increasing distance to :
The kNN of are the given by the set of the first items:
The kNN selection is deterministic but not differentiable. This effectively hinders to derive gradient to guide the generation of adversaries with small perturbation for methods such as FGSM, BIM and etc. We aim to approximate the prediction of kNN with neural networks, which can offer gradients for optimizing.
We focus on white-box threat model  for attacks on both kNN and DkNN, which means the attackers have access to the training set and all the parameters of target models. Since the training set as well as k are a kind of parameters of kNN classifiers as they are used during inference, we assume they are known by attackers as Sitawarin et al. .
To find the decision boundary of kNN classifiers, we propose a deep kNN block (DkNNB). Specifically, the DkNNB is a small neural network which aims to approximate the output of k nearest neighbors of an input . The illustration is presented in Fig. 2.
Suppose the distribution predicted by kNN and estimated distribution inferenced by DkNNB areand respectively. For a general kNN classification model, its prediction is a label , and the corresponding distribution
is a one hot vector:
Then the loss for optimizing DkNNB towards kNN classifier is
Then the derivative with respect to is:
Consistency learning for distribution. As detailed in Section 2.2, DkNN summarizes probability distribution of outputs. However, as shown in Eq. (14), learning with the classification loss only penalizes the true class, which is not optimal to approximate kNN classification. Besides, the targets of kNN and DNN are correct classification labels. Learning the output labels of kNN will force the DkNNB to get the same parameters with the DNN classifier. To overcome these problems and further improve the attack performance, we propose to learn from the output distribution rather than classification. We propose a new consistency learning (CL) for distribution. Specifically, we define a new CL loss to guide the optimizing of DkNNB:
as a hyperparameter, the final loss is:
4.1 Datasets and Experimental Setting
Baselines & Datasets. To demonstrate the effectiveness of the proposed AdvKNN, we evaluate our method on three datasets: MNIST , SVHN , and FashionMnist . For each dataset, 750 samples (75 from each class) from the testing split are held out as the calibration set. We reimplement DkNN from Papernot  with the same parameters as Sitawarin et al. did  , including the base classifier network architecture and the value of . The proposed method can be applied to any gradient based attack algorithms. We choose FGSM, BIM as the baseline attacks which are optimized under norm . They are also used in DkNN for robustness evaluation . Likewise, all the hyperparameters are set the same to the DkNN paper .
The proposed DkNNB is implemented with a fully connected layer. is set to 0.3 for all the three datasets. If not mentioned, kNN is conducted on the last convolution layer. The classification accuracies of the backbone network (DNN), kNN, DkNN as well as DkNNB are shown in Table 1.
4.2 Results and Discussion
Fig. 3 shows a clean sample and its adversarial versions generated by our method along with their five nearest neighbors at each layer. One the left column, all the majorities of the 5 neighbors in each layer belong to the same class as input. The final predictions of kNN and DkNN are correct. However, as shown on the right column, the majority of neighbors of the adversarial example are of the incorrect class. Both kNN and DkNN are fooled successfully by adversaries with imperceptible distortions.
Metrics. We employ kNN accuracy, DkNN accuracy, mean distortion and mean credibility  to evaluate the effectiveness of our method . Lower kNN accuracy, DkNN accuracy, mean distortion and higher credibility mean better performance:
|Accuracy() on LeNet5||DNN||kNN||DkNN|
Comparison with state-of-the-art methods. Table 2 is the result of DkNNB added to the third layer. It shows the performances of attack methods before and after combining with the proposed method. With DkNNB, both FGSM and BIM attacks can degrade the classification accuracy of both kNN and DkNN by a large margin in most cases. After combined with CL, the performance is better.
Fig. 4 shows the accuracies of kNN and DkNN changed with different mean distortions. The purple plus marker is the best attack result on DkNN with norm released by . It can be seen that our method can get the same accuracy drop with much lower distortion. In Fig. 6, we show the distribution of credibility for the clean samples and adversarial examples generated by the proposed AdvKNN under different mean distortions. It can be observed that filtering out adversaries needs a lot of accuracy sacrifice of clean samples. With credibility score threshold 0.5, 71.42% adversarial examples generated with mean distortion 0.3 can be detected, but 47.54% clean samples will be filtered out too. Besides, mean credibility of methods proposed by Sitawarin  is 0.1037 with mean distortion 3.476, while ours is 0.4608 with distortion 3.005, which indicates that our attack is more difficult to be detected. The generated adversarial examples are shown in Fig. 6.
For evaluating the transferability of generated adversaries to find if the generated adversaries can attack other models successfully, we further test the performance on LeNet5  trained with clean samples of MNIST. Table 3 shows the comparison of classification accuracy from different attack methods. It can be seen that adversaries generated with the proposed method performs better than FGSM and BIM, indicating the boost of adversarial examples transferability.
Ablation study. As shown in Fig. 4, the proposed DkNNB is always effective when equipped to each layer. Layer 3 performs the best in terms of kNN accuracy. This is reasonable because kNN is conducted on layer 3. It can be observed that the adversaries generated with DkNNB equipped to layer 2 also perform well, indicating the transferability between layers. Besides, we evaluate the attack performance under different . As shown in Fig. 7, the proposed method always performs the best comparing with FGSM and BIM in terms of kNN and DkNN attack success rates.
In this paper, we put forward a new method for attacking kNN and DkNN models to evaluate their robustness against adversarial examples, which can be easily combined with existing gradient based attacks. Specifically, we proposed a differentiable deep kNN block to approximate the distribution of k nearest neighbors, which can provide estimated gradients as the guidance of adversaries. Besides, we also proposed a new consistency learning for attack robustness on distribution based defense models. We conducted extensive experiments and demonstrated the effectiveness of each part of the proposed algorithm, as well as the superior overall performance.
Decision-based adversarial attacks: reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §1, §3.
-  (2017) Mitigating evasion attacks to deep neural networks via region-based classification. In Proceedings of the 33rd Annual Computer Security Applications Conference, pp. 278–287. Cited by: §1.
-  (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1, §4.1.
-  (2018) Boosting adversarial attacks with momentum. In , pp. 9185–9193. Cited by: §1.
-  (2019) Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4312–4321. Cited by: §1.
-  (2019) Defense against adversarial images using web-scale nearest-neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8767–8776. Cited by: §1.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §2.1.
-  (2017) Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: §1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
-  (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1, §1, §2.1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §4.1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
-  (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1.
-  (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1.
-  (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1.
-  (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv: Cryptography and Security. Cited by: §1.
-  (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §1.
-  (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §1.
-  (2018) Deep k-nearest neighbors: towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765. Cited by: §1, §1, §1, §4.1, §4.2.
-  (2014) Face recognition methods & applications. arXiv preprint arXiv:1403.0485. Cited by: §1.
-  (2009) K-nearest neighbor. Scholarpedia 4 (2), pp. 1883. Cited by: §2.2.
-  (2019) Harnessing the vulnerability of latent layers in adversarially trained models. arXiv preprint arXiv:1905.05186. Cited by: §1.
-  (2019) Defending against adversarial examples with k-nearest neighbor. arXiv preprint arXiv:1906.09525. Cited by: §1.
-  (2019) On the robustness of deep k-nearest neighbors. arXiv preprint arXiv:1903.08333. Cited by: Figure 1, §1, §1, §2.2, §3, §4.1, §4.2, §4.2.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
-  (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv: Learning. Cited by: §4.1.
-  (2019) Adversarial examples for non-parametric methods: attacks, defenses and large sample limits. arXiv preprint arXiv:1906.03310. Cited by: §1.