MetaAdvDet: Towards Robust Detection of Evolving Adversarial Attacks

08/06/2019 ∙ by Chen Ma, et al. ∙ Shanghai University, Inc. Tsinghua University 0

Deep neural networks (DNNs) are vulnerable to adversarial attack which is maliciously implemented by adding human-imperceptible perturbation to images and thus leads to incorrect prediction. Existing studies have proposed various methods to detect the new adversarial attacks. However, new attack methods keep evolving constantly and yield new adversarial examples to bypass the existing detectors. It needs to collect tens of thousands samples to train detectors, while the new attacks evolve much more frequently than the high-cost data collection. Thus, this situation leads the newly evolved attack samples to remain in small scales. To solve such few-shot problem with the evolving attack, we propose a meta-learning based robust detection method to detect new adversarial attacks with limited examples. Specifically, the learning consists of a double-network framework: a task-dedicated network and a master network which alternatively learn the detection capability for either seen attack or a new attack. To validate the effectiveness of our approach, we construct the benchmarks with few-shot-fashion protocols based on three conventional datasets, i.e. CIFAR-10, MNIST and Fashion-MNIST. Comprehensive experiments are conducted on them to verify the superiority of our approach with respect to the traditional adversarial attack detection methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The evolving adversarial attacks threaten the deep convolutional neural networks (DNNs) via adding human-imperceptible perturbation to clean images and thus lead to incorrect prediction. Various defense methods have been proposed for detecting attacks, which distinguish adversarial images and real images via capturing the features of DNNs under attacks

(Ma et al., 2018; Bhagoji et al., 2017; Tian et al., 2018; Xu et al., 2018). However, new attack methods keep constantly evolving and yield new adversarial examples to bypass existing detector. For example, C&W attack (Carlini and Wagner, 2017) is proposed to circumvent all existing detection techniques at that time. Certain detection techniques have been proposed to detect new attacks (Sorin et al., 2002; Dathathri et al., 2018), these techniques are promising. However, most of them need tens of thousands of examples to train which are infeasible in practice. Because new attacks evolve much faster than the high-cost data collection, which results in a few-shot learning problem with evolving attacks. This issue makes the detection of adversarial examples still challenging.

Therefore, we study on how to tackle such few-shot learning problem, and propose a meta-learning based training approach with the learning-to-learn strategy. It focuses on learning to detect new attack from one or few instances of that attack. We name our approach as MetaAdvDet, refers to Meta-learning Adversarial Detection approach. To this end, the approach is equipped with a double-network framework for learning from tasks, which is defined as the small data collection with real examples and randomly chosen type of attacks. The purpose of introducing the tasks is to simulate new attack scenarios. To better learn from tasks, MetaAdvDet uses one network to focus on learning individual tasks, and the other network to learn the general detection strategy over multiple tasks. Fig. 1 illustrates the training procedure of one mini-batch, more details are described in Sec. 3.2. Each task is divided into support set and query set, which are used for learning either basic detection capability on old attacks, or minimizing the test error on new attacks. After training, the framework efficiently detects new attack with fine-tuning on limited examples. In contrast, the DNN based methods that use‘ the traditional training approach perform much worse in detecting new attacks than ours.

To comprehensively validate the detection techniques in terms of evolving attacks, we propose evaluations in following dimensions to validate the superiority of our approach in the few-shot problem.

Cross-adversary Dimension. To assess the capability of detecting new types of attacks in test set with few-shot samples..

Cross-domain Dimension. To assess the capability of detecting all attacks across different domains with few-shot samples.

Cross-architecture Dimension

. To assess the capability of detecting the adversarial examples that are generated by attacking the classifier with new architecture.

White-box attack dimension. To assess the capability of detecting white-box attacks with few-shot samples.

To validate the effectiveness of our approach from above dimensions, we propose benchmarks with the few-shot-fashion protocol on three conventional datasets, i.e. CIFAR-10, MNIST and Fashion-MNIST datasets. The benchmarks include the generated adversarial examples by using various types of attacks, and it also defines the partition of train set and test set to simulate the scenario of testing the evolving attack’s detection.

In experiments, we compare our approach with end-to-end state-of-the-art methods using these benchmarks, and the results show that our approach surpasses the existing method by a large margin.

We summarize the main contributions below:

(1) To the best of our knowledge, we are the first to define the adversarial attack detection problem as a few-shot learning problem of detecting evolving new attacks.

(2) We propose a meta-learning based approach: MetaAdvDet, it is equipped with a double-network framework with the learning-to-learn strategy for detecting evolving attacks. Benefiting from the learning-to-learn strategy, our approach is able to achieve high performance in detecting new attacks.

(3) To comprehensively validate our approach in terms of evolving attacks, we construct benchmarks with the few-shot-fashion protocol on three datasets, i.e. CIFAR-10, MNIST and Fashion-MNIST. The benchmarks define the partition of train set and test set to simulate the scenario of testing the evolving attack. We believe the proposed benchmark is useful for the future research of defending evolving attacks.

2. Background

Many attempts have been made to detect or defense against adversarial attack. We first introduce the defense techniques, and then we introduce the meta-learning techniques that related to our work.

2.1. Defense Techniques

The adversary algorithm is used to generate the adversarial examples which makes the classifier to output incorrect prediction. Many defense techniques have been proposed to defend against adversarial attack, these techniques generally fall into two categories.

The first category attempts to build a robust model that classifies the adversarial example correctly, such as (Papernot et al., 2016c; Akhtar et al., 2018; Song et al., 2018; Liao et al., 2018). However, certain new attacks (Chen et al., 2017; Li et al., 2019) are deliberately implemented to grasp the weakness of these methods to circumvent the defense. For example, Athalye et al. (Athalye et al., 2018) identifies the obfuscated gradients, which is a kind of gradient masking, that leads to a false sense of security in defenses. Based on their findings, the new attacks are proposed to circumvent 7 of 9 defenses relying on obfuscated gradients.

Due to the difficulty, the second category of defense techniques turn to distinguish the adversarial examples from real ones, in order to improve security and detect malicious users. This category refers to adversarial attack detection. Unlike the first category, adversarial detection does not need to classify the adversarial image correctly, but only to identify them. Essentially, a detector is also a binary classifier which is trained on the real and adversarial examples. Based on this idea, certain detection techniques (Carrara et al., 2017; Sorin et al., 2002; Metzen et al., 2017) build a subnet classifier to capture the hidden layer’s features of the adversarial example. Other methods include (1) capturing the difference of DNN’s output between real and adversarial images when applying certain transformation to the input images (Tian et al., 2018; Xu et al., 2018; Dathathri et al., 2018; Bhagoji et al., 2017), (2) utilizing the intrinsic dimensionality of adversarial regions (Ma et al., 2018)

, (3) employing new loss function to encourage DNN to learn a more distinguishable representation

(Pang et al., 2018; Wan et al., 2018), (4) using statistical test (Grosse et al., 2017), and (5) using the capsule network (Frosst et al., 2018).

However, the high-cost data collection cannot keep up with the evolution frequency of the attacks, which leads the training for detecting new attacks difficulty. For example, when a new attack first appears without publishing the source code, most of defenders have insufficient examples to train the detector. This situation makes the issue of detecting evolving attacks highly urgent. We categorize this issue as a new defense problem, which is a few-shot learning problem of detecting evolving attacks.

2.2. Meta-Learning

Few-shot learning problem (Vinyals et al., 2016; Snell et al., 2017) has been studied for a long time, which is defined as learning from few samples. The meta-learning techniques (Finn et al., 2017; Li et al., 2017; Erin Grant and Griffiths, 2018; Mishra et al., 2018; Jamal and Qi, 2019) are promising for addressing the few-shot learning problem, which usually trains a meta-learner on the distribution of few-shot tasks so that the it can generalize and perform well on the unseen task. Model-agnostic meta-learning (MAML) (Finn et al., 2017) is a typical meta-learning approach, which learns a internal representation that is widely suitable for many tasks. It learns a proper weight initialization on the support set and then updates itself to perform well on the query set. To update the weights more efficiently, Meta-SGD (Li et al., 2017) makes the meta-learner not only to learn the weight initialization but also update direction and learning rate. For better understanding in this field, we introduce the terminologies of meta-learning, as describe below.

Task: A meta-learning model (meta-learner) should be trained over a variety of tasks and optimized for the best performance on the task distribution, including potentially unseen tasks. The concept of “task” in this paper is totally different from the concept of “multi-task learning”, but only a manner of data partition that the meta-learner used to train.

Support&query set: Each task is split into two subsets, which are the support set for learning the basic classification on old tasks, and the query set for training in the train stage or testing in the test stage. It should be emphasized that the support set and query set from the same task have the same data distribution.

Way is the class in each task that the meta-learner wish to discriminate, whose number may be specified arbitrarily and do not need to equal the ground truth class number.

Shot is the number of samples in each way of the support set. For example, an -way, -shot classification task includes the support set with labeled examples for each of classes.

Based on the spirit of meta-learning, we propose the training method with a double-network framework and introduce the double-update scheme for achieving fast adaption capacity. Experiments show the superiority of our approach in detecting new attacks.

3. Approach

3.1. Overview

The evolving adversarial attacks are hard to distinguish due to the insufficient new adversarial examples for training the detector, results in the few-shot learning problem. One of the keys for solving this problem is to use the power of meta-learning techniques. Typical meta-learning methods (e.g. MAML (Finn et al., 2017)) are trained for learning the task distribution. Because the categories of data in each task are randomly chosen, the meta-leaner acquires fast adaption capability to unseen data type via learning these tasks. To model the attack detection technique in the meta-learning style framework, we collect various types of attacks to construct adversarial example dataset into the multiple tasks form (Fig. 2). Each task is a small data collection with a randomly chosen attack which represents one attacking scenario, so the large amount of tasks make the meta-learner experience various attacking scenarios, so that it can adapt to new attacks rapidly. Our approach is equipped with a double-network framework with learning-to-learn strategy, which focuses on learning how to learn new tasks faster by reusing previous experience, rather than considering new tasks in isolation. Specifically, one network of our framework focuses on learning from individual tasks (named task-dedicated network ), the other network updates its parameters based on the gradient accumulated from the (named master network ), to learn a general strategy over all tasks (Fig. 1). This double-network framework leads to the double update scheme, corresponding to the two networks.

Figure 2. The details of constructing tasks for training and testing the meta-learner (including and ). Each task (support&query set) is sampled independently from the dataset, and each mini-batch consists of tasks. The training employs a double-update scheme: inner update and outer update. The inner update represents that learns on support set, and the outer update represents updates with the accumulated gradients of on query set (as described in Fig. 1 and Sec. 3.2). can be used to detect new attacks.

Fig. 2 shows the details of constructing tasks for training and testing the meta-learner, Fig. 1 demonstrates the procedure of training in one mini-batch, detailed steps are shown in Algorithm 1.

3.2. Learning MetaAdvDet

As we mentioned earlier, the learning-to-learn strategy is proposed to learn new attacks by reusing previous experience of detecting old attacks. Following the typical setting of meta-learning, all the training data are organized into tasks, each task is divided to two subsets, namely, the support set for learning basic capability of detecting old attacks, and the query set acts as the surrogate of new attacks for achieving rapid adaption in detecting new attacks of test set. To learn the tasks, the meta-learner includes a double-network framework, i.e. the master network , and a task-dedicated network which is cloned from to learn from individual tasks. updates its parameters based on each task’s support set, and then it calculates its gradient of the query set, which will be accumulated to update ’s parameters (Fig. 1). The same will be copied and overwritten to the before learning next task. The and

output the classification probability to distinguish the real and adversarial example, corresponding to the two-way configuration. The two-way configuration stipulates that one of the ways should use real examples in all tasks. Two options are considered,

i.e. the randomized-way setting, whose two-way labels are shuffled in each task; and the fixed-way setting, which uses label 1 for real example and label 0 for adversarial example in all tasks. We will compare the effect of above two options in Sec. 5.4.

Algorithm 1 shows the training procedure, Fig. 1 shows the detail of one mini-batch training procedure. copies the all the parameters from at the beginning of learning task , where the subscript denotes the task index. Then, the inner update step updates the parameters of by using the support set of for multiple iterations. Line 8 and line 9

demonstrate this step, which is the same with the supervised learning in the traditional DNN: we directly feed input images to

and uses the gradient descent to update its parameters based on the classification ground truth. Unlike existing methods that applying transformation on input images (Tian et al., 2018; Xu et al., 2018), we should note that the input image is not applied any transformation in this step of our approach. Finally, the meta-learner acquires rapid adaption capability by considering to minimize the test error on new data, this is the role that the outer update step upon the query set plays. More specifically, we calculate the cross entropy loss on the query set of task to obtain the gradient w.r.t. , which is accumulated from learning all tasks and finally sent to . Because and use the same network structure and parameters, the accumulated gradient can be used to update parameters of , namely . Thus, updates the for learning the strategy over the multi-task distribution.

master network and its parameters , task-dedicated network and its parameters , the feed-forward function of , max iterations , inner-update learning rate , outer-update learning rate , inner updates iteration , the multi-task format dataset , cross entropy loss function .
the learned network
1:for  to  do
2:     sample tasks from
3:     for  to  do
4:          and support set and query set of
5:          copy parameters from to
6:          will be used in the outer update
7:         for  to  do
8:              Calculate by using
9:               inner update
10:         end for
11:          by using
12:     end for
13:      outer update
14:end for
Algorithm 1 MetaAdvDet training procedure

Following popular few-shot-fashion testing procedure (Ravi and Larochelle, 2017), the evaluation restricts that the method needs to be evaluated on all test tasks. Algorithm 2 shows the testing procedure. The few-shot-fashion testing procedure should include a fine-tune step by using few-shot examples, as shown in line 6 of Algorithm 2. In experiments, we adopt a general binary classifier with DNN as the baseline for comparison. DNN uses a single network for training, whereas MetaAdvDet uses a double-network framework to obtain the learning-to-learn strategy. The experiment proves the superiority of our approach in detecting new attacks (Sec. 5.5).

master network and its learned parameters , task-dedicated network and its parameters , the feed-forward function of , fine-tune iterations , learning rate , test tasks which is obtained by reorganizing the test set, cross entropy loss , ground truth of the query set.
the average F1 score over all tasks
1:for  to  do iterate over all test tasks
2:      and support set and the query set of
3:      copy parameters to ensure each task is tested independently
4:     for  to  do
5:         Calculate by using
6:          fine-tune step
7:     end for
8:      get prediction of query set of task
10:end for
12:return F1 score
Algorithm 2 MetaAdvDet testing procedure

The F1 score of the query set is adopted as the metric for evaluating the performance of detection techniques (Sec. 5.2). Note that the F1 score is calculated upon individual tasks, the final F1 score is obtained via averaging F1 scores of all tasks, which follows the few-shot-fashion testing procedure of MiniImagenet (Ravi and Larochelle, 2017), steps are shown in line 11 of Algorithm 2. All the compared methods should use this metric and include the fine-tune step for fair comparison.

4. Proposed Benchmark

4.1. Adversarial Example Datasets Construction

In order to validate the effectiveness of our approach, we construct the adversarial example datasets based on the conventional datasets. The built datasets use fifteen adversaries to yield examples whose data sources come from CIFAR-10 (Krizhevsky, 2009), MNIST (LeCun and Cortes, 2010) and Fashion-MNIST (Xiao et al., 2017) datasets, named AdvCIFAR, AdvMNIST and AdvFashionMNIST respectively. To train the detectors for distinguishing real examples and adversarial examples, each dataset includes an additional real example’s category whose data are directly transfered from original dataset (i.e. CIFAR-10 etc.). All fifteen types of adversarial examples are generated by utilizing CleverHans library (Papernot et al., 2018), which attacks the classifiers with three architectures for each adversary, namely 4 conv-layers network (conv-4), ResNet-10 (He et al., 2016) and ResNet-18 (He et al., 2016). Note that MI-FGSM, BIM and PGD attacks adopt the norm version, C&W and Deepfool attacks adopt the norm version. Such adoptions are based on the attack successful rate. In addition, the adversarial examples of L-BFGS attack (Szegedy et al., 2014) is used as the validation set. The BPDA attack (Athalye et al., 2018) that utilizes the obfuscate gradients of defense is not used, because our approach does not reply on obfuscate gradients. The statistical data for the adversarial example datasets are shown in Tab. 1.

adversary AdvCIFAR AdvMNIST AdvFashionMNIST
(r)2-7 train test train test train test
FGSM (Goodfellow et al., [n. d.])
MI-FGSM (Dong et al., 2018)
BIM (Kurakin et al., 2017)
PGD (Madry et al., 2018)
C&W (Carlini and Wagner, 2017)
jsma (Papernot et al., 2016b)
EAD (Chen et al., 2018)
SPSA (Uesato et al., 2018)
Spatial Transformation (Xiao et al., 2018)
VAT (Miyato et al., 2016)
semantic (Hosseini et al., 2017)
MaxConfidence (Goodfellow et al., 2019)
Deepfool (Moosavi-Dezfooli et al., 2016)
NewtonFool (Jang et al., 2017)
L-BFGS (Szegedy et al., 2014)(validation set)
Table 1. Our adversarial example datasets contain the examples generated by attacking different architectures, including a 4 conv-layers network (conv-4), ResNet-10 and ResNet-18.This table lists the amount of adversarial examples which are generated by successfully attacking the conv-4 network.

4.2. Cross-Adversary Benchmark

To validate the effectiveness of detection techniques in detecting new attacks, we configure the train set and test set contain no common type of adversarial examples to simulate this situation. To this end, the attacks are grouped based on their categories, and we propose the cross-adversary benchmark which assigns the different adversary groups to the train set and test set.

Train Adversary Group Test Adversary Group Validation Train&Test
FGSM, MI-FGSM, BIM, PGD, C&W, jsma, SPSA, VAT, MaxConfidence EAD, semantic, Deepfool, Spatial Transformation, Newtonfool L-BFGS same domain
Table 2. The definition of adversary groups in the cross-adversary benchmark.

Tab. 2 shows the adversary groups of cross-adversary benchmark. The grouping principles of this benchmark are: (1) each adversary should be assigned to one group only. (2) The similar adversaries should be assigned into the same group. For example, the MI-FGSM adversary is a modification of FGSM, and thus they are similar and we make them into one group. Based on this benchmark, the train set and test set should not include attacks of the same group simultaneously. Note that in this benchmark, the adversaries of train group extract the train set of the adversarial example dataset (e.g. train set of FGSM in Tab. 1) to train the detectors. Similarly, the detectors are evaluated on the test set of the test group’s adversaries.

4.3. Cross-Domain Benchmark

Protocol Train Domain Test Domain Attack Types Test Shots
#1 AdvMNIST AdvFashionMNIST all attacks 1-shot, 5-shot
#2 AdvFashionMNIST AdvMNIST all attacks 1-shot, 5-shot
Table 3. The cross-domain benchmark consists of 2 protocols on the AdvMNIST and AdvFashionMNIST.

The concept of a domain indicates an adversarial dataset, e.g. AdvMNIST. Since different domains have different data distributions, which leads the cross-domain benchmark to a more challenging benchmark. To evaluate the capability of detecting the adversarial examples generated from new domain, the detectors are trained one domain (Train Domain), and tested on the other domain (Test Domain). In this benchmark, we focus on the transferability between two datasets, namely AdvMNIST and AdvFashionMNIST, as listed in Tab. 3. Note that in this benchmark, all types of the attacks are used to train the detector.

4.4. Cross-Architecture Benchmark

Protocol Train Arch Test Arch Attack Types Test Shots Train&Test
#1 ResNet-10 ResNet-18 all attacks 1-shot, 5-shot same domain
#2 ResNet-18 ResNet-10 all attacks 1-shot, 5-shot same domain
#3 conv-4 ResNet-10 all attacks 1-shot, 5-shot same domain
#4 ResNet-10 conv-4 all attacks 1-shot, 5-shot same domain
Table 4. The cross-architecture benchmark consists of 4 protocols, this benchmark indicates the examples of train set and test set are generated by attacking different networks.

Existing studies show that the adversarial examples generated by attacking one architecture can fool another architecture (Papernot et al., 2016a; Liu et al., 2017). To validate the detection capability in this situation, this benchmark stipulates that the train set and test set should include the adversarial examples come from attacking different architectures. For example, the detector is trained on the adversarial examples generated by attacking a classifier with the conv-4 network (Train Arch), but tested on the ones of ResNet-10 (Test Arch). Tab. 4 shows the detail of this benchmark, all types of attacks are used to train the detectors. Three architectures are used, namely conv-4, ResNet-10 and ResNet-18. Note that the concept of architecture in this benchmark is only related to the classifier’s backbone during adversarial examples generation, but not related to the detector model.

4.5. White-box Attack Benchmark

The white-box attack means the adversary has the information of both the image classifier and is aware of the detector. It has full knowledge of the detector. In other words, the adversary needs to fool both the classifier and detector simultaneously, making it more challenge to defend. We use the targeted iterative FGSM (I-FGSM) (Kurakin et al., 2017) and C&W (Carlini and Wagner, 2017) attacks to simulate white-box attacks with the method presented in Carlini and Wagner (Carlini and Wagner, 2017). The basic idea is to construct a combined model which combines the original classifier model and the detector. The original classifier has output labels, then the new model outputs labels with the last label indicates whether the input is an adversarial example. More specifically, let’s denote the new model as which combines the classifier and the detector .

’s output logits is denoted as

, ’s output is denoted as and ’s output as . The is constructed using the following formula:


It is easy to see that when an input is detected as an adversarial example by , then would be larger than 0.5 and it leads to be larger than for . If an input is detected as a real example, classifies it the same label as does. In this way, the new model combines and .

Now, we can use the targeted iterative FGSM (I-FGSM) or C&W adversary to attack this new model to generate the white-box adversarial example. The target label is set to make classify this example incorrectly but make the example bypass the detector . In MetaAdvDet, represents for the learned master network which would be attacked. Although the white-box attack leads to misclassify, MetaAdvDet can benefit from the learning-to-learn strategy for recovering to the correct prediction with limited white-box examples provided, as steps shown in Algorithm 2.

5. Experiment

5.1. Experiment Setting

index module parameter configuration
1 conv-layer 3

3 kernel, channel = 64, pad = 0

2 batch normalization momentum = 1
3 ReLU -
4 max pooling 2

2 pooling, stride = 2

Table 5.

The modules of one block in the conv-3 backbone of MetaAdvDet and other compared methods. The conv-3 backbone consists of 3 such blocks in total, and the last block connects to a fully-connected layer to output a vector with two probabilities.

name default value description
shots 1 number of examples in a way, MetaAdvDet should set the same shots in both training and testing.
ways 2 alias of class number, data of the same way come from using the same adversary to attack the same category’s images.
train query set size 70 number of examples of a query set in training.
test query set size 30 number of examples of a query set in testing.
task number 30 number of tasks in each mini-batch.
inner update times 12 iteration times of inner update during training
fine-tune times 20 iteration times of fine-tune during testing.
total tasks 20000 total tasks in the constructed tasks.
inner learning rate 0.001 learning rate of inner update.
outer learning rate 0.0001 learning rate of outer update.
dataset AdvCIFAR the dataset for ablation study
backbone conv-3 the backbone of MetaAdvDet & compared methods
benchmark cross-adversary the benchmark for ablation study
Table 6. The default parameters configuration which is used in the ablation study in Sec. 5.4, and also used in other comparative experiments of Sec. 5.5, Sec. 5.6, Sec. 5.7 and Sec. 5.8.

In the construction of tasks, we set the task number to be in total, which covers all the samples of the original datasets. During the learning process, 30 tasks are randomly chosen from these tasks to form each mini-batch. In each task, the two-way setting is applied which makes MetaAdvDet to be a binary classifier for distinguishing the adversarial examples from real examples, as shown in Fig. 2. The inner-update learning rate is set to empirically. Because of the summation of gradients in Algorithm 1, the outer-update learning rate is set to which is 10 times smaller than

. The training epoch is set to 4, because after 4 epochs, we observe that the F1 score on the validation set is stable. The query set size used for outer-update is set to 70 for two ways, that is 35 samples in each way. The fine-tune iteration times is set to 20 which reaches the stable performance (Fig.

4). All parameters configuration is shown in Tab. 6 which is set empirically based on validation set.

5.2. Evaluation Metric

Our metric restricts that all compared methods need to be evaluated on 1000 testing tasks to cover all test samples. To quantify the detection performance of all detection methods, we adopt the F1 score which follows Liang et al.(Liang et al., 2018) and Sabokrou et al.(Sabokrou et al., 2018)

. It is defined as the harmonic mean between

precision and recall:


We use label 1 to represent the real example and 0 to represent the adversarial example, so TP is the number of correctly detected real examples, FN is the number of real examples that are incorrectly detected as adversarial examples, and FP is the number of adversarial images that are detected as real examples. Note that the final F1 score is obtained via averaging F1 scores of all tasks (Algorithm 2).

5.3. Compared Methods

The selection of compared state-of-the-art methods are based on the consideration of two principles: (1) In order to comply with the few-shot-fashion benchmarks, the compared method must be an end-to-end learning approach to be fine-tuned in test stage. (2) The compared methods are able to detect new attacks, in order to evaluate and compare the detection technique in terms of evolving adversarial attacks. Based on above principles, MetaAdvDet is compared with a image rotation transformation based detector, named TransformDet (Tian et al., 2018); and a detection technique based on a secret fingerprint, named NeuralFP (Dathathri et al., 2018). The NeuralFP is trained for 100 epochs on each dataset, and TransformDet is trained for 10 epochs on each dataset. In the fine-tune step, because NeuralFP is trained on real examples, we extract the real examples from support set to fine-tune. In addition, NeuralFP obtains the F1 score via determining the best threshold for each task. We configure these methods following their original settings (Tian et al., 2018; Dathathri et al., 2018) (Tab. 7).

Method Train Set Test Set Validation Set
DNN train set of adversarial example datasets (e.g. AdvCIFAR etc.) constructed tasks of test set in adversarial example datasets, each task contains the support set and the query set. The performance is evaluated on the query set. constructed tasks of validation set in adversarial example datasets, each task contains the support set and the query set. The performance is evaluated on the query set.
DNN (balanced) train set of adversarial example datasets which is down-sampling to keep class balanced
NeuralFP (Dathathri et al., 2018) real examples of train set in the original dataset (e.g. CIFAR-10 etc.)
TransformDet (Tian et al., 2018) train set of adversarial example datasets, down-sampling if necessary.
MetaAdvDet constructed tasks of train set in adversarial example datasets
Table 7. The configuration of train, validation and test set of all compared methods on the proposed benchmarks.

We adopt a binary neural network classifier as the baseline, denotes as DNN. DNN is trained on all data of adversarial example dataset and its backbone is the same with MetaAdvDet, which is a 3 conv-layers network (Tab. 5). Because the adversarial example datasets are the highly class imbalance datasets which contain much more adversarial examples than real ones, we train the other DNN by using the balanced data between adversarial and real samples by down-sampling, which is denoted as DNN (balanced). The dataset configurations of different methods are listed in Tab. 7.

5.4. Ablation Study

To inspect the effect of each key parameter respectively, we conduct the control experiments on AdvCIFAR by adjusting one parameter while keeping other parameters fixed as listed Tab. 6. Fig. 3 and Fig. 4 are the results of the cross-adversary benchmark.

(a) train query set size study
(b) task number study
Figure 3. Ablation study results of train query set size and task number of a training mini-batch.
(a) shots study
(b) fine-tune iterations study
Figure 4. Ablation study results of shots and fine-tune iterations. MetaAdvDet outperforms the baseline DNN and DNN (balanced) by a large margin.

From Fig. 4, following conclusions can be drawn:

1) MetaAdvDet outperforms DNN with only a few fine-tuning iterations, e.g. MetaAdvDet even surpasses the results of all fine-tune iterations of DNN by only using a single iteration (Fig. 4).

2) The balanced training data of DNN (balanced) helps to improve performance over DNN with the few-shot fine-tunings (Fig. 4).

shots fixed-way randomized-way
1 0.686 0.616
5 0.787 0.772
Table 8. F1 score of randomized-way and fixed-way settings in AdvCIFAR. The randomized-way indicates that the labels of two ways are shuffled in each task. The fixed-way uses label 1 as real example and label 0 as adversarial example.

Tab. 8 illustrates the F1 score results of a randomized-way and fixed-way assignment settings in AdvCIFAR dataset. It shows that the result of fixed-way setting outperforms that of randomized-way setting. In following experiments, we use the fixed-way setting.

5.5. Cross-Adversary Benchmark Result

To compare the performance of our approach and other state-of-the-art methods under the cross-adversary benchmark. In this section, we collect the results of TransformDet (Tian et al., 2018), NeuralFP (Dathathri et al., 2018), baseline DNN and DNN (balanced). Tab. 9 shows that MetaAdvDet outperforms the baseline and other methods in nearly all datasets. Thus, we can conclude MetaAdvDet is able to achieve high performance in detecting new attack with limited examples of that attack.

Our approach is particularly effective in detecting the attacks that exhibit quite different appearance from the training attacks. Typical representative attacks are Spatial Transformation, etc., the results of three representative attacks are shown in Tab. 10.

Dataset Method F1 score
(rl)3-4 1-shot 5-shot
AdvCIFAR DNN 0.495 0.639
DNN (balanced) 0.536 0.643
NeuralFP (Dathathri et al., 2018) 0.698 0.700
TransformDet (Tian et al., 2018) 0.662 0.697
MetaAdvDet (ours) 0.685 0.791
AdvMNIST DNN 0.812 0.852
DNN (balanced) 0.797 0.808
NeuralFP (Dathathri et al., 2018) 0.780 0.906
TransformDet (Tian et al., 2018) 0.840 0.904
MetaAdvDet (ours) 0.987 0.993
AdvFashionMNIST DNN 0.782 0.885
DNN (balanced) 0.744 0.850
NeuralFP (Dathathri et al., 2018) 0.798 0.817
TransformDet (Tian et al., 2018) 0.712 0.879
MetaAdvDet (ours) 0.848 0.944

Table 9. F1 score of the cross-adversary benchmark, this table shows the results in using the adversarial examples generated by attacking the classifier with conv-4 architecture.
Dataset Adversary Method F1 score
(rl)4-5 1-shot 5-shot
AdvCIFAR Spatial Transformation (Xiao et al., 2018) DNN 0.498 0.599
DNN (balanced) 0.529 0.589
NeuralFP (Dathathri et al., 2018) 0.708 0.696
TransformDet (Tian et al., 2018) 0.633 0.660
MetaAdvDet (ours) 0.811 0.920
(l0.3)2-5 semantic (Hosseini et al., 2017) DNN 0.488 0.644
DNN (balanced) 0.529 0.657
NeuralFP (Dathathri et al., 2018) 0.698 0.700
TransformDet (Tian et al., 2018) 0.662 0.688
MetaAdvDet (ours) 0.763 0.855
(l0.3)2-5 NewtonFool (Jang et al., 2017) DNN 0.511 0.664
DNN (balanced) 0.542 0.670
NeuralFP (Dathathri et al., 2018) 0.696 0.696
TransformDet (Tian et al., 2018) 0.658 0.716
MetaAdvDet (ours) 0.647 0.735
Table 10. F1 score of representative adversaries on the AdvCIFAR dataset, cross-adversary benchmark.

5.6. Cross-Domain Benchmark Result

Train Domain Test Domain Method F1 score
(rl)4-5 1-shot 5-shot

AdvFashionMNIST DNN (balanced) 0.698 0.813
NeuralFP (Dathathri et al., 2018) 0.748 0.811
TransformDet (Tian et al., 2018) 0.664 0.808
MetaAdvDet (ours) 0.799 0.870

AdvMNIST DNN (balanced) 0.950 0.977
NeuralFP (Dathathri et al., 2018) 0.775 0.836
TransformDet (Tian et al., 2018) 0.934 0.940
MetaAdvDet (ours) 0.956 0.981
Table 11. F1 score of the cross-domain benchmark, this table shows the results which are evaluated on the adversarial examples generated by attacking the conv-4 network. All the types of attacks are used to train the detectors.

In the cross-domain benchmark, the models are trained in one domain, and tested in the other domain’s test set (Sec. 4.3). We use DNN (balanced) instead of DNN in this benchmark because the all types of attacks are used to train which results in the highly imbalanced data classification issue if using DNN. Tab. 11 shows the result, which demonstrates that MetaAdvDet has an advantage in hard test set. For example, when training on AdvMNIST and testing on AdvFashionMNIST, MetaAdvDet outperforms DNN (balanced) by a large margin (10.1% improvement in 1-shot).

Dataset Train Arch Test Arch Method F1 score
(rl)5-6 1-shot 5-shot

ResNet-10 ResNet-18 NeuralFP (Dathathri et al., 2018) 0.713 0.709
TransformDet (Tian et al., 2018) 0.758 0.880
DNN (balanced) 0.702 0.768
MetaAdvDet (ours) 0.832 0.902
(r)2-6 ResNet-18 ResNet-10 NeuralFP (Dathathri et al., 2018) 0.712 0.703
TransformDet (Tian et al., 2018) 0.788 0.874
DNN (balanced) 0.711 0.752
MetaAdvDet (ours) 0.840 0.889
(r)2-6 conv-4 ResNet-10 NeuralFP (Dathathri et al., 2018) 0.712 0.703
TransformDet (Tian et al., 2018) 0.763 0.868
DNN (balanced) 0.723 0.779
MetaAdvDet (ours) 0.835 0.885
(r)2-6 ResNet-10 conv-4 NeuralFP (Dathathri et al., 2018) 0.709 0.702
TransformDet (Tian et al., 2018) 0.766 0.885
DNN (balanced) 0.739 0.790
MetaAdvDet (ours) 0.854 0.918

ResNet-10 ResNet-18 NeuralFP (Dathathri et al., 2018) 0.906 0.882
TransformDet (Tian et al., 2018) 0.973 0.988
DNN (balanced) 0.943 0.972
MetaAdvDet (ours) 0.984 0.993
(r)2-6 ResNet-18 ResNet-10 NeuralFP (Dathathri et al., 2018) 0.894 0.738
TransformDet (Tian et al., 2018) 0.967 0.990
DNN (balanced) 0.912 0.953
MetaAdvDet (ours) 0.981 0.991
(r)2-6 conv-4 ResNet-10 NeuralFP (Dathathri et al., 2018) 0.894 0.738
TransformDet (Tian et al., 2018) 0.972 0.985
DNN (balanced) 0.897 0.959
MetaAdvDet (ours) 0.963 0.983
(r)2-6 ResNet-10 conv-4 NeuralFP (Dathathri et al., 2018) 0.917 0.961
TransformDet (Tian et al., 2018) 0.984 0.992
DNN (balanced) 0.958 0.978
MetaAdvDet (ours) 0.990 0.996

ResNet-10 ResNet-18 NeuralFP (Dathathri et al., 2018) 0.813 0.856
TransformDet (Tian et al., 2018) 0.936 0.974
DNN (balanced) 0.848 0.932
MetaAdvDet (ours) 0.960 0.979
(r)2-6 ResNet-18 ResNet-10 NeuralFP (Dathathri et al., 2018) 0.820 0.838
TransformDet (Tian et al., 2018) 0.935 0.972
DNN (balanced) 0.829 0.918
MetaAdvDet (ours) 0.957 0.976
(r)2-6 conv-4 ResNet-10 NeuralFP (Dathathri et al., 2018) 0.820 0.838
TransformDet (Tian et al., 2018) 0.946 0.970
DNN (balanced) 0.920 0.968
MetaAdvDet (ours) 0.946 0.975
(r)2-6 ResNet-10 conv-4 NeuralFP (Dathathri et al., 2018) 0.817 0.911
TransformDet (Tian et al., 2018) 0.945 0.979
DNN (balanced) 0.886 0.945
MetaAdvDet (ours) 0.967 0.982

Table 12. F1 score of cross-architecture benchmark.

5.7. Cross-Architecture Benchmark Result

Tab. 12 shows the results of cross-architecture benchmark. Because NeuralFP is trained on the real samples, thus the same NeuralFP model is tested on the examples of different test architectures (Test Arch). Tab. 12 shows that MetaAdvDet outperforms other methods under different train and test architecture combinations, proving the superiority of MetaAdvDet in the cross-architecture benchmark.

5.8. White-box Attack Benchmark Result

In Tab. 13, we present the detection performance of the white-box benchmark. The NeuralFP (Dathathri et al., 2018) result is omitted because it detects the attack by setting threshold rather than conducting classification, which cannot be used in the method of Sec. 4.5. Tab. 13 shows that: (1) MetaAdvDet can effectively detect white-box attack even with only one white-box example provided. (2) White-box attack targets on the master network of the meta-learner in MetaAdvDet, whereas it targets on the detector itself in other methods.

Dataset Method I-FGSM Attack C&W Attack
(rl)3-6 1-shot 5-shot 1-shot 5-shot

DNN (balanced) 0.466 0.537 0.459 0.527
TransformDet (Tian et al., 2018) 0.593 0.728 0.443 0.502
MetaAdvDet (ours) 0.553 0.633 0.548 0.607

DNN (balanced) 0.857 0.956 0.814 0.913
TransformDet (Tian et al., 2018) 0.864 0.952 0.775 0.893
MetaAdvDet (ours) 0.968 0.994 0.920 0.990

DNN (balanced) 0.745 0.890 0.726 0.853
TransformDet (Tian et al., 2018) 0.837 0.920 0.747 0.853
MetaAdvDet (ours) 0.849 0.963 0.882 0.967

Table 13. F1 score of white-box attack benchmark.

5.9. Inference Time

Method DNN NeuralFP (Dathathri et al., 2018) TransformDet (Tian et al., 2018) MetaAdvDet (ours)
Inference time (ms)
Table 14. The inference time(ms) of all methods.

We further evaluate the inference time (excluding fine-tune steps) of all the methods measured in millisecond on one NVIDIA Geforce GTX 1080Ti GPU in Tab. 14. It shows that MetaAdvDet obtains the comparable inference time to DNN, due to that both methods use the same network architecture and the same feed-forward procedure for inference. In contrast, TransformDet applies multiple transformations on the input image which increases the inference time. NeuralFP tests multiple thresholds to determine the best threshold for detection, which significantly increases the inference time.

6. Conclusion

In this paper, we present a meta-learning based adversarial attack detection approach for detecting evolving adversary attacks with limited examples. To this end, the approach is equipped with a double-network framework which includes a task-dedicated network and a master network to learn from either individual tasks or the task distribution. In this way, the rapid adaption capability of detecting new attacks is achieved. Experimental results conclude that: (1) Tab. 9, Tab. 11 and Tab. 12 show that NeuralFP gets lower F1 scores than ours under different benchmarks. It manifests that NeuralFP which is trained on real examples cannot detect evolving attacks effectively. (2) We get the lowest results in AdvCIFAR dataset (Tab. 9, Tab. 12 and Tab. 13), which manifests the adversarial examples generated in CIFAR-10 are more difficult to detect. (3) MetaAdvDet performs well in the benchmarks of cross-adversary (Tab. 9), cross-domain (Tab. 11), cross-architecture (Tab. 12) and white-box attack (Tab. 13), proving that MetaAdvDet is a suitable method for detecting evolving attacks with limited examples.