Tensorflow implementation of Meta Adversarial Training for Adversarial Patch Attacks on Tiny ImageNet.
Recently demonstrated physical-world adversarial attacks have exposed vulnerabilities in perception systems that pose severe risks for safety-critical applications such as autonomous driving. These attacks place adversarial artifacts in the physical world that indirectly cause the addition of universal perturbations to inputs of a model that can fool it in a variety of contexts. Adversarial training is the most effective defense against image-dependent adversarial attacks. However, tailoring adversarial training to universal perturbations is computationally expensive since the optimal universal perturbations depend on the model weights which change during training. We propose meta adversarial training (MAT), a novel combination of adversarial training with meta-learning, which overcomes this challenge by meta-learning universal perturbations along with model training. MAT requires little extra computation while continuously adapting a large set of perturbations to the current model. We present results for universal patch and universal perturbation attacks on image classification and traffic-light detection. MAT considerably increases robustness against universal patch attacks compared to prior work.READ FULL TEXT VIEW PDF
Classifiers such as deep neural networks have been shown to be vulnerabl...
Despite their overwhelming success on a wide range of applications,
Standard adversarial attacks change the predicted class label of an imag...
Adversarial learning has emerged as one of the successful techniques to
Adversarial patches pose a realistic threat model for physical world att...
We study the problem of learning classifiers robust to universal adversa...
Recently, convolutional neural networks (CNNs) have made significant
Tensorflow implementation of Meta Adversarial Training for Adversarial Patch Attacks on Tiny ImageNet.
Deep learning is currently the most promising method for open-world perception tasks such as in automated driving and robotics. However, the use in safety-critical domains is questionable, since a lack of robustness of deep learning-based perception has been demonstrated (Szegedy et al., 2014; Goodfellow et al., 2015; Metzen et al., 2017; Hendrycks & Dietterich, 2019).
Physical-world adversarial attacks (Kurakin et al., 2017; Athalye et al., 2018; Braunegg et al., 2020) are one of most problematic failures in robustness of deep learning. Examples of such attacks are fooling models for traffic sign recognition (Chen et al., 2018; Eykholt et al., 2018a, b; Huang et al., 2019)2016, 2017)
, optical flow estimation(Ranjan et al., 2019), person detection (Thys et al., 2019; Wu et al., 2020b; Xu et al., 2020), and LiDAR perception (Cao et al., 2019). In this work, we focus on two subsets of these physical-world attacks: local ones which place a printed pattern in a scene that does not overlap with the target object (Lee & Kolter, 2019; Huang et al., 2019) and global ones which attach a mainly-translucent sticker on the lens of a camera (Li et al., 2019). Note that these physical-world attacks have corresponding digital-domain attacks, in which the attacker directly modifies the signal after it was received by the sensor and before it is processed by the model. The corresponding digital-domain attack for the adversarial camera sticker is a type of universal adversarial perturbation (Moosavi-Dezfooli et al., 2017), while the digital adversarial patch attack (Brown et al., 2017; Anonymous, 2020) corresponds to physical patch attacks (Lee & Kolter, 2019; Huang et al., 2019).
We focus on increasing robustness against digital-domain attacks. Digital-domain attacks are strictly stronger than the corresponding physical-world attacks since they allow the attacker to have complete control over the change of the signal. In contrast, physical-world attacks need to be invariant under non-controllable effects such as scale, rotation, object position, and light conditions, which cannot be controlled by the attacker. Therefore, a system robust against digital-domain attacks is also robust against the corresponding physical-world attacks.
Currently, the most promising method for increasing robustness against adversarial attacks is adversarial training (Goodfellow et al., 2015; Madry et al., 2018). Adversarial training simulates an adversarial attack for every mini-batch and trains the model to become robust against such an attack. Adversarial training against digital-domain universal perturbations or patches is complicated by the fact that these attacks are computationally much more expensive than image-dependent adversarial attacks and existing approaches for speeding up adversarial training (Shafahi et al., 2019; Zhang et al., 2019; Zheng et al., 2020) are not directly applicable. Existing approaches for tailoring adversarial training to universal perturbations or patches either refrain from simulating attacks in every mini-batch (Moosavi-Dezfooli et al., 2017; Hayes & Danezis, 2018; Perolat et al., 2018), which bears the risk that the model easily overfits these fixed or rarely updated universal perturbations. Alternative approaches use proxy attacks that are computationally cheaper such as “universal adversarial training” (UAT) (Shafahi et al., 2018) and “shared adversarial training” (SAT) (Mummadi et al., 2019). These approaches face the challenge of balancing the implicit trade-off between simulating universal perturbation attacks accurately and keeping computation cost of the proxy attacks small.
We propose meta adversarial training (MAT)111Code is available under http://github.com/boschresearch/meta-adversarial-training., which falls into the category of proxy attacks. MAT combines adversarial training with meta-learning. We summarize the key novel contributions of MAT and refer to Section 3 for details:
MAT amortizes the cost of computing universal perturbations by sharing information about optimal perturbations over consecutive steps of model training, which reduces the cost of generating strong approximations of universal perturbations considerably. In contrast to UAT (Shafahi et al., 2018), MAT uses meta-learning for sharing of information rather than joint training, which empirically generates stronger perturbations and a more robust model.
MAT meta-learns a large set of perturbations concurrently. While a model easily overfits a single perturbation, even if it changes as in UAT, overfitting is much less likely for a larger set of perturbations such as those generated with MAT.
MAT encourages diversity of the generated perturbations by assigning random but fixed target classes and step-sizes to each perturbation during meta-learning. This avoids that many perturbations focus on exploiting the same vulnerability of a model.
We perform an extensive empirical evaluation and ablation study of MAT on image classification and traffic-light detection tasks against a variety of attacks to show the robustness of MAT against universal patches and perturbations (see Section 4). We refer to Figure 1 for an illustration of MAT for universal patch attacks against traffic light detection.
We review work on generating universal perturbations, defending against them, and meta-learning.
Adversarial perturbations are changes to the input that are crafted with the intention of fooling a model’s prediction on the input. Universal perturbations are a special case in which one perturbation needs to be effective on the majority of samples from the input distribution. Most work focuses on small additive perturbations that are bounded by some -norm constraint. For example, Moosavi-Dezfooli et al. (2017) proposed the first approach by extending the DeepFool algorithm (Moosavi-Dezfooli et al., 2016). Similarly, Metzen et al. (2017) extended the iterative fast gradient sign method (Kurakin et al., 2017) for generating universal perturbations on semantic image segmentation. Mopuri et al. (2017, 2018) presented data-independent attacks and Hayes & Danezis (2018) proposed using a generative model for learning a diverse distribution of universal perturbations. Li et al. (2019) presented a physical-world attack in which a translucent sticker is placed on the lens of a camera, which adds a universal perturbation to the image taken by the camera, and showed that this can fool an image classification system.
Other types of universal perturbations are so-called adversarial patches (Brown et al., 2017). In these universal patch attacks, the adversary can arbitrarily modify a small part of the image, typically a connected rectangular area, while leaving the remaining part of the image unchanged. Following Athalye et al. (2018), randomizing conditions such as location, rotation, scale, and lighting during the attack can make the universal patch sufficiently effective to fool the model when it is printed out and placed in the physical world. Later work has generalized these physical-world attacks to object detection (Lee & Kolter, 2019; Huang et al., 2019) and optical flow estimation (Ranjan et al., 2019).
First works for defending against universal perturbations are based on training a model against a fixed or slowly updated set/distribution of universal perturbations: Moosavi-Dezfooli et al. (2017) precompute a set of universal perturbations that are used during training, Hayes & Danezis (2018) learn a generative model of universal perturbations, and Perolat et al. (2018) build a slowly increasing set of universal perturbations concurrent to model training. A shortcoming of these approaches is that the model might overfit the fixed or slowly changing distribution of universal perturbations. However, re-computing universal perturbations in every mini-batch from scratch is prohibitively expensive. To address this issue, SAT Mummadi et al. (2019) trains a model against so-called shared perturbations. These shared perturbations do not have to be universal but only need to fool the model on a fixed subset of the batch. However, since the shared perturbations are recomputed in every mini-batch, it assumes a few gradient steps are sufficient to find strong perturbations from random initialization. In contrast, our method meta-learns strong initial perturbations. In UAT (Shafahi et al., 2018)
, training the neural network’s weights and updating a single universal perturbation happen concurrently, which scales to a large dataset. However, our experiments in Section4 indicate that a single incrementally and slowly updated perturbation is not sufficiently strong and diverse for making a model robust against all possible universal perturbations. Instead, our method meta-learns a large and diverse collection of perturbations during training.
For defending against adversarial patches, Chiang et al. (2020) proposed an approach of extending interval-bound propagation (Gowal et al., 2019) to the patch threat model. While this allows certification of robustness, it only scales to tiny patches and reduces clean accuracy considerably. Wu et al. (2020a) proposed the “defense against occlusion attack”, which applies adversarial training to inputs perturbed with input-dependent adversarial patches placed at specific positions determined, for example, by the input gradient magnitude. Since they generate patches from scratch, they require an expensive optimization of the patch for every training batch. Moreover, robustness against stronger attacks such as those proposed in Section 4.1 remains unclear. Saha et al. (2019) hypothesize that vulnerability of object detectors against adversarial patches stems from contextual reasoning. Accordingly, they propose Grad-defense which penalizes strong dependence of object detections on their context in a data-driven manner, where dependence is determined by Grad-CAM (Selvaraju et al., 2019). Lastly, some non-adversarial data augmentation techniques resemble the universal adversarial patch scenario: they add a Gaussian noise patch (Lopes et al., 2019) or a patch from a different image (CutMix) (Yun et al., 2019) to each input. CutMix is conceptually very similar to the out-of-context defense (Saha et al., 2019). However, as demonstrated in our experiments in Section 4, even though these approaches increase robustness against occlusions, they are unlikely to increase robustness against universal patch attacks.
Gradient-based meta-learning methods such as MAML (Finn et al., 2017) or REPTILE (Nichol et al., 2018) allow learning initial parameters for a class of optimization tasks, so that one can find close-to-optimal parameters on a novel task from the distribution with a small number of gradient steps. Moreover, meta-learning can also be used to learn the task optimizer itself such as by Xiong & Hsieh (2020) in the context of adversarial training. While it is common to meta-learn initial weights for neural networks, we propose that these algorithms can also be used to meta-learn initial values for universal perturbations. In this work, we combine REPTILE with adversarial training because of the low computational overhead of REPTILE; however, in principle other gradient-based meta-learning methods could also be used as part of our method.
In this section, we propose a novel combination of adversarial training with meta-learning that trains models to be robust against universal perturbations.
Let be a distribution over -dimensional datapoints and corresponding labels , model parameters to be optimized, and
a loss function. Moreover, letbe the set of allowed perturbations and be a function that applies a perturbation to a datapoint, potentially dependent on the label and some randomness . For universal perturbations (Moosavi-Dezfooli et al., 2017), one may choose for some small and . Alternatively, for universal patch attacks (Brown et al., 2017), we may define a mask , set , and let with being a sequence of stochastic transformations and some randomness governing the stochasticity of the transformations. That is, each patch and mask are consistently transformed with , e.g., translated, scaled, and rotated, and the transformed patch is applied to the input where the transformed mask is 1 and the input remains unchanged otherwise.
Following Mummadi et al. (2019), we define the universal adversarial risk as
where we drop the explicit dependence of on , , and . Generally, we are interested in finding model parameters that minimize the universal adversarial risk, denoted as . This corresponds to the standard min-max saddle point formulation of adversarial training introduced by Madry et al. (2018), where we incrementally update the model parameters by computing based on (or more precisely an approximation of ). However, in contrast to standard adversarial training, the inner maximization problem is optimized over an expected value with respect to the data distribution and potential randomness , making it more expensive to solve (even approximately). As the optimal of the inner maximization at step of the outer minimization depends on the parameter value , this maximization of needs to be repeated in every step of the outer minimization, making the direct minimization of intractable.
Existing work has addressed this in different ways. One approach (Moosavi-Dezfooli et al., 2017; Hayes & Danezis, 2018; Perolat et al., 2018) relaxes the explicit dependence of on and computes a set or distribution over for some parameter checkpoints of , and then applies these perturbations to the model while updating its parameters . One shortcoming is that the outer minimization of can converge to a value for which the precomputed set or distribution over loses its effectiveness even though there are still some other choices of which are effective for . Another approach is proposed by Mummadi et al. (2019): they instead replace the distribution in the inner maximization with the current batch of the outer minimization. The effectiveness of this procedure hinges on the ability to efficiently approximate this inner maximization with few gradient steps. In summary, the main challenge of using adversarial training to increase model robustness against universal perturbations is efficiently approximating for the current in every step of the outer minimization.
In contrast with the aforementioned approaches and similar to UAT (Shafahi et al., 2018), we exploit the property that one step of the outer minimization only applies a small change to ; thus, for consecutive steps and of the outer minimization, the resulting inner maximization problems for finding and are closely related (Zheng et al., 2020). UAT exploits this property by initializing the inner maximization at with the (approximate) solution for and performs a single gradient step on a single batch in the inner maximization at . A potential shortcoming of this method is that it uses only a single gradient-step and thus implements joint training of parameters and perturbation, which does not allow capturing higher-order derivatives of the loss function (Nichol et al., 2018) and may therefore learn suboptimal initial parameters.
In order to address this shortcoming, we propose an approach to meta-learn initial values for universal perturbations – by approaching the optimization problems with gradient-based meta-learning: in parallel to updating in the outer minimization, we meta-learn an initialization , which we refer to as the “meta-perturbation” at time step of the outer minimization. That allows for approximating the inner-optimization problem of with few gradient steps. More precisely, we use the REPTILE (Nichol et al., 2018) meta-learning algorithm with the iterative fast gradient sign method (I-FGSM) (Kurakin et al., 2017) task learner. In the inner maximization, we employ iterations of I-FGSM with the following update: and , where indexes the inner maximization iterations, denotes projection on the set , and the step size of I-FGSM. The key difference compared to standard I-FGSM and PGD (Madry et al., 2018) is that the initialization is neither constant nor randomly sampled but meta-learned.
The resulting perturbation is used two-fold: first, it is used with the REPTILE meta-learner for updating with the following update: , where is the learning rate of REPTILE. Second, is used in the next step of the outer minimization as an approximation of the optimal for the sample with randomness . Learning the universal perturbation in UAT can be seen as a special case of our procedure for and .
We estimate the expected loss in the I-FGSM task learner based on a single sample . Moreover, we use the same sample in all steps of I-FGSM at time as well as in the outer minimization step of updating
. This provides us benefits of reduced variance and more efficient computation, however, at the cost of a biased estimate of– I-FGSM will converge to an that is overfit to and . Compared to a perturbation optimized over the entire distributions and , will incur a higher loss on the sample. Nevertheless, since we typically choose the number of I-FGSM steps , we expect only weak overfitting and the gains from reduced variance more than compensates for the increased bias.
While the procedure proposed in Section 3.2 allows for meta-learning of a single meta-perturbation , one such meta-perturbation can easily get trapped in a local optimum, from which gradient-based meta-learning cannot easily escape. For instance, in a classification task with classes, there will likely be at least local optimal perturbations. More precisely, for each class, there is at least one optimum which corresponds to the perturbation that maximizes the model’s prediction of this class. Hence, we propose a meta-learning approach that learns not just a single meta-perturbation but rather an entire set of meta-perturbations, where the chosen should be proportional to . For each sample, we select one of these meta-perturbations that will be used for initializing I-FGSM and later get updated by REPTILE.
However, meta-learning a set of meta-perturbations in this way with the same optimizer and objective will not automatically result in a diverse set of meta-perturbations. For instance, many of the perturbations might focus on exploiting similar weaknesses of a model such as triggering a misclassification to the same class. Moreover, utilizing the same task learner with the same step size might result in perturbations with similar properties. To alleviate this optimization problem, we encourage diversity of the generated set of meta-perturbations: for every meta-perturbation, we randomly assign a target and perform a targeted I-FGSM attack. This avoids many perturbations converging to similar patterns that fool the model into predicting the same target. Moreover, we also assign a randomly chosen fixed step size
for I-FGSM to every meta-perturbation. Larger step sizes correspond to meta-perturbations that explore the space of allowed perturbations more globally while smaller step sizes result in more fine-grained attacks. We empirically evaluate the effectiveness of these heuristics in Section4.
We summarize the proposed meta adversarial training (MAT) in Algorithm 1. The function INIT (see Algorithm 2 in the appendix) is responsible for initializing consisting of meta-perturbations along with corresponding targets and step-sizes . For classification tasks, we select the target as one of the classes in a round-robin fashion. Moreover, we select the step size log-uniformly from . We initialize the meta-perturbations by either sampling uniform randomly from or by (sub-sampling) an actual data-point, which corresponds to an on-manifold initialization akin to CutMix (Yun et al., 2019). This data-initialization was concurrently proposed by Yang et al. (2020b), and Yang et al. (2020a) found that such texture patches can be adversarial even without further optimization. The function SELECT (see Algorithm 3 in the appendix) uniform randomly samples trials of from with randomness and returns the trial which maximizes the loss .
Line 8-11 present the core of MAT consisting of (i) inner maximization of a perturbation that was initialized from a meta-perturbation with I-FGSM (Line 8), (ii) a step of outer minimization of with an optimizer like SGD on a pair of perturbed input and corresponding label (Line 10), and (iii) the meta-learning update of the respective meta-perturbation with REPTILE (Line 11). While Algorithm 1 shows the procedure for a batch size equal to one, we can easily run it also for larger batch sizes. The only required change is that REPTILE-based meta-learning can deal with the situation where the same meta-perturbation is selected and optimized for several elements in a batch. In this case, the meta-learning update becomes for perturbations that were initialized with the same meta-perturbation .
We briefly summarize the main advantages of MAT compared to prior work: as opposed to UAT, MAT meta-learns a diverse set of meta-perturbations with I-FGSM concurrently to model training rather than jointly training model parameters and a single perturbation with FGSM. Compared to SAT (Mummadi et al., 2019), MAT does not treat every inner maximization problem independently but meta-learns strong initializers, allowing MAT to find stronger perturbations with no more computational cost than standard adversarial training (see Section A). In contrast to the work of Moosavi-Dezfooli et al. (2017); Hayes & Danezis (2018); Perolat et al. (2018), MAT computes novel perturbations in every iteration of model training (outer minimization). We would also like to note that MAT meta-learns perturbations but not model weights and thus results in a standard trained model that does not require test-time adaptation.
We evaluate the performance of MAT, ablated versions of MAT, and baselines. We present results for universal patch attacks on image classification on Tiny ImageNet(Tin, ) (results for universal perturbations attacks are provided in Section C.1.2
). We choose Tiny ImageNet because its resolution of 64x64 pixels allows attacks with relatively large patches. Moreover, it facilitates an evaluation based on an extensive grid search of attack configurations with reasonable computational cost. In addition, we present results for a universal patch attack for an object detection task on the Bosch Small Traffic Lights Dataset(Behrendt & Novak, 2017). For a detailed description of the implementation of the experiments, we refer to Section B in the appendix. In order to compare different methods for increasing robustness against universal patches, a reliable way of evaluating their robustness is required, which we outline next.
We outline strong attacks for reliably evaluating the robustness of trained models against universal perturbations and patches. Importantly, we do not use the meta-learned perturbations as this might result in a biased robustness evaluation. Instead, we extend PGD (Madry et al., 2018) in a similar way as Mummadi et al. (2019) by rewriting from Equation (1) to with
and then use the estimate
based on samples and Finally, we define stochastic projected gradient descent (S-PGD) as and . Note that S-PGD uses different in every step when estimating .
In general, S-PGD will converge to local optima; namely, obtained after K steps of S-PGD will not necessarily be the global maximizer of . To account for this, we propose three extensions of S-PGD: Firstly, since the initialization of will generally affect the quality of , we propose an alternative initialization akin to CutMix (Yun et al., 2019) where we initialize based on a datapoint . For universal patch attacks, we downsample or crop to the patch size, whereas for universal perturbation attacks, we scale its intensity range such that . This initialization becomes even more effective if we sample many and select the one for initializing which would maximize . We denote this initialization as data initialization.
Secondly, we take inspiration from recently proposed low-frequency attacks (Guo et al., 2019; Sharma et al., 2019): we modify the process of adding a perturbation to an input to , where denotes a low-pass filter with cutoff-frequency . To achieve this, we follow Jo & Bengio (2017) and create a centered radial mask with radius . The patch is transformed into frequency space and multiplied by the radial mask. The result is transformed back to image space and thus yields the patch to be applied to the image. While this makes the attack weaker in principle since only low-frequency perturbations are possible, we observe that in practice, it can lead to a more well-behaved optimization problem and result in S-PGD converging to stronger perturbations.
Thirdly, we perform a transfer attack
, in which we run an attack after every epoch of model training. We initializewith one of the found in previous epochs, namely the one that would maximize . After every epochs, we run an additional S-PGD attack from randomly initialized . This transfer attack helps identify cases where universal perturbations found in early epochs remain effective against the model but in later epochs are no longer found when running S-PGD attacks from random or data initialization.
We evaluate robustness against universal patches of size 24x24 pixel that cover approximately 14% of the image. Patches are randomly translated from the center of the image by at most 26 pixels.
We train every model for 75 epochs with SGD, an initial learning rate of 0.033, a cosine decay learning rate scheduler, momentum 0.9, and a batch size of 128. We use a ResNet (He et al., 2016), train it from scratch, and follow Xie & Yuille (2020)
by replacing batch normalization with group normalization(Wu & He, 2019) and weight standardization (Qiao et al., 2019). We use iterations of I-FGSM in AT (Madry et al., 2018), SAT (Mummadi et al., 2019), and MAT. For UAT (Shafahi et al., 2018), we use following their recommendation. For SAT, we use sharedness 128. We note that all adversarial training baselines were trained against patch attacks.
For every setting, we perform 5 independent runs. We evaluate the robustness against 2500-step-S-PGD with a batch size of 64 and random initialization, data initialization (data samples resized to patch size), and low-frequency filter, and the transfer attack (see Section 4.1). For the S-PGD settings, we perform a grid search (see Table 4) over step sizes and momentum independently for every trained model and report the minimal accuracy. Finally, we report the minimal accuracy across all attacks.
|CutMix (Yun et al., 2019)|
|PatchUniform (Lopes et al., 2019)|
|AT (Madry et al., 2018)|
|SAT (Mummadi et al., 2019)|
|UAT (Shafahi et al., 2018)|
Results are summarized in Table 1 (more details can be found in Section C.1). We observe that a model trained with standard empirical risk minimization offers no robustness against any of the evaluated attacks. Similarly, PatchUniform (akin to Lopes et al. (2019) but with uniform rather than GaussianNoise) or CutMix (Yun et al., 2019) augmentation do not offer robustness against strong patch attacks. When comparing the baseline adversarial defenses (Madry et al., 2018; Mummadi et al., 2019; Shafahi et al., 2018)
, none of them exhibit more than trivial robustness. UAT provides relatively high robustness against S-PGD with random initialization, even though we perform an extensive grid search over hyperparameters of S-PGD. However, this robustness does not carry over to other attack variants. Closer inspection of AT and SAT (see Figure4 in the appendix) shows that also for these methods only very specific combinations of learning rate and momentum allow for discovery of effective patches from random initialization. In summary, since S-PGD with random initialization, a single step-size, and momentum value is often used as the default evaluation, we suspect that some prior work might have overestimated the robustness of existing methods.
In contrast, MAT (full) with standard parameters (INIT with data initialization and meta-perturbations, targeted attacks, in SELECT, iterations in I-FGSM, REPTILE learning rate ) shows high robustness against all attack variants. When ablating MAT, choosing a random initialization in INIT is most problematic – it results in similar but less severe overfitting to randomly initialized attacks as UAT. Also, ablating towards joint training ( and ) deteriorates performance relative to meta-learning in MAT (full). In addition, enforcing diversity in meta-perturbations via targeted attacks in MAT (full) is responsible for a small increase in robustness compared to untargeted learning of meta-perturbations. Finally, taking the worst over samples in SELECT outperforms random sampling ().
We observed that baseline adversarial training procedures like AT, SAT, and UAT are very sensitive to the step size used in the inner maximization. Although we have tuned their step sizes to some extent, there might be some effective with which their performance would improve. However, MAT does not require us to tune at all since it uses different learning rates for every meta-perturbation, and thus it covers a broad range of step sizes automatically. Most importantly, MAT offers increased robustness without affecting clean performance. In contrast, MAT acts as an effective regularizer and reduces overfitting compared to standard training and achieves the strongest clean performance among all methods, surpassing standard training by 4 percentage points. Even against the strongest patches, MAT only loses 2 percentage points accuracy relative to standard training on clean data, despite the relatively large patch size. We provide illustrations of patches in Section D and show in Section A that MAT has similar computational cost as standard adversarial training.
We evaluate robustness of a traffic light detector based on YoloV3 (Redmon & Farhadi, 2018). This attack scenario is a good proxy for physical-world attacks on automated driving systems since traffic light detection crucially relies on camera-based perception. We add 64x64 patches to 1280x704 images, covering 0.45% of the image, and add random translations from the center by up to pixels. We train the models with ADAM for 15 epochs with batch size 1 and learning rate . We replace batch normalization by group normalization with weight standardization.
We evaluate the effectiveness of universal patch attacks using two metrics: mean Average Precision (mAP) and mean recall over classes for a fixed confidence threshold. While mAP captures both non-existent detections caused by the patch (false positives) and correct detections missed by the model (false negatives), mean recall focuses only on the latter. In other words, recall captures “blindness“ attacks (Saha et al., 2019) that could be more dangerous in real life scenarios. We set the confidence threshold to 0.3, the non-maximum suppression threshold to 0.1, and the IOU threshold for evaluating true positives to 0.1, respectively.
For attacks with S-PGD, we run 4000 steps with a batch size of 4, fix momentum to , and perform a grid search (see Table 6) over step sizes and cutoff frequency of the optional low-pass filter. Moreover, we also conduct a grid search over three options for the loss maximized by the attacker: the standard loss also used for model training, the standard loss subtracting the objectness loss as proposed by Saha et al. (2019), and the standard loss ignoring all false positives. The last loss variant is also well suited for “blindness“ attacks since it accentuates false negatives. For MAT (default), we use random initialization in INIT, meta-perturbations, in SELECT, iterations in I-FGSM, and learning rate in REPTILE as the default configuration.
Table 2 summarizes the results (more details can be found in Section C.2). In general, there are no systematic differences between training methods on clean data. Standard training faces a drop in mean recall when a universal patch is added. Thus, standard training is likely to be susceptible to physical-world blindness attacks, which could for example cause the model to ignore real traffic lights. In contrast, UAT and all variants of MAT are very robust against the tested blindness attacks even in the digital domain. In terms of the mAP, UAT faces a considerable drop for patch attacks initialized with data crops and with a low-frequency filter. A similar but weaker effect can also be observed for MAT with the default configuration. Both methods therefore detect non-existent traffic lights on the patch or in its vicinity. Interestingly, false positive detections often resemble traffic lights (see Figure 1 and Section E). Despite this resemblance, a human would not be fooled by these patches. The same holds true for MAT with data initialization in INIT: its high mAP indicates that it is very robust in terms of prevention of additional false positives. When ablating MAT, we observe that its mAP deteriorates as the configuration approaches UAT ( and ). We conclude that all aspects of MAT are essential for achieving maximal robustness.
We propose meta adversarial training (MAT), a novel combination of adversarial training with meta-learning that allows the increase of model robustness against universal perturbations and patches with little computational overhead. Moreover, we show that prior work, which was assumed to be robust, can be fooled by stronger attacks. In contrast, MAT remains robust against all evaluated attacks. Our results show that further research into attacks or alternatively scaling up certification procedures (Chiang et al., 2020) is required for reliably evaluating robustness against universal perturbations. On the other hand, our results also indicate that physical-world attacks will become considerably more difficult against models trained with MAT.
International Joint Conferences on Artificial Intelligence Organization (IJCAI), 2020. submitted.
16th European Conference on Computer Vision (ECCV), 2020.
Computer Vision and Pattern Recognition (CVPR), 2018b.
PatchAttack: A Black-box Texture-based Attack with Reinforcement Learning.In 16th European Conference on Computer Vision (ECCV), 2020a.
Cutmix: Regularization strategy to train strong classifiers with localizable features.In The IEEE International Conference on Computer Vision (ICCV), October 2019.
Computational cost of MAT is dominated by the number of forward passes and backward passes through the network for a single iteration of model training (the outer loop). Adversarial training (AT) with -step PGD incurs cost for one iteration: for generating the perturbation and for one step of model training. Similarly, the cost of MAT (with REPTILE as in our experiments) for inner maximization and one step of outer minimization is . Additionally, selecting a meta-perturbation from samples in Algorithm 3 incurs a cost of for . The additional cost is for because the meta-perturbation is sampled randomly in this case and its loss need not be computed. The cost of REPTILE itself is negligible, because it is a simple convex combination. Therefore, the cost of MAT () and that of AT are comparable and MAT () clearly outperforms AT in Table 1. Also for a small such as for MAT (full) in Table 1, MAT+REPTILE does not incur considerably higher cost than AT for the same . The key point is that due to better initialization from the set of meta-perturbations, a small can be chosen for MAT, whereas would need to be very large for PGD in order to create equally strong attacks.
We present details on subprocedure INIT in Algorithm 2. Two relevant parameters of INIT are described below.
As described in Section 3.2, meta adversarial training (MAT) meta-learns a set of meta-perturbations , where indexes . Similar to the attack initialization described in Section 4.1, these meta-perturbations can be initialized in INIT in two ways as follows:
: sampling randomly from a uniform distribution over the space of allowed perturbations.
Data initialization: this initialization sub-samples actual data points from the training dataset and corresponds to an on-manifold initialization that follows the data distribution. To generate universal patches, we downsample or crop the data points. To create universal perturbations, we scale the intensity of the data points to the range of .
The number of meta-perturbations is chosen roughly proportional to the number of classes of the dataset regardless of classification or object detection tasks. We choose for Tiny ImageNet, which has 200 classes. For Bosch Small Traffic Lights Dataset, we choose because the dataset only has 4 classes.
We present details on the sub-procedure SELECT in Algorithm 3. We note that the special choice corresponds to a uniform random sampling of a meta-perturbation (and corresponding target class, step-size, and randomness). For , SELECT requires additional evaluations of the loss functions (and thus forward passes through the model) since the sample with the maximal loss is selected.
To evaluate the robustness of a model trained with MAT against universal patch attacks and universal perturbation attacks, we compare its performance with other training approaches such as Standard, CutMix (Yun et al., 2019), PatchUniform, adversarial training (AT) (Madry et al., 2018), shared adversarial training (SAT) (Mummadi et al., 2019), and universal adversarial training (UAT) (Shafahi et al., 2018). Please note that this evaluation is an ablation study of MAT, namely, we configure MAT in a way that is similar to each training approach. Detailed configurations are shown for universal patches in Table 3 and for universal perturbations in Table 7.
. This ResNet model contains 4 residual stacks, where each stack consists of 2 residual blocks. The stacks have 64, 128, 256, and 512 channels and spatial resolution of 64x64, 32x32, 16x16, 8x8, respectively. We employ ReLU as its activation function. Each convolutional layer has a stride of 1, kernel size of 3, group normalization with weight standardization, SAME padding, ”he_normal” kernel initialization, and a weight decay ofon the kernel weights.
Each model is trained with 24x24 pixel patches applied to the 64x64 pixel input images, namely, a patch covers approximately 14% of the image. Patches are randomly translated from the center of the image by up to 26 pixels during training. We train each model with SGD for 75 epochs, an initial learning rate of 0.033, a cosine decay learning rate scheduler, momentum of 0.9, and a batch size of 128. For each setting, we perform 5 independent runs with 5 different seeds. Details regarding adversarial training procedures are shown in the Table 3. The most crucial parameter for AT, SAT, and UAT is the step size of I-FGSM. For AT, we follow MAT and sample the step size per datapoint randomly from a log-uniform distribution over . Since UAT and SAT only update a single patch per batch, this random sampling strategy is not feasible on a per-batch level. Instead, we use a fixed ; more specifically, we use for SAT such that I-FGSM can reach any value in in K=5 iterations. Since UAT updates the perturbation iteratively over the batches, a smaller value for is feasible here and we employ . We did not tune these choices for extensively but note that MAT does not require any tuning of the step size .
As described in Section 4.1, we propose strong attacks for reliably evaluating the robustness of trained models against universal patch attacks and universal perturbation attacks by optimizing the perturbations using S-PGD. We propose two initialization methods for S-PGD. Additionally, we utilize the low-frequency attack described above. The S-PGD step size is exponentially decayed with a total decay of . For evaluation of each model’s robustness, we perform S-PGD attacks over the parameter grid given in Table 4. The attack results can be found in Subsection C.1.
|worst over samples||1||1||1||5|
|REPTILE learning rate||0||0||1||0.25|
|number of meta-patches||1||1000|
|Patch Initialization||random, data|
|Step Size||0.0001, 0.00033, 0.001, 0.0033, 0.01, 0.033, 0.1|
|Momentum||0, 0.9, 0.99|
|Cutoff Frequency||off, 12|
|Number of iteration (S-PGD)||2500|
|Total Step Size Decay||0.01|
We describe the experimental details for training robust traffic light detectors against universal patch attacks.
For each training procedure, we train a Yolo V3 model (Redmon & Farhadi, 2018) from scratch on Bosch Small Traffic Lights Dataset (Behrendt & Novak, 2017). The model has three network outputs on each scale as implemented in the original paper. For each DarkNet conv layer, we replace batch normalization with group normalization and use weight standardization. To interpret the network outputs of Yolo V3, we set the confidence threshold to 0.3. This means only the predictions with an objectness score count as valid predictions. The non-maximum suppression threshold is set to 0.1, that means we prune the predictions when their bounding boxes overlap with IoU .
Each model is trained with 64x64 pixel patches applied to the input images resized to 1280x704 – both width and height of the resized images are a multiple of 32 because a grid cell’s size is 32x32; thus, a patch covers 0.45% of the image. Patches are randomly translated from the center of the image by up to (512, 282) pixels during training. We ensure that translated patches do not overlap with any ground-truth traffic-light annotation. We train the model with ADAM for 15 epochs, an initial learning rate of 0.0001, a cosine decay learning rate scheduler, and a batch size of 1. We compare the accuracy against universal patch attacks of UAT and MAT variants. The configuration details are shown in Table 5.
As described in Section 4.3, in order to evaluate the effectiveness of the universal patches as well as the robustness of the model, we apply two metrics to the evaluation procedure - mean Average Precision (mAP) and mean recall over classes with the IoU threshold of 0.1, which determines true positives between predicted bounding boxes and the ground truth. For generating universal patches, we use S-PGD with 4000 steps with a batch size of 4. To find the strongest patches, we perform a grid search over step sizes , a fixed momentum of 0.9, and a cutoff frequency of an optional low-pass filter. In addition, we also conduct a grid search over three different options for the loss that is maximized by the attacker - 1) the standard loss that is also used during training, 2) the standard loss subtracting the objectness loss, and 3) the standard loss ignoring all false positives. Similar to the previous initialization approaches in Section B.3, the perturbations for these attacks are initialized in two different ways - randomly or from a cropped image of Bosch Small Traffic Lights Dataset. Each configuration is a unique combination of an initialization, a step size, a cutoff frequency, and a loss found through the grid search. The parameter grid is summarized in Table 6. More result details can be found in Section C.2.
|SETTING||UAT||MAT (default)||MAT(+data)||MAT (=1)||MAT(=1,=1)|
|worst over samples||1||1||1||1||1|
|REPTILE learning rate||1.0||0.25||0.25||0.25||0.25|
|number of patches||1||10||10||10||1|
|Patch Initialization||random (RI), data crop (DI)|
|Number of Steps (S-PGD)||4000|
|Step Size||0.1, 0.01, 0.001, 0.0001|
|Total Step Size Decay||0.01|
|Cutoff Frequency||25, 50, 100, 250|
|Loss||standard, no objectness loss, ignoring false positives|
In Figure 2, we compare learning curves of MAT models against the transfer attack (see Section 4.1) between different settings during training. The left plot shows that initializing the meta-perturbations via data initialization leads to higher universal adversarial accuracy compared to random initialization. The middle plot shows that the model trained with targeted meta-perturbations is more robust than the model trained with untargeted meta-perturbations, because targeted meta-perturbations allow for a greater diversity. The right plot shows results of randomly choosing a patch (), selecting the worst patch from samples, and selecting the worst patch from samples. Training the model with more than one sample () improves the model’s robustness but robustness saturates for while larger increases computational cost.
Figure 3 shows learning curves of ablated versions of MAT against the transfer attack. In accordance with Table 1, training with a larger number of meta-perturbations , more iterations in I-FGSM, and with a REPTILE learning rate smaller than 1.0 consistently improves robustness.
While Table 1 shows the worst accuracy of a setting against all attacks of the grid search, Figure 4 summarizes the accuracy of all attacks of the grid search in a box plot. Each value is averaged over 5 independent runs with 5 different seeds for each training procedure. Each model is evaluated against three patch attack procedures: data, random, and low frequency. Configurations with large variance indicate that the model might appear to be robust if hyperparameters of the attack are chosen badly. This effect is particularly pronounced for AT and SAT against the random initialized S-PGD attack, where only very few attack configurations are able to strongly degrade performance.
Moreover, the results exhibit that MAT is the only model robust against the data initialization attacks. None of the attacks reduce MAT’s accuracy below 0.5 regardless of initialization methods. As discussed before, attacks through data initialization are more effective than through random initialization and attacks employing a low frequency filter are most effective on MAT with random initialization (MATr). Nevertheless, MATr still shows stronger robustness than all other approaches except MAT.
We present an analogous evaluation as in Section 4.2 for a universal perturbation attack. We use the same dataset, neural architecture, and training pipeline but train the models specifically for universal perturbation attacks. We allow universal perturbations with . In comparison with training models against universal patch attacks, the key difference is that the models are trained specifically for universal perturbation attacks instead of universal patch attacks. Following the training configuration shown in Table 7, we train ResNet V1 models with 4 training approaches - Standard, AT, UAT, and MAT. We do not present results for SAT (Mummadi et al., 2019) since we have not found a stable configuration of hyperparameters for this setting; however, the results of Mummadi et al. (2019) indicate that SAT should perform slightly better than AT when configured appropriately. We evaluate the robustness of the models against the same attacks as for patch attacks.
|worst over samples||1||1||5|
|REPTILE learning rate||0||1||0.25|
|number of meta-perturbations||1||1000|
|I-FGSM step-size||[0.0001, 0.02]||0.01||[0.0001, 0.02]|
|AT (Madry et al., 2018)|
|UAT (Shafahi et al., 2018)|
The evaluation results are summarized in Table 8. In comparison with the results of universal patch attacks in Table 1, we notice a few interesting differences: firstly, clean accuracy is degraded for all variants of adversarial training compared to standard training. This indicates a trade-off between clean performance and robustness in this threat-model. Secondly, in contrast to standard training, AT and UAT made non-trivial gains in robustness, whereas their robustness did not improve against universal patches addressed in Section 4.2. Thirdly, the accuracy of UAT in Table 8 shows that UAT overfits less strongly to the randomly initialized S-PGD attack compared to the universal patch attacks in Table 1. Despite these differences, MAT considerably outperforms all other methods in terms of robustness also in this setting.
Figure 5 shows the learning curves of those training approaches against the transfer attack for generating universal perturbations. Notably, while MAT is less robust in the early phase of training, it reaches a significantly higher level of robustness in the end.
Figure 6 shows the box plot corresponding to Table 8. The accuracy of MAT models is above 0.4 for all three attacks and shows little variance. In contrast, UAT and AT are robust against certain attack configurations but against an optimally configured attack, accuracy degrades to 0.25 or less. This shows that evaluating robustness reliably requires a strong set of attacks and their well-tuned hyperparameters.
We show the box plots corresponding to Table 2 in Figure 7 for recall and mAP, respectively. Notably, only the recall of the standard model can be reduced considerably (meaning true positives can be hidden) and this requires an appropriately configured attack. Interestingly, a low-frequency attack is not effective for reducing the recall of any model. In contrast, low-frequency attacks are the most effective ones for reducing mAP, that is: for causing false positive detections. While randomly initialized S-PGD is not successful at reducing the mAP of any model besides the standard model, many low-frequency attacks of varied attack configurations reduce mAP of most models (except MAT + data) considerably. In contrast, S-PGD from data initialization can be effective but fails in most cases to reduce mAP for all but the standard model.
We illustrate universal patch attacks on models trained on Tiny ImageNet in Figure 8. Note that these are the strongest patches found against these models during the grid search. Oftentimes, the generated patch resembles the target class: examples for this are the low-frequency attack on a standard model (fooling it to mistake a chimpanzee for a police van), the random initialization attack against the SAT model (fooling it to mistake the chimpanzee for a ladybug), the data initialization attack against the SAT model (fooling it to mistake the chimpanzee for an orange), or the low-frequency attack against the UAT model (fooling it into classifying the input as a fire salamander based on the characteristic texture of the patch). While these misclassifications can be explained, a human would very likely still classify the inputs as chimpanzees. Attacks on MAT (full) fail to generate interpretable patches; however, transferring patches generated for other models (such as the shown ones) to MAT does not cause misclassifications either.
We illustrate universal patch attacks on models trained on Bosch Small Traffic Lights Dataset in Figure 9. Note that these are the strongest patches found against these models during the grid search in terms of the mAP. These patches often invoke high confidence false detections. However, MAT with data initialization does not show any false positives.
For the patches found for MAT (Data Init), we show the progress of the patches during the attack in Figure 10. Similarly, Figure 11 shows the patches’ evolution during an attack on the standard model. Note that patches converge fairly quickly, namely, running attacks longer would not make them stronger. Moreover, all three patches for MAT converge to a red-cyan pattern and the patches for data and random initialization exhibit very similar patterns. This indicates that this pattern is actually a minimizer of the loss with a large basin of attraction. However, as Figure 9 shows, it does not really fool the model. Finally, Figure 12 shows the training of the patch shown in Figure 1.
We illustrate universal perturbation attacks on models trained on Tiny ImageNet in Figure 13. Note that these are the strongest perturbations found against these models during the grid search.