JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks

04/07/2019
by   N. Benjamin Erichson, et al.
berkeley college
0

It has been demonstrated that very simple attacks can fool highly-sophisticated neural network architectures. In particular, so-called adversarial examples, constructed from perturbations of input data that are small or imperceptible to humans but lead to different predictions, may lead to an enormous risk in certain critical applications. In light of this, there has been a great deal of work on developing adversarial training strategies to improve model robustness. These training strategies are very expensive, in both human and computational time. To complement these approaches, we propose a very simple and inexpensive strategy which can be used to "retrofit" a previously-trained network to improve its resilience to adversarial attacks. More concretely, we propose a new activation function---the JumpReLU---which, when used in place of a ReLU in an already-trained model, leads to a trade-off between predictive accuracy and robustness. This trade-off is controlled by the jump size, a hyper-parameter which can be tuned during the validation stage. Our empirical results demonstrate that this increases model robustness, protecting against adversarial attacks with substantially increased levels of perturbations. This is accomplished simply by retrofitting existing networks with our JumpReLU activation function, without the need for retraining the model. Additionally, we demonstrate that adversarially trained (robust) models can greatly benefit from retrofitting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

05/25/2019

Resisting Adversarial Attacks by k-Winners-Take-All

We propose a simple change to the current neural network structure for d...
03/23/2020

Architectural Resilience to Foreground-and-Background Adversarial Noise

Adversarial attacks in the form of imperceptible perturbations of normal...
10/08/2020

Improve Adversarial Robustness via Weight Penalization on Classification Layer

It is well-known that deep neural networks are vulnerable to adversarial...
06/07/2019

Reliable Classification Explanations via Adversarial Attacks on Robust Networks

Neural Networks (NNs) have been found vulnerable to a class of impercept...
06/18/2021

Less is More: Feature Selection for Adversarial Robustness with Compressive Counter-Adversarial Attacks

A common observation regarding adversarial attacks is that they mostly g...
09/23/2018

Adversarial Defense via Data Dependent Activation Function and Total Variation Minimization

We improve the robustness of deep neural nets to adversarial attacks by ...
04/11/2021

Achieving Model Robustness through Discrete Adversarial Training

Discrete adversarial attacks are symbolic perturbations to a language in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As machine learning methods become more integrated into a wide range of technologies, there is a greater demand for robustness, in addition to the usual efficiency and high-quality prediction, in machine learning algorithms. Deep neural networks, in particular, are ubiquitous in many technologies that shape the modern world 

[20, 12], but it has been shown that even the most sophisticated network architectures can easily be perturbed and fooled by simple and imperceptible attacks. For instance, single pixel changes which are undetectable to the human eye can fool neural networks into making erroneous predictions. These adversarial attacks can reveal important fragilities of modern neural networks [45, 13, 24], and they can reveal flaws in network training and design which pose security risks [19]. Partly due to this, evaluating and improving the robustness of neural networks is an active area of research. Due to the unpredictable and sometimes imperceptible nature of adversarial attacks, however, it can be difficult to test and evaluate network robustness comprehensively. See, e.g., Figure 1, which provides a visual illustration of how a relatively small adversarial perturbation can lead to incorrect classification.

[width=0.61]figs/adverserial_illustration_v2.png clean exampleadversarial perturbationadversarial example

Figure 1.

Adversarial examples are constructed by perturbing a clean example with a small amount of non-random noise in order to fool a classifier. Often, an imperceptible amount of noise is sufficient to fool a model (top row). The JumpReLU improves the robustness, i.e., a higher level of noise is required to fool the retrofitted model (bottom row).

Most work in this area focuses on training, e.g., developing adversarial training strategies to improve model robustness. These training strategies are very expensive, in both human and computational time. For example, a single training run can be expensive, and typically many training runs are needed, as the analyst “fiddles with” parameters and hyper-parameters.

Figure 2. Simplified illustration of a neural network architecture using ReLU activation functions. JumpReLU can be activated by setting the jump value (threshold value) larger than zero to increase the resilience to adversarial attacks. (One could use different values of for different layers, but we did not observe that to help.)

Motivated by this observation, we propose a complementary approach to improve the robustness of the model to the risk of adversarial attacks. The rectified linear unit (ReLU) will be our focus here since it is the most widely-used and studied activation function in the context of adversarial attacks (but we expect that the same idea can be applied more generally). For networks trained with ReLUs, our method will replace the ReLU with what we call a

JumpReLU

function, a variant of the standard ReLU that has a jump discontinuity. See Figure 

2 for an illustration of the basic method. The jump discontinuity in the JumpReLU function has the potential to dampen the effect of adversarial perturbations, and we will “retrofit” existing ReLU-based networks by replacing the ReLU with the JumpReLU, as a defense strategy, to reduce the risk of adversarial attacks. The magnitude of the jump is a parameter which controls the trade-off between predictive accuracy and robustness, and it can be chosen in the validation stage, i.e., without the need to retrain the network.

In more detail, our contributions are the following:

  • We introduce and propose the JumpReLU activation function, a novel rectified linear unit with a small jump discontinuity, in order to improve the robustness of trained neural networks.

  • We show that the JumpReLU activation function can be used to “retrofit” already deployed, i.e., pre-trained, neural networks—without the need to perform an expensive retraining of the original network. Our empirical results show that using the JumpReLU in this way leads to networks that are resilient to substantially increased levels of perturbations, when defending classic convolutional networks and modern residual networks. We also show that JumpReLU can be used to enhance adversarially trained (robust) models.

  • We show that the popular Deep Fool method requires increased noise levels by a factor of about to achieve nearly percent fooling rates for the retrofitted model on CIFAR10. We show that these increased noise levels are indeed critical, i.e., the detection rate of adversarial examples is substantially increased when using an additional add-on detector.

  • The magnitude of the jump is an additional hyper-parameter in the JumpReLU activation function that provides a trade-off between predictive accuracy and robustness. This single parameter can be efficiently tuned during the validation stage, i.e., without the need for network retraining.

In summary, the JumpReLU activation functions improves the model robustness to adversarial perturbations, while attaining a “good” accuracy for clean examples. Further, the impact on the architecture is minimal and does not effect the inference time of the network.

2. Related work

Adversarial examples are an emerging threat for many machine learning tasks. Szegedy et al[45] discovered that neural networks are particularly susceptible to such adversarial examples. This can lead to problems in safety- and security-critical applications such as medical imaging, surveillance, autonomous driving, and voice command recognition. Due to its importance, adversarial learning has become an intense area of research, posing a cat-and-mouse game between attackers and defenders.

Indeed, there is currently a lack of theory to explain why deep learning is so sensitive to this form of attack. Early work hypothesized that the highly non-linear characteristics of neural networks and the tendency toward almost perfect interpolation of the training data are the reasons for this phenomena. Tanay and Griffin 

[46] argued that the adversarial strength is related to the level of regularization and that the effect of adversarial examples can be mitigated by using a proper level of regularization. In contrast, Goodfellow et al[13] impressively demonstrated that the linear structure in deep networks with respect to their inputs is sufficient to craft adversarial examples.

Let’s assume that denotes an input such as an image. The problem of crafting an adversarial example requires finding an additive perturbation , so that which is constructed as

(1)

fools a specific model under attack. The minimal perturbation with respect to a -norm can be obtained by using an optimization based strategy which aims to minimize

(2)

so that the example is misclassified.

Note, the perturbation used to construct adversarial examples needs to be small enough to be unnoticeable for humans, or add-on detection algorithms. Intuitively, the average minimum perturbation which is required to fool a given model yields a plausible metric to characterize the robustness of a model [35]. Hence, we can quantify the robustness for a trained model as

(3)

where the input-target-pairs are drawn from distribution , and is the minimal perturbation that is needed to fool the model .

2.1. Attack strategies

There are broadly two types of attacks: targeted and non-targeted attacks. Targeted attacks aim to craft adversarial examples which fool a model to predict a specific class label. Non-targeted attacks have a weaker objective, i.e., simply to classify an adversarial example incorrectly.

Independent of the type, attack strategies can be categorized broadly into two families of threat models. Black-box attacks aim to craft adversarial examples without any prior knowledge about the target model [44, 42, 9, 10]. White-box attacks

, in contrast, require comprehensive prior knowledge about the target model. There are several popular white-box attacks for computer vision applications 

[45, 13, 24, 33, 19, 32, 36]. A slightly weaker form are gray-box attacks, which take advantage of partial knowledge about the target model.

The following (non-targeted) attack methods are particularly relevant for our results.

  • First, the Fast Gradient Sign Method (FGSM) [13], which crafts adversarial perturbations

    by using the sign of the gradient of the loss function

    with respect to the clean input image . Let’s assume that the true label of is . Then, the adversarial example is constructed as

    (4)

    where controls the magnitude of the perturbation. Here, the operator is an element-wise function, extracting the sign of a real number.

    Relatedly, the iterative variant IFGSM [19] constructs adversarial examples using iterations

    (5)

    where is an element-wise clipping function. This approach is essentially a projected gradient decent (PGD) method used to craft adversarial examples [28].

  • Second, the Deep Fool (DF) method, which is another iterative method for constructing adversarial examples [33]. The DF method first approximates the model under consideration as a linear decision boundary, and then seeks the smallest perturbation needed to push an input image over that boundary. DF can minimize the loss function using either the or norm.

  • Third, the recently introduced trust region (TR) based attack method [48]. In [48], the authors show that this TR method performs similarly to the Carlini and Wagner (CW) [8] attack method, but is more efficient in terms of the computational resources required to construct the adversarial examples.

2.2. Defense strategies

Small perturbations are often imperceptible for both humans and the predictive models, making the design of counterattacks a non-trivial task. Commonly used techniques for preventing overfitting (e.g., including weight decay and dropout layers and then retraining) do not robustify the model against adversarial examples. Akhtar and Mian [2] segment modern defense strategies into three categories.

The first category includes strategies which rely on specialized add-on (external) models which are used to defend the actual network [1, 22, 43, 47, 29].

The second category includes defense strategies which modify the network architecture in order to increase the robustness [15, 40, 35, 34, 16]. Closely related to our work, Zantedeschi et al[50] recently proposed a bounded ReLU activation function as an efficient defense against adversarial attacks. Their motivation is to dampen large signals to prevent accumulation of the adversarial perturbation over layers as a signal propagates forward, using the function.

The third category aims to modify the input data for the training and validation stage in order to improve the robustness of the model [31, 51, 16, 4, 26, 23, 37].

A drawback of most state-of-the-art defense strategies is that they involve modifying the network architecture. Such strategies require that the new network is re-trained or that new specialized models are trained from scratch. This retraining is expensive in both human and computation time. Further, specialized external models can require considerable effort to be deployed and often increase the need of computational resources and inference time.

3. Jump rectified linear unit (JumpReLU)

The rectified linear unit (ReLU) and its variants have arguably emerged as the most popular activation functions for applications in the realm of computer vision. The ReLU activation function has beneficial numerical properties, and also has sparsity promoting properties [11]. Indeed, sparsity is a widely used concept in statistics and signal processing [17]. For a given input and an arbitrary function , the ReLU function can be defined as the positive part of the filter output as

(6)

illustrated in Figure 2(a). The ReLU function is also known as the ramp function which has several other interesting definitions. For instance, we can define the ReLU function as

(7)

where is the discrete Heaviside unit step function

(8)

Alternatively, the logistic function can be used for smooth approximation of the Heaviside step function

(9)

Intriguingly, this smooth approximation resembles the Swish activation function [38], which is defined as

(10)

The ReLU activation function works extremely well in practice. However, a fixed threshold value seems arbitrary. Thus, it seems reasonable to crop activation functions so that they turn on only for inputs greater than or equal to the jump value . In this case, sub-threshold signals are suppressed, while significant signals are allowed to pass.

We introduce the JumpReLU function which suppresses signals of small magnitude and negative sign

(11)

illustrated in Figure 2(b). This activation function introduces a jump discontinuity, yielding piece-wise continuous functions. While this idea can likely be transferred to other activation functions, we restrict our focus to the family of discrete ReLU activation functions.

Glorot et al[11] note that too much sparsity can negatively affect the predictive accuracy. Indeed, this might be an issue during the training stage, however, a fine-tuned jump value can improve the robustness of the model during the validation stage by introducing an extra amount of sparsity. The tuning parameter can be used to control the trade-off between predictive accuracy and robustness of the model. Importantly, JumpReLU can be used to retrofit previously trained networks in order to mitigate the risk to be fooled by adversarial examples.

Note that the jump value can be tuned cheaply during the validation stage once the network is trained, i.e., without the need for expensive retraining of the model.

(a) ReLU and Swish (dashed).

(b) JumpReLU activation function.
Figure 3. The rectified linear unit is the most widely studied activation function in context of adversarial attacks, illustrated in (a). In addition its smooth approximation (Swish) is shown, with . The JumpReLU activation function (b) introduces robustness and an additional amount of sparsity, controlled via the jump value (threshold value) . In other words, JumpReLU suppresses small positive signals.

4. Experiments

We first outline the setup which we use to evaluate the performance of the proposed JumpReLU activation function.111Reserach code is available here: https://github.com/erichson/JumpReLU. We restrict our evaluation to MNIST and CIFAR10, since these are the two standard datasets which are most widely used in the literature to study adversarial attacks.

  • The MNIST dataset [21] provides gray-scale image patches for 10 classes (digits), comprising instances for training, and examples for validation. For our experiments we use a LeNet5 architecture with an additional dropout layer, which we denote as LeNetLike.

  • The CIFAR10 dataset [18] provides RGB image patches for 10 classes, comprising instances for training, and examples for validation. For our CIFAR10 experiments we use a simple AlexLike architecture proposed by [8]; a wide residual network (WideResNet) architecture [49] of depth 30 and with width factor ; and a MobileNetV2 architecture which is using inverted residuals and linear bottlenecks [41].

We aim to match the experimental setup for creating adversarial examples as closely as possible to prior work. Thus, we follow the setup proposed by Madry et al[28] and Buckman et al[5]. More concretely, we use for all MNIST experiments and steps for iterative attacks; for experiments on CIFAR10 we use the same , and steps for PGD and Deep Fool attacks. For the trust region attack method we use steps. Note, these values are chosen by following the assumption that an attacker aims to construct adversarial examples which have imperceptible perturbations. Further, we assume that the attacker has only a limited budget of computational resources at disposal.

We also evaluate the effectiveness of the JumpReLU for adversarially trained networks. Training the model with adversarial examples drastically increases the robustness with respect to the specific attack method used to generate the adversarial training examples [13]. Here, we use the FGSM method to craft examples for adversarial training with for MNIST, and for CIFAR10. Unlike Madry et al[28], we perform robust training with mixed batches composed of both clean and adversarial examples. This leads to an improved accuracy on clean examples, while being slightly less robust. The specific ratio of the numbers of clean to adversarial examples can be seen as a tuning parameter, which may depend on the application.

4.1. Results

In the following, we compare the performance of JumpReLU to the standard ReLU activation function for both gray-box and white-box attack scenarios. For each scenario, we consider three different iterative attack methods: the Projected Gradient Descent (PGD) method, the Deep Fool (DF) method using both the (denoted as DF) and norm (denoted as DF) as well as the Trust Region (TR) attack method.222We use the TR method as a surrogate for the the more popular Carlini and Wagner (CW) [8] attack method. This is because the CW method requires enormous amounts of computational resources to construct adversarial examples. For instance, it takes about one hour to construct adversarial examples for CIFAR10 using the CW method, despite using a state-of-the-art GPU and the implementation provided by [39]. Yao et al[48] show that the TR method requires similar average and worst case perturbation magnitudes as the CW method does in order to attack a specific network.

4.1.1. Gray-box attack scenario

We start our evaluation by considering the gray-box attack scenario. In this “vanilla” flavored setting, we assume that the adversary has only partial knowledge about the model. Here, the adversary has full access to the ReLU network to construct adversarial examples, but it has no information about the JumpReLU setting during inference time. In other words, the ReLU network is used as a source network to craft adversarial examples which are then used to attack the JumpReLU network. We present results for models trained on clean data only (base) and adversarially trained models (robust). Table 1 shows a summary of results for MNIST and CIFAR10 using different network architectures. The positive benefits of JumpReLU are pronounced, while the loss of accuracy on clean examples is moderate.

First, Tab. 1(a) shows for MNIST that the retrofitted models have a substantially increased resilience to gray-box attacks. Especially the adversarial examples, which are crafted using the PGD method, turn out to be ineffective for fooling both the retrofitted base and robust (highlighted in gray) models. Further, we can see that the JumpReLU increases the resilience to the DF and TR attack methods.

Next, Tables 1(b)1(c), and 1(d) show results for CIFAR10. Clearly, the more complex residual networks (Tab. 1(c) and Tab. 1(d)) appear to be more vulnerable than the simpler AlexLike network (Tab. 1(b)). The JumpReLU is able to prevent the PGD gray-box attack on the AlexLike network, whereas the stand-alone JumpReLU is insufficient to defend the base WideResNet and MobileNetV2. Still, JumpReLU is able to substantially increase the robustness with respect to the Deep Fool and Trust Region attacks.

Surprisingly, the JumpReLU is able to substantially improve the resilience of robustly trained models. In case of the PGD gray-box attack, the retrofitted model improves the accuracy from to for the WideResNet (Tab. 1(c)) and from to for the MobileNetV2 (Tab. 1(d)). Indeed, this demonstrates the flexibility of our approach and shows that retrofitting is not limited to weak models only.

While we see that the adversarially trained models are more robust with respect to the specific attack method used for training, it can also be seen that such models provide no significant protection for other attack methods. In contrast, our defense strategy based on the JumpReLU is agnostic to specific attack methods, i.e., we improve the robustness with respect to all attacks considered here. Note we could further increase the jump value for the robust models, in order to increase the robustness to the Deep Fool and TR attack method. However, this comes with the price of sacrificing slightly more accuracy on clean data.

Appendix A provides additional results for the gray-box attack scenario, showing that the crafted adversarial examples are “unidirectional,” in the sense that adversarial examples crafted by using source models which have a low jump value can be used to attack models which have a higher jump value, but not vice versa.

Model Accuracy PGD DF DF TR
ReLU (Base)* 99.55% 66.69% 0.0% 0.0% 0.0%
JumpReLU (Base) 99.53% 91.65% 81.39% 58.93% 58.90
ReLU (Robust)* 99.50% 91.39% 0.0% 0.0% 0.0%
JumpReLU (Robust) 99.47% 97.07% 70.84% 45.17% 53.24%
(a) Results for LeNetLike network (MNIST); .
Model Accuracy PGD DF DF TR
ReLU (Base)* 89.46% 6.38% 0.0% 0.0% 0.0%
JumpReLU (Base) 87.52% 45.75% 61.82% 60.55% 53.08%
ReLU (Robust)* 87.93% 51.88% 0.0% 0.0% 0.0%
JumpReLU (Robust) 86.19% 67.65% 52.28% 46.9% 51.52%
(b) Results for AlexLike network (CIFAR10); .
Model Accuracy PGD DF DF TR
ReLU (Base)* 94.31% 0.0% 0.0% 0.0% 0.0%
JumpReLU (Base) 92.58% 0.39% 37.33% 40.21% 45.90%
ReLU (Robust)* 93.72% 60.43% 0.0% 0.0% 0.0%
JumpReLU (Robust) 93.01% 70.25% 28.62% 26.11% 35.33%
(c) Results for WideResNet (CIFAR10); .
Model Accuracy PGD DF DF
ReLU (Base)* 92.07% 0.0% 0.0% 0.0% 0.0%
JumpReLU (Base) 90.43% 0.54% 40.69% 41.61% 43.18%
ReLU (Robust)* 91.69% 53.98% 0.0% 0.0% 0.0%
JumpReLU (Robust) 90.12% 66.37% 37.31% 35.45% 40.4%
(d) Results for MobileNetV2 (CIFAR10); .
Table 1. Summary of results for black-box attacks. The numbers indicate the accuracy, i.e., the percentage of correctly classified instances (higher numbers indicate better robustness). Here, the ReLU network (indicated by a ‘*’) is used as the source model to generate adversarial examples.
Model Accuracy PGD DF DF TR
ReLU (Base) 99.55% 66.69% (17.9%) (21.8% ) (18.9%)
JumpReLU (Base) 99.53% 83.21% (34.1%) (44.9% ) (25.0%)
ReLU (Robust) 99.50% 91.39% (28.4% ) (31.4% ) (24.7%)
JumpReLU (Robust) 99.47% 94.36% (46.6%) (53.3%) (32.8%)
Madry [28] 98.80% 93.20% - - -
Vanilla [5] 99.03% 91.36% - - -
One-hot [5] 99.01% 93.77% - - -
Thermo [5] 99.23% 93.70% - - -
(a) Results for LeNetLike network (MNIST); .
Model Accuracy PGD DF DF TR
ReLU (Base) 89.46% 6.38% (1.2%) (1.5%) (1.3%)
JumpReLU (Base) 87.52% 18.56% (9.80%) (10.6%) (1.7%)
ReLU (Robust) 87.93% 51.88% (3.6%) (4.2%) (3.6%)
JumpReLU (Robust) 86.19% 56.70% (13.2%) (14.1%) (4.3%)
(b) Results for AlexLike network (CIFAR10); .
Model Accuracy PGD DF DF TR
ReLU (Base) 94.31% 0.37% (1.4%) (1.8%) (1.3%)
JumpReLU (Base) 92.58% 0.95% (14.3%) (18.5%) (1.9%)
ReLU (Robust) 93.72% 60.43% (6.4%) (7.5%) (4.8%)
JumpReLU (Robust) 93.01% 67.89% (44.4%) (43.8%) (6.1%)
Madry [28] 87.3% 50.0% - - -
Vanilla [5] 87.16% 34.71% - - -
One-hot [5] 92.19% 58.96% - - -
Thermo [5] 92.32% 65.67% - - -
(c) Results for WideResNet (CIFAR10); .
Model Accuracy PGD DF DF TR
ReLU (Base) 92.07% 0.74% (0.7%) (0.9%) (0.7%)
JumpReLU (Base) 91.10% 0.92% (5.3%) (6.8%) (1.0%)
ReLU (Robust) 91.69% 53.98% (4.7%) (5.3%) (4.1%)
JumpReLU (Robust) 90.12% 59.66% (62.6%) (51.4%) (4.9%)
(d) Results for MobileNetV2 (CIFAR10); .
Table 2. Summary of results for white-box attacks. The numbers indicate the accuracy, i.e., the percentage of correctly classified instances (higher numbers indicate better robustness). The Deep Fool method is able to fool all instances using only iterations, hence we show here the average minimum perturbations in parentheses. The best performance in each category is highlighted in bold letters.

4.1.2. White-box attack scenario

We next consider the more challenging white-box attack scenario. Here, the adversary has full knowledge about the model under attack, and it can access their gradients. This is the more important scenario in practice, where it is highly likely that the attacker has access to the model.

Table 2 summarizes the results for the different datasets and architectures under consideration. Again, we see some considerable improvements for the retrofitted models—especially, the retrofitted robustly trained WideResNet (Tab. 2(c)) and MobileNetV2 (Tab. 2(d)) excel. The performance of JumpReLU is even competitive in comparison to more sophisticated techniques such as one-hot and thermometer encoding (the authors provide only scores for the FGSM and PGD attack method) [5]. In case of the PGD white-box attack, our retrofitted model (robust) achieves accuracy for MNIST (Tab. 2(a)

), whereas one-hot encoding achieves only

. The defense performance is also competitive for the WideResNet, where we achieve about accuracy compared to the thermometer method which achieves . Note that Buckman et al[5] also present results for models trained with trained adversarial examples which outperform the results shown here. Nevertheless, these models have a lower accuracy for clean data.

Again, we want to stress the fact that JumpReLU does not requires that the model is re-trained from scratch. We can simply select a suitable jump value during the validation stage. The choice of the jump value depends thereby on the desired trade-off between accuracy and robustness, i.e., large jump values improve the robustness, while decreasing the accuracy on clean examples. We also considered comparing with the bounded ReLU method [50], but our preliminary results showed a poor performance of this defense method. A weak performance of the bounded ReLU is also reported by [7].

The adversarially trained (robust) models provide a good defense against the PGD attack (Appendix B contextualizes how the JumpReLU forces the PGD attack method to use more iterations in order to craft adversarial attacks). Yet, Deep Fool is able to fool all instances in the test set using only iterations, and TR using iterations. On first glance, this performance seems to be undesirable. We can see, however, that Deep Fool requires substantially increased average minimum perturbations in order to achieve such a high fool rate. The numbers in parentheses in Table 2 indicate the average minimum perturbations which are needed to achieve a nearly

percent fooling rate. These numbers provide a measure for the empirical robustness of the model, which we compute by using the following plug-in estimator

(12)

with . Here, we compute the relative perturbations, rather than absolute perturbations. This is because the relative measure provides a more intuitive interpretation, i.e., the numbers reflecting the average percentage of changed information in the adversarial examples.

The numbers show that the retrofitted models feature an improved robustness, while maintaining a “good” predictive accuracy for clean examples. For MNIST, the noise levels need to be increased by a factor of about in order to achieve a percent fooling rate. Here, we set the jump value to . For CIFAR10, we achieve a stellar performance of resilience to the Deep Fool attacks, i.e., the noise levels are required to be increased by a factor of to to achieve a successful attack.

Clearly, we can see that the TR method is a stronger attack than Deep Fool. However, we are still able to achieve an improved resilience to this strong attack. For instance, the TR attack requires average minimum perturbations of about to attack the retrofitted robust WideResNet (Tab. 2(c)) and about to attack the retrofitted robust MobileNetV2 (Tab. 2(d)). These high levels of perturbations are critical in a sense that they are not any longer imperceptible for humans, i.e., they render the attack less useful in practice.

4.1.3. Performance trade-offs

(a) LeNetLike network (MNIST)
(b) AlexLike network (CIFAR10)
(c) WideResNet (CIFAR10)
(d) MobileNetV2 (CIFAR10)
Figure 4. JumpReLU performance trade-offs for MNIST and CIFAR10. The left axis shows the average predictive accuracy for clean examples for varying values of the jump value. The right axis shows the average minimum perturbations required to construct adversarial examples which achieve a nearly fooling rate.

As mentioned, the JumpReLU activation function provides a trade-off between robustness and classification accuracy. The user can control this trade-off in a post-training stage by tuning the jump value , where resembles the ReLU activation function. Of course, the user needs to decide how much accuracy on clean data he is willing to sacrifice in order to buy more robustness. However, this sacrifice is standard to most robustification strategies. For instance, for adversarial training one must choose the ratio between clean and adversarial examples used for training, where a higher ratio of adversarial to clean examples improves the robustness while decreasing the predictive accuracy. Thus, the decision of a “good” jump value is application dependent.

Figure 4 shows this trade-off for different network architectures. We see that the jump value is positively correlated to the level of perturbation which is required in order to achieve a percent fooling rate. Choosing larger jump values increase the robustness of the model, while sacrificing only a slight amount of predictive accuracy. It can be seen, that larger jump values only marginally effect the accuracy of the LeNetLike network on clean examples, while the other networks are more sensitive.

4.1.4. Visual results

The interested reader may ask whether the increased adversarial perturbations are of any practical significance. To address this question, we show some visual results which illustrate the magnitude of the effect. Recall the aim of the adversary is to construct unobtrusive adversarial examples.

Figure 6 shows both clean and adversarial examples for the MNIST dataset, which are crafted by the Deep Fool algorithm. Clearly, the adversarial examples which are needed to fool the retrofitted LeNetLike network are visually distinct from those examples which are sufficient to fool the unprotected model. We also show the corresponding perturbation patterns, i.e., the absolute pixel-wise difference between the clean and adversarial examples, to better illustrate the difference. Note that we use a “reds” color scheme here: white indicates no perturbations, light red indicates very small perturbations, dark red indicates large perturbations.

Next, Figure 6 shows visual results for the CIFAR10 dataset. It is well known that models for this dataset are highly vulnerable, i.e., very small perturbations are already sufficient for a successful attack. Indeed, the minimal perturbations which are needed to fool the unprotected network (here we show results for the AlexLike network) are nearly imperceptible by visual inspection. In contrast, the crafted adversarial examples to attack the retrofitted model show distinct perturbation patterns, and one can recognize that the examples were altered. Note the example we show here correspond to the baseline AlexLike network.

In summary, the visual results put the previously presented relative noise levels into perspective, and they show that average minimum perturbations of about to are lucid. Thus, it can be concluded that the JumpReLU is an effective strategy for improving the model robustness.

4.1.5. Adversarial detection

As a proof-of-concept, we demonstrate that the increased minimum perturbations, which are required to attack the retrofitted model can help to improve the discrimination power of add-on adversarial detectors. While for humans adversarial perturbations are often visually imperceptible, add-on detectors aim to discriminate between clean and adversarial examples using inputs from intermediate feature representations of a model. Indeed, these specifically trained detectors have been shown to be highly effective for detecting adversarial examples [30, 14, 25]. Yet, there is also work which shows that adversarial detectors can be fooled (bypassed) if the attacker is aware of their presence [6]. However, such specific attacks require to be more sophisticated than the commonly used attack methods.

We follow the work by Ma et al[27], who use the idea of Local Intrinsic Dimensionality (LID) to characterize adversarial subspaces. The idea is that clean and adversarial examples show distinct patterns so that the LID characteristics allow to discriminate between such examples.

Intuitively, adversarial examples which show increased perturbation patterns should feature more extreme LID characteristics. Hence, a potential application of the JumpReLU is to combine it with an LID based detector. Table 3

shows the area under a receiver operating characteristic curve (AUC) as a measure for the discriminate power between clean and adversarial examples. Indeed, the results show that the combination with JumpReLU improves the detection performance for CIFAR10.

Model PGD DF DF TR
LID + ReLU 72.54 73.41 72.93 72.47
LID + JumpReLU 74.25 78.24 75.84 74.71
Table 3. AUC scores as measure of the discrimination power between clean and adversarial examples using LID characteristics. Here we compare ReLU and JumpReLU.
(a) Clean examples which are used for training. (b) Adversarial examples to fool the model without defense. (c) Perturbation patterns to fool the model without defense. (d) Adversarial examples to fool the retrofitted model. (e) Perturbation patterns to fool the retrofitted model.
Figure 5. Visual results for MNIST to verify the effect of the JumpReLU defense strategy against the DF attack. Noticeable higher levels of perturbations are required in order to successfully attack the retrofitted network. Subfigures (c) and (e) show the corresponding perturbation patterns.
(a) Clean examples which are used for training. (b) Adversarial examples to fool the model without defense. (c) Perturbation patterns to fool the model without defense. (d) Adversarial examples to fool the retrofitted model. (e) Perturbation patterns to fool the retrofitted model.
Figure 6. Visual results for CIFAR10 to verify the effect of the JumpReLU defense strategy against the DF attack. By visual inspection, it can b seen that the DF attack requires noticeable higher levels of perturbations in order to successfully attack the retrofitted network.

5. Conclusion

We have proposed a new activation function—the JumpReLU function—which, when used in place of a ReLU in an already pre-trained model, leads to a trade-off between predictive accuracy and robustness. This trade-off is controlled by a parameter, the jump size, which can be tuned during the validation stage. That is, no additional training of the pre-trained model is needed when the JumpReLU function is used. (Of course, if one wanted to perform additional expensive training, then one could do so.) Our experimental results show that this simple and inexpensive strategy improves the resilience to adversarial attacks of previously-trained networks. Appendix C explores extension of the JumpReLU. Randomness as a resource to improve model robustness has been demonstrated before within the defense literature. Motivated by this observation, we introduce the randomized JumpReLU and show that a small amount of randomness can help to improve the model robustness even further.

Limitations of our approach are standard for current adversarial defense methods, in that stand-alone methods do not guarantee a holistic protection and that sufficiently high levels of perturbation will be able to break the defense. That being said, JumpReLU can easily be used as a stand-alone approach to “retrofit” previously-trained networks, improving their robustness, and it can also be used to support other more complex defense strategies.

Appendix A A second look to the gray-box attack properties of the JumpReLU

(a) Transferability for MNIST.
(b) Transferability for CIFAR10.
Figure 7. Gray-box attack matrix for different jump values. Each cell indicates the predictive accuracy of a model retrofitted with the jump value (target), which is being attack by using adversarial examples generated by a model with jump value (source). Higher cell values indicate better robustness.

We present an extended set of results for the gray-box attack scenario. Specifically, we study the situation where the adversary has full access to a retrofitted model (which has a fixed jump value) in order to construct adversarial examples, but the adversary has no information about the jump value of the target network during inference time.

Here, the adversarial examples are crafted by using the projected gradient decent (PGD) attack method. Figure 7 shows the efficiency of a non-targeted attack on networks using different jump values. Note, we run the attack with a large number of iterations, enough so that the crafted adversarial examples achieve a nearly 100 percent fool rate for the source model.

We see that the attack is unidirectional, i.e., adversarial examples crafted by using source models which have a low jump value can be used to attack models which have a higher jump value. However, retrofitted models which have a low jump value are resilient to adversarial examples generated by source models which have a large jump value. Thus, one could robustify the network by using a large jump size for evaluating the gradient, while using a smaller jump size for inference. Of course, this is a somewhat pathological setup, designed to illustrate and validate properties of the method, yet these results reveal some interesting behavioral properties of the JumpReLU.

Appendix B Accuracy vs number of iterations

Iterative attack methods can be computational demanding if a large number of iterations is required to craft strong adversarial examples. Of course, it is an easy task to break any defense with unlimited time and computational resources. However, it is the aim of an attacker to design efficient attack strategies (i.e., fast generation of examples which have minimal perturbations), while the defender aims to make models more robust to these attacks (i.e., force the attacker to increase the average minimal perturbations which are needed to fool the model).

Figure 8 contextualizes the accuracy vs the number of iterations for the PGD attack. Attacking the retrofitted model requires a larger number of iterations in order to achieve the same fool rate as for the unprotected network. This is important, because a large number of iterations requires more computational resources as well as it increases the computational time. To put the numbers into perspective, it takes about minutes to run iterations to attack the unprotected WideResNet. In contrast, it takes about minutes to run iterations to attack the retrofitted model.

(a) LeNetLike network (MNIST).
(b) AlexLike (CIFAR10).
(c) Robust WideResNet (CIFAR10).
(d) Robust MobileNetV2 (CIFAR10).
Figure 8. Strength of the PGD attack for increasing numbers of iterations. It can be seen, that the PGD method requires a large number of iterations to craft strong adversarial examples. The JumpReLU increases the model robustness, i.e., the fooling rate is reduced for a fixed number of iterations.

Appendix C Randomized JumpReLU

Several state-of-the-art defense strategies rely on randomness as a resource for improving model robustness. Here, we explore whether a randomized version of the JumpReLU can help to further improve the model robustness.

More concretely, the randomized JumpReLU selects a random in a specified range for every forward pass. The underlying idea is that this approach leads to obfuscated gradients. It has impressively demonstrated that obfuscated gradients do not guarantee safety [3]. Nevertheless, our aim is to evaluate whether the randomized JumpReLU leads to increased average minimal perturbations. Table 4 shows the results for the white-box attack scenario. For comparison we show here also the results for the deterministic JumpReLU. Here, we chose the jump value so that the retrofitted models, using the deterministic and randomized JumpReLU, have roughly the similar accuracy for clean data.

First, we note that the average minimal perturbations are increased in all situations, especially those for the TR attack method. For instance, the TR attack needs to increase the average minimal perturbations from to to attack the robust WideResNet, yet it achieves only a fool rate of . We see a similar behavior for the MobileNetV2 architecture. There are substantial gains in terms of the model robustness, and this renders the TR attack useless, since adversarial examples featuring such large perturbation patterns are easy to detect. Second, we see that in many instances the different attack methods fail to achieve a nearly 100 percent fool rate despite the increased perturbations.

This leads to the conclusion that randomness can indeed help to improve the robustness. However, the drawback is that this approach requires a second tuning parameter. That is, because we sample

from a uniform distribution with support

. For our experiments, we simply set . We have not explored different settings and leave this open as a future research direction.

Model Accuracy PGD DF DF TR
ReLU (Base) 0.00 99.55% 66.69% 0.00% (17.9%) 0.00% (21.8%) 0.00% (18.9%)
JumpReLU (D) 1.00 99.53% 83.21% 0.00% (34.1%) 0.00% (44.9%) 0.00% (25.0%)
JumpReLU (R) 1.00 99.57% 83.49% 5.37% (36.2%) 2.64% (47.6%) 9.61% (37.0%)
ReLU (Robust) 0.00 99.50% 91.39% 0.00% (28.4% ) 0.0% (31.4% ) 0.00% (24.7%)
JumpReLU (D) 1.00 99.47% 94.36% 0.00% (46.6%) 0.00% (53.3%) 0.00% (32.8%)
JumpReLU (R) 1.00 99.47% 95.17% 5.89% (51.0%) 1.39% (52.8%) 8.08% (44.8%)
(a) Results for LeNetLike network (MNIST).
Model Accuracy PGD DF DF TR
ReLU (Base) 0.00 89.46% 6.38% 0.00% (1.2%) 0.00% (1.5%) 0.00% (1.3%)
JumpReLU (D) 0.40 87.52% 18.56% 0.00% (9.8%) 0.00% (10.6%) 0.00% (1.7%)
JumpReLU (R) 0.50 88.20% 20.13% 0.00% (7.1%) 0.00% (7.9%) 3.66% (13.1%)
ReLU (Robust) 0.00 87.93% 51.88% 0.00% (3.6%) 0.00% (4.2%) 0.00% (3.6%)
JumpReLU (D) 0.40 86.19% 56.70% 0.00% (13.2%) 0.00% (14.1%) 0.00% (4.3%)
JumpReLU (R) 0.50 86.15% 61.03% 0.00% (13.6%) 1.18% (14.6%) 15.79% (19.7%)
(b) Results for AlexLike network (CIFAR10).
Model Accuracy PGD DF DF TR
ReLU (Base) 0.00 94.31% 0.37% 0.00% (1.4%) 0.00% (1.8%) 0.00% (1.3%)
JumpReLU (D) 0.07 92.58% 0.95% 0.00% (14.3%) 0.00% (18.5%) 0.00% (1.9%)
JumpReLU (R) 0.09 92.53% 13.40% 0.00% (14.7%) 0.00% (18.3%) 7.53% (5.9%)
ReLU (Robust) 0.00 93.72% 60.43% 0.00% (6.4%) 0.00% (7.5%) 0.00% (4.8%)
JumpReLU (D) 0.07 93.01% 67.89% 0.00% (44.4%) 0.00% (43.8%) 0.00% (6.1%)
JumpReLU (R) 0.09 93.07% 71.82% 0.00% (44.5%) 0.00% (43.8%) 29.27% (18.1%)
(c) Results for WideResNet (CIFAR10).
Model Accuracy PGD DF DF TR
ReLU (Base) 0.00 92.07% 0.74% 0.00% (0.7%) 0.00% (0.9%) 0.00% (0.7%)
JumpReLU (D) 0.06 91.10% 0.92% 0.00% (5.3%) 0.00% (6.8%) 0.00% (1.0%)
JumpReLU (R) 0.08 90.37% 5.07% 1.36% (8.1%) 1.59% (9.8%) 2.28% (5.2%)
ReLU (Robust) 0.00 91.69% 53.98% 0.00% (4.7%) 0.00% (5.3%) 0.00% (4.1%)
JumpReLU (D) 0.06 90.12% 59.66% 0.00% (62.6%) 0.00% (51.4%) 0.00% (4.9%)
JumpReLU (R) 0.08 90.16% 68.98% 1.43% (65.6%) 1.68% (53.3%) 7.96% (25.8%)
(d) Results for MobileNetV2 (CIFAR10).
Table 4. Summary of results for white-box attacks with randomized JumpReLU. Here (D) denotes the deterministic and (R) denotes the randomized JumpReLU. The numbers indicate the accuracy, i.e., the percentage of correctly classified instances (higher numbers indicate better robustness). In addition, we show the average minimum perturbations in parentheses. The best performance is highlighted in bold letters.

Acknowledgments

We would like to acknowledge ARO, DARPA, NSF, and ONR for providing partial support for this work. We would also like to acknowledge Amazon for providing AWS credits for this project.

References