Denoised Internal Models: a Brain-Inspired Autoencoder against Adversarial Attacks

11/21/2021
by   Kaiyuan Liu, et al.
0

Despite its great success, deep learning severely suffers from robustness; that is, deep neural networks are very vulnerable to adversarial attacks, even the simplest ones. Inspired by recent advances in brain science, we propose the Denoised Internal Models (DIM), a novel generative autoencoder-based model to tackle this challenge. Simulating the pipeline in the human brain for visual signal processing, DIM adopts a two-stage approach. In the first stage, DIM uses a denoiser to reduce the noise and the dimensions of inputs, reflecting the information pre-processing in the thalamus. Inspired from the sparse coding of memory-related traces in the primary visual cortex, the second stage produces a set of internal models, one for each category. We evaluate DIM over 42 adversarial attacks, showing that DIM effectively defenses against all the attacks and outperforms the SOTA on the overall robustness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/27/2017

Biologically inspired protection of deep networks from adversarial attacks

Inspired by biophysical principles underlying nonlinear dendritic comput...
04/09/2018

A theory of consciousness: computation, algorithm, and neurobiological realization

The most enigmatic aspect of consciousness is the fact that it is felt, ...
03/16/2021

Bio-inspired Robustness: A Review

Deep convolutional neural networks (DCNNs) have revolutionized computer ...
02/13/2020

Recurrent Attention Model with Log-Polar Mapping is Robust against Adversarial Attacks

Convolutional neural networks are vulnerable to small ℓ^p adversarial at...
12/07/2018

Combatting Adversarial Attacks through Denoising and Dimensionality Reduction: A Cascaded Autoencoder Approach

Machine Learning models are vulnerable to adversarial attacks that rely ...
05/28/2021

Visualizing Representations of Adversarially Perturbed Inputs

It has been shown that deep learning models are vulnerable to adversaria...
03/15/2021

HDTest: Differential Fuzz Testing of Brain-Inspired Hyperdimensional Computing

Brain-inspired hyperdimensional computing (HDC) is an emerging computati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The great advances in deep learning (DL) techniques bring us a large number of sophisticated models that approach human-level performance in a broad spectrum of tasks, such as image classification (LeCun et al., 1989; He et al., 2016; Krizhevsky et al., 2012; Szegedy et al., 2016), speech recognition (Amodei et al., 2016; Xiong et al., 2016)

, and natural language processing 

(Vaswani et al., 2017; Devlin et al., 2019; Yang et al., 2019b; Gu et al., 2018). Despite its success, deep neural network (DNN) models are vulnerable to adversarial attacks (Szegedy et al., 2014; Biggio et al., 2013; Goodfellow et al., 2014b). Even with adding human-unrecognizable perturbations, the predictions of the underlying network model could be completely altered (Biggio and Roli, 2018; Goodfellow et al., 2014a; Moosavi-Dezfooli et al., 2016; Athalye and Carlini, 2018). On the other hand, the human brain, treated as an information processing system, enjoys remarkably high robustness (Xu and Vaziri-Pashkam, 2021; Athalye et al., 2018b). A question naturally arises whether knowledge about the working mechanism of the human brain can help us improve the adversarial robustness of DNN models (Casamassima et al., 2021; Huang et al., 2019).

Biological systems keep being an illuminating source of human engineering design. Two famous relevant examples are the perceptron model 

(Rosenblatt, 1958)

and the Rectified Linear Unit (ReLU

(Agarap, 2019)activation function. Further, the recurrent neural network (RNN) (Elman, 1990) architecture also has its origin in the study of how to process time-series data, like natural language. In this work, we draw inspiration from the visual signal processing paradigm of the human brain and propose a novel model to address the robustness issue in the image classification task.

Figure 1: From the visual signal processing in human brain (A) to the Denoised Internal Models (B).

With the recent progress in neuroscience, we now better understand the information processing pipeline in the human brain’s visual system. Two brain areas are involved in this pipeline: the thalamus and the primary visual cortex (Cudeiro and Sillito, 2006). Visual signals from the retina will travel to the Lateral Ganglion Nucleus (LGN) of the thalamus before reaching the primary visual cortex (Derrington et al., 1984). The LGN is specialized in handling visual information, helping to process different kinds of stimuli. In addition, some vertebrates, like zebrafish, have no visual cortex but still have some neural structure similar to the hypothalamus to receive and process visual signals (O’Connor et al., 2002)

. This fact highlights the importance of such an information pre-processing module in the biological visual signal processing system. The primary visual cortex is one of the best-studied brain areas, which displays a complex 6-layers structure and provides excellent pattern recognition capacities. An important finding 

(Xie et al., 2014) reveals that Layer 2/3 of the primary visual cortex contains the so-called engram cells (Tonegawa et al., 2015), which only activate for specific stimuli and related ones, such as Jennifer Aniston’s pictures (Quiroga et al., 2005). In other words, the concepts corresponding to those stimuli are encoded sparsely through the engram cells (McGaugh, 2000; Guan et al., 2016). Furthermore, artificial activation of the engram cells induces corresponding memory retrieval (Liu et al., 2012, 2014). Those discoveries suggest there could be internal generative models for different kinds of concepts in the human brain.

Simulating the pipeline mentioned above, we proposed the Denoised Internal Models (DIM) (see Figure 1

 B), which consists of a global denoising network (a.k.a. denoiser) and a set of generative autoencoders, one for each category. The denoiser helps pre-process the input data, similar to what LGN does. The autoencoders can be regarded as internal models for specific concepts mimicking the function of engram cells in the primary visual cortex. In order to have a comprehensive evaluation of DIM’s robustness, we conduct our experiments on MNIST 

(Lecun et al., 1998), using DIM against 42 attacks in the foolbox v3.2.1 package (Rauber et al., 2017) and comparing its performance to SOTA models. The results show that DIM outperforms the SOTA models on the overall robustness and has the most stable performance across the 42 attacks.

2 Related Works

Figure 2: The training phase (A) and the inference phase (B) of DIM.

2.1 Adversarial Attacks

A large number of adversarial attacks have been proposed recently (Tramèr et al., 2018; Rony et al., 2019; Rauber and Bethge, 2020; Hosseini et al., 2017; Carlini and Wagner, 2017; Moosavi-Dezfooli et al., 2016; Brendel et al., 2020). From the viewpoint of the attacker’s knowledge about the models, these attacks can be divided into white-box ones and black-box ones. The former (Rony et al., 2019; Rauber and Bethge, 2020; Moosavi-Dezfooli et al., 2016; Brendel et al., 2020) knows all model information, including the network architecture, parameters, and learning mechanisms. The latter (Brendel et al., 2018) only knows limited or zero knowledge about the models, but it can interact with the model through inputs and outputs. Typically, white-box attacks are harder for defense.

Viewing from the norm types, i.e., the distance measure of the adversarial perturbations, major existing adversarial attacks fall into four categories: attacks (Schott et al., 2019), attacks (Hosseini et al., 2017; Brendel et al., 2020), attacks (Rony et al., 2019; Rauber and Bethge, 2020; Carlini and Wagner, 2017; Moosavi-Dezfooli et al., 2016), and attacks (Moosavi-Dezfooli et al., 2016; Brendel et al., 2020).

2.2 Defense Methods

Many defense approaches have been proposed to tackle the robustness challenge in deep learning (Yin et al., 2019; Pang et al., 2019b; Hu et al., 2019; Verma and Swami, 2019; Bafna et al., 2018; Pang et al., 2019a). Roughly, there are four major types.

2.2.1 Adversarial Training

Proposed by Madry et al. (2018), it is one of the most popular defense methods that can withstand strong attacks. It follows the simple idea of training the model on the generated adversarial samples.

2.2.2 Randomization

This approach (Vaishnavi et al., 2020; Vincent et al., 2010) randomizes the input layer or some intermediate layers to neutralize the adversarial perturbations and help to protect the underlying model.

2.2.3 Gradient Masking

This approach mainly defends against gradient-based attacks by building a model with no useful gradient (Xiao et al., 2019); that is, the gradients of the model outputs with respect to its inputs are almost zero. However, it turns out such a method fails to work in practice (Athalye et al., 2018a).

2.2.4 Generative Models

This approach exploits a generative model, normally GAN (Samangouei et al., 2018) or autoencoder (Cintas et al., 2020; Meng and Chen, 2017), to project the high-dimensional inputs into a low-dimensional manifold. It is generally believed that such a way can reduce the risk of overfitting and improve the adversarial robustness (Jang et al., 2019).

Among this type, we would like to mention the ABS (Schott et al., 2019) model. From a different starting point, it also arrived at the design that uses an individual generative model for each category in the dataset.

3 Model

3.1 Biological Inspirtion

In this subsection, we give a closer look at the visual signals processing pipeline in the human brain. As depicted in Figure 1 (A), visual perceptual information streams are firstly received and processed by the LGN in the thalamus before projecting to the primary visual cortex (O’Connor et al., 2002).

The cell bodies in the LGN arrange to form a 6-layer structure, where the inner two layers are called the magnocellular layers, while the outer four are called parvocellular layers (Brodal, 2004). Previous studies (White et al., 2009) reveal that parvocellular layers are sensitive to color and perceive a high level of detail. On the other hand, magnocellular layers are highly sensitive to motion while insensitive to color and detail. In this way, the LGN pre-processes different kinds of visual information.

The primary visual cortex has a complex hierarchical structure of six layers. Roughly speaking, Layer 4 handles the input signals from the LGN, and Layer 5 sends outputs to other regions in the brain. Upon receiving inputs, Layer 4 sent strong signals directly to Layer 2/3 for processing (Markram et al., 2015)

. Recent advances in neuroscience surprisingly find that contextual information and memory components are sparsely encoded in Layer 2, namely, only a distinct population of neurons in Layer 2 respond to a specific kind of context, and these populations are spatially separated 

(Xie et al., 2014). Such spatial sparsity reflects typical engram-cell behavior. As mentioned in the introduction, artificial activation of the engram cells induces memory retrieval, suggesting that internal models operate inside the primary visual cortex.

It is worth emphasizing that we only consider the functional level analogy to the brain’s visual signal processing system rather than repeating its precise connections and structure. We recognize the LGN as a pre-processor that distills inputs signals, separating different kinds of information. On the other hand, the primary visual cortex corresponds to a set of internal generative models. In this way, we abstract the human visual signal processing system as a two-stage model and will lay our model design based on it.

3.2 Denoised Internal Models

Based on the above abstraction, we proposed a two-stage model Denoised Internal Models (DIM). The corresponding schematic diagram is shown in Figure 1 (B). In the first stage, we seek a global denoiser that helps filter the ”true” signals out of the input images, which is analogous to the function of LGN in the thalamus. The basic idea is that adversarial perturbations are generally semantically meaningless and can be effectively treated as noise in the raw images. The second stage consists of a set of internal generative models, which operate in a dichotomous sense. Each internal model only accepts images from a distinct category and will reject images from other categories. Upon acceptance, the internal model will output a reconstructed image, while it returns a black image if the input is rejected. In this way, our model reflects the engram-cell behavior in the primary visual cortex.

One of the main targets of this paper is to evaluate whether a functional level analogy to the brain’s visual signal processing system helps improve the adversarial robustness rather than focusing on specific algorithms. Hence, we have kept our network architecture simple to avoid complexities during evaluation. Details about the architecture and parameter settings can be found in the appendix.

3.2.1 Denoiser

There exist different methods to filter the raw image out of a noisy one (Yang et al., 2019a; Candès and Recht, 2009; Chatterjee, 2015; Chen and Chi, 2018). We adopt a simple autoencoder as the denoise network model. In the training phase, we add noise to the images in the training dataset. Those noisy images serve as the inputs to the denoiser and the original ones as the learning targets. The model is trained on the mean-square error (MSE) loss to minimize the mean reconstruction error.

(1)

where refers to the function of the denoiser, and indicates the added noise. denotes the norm.

3.2.2 Internal Models

We train an autoencoder as the internal generative model for each category in the dataset. Ideally, the input, the bottleneck, and the output of the autoencoders are analogous to the roles of neurons in Layer 4, Layer 2/3, and Layer 5, respectively, in the primary visual cortex. The internal models receive inputs from the outputs of the denoiser. Then we add noise to the inputs to reflect the randomness in the brain’s neural activities. The -th autoencoder in the internal models is trained on the following loss

(2)

where indicator if belongs to category , and otherwise. denotes the function that corresponds to the -th autoencoder, and

indicates the added noise. We choose this loss function to encourage the engram-cell behavior of the autoencoders. As a result of this behavior, it is natural to perform inference based on the relative output intensities from different autoencoders. Specifically, for each input image

, we estimate its relative intensity from the

-th autoencoder for as

(3)

The prediction on by DIM will be

(4)

where and is the number of categories.

Finally, we also consider a variation of the DIM model in the inference phase, which includes two binarization operations, one applied to the input images and the other applied to the outputs of the denoiser. We refer to this variation as biDIM hereafter.

4 Experiments

CNN biCNN Madry biABS ABS biDIM DIM
-metric ()
DDNAttack 15% 71% 94% 85% 84% 92% 93%
PGD 30% 76% 96% 86% 88% 93% 94%
BasicIterativeAttack 17% 67% 95% 83% 83% 93% 94%
FastGradientAttack (FGM) 55% 92% 97% 94% 86% 94% 95%
DeepFoolAttack 21% 21% 95% 49% 83% 75% 89%
CarliniWagnerAttack 13% 10% 83% 45% 84% 51% 74%
BrendelBethgeAttack 12% 8% 50% 48% 93% 57% 71%
BoundaryAttack 19% 62% 54% 93% 90% 80% 80%
All attacks 1.1/9% 0.9/7% 1.4/41% 1.3/41% 2.2/83% 1.4/45% 1.9/66%
-metric ()
PGD 0% 73% 95% 88% 11% 89% 85%
BasicIterativeAttack 0% 70% 96% 83% 8% 89% 82%
FastGradientAttack (FGSM) 7% 78% 96% 86% 38% 90% 89%
DeepFoolAttack 0% 83% 95% 86% 7% 91% 78%
BrendelBethgeAttack 2% 81% 94% 89% 11% 88% 9%
All attacks 0.08/0% 0.36/69% 0.34/93% 0.42/82% 0.22/3% 0.49/78% 0.2/8%
-metric ()
Pointwise 25% 43% 2% 82% 76% 53% 59%
All attacks 8/25% 11/43% 4/2% 26/82% 19/76% 13/53% 14/59%
-metric ()
BrendelBethgeAttack 11% 4% 16% 48% 89% 65% 65%
All attacks 5/11% 3/4% 4/16% 8/47% 19/89% 13/65% 11/65%
Minimal Accuracy 0% 4% 2% 41% 3% 45% 8%
Table 1: Results for different kinds of models under defferent adversarial attacks, arranged according to distance metrics. Each entry shows the accuracy of the model for the threshold of , , , and . For each norm, we also summarize all attacks in the type, calculating both the median adversarial distance (left value) between all samples and the overall accuracy (right value). The last row shows the minimal accuracy of each model across all the attacks. The best results for the overall performance are shown in bold. Due to the space limit, we only display 15 important attacks out of the total 42 in this table and leave the full results in the appendix.

To evaluate the adversarial robustness of our model, we compare DIM and biDIM against two SOTA methods: the adversarial training (Madry et al., 2018), a SOTA defense method, and the analysis by synthesis (ABS) model as well as its variation with binarization inputs (biABS) (Schott et al., 2019)

. We also include a vanilla convolutional neural network (CNN) model and its variation with input binarization (biCNN) as the baseline models. The DIM models are implemented using relatively simple neural networks. Consequently, their clean accuracies are 96%, while the other models reach 99%. Despite this disadvantage, biDIM still beats the SOTA on the overall robustness. We will discuss more details in the next section.

In our experiments, we applied almost all attacks available in foolbox v3.2.1 against all models. Those attacks consist of 22 attacks, 12 attacks, 6 attacks and 2 attacks 111We use foolbox v2.4.0 for the Pointwise attack since it is not available in foolbox v3.2.1.. The most effective ones are those based on model gradients and those based on the prediction boundary. More specifically, the gradient-based attacks include the DeepFool Attack (Moosavi-Dezfooli et al., 2016), Basic Iterative Method (BIM) Attack (Kurakin et al., 2016), and the Carlini&Wagner Attack (Carlini and Wagner, 2017). They exploit the gradients at the raw input images to find directions leading to wrong predictions. The boundary attacks rely on the model decision. Starting from adversarial samples with relative large perturbation size, these attacks search towards the corresponding raw input images along the boundary between the adversarial and non-adversarial regions. Within this type, there are white-box attacks Boundary Attack (Brendel et al., 2018) and BrendelBethge Attack as well as black-box attacks , , and BrendelBethge Attack (Brendel et al., 2020). Our experiments also cover other types of attacks, such as the additive random noise attacks, including the Gaussian Noise Attack and the Uniform Noise Attack and their variations.

In practice, the overall robustness is more important than the robustness under a single attack, since the adversaries will not restrict themselves to any specific attack. To reflect the overall robustness, we summarize the experimental results within each norm and leave the full results for individual attacks in the appendix. More concretely, we count a sample as successfully attacked as long as one attack finds the adversarial image. Furthermore, the corresponding perturbation size on the sample is computed by minimizing across all successful attacks.

CNN singleIM Internal Models Dn-singleIM DIM biDIM
-metric ()
DDNAttack 15% 83% 91% 87% 93% 92%
PGDAttack 30% 89% 95% 89% 94% 93%
BasicIterativeAttack 17% 88% 94% 90% 94% 93%
FastGradientAttack (FGM) 55% 89% 95% 90% 95% 94%
DeepFoolAttack 21% 71% 83% 82% 89% 75%
CarliniWagnerAttack 13% 54% 66% 68% 74% 51%
BrendelBethgeAttack 12% 61% 58% 70% 71% 57%
BoundaryAttack 19% 65% 67% 75% 80% 80%
All attacks 9% 52% 51% 65% 66% 45%
-metric ()
PGDAttack 0% 49% 70% 72% 85% 89%
BasicIterativeAttack 0% 54% 61% 72% 82% 89%
FastGradientAttack (FGSM) 7% 64% 78% 79% 89% 90%
DeepFoolAttack 0% 44% 61% 66% 78% 91%
BrendelBethgeAttack 2% 2% 1% 6% 9% 88%
All Attacks 0% 2% 0% 6% 8% 78%
-metric ()
Pointwise 25% 54% 50% 58% 59% 53%
All attacks 25% 54% 50% 58% 59% 53%
-metric ()
BrendelBethgeAttack 11% 61% 57% 65% 65% 65%
All attacks 11% 61% 57% 65% 65% 65%
Table 2: Results for ablation study, including six models: the vanilla CNN, the single-head Internal Model (single-IM), the Internal Model without denoiser, the single-head Internal Model with denoiser (Dn-singleIM), the DIM, and the biDIM. The rest settings are the same as those in Table 1.

Table 1 reports the model’s accuracy within given bounds of perturbations, i.e., , , , and . It has been recognized that the model’s accuracy on bounded adversarial perturbations is often biased (Schott et al., 2019)

; nonetheless, we reported it for completeness. On the other hand, the median adversarial perturbation size reflects the perturbation with which the model achieves 50% accuracy. It is hardly affected by the outliers; hence, it can help summarize the distribution of adversarial perturbations better. We also report the median perturbation size for each model in all four

norm cases (values before the slash). Note that clean samples that are already misclassified are counted as adversarial samples with a perturbation size of , and failed attacks are assigned a perturbation size of .

For a better understanding of how each component in DIM affects the adversarial robustness, we further carry out an ablation study as well as investigate the latent representations of autoencoders in the internal models.

The ablation study involves six models. Starting from the vanilla CNN as the baseline, we first consider two ablations of DIM: the stand-alone Internal Models (IM) without the denoiser and the single-head internal model (single-IM). The single-head internal model combines a single encoder with a set of decoders, each of which corresponds to a category in the dataset. Then, we extend the single-head internal model by adding the denoiser to it. At last, we compare the performance of those ablations to DIM and biDIM. The results are summarized in Table 2.

In Figure 3, we visualize the clustering of latent representations in all the ten latent spaces by applying the tSNE method to reduce the dimension of the latent representations.

5 Reuslts and Discussion

The last row of Table 1 shows the minimal accuracy of a model against all the 42 attacks. Higher minimal accuracy indicates more stable performance in robustness under different types of adversarial attacks. From the table, we find biDIM achieves the highest minimal accuracy.

Our model closely simulates the brain’s visual signal processing pipeline while been implemented with relatively simple neural networks in the denoiser and internal models. Its stable robustness suggests that drawing inspiration from bio-systems could be a promising direction to explore when tackling the robustness issue in deep learning.

For a more detailed comparison with the SOTA methods:

  • For attacks, ABS has the highest accuracy while biDIM outperforms biABS in both accuracy and median perturbations size. Madry achieves good performance except for two boundary attacks: the boundary attack and BBA, which considerably degrade its overall robustness.

  • For attacks, Madry has the best accuracy since their adversarial training is based on the norm. We find that the overall accuracy of ABS and DIM both decrease rapidly. The individual accuracy of DIM only drops for the BBA case, while the performance of ABS deteriorates on all five listed attacks. On the other hand, both biDIM and biABS retain decent robustness under attacks with the help of the input binarization. It is worth noting that biDIM has the largest median adversarial perturbation size.

  • Under attacks, Madry suffers a significant decrease in accuracy, becoming even worse than the baseline methods. On the contrary, DIM/biDIM and ABS/biABS still show moderate performance, especially the biABS, which has the highest accuracy.

  • For attacks, no model performs particularly poorly. ABS stands out in this case, while biDIM outperforms biABS and both of them are much better than Madry.

In summary, biDIM has the most stable performance over all kinds of attacks in our experiments, even though it may be inferior to other SOTA methods under specific circumstances. More importantly, biDIM achieves the highest minimal accuracy, indicating that it is not only stable but also a competitive defense method against all types of adversarial attacks.

We would like to also remark that the inference using DIM (biDIM) is much faster than the ABS (biABS), which may take several seconds for a single forward pass. This heavy time cost highly restricts the extensibility of the ABS model.

Figure 3: The clustering of the latent representations of the ten autoencoders in the Internal Models. Different colors corresponds to different categories (digits) in the MNIST dataset.

In Table 2, we first compared our DIM model with its three ablations: the single-head Internal Model (singleIM), the stand-alone Internal Models, and the singleIM with denoiser (Dn-singleIM). A vanilla CNN is included as the baseline. We found that extending from the singleIM to the Internal Models generally increases the accuracy, indicating better robustness. Further, models with the denoiser outperform the ones without in almost all cases. More interestingly, we note that the increase in accuracy is more evident for attacks. The comparison between DIM and biDIM shows that input binarization is a very effective method against attacks. However, our results suggest that it often degrades the performance under attacks.

At last, we studied the clustering of the latent representations, whose knowledge might provide us clues about how the internal models work. For the sake of visualization, we apply the tSNE algorithm (van der Maaten and Hinton, 2008) to map the original 10-dimensional latent representations into 2-dimensional ones. The results are shown in Figure 3.

From Figure 3, we saw that, for the -th autoencoder, the distribution of the representations in the -th category is well centralized and stands distinguished from the others. In addition, the representations of other categories are scattered over the latent space and show no apparent pattern. This behavior reflects how we train internal models, i.e., we require each autoencoder to only react to images from the corresponding category and return a black image otherwise.

6 Conclusion

In this work, we proposed the Denoised Internal Models (DIM), a novel two-stage model closely following the human brain’s visual signal processing paradigm. The model is carefully designed to mimic the cooperation between the thalamus and the primary visual cortex. Moreover, instead of a single complex model, we adopt a set of simple internal generative models to encode different categories in the dataset. This reflects the sparse coding that is based on the engram cells in Layer 2/3 of the primary visual cortex. Recent progress in neuroscience suggests that the engram-based generative models may serve as the base of a robust cognitive function in the human brain.

In order to comprehensively evaluate the robustness of our model, we conducted extensive experiments across a broad spectrum of adversarial attacks. The results (see Table 1) demonstrate that DIM, especially its variation biDIM, achieves a stable and competitive performance over all kinds of attacks. biDIM beats the SOTA methods adversarial training and ABS on the overall performance across the 42 attacks in our experiments. Further investigations show that the clusters corresponding to different categories are well separated in the latent spaces of the internal models, which provides some clues about the good robustness of DIM. The present work is an initial attempt to integrate the brain’s working mechanism with the model design in deep learning. We will explore more sophisticated realizations of the internal models and extend our model to real-world datasets in future work. May the brain guides our way.

References

  • A. F. Agarap (2019) Deep learning using rectified linear units (relu). External Links: 1803.08375 Cited by: §1.
  • D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In

    International conference on machine learning

    ,
    pp. 173–182. Cited by: §1.
  • A. Athalye, N. Carlini, and D. Wagner (2018a) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274–283. Cited by: §2.2.3.
  • A. Athalye and N. Carlini (2018) On the robustness of the cvpr 2018 white-box adversarial example defenses. arXiv preprint arXiv:1804.03286. Cited by: §1.
  • A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2018b) Synthesizing robust adversarial examples. In International conference on machine learning, pp. 284–293. Cited by: §1.
  • M. Bafna, J. Murtagh, and N. Vyas (2018)

    Thwarting adversarial examples: an l 0-robust sparse fourier transform

    .
    In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10096–10106. Cited by: §2.2.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §1.
  • B. Biggio and F. Roli (2018)

    Wild patterns: ten years after the rise of adversarial machine learning

    .
    Pattern Recognition 84, pp. 317–331. Cited by: §1.
  • W. Brendel, J. Rauber, M. Kümmerer, I. Ustyuzhaninov, and M. Bethge (2020) Accurate, reliable and fast robustness evaluation. In Thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 12817–12827. Cited by: §2.1, §2.1, §4.
  • W. Brendel, J. Rauber, and M. Bethge (2018) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. In International Conference on Learning Representations, Cited by: §2.1, §4.
  • P. Brodal (2004) The central nervous system: structure and function. oxford university Press. Cited by: §3.1.
  • E. J. Candès and B. Recht (2009) Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9 (6), pp. 717. External Links: Document, ISBN 1615-3383, Link Cited by: §3.2.1.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §2.1, §2.1, §4.
  • E. Casamassima, A. Herbert, and C. Merkel (2021) Exploring cnn features in the context of adversarial robustness and human perception. In Applications of Machine Learning 2021, Vol. 11843, pp. 1184313. Cited by: §1.
  • S. Chatterjee (2015)

    Matrix estimation by universal singular value thresholding

    .
    The Annals of Statistics 43 (1). External Links: ISSN 0090-5364, Link, Document Cited by: §3.2.1.
  • Y. Chen and Y. Chi (2018) Harnessing structures in big data via guaranteed low-rank matrix estimation: recent theory and fast algorithms via convex and nonconvex optimization. IEEE Signal Processing Magazine 35 (4), pp. 14–31. Cited by: §3.2.1.
  • C. Cintas, S. Speakman, V. Akinwande, W. Ogallo, K. Weldemariam, S. Sridharan, and E. McFowland (2020) Detecting adversarial attacks via subset scanning of autoencoder activations and reconstruction error. Cited by: §2.2.4.
  • J. Cudeiro and A. M. Sillito (2006) Looking back: corticothalamic feedback and early visual processing. Trends in neurosciences 29 (6), pp. 298–306. Cited by: §1.
  • A. M. Derrington, J. Krauskopf, and P. Lennie (1984) Chromatic mechanisms in lateral geniculate nucleus of macaque.. The Journal of physiology 357 (1), pp. 241–265. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1.
  • J. L. Elman (1990) Finding structure in time. Cognitive Science 14 (2), pp. 179–211. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog1402_1 Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014a) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014b) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.
  • J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, et al. (2018) Recent advances in convolutional neural networks. Pattern Recognition 77, pp. 354–377. Cited by: §1.
  • J. Guan, J. Jiang, H. Xie, and K. Liu (2016) How does the sparse memory “engram” neurons encode the memory of a spatial–temporal event?. Frontiers in neural circuits 10, pp. 61. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §1.
  • H. Hosseini, B. Xiao, M. Jaiswal, and R. Poovendran (2017) On the limitation of convolutional neural networks in recognizing negative images. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 352–358. Cited by: §2.1, §2.1.
  • S. Hu, T. Yu, C. Guo, W. Chao, and K. Q. Weinberger (2019) A new defense against adversarial images: turning a weakness into a strength. Advances in Neural Information Processing Systems 32. Cited by: §2.2.
  • Y. Huang, S. Dai, T. Nguyen, P. Bao, D. Y. Tsao, R. G. Baraniuk, and A. Anandkumar (2019) Brain-inspired robust vision using convolutional neural networks with feedback. Cited by: §1.
  • U. Jang, S. Jha, and S. Jha (2019) On the need for topology-aware generative models for manifold-based defenses. In International Conference on Learning Representations, Cited by: §2.2.4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §1.
  • A. Kurakin, I. Goodfellow, S. Bengio, et al. (2016) Adversarial examples in the physical world. Cited by: §4.
  • Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §1.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document Cited by: §1.
  • X. Liu, S. Ramirez, P. T. Pang, C. B. Puryear, A. Govindarajan, K. Deisseroth, and S. Tonegawa (2012) Optogenetic stimulation of a hippocampal engram activates fear memory recall. Nature 484 (7394), pp. 381–385. Cited by: §1.
  • X. Liu, S. Ramirez, and S. Tonegawa (2014) Inception of a false memory by optogenetic manipulation of a hippocampal memory engram. Philosophical Transactions of the Royal Society B: Biological Sciences 369 (1633), pp. 20130142. Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §A.1.2, §2.2.1, §4.
  • H. Markram, E. Muller, S. Ramaswamy, M. W. Reimann, M. Abdellah, C. A. Sanchez, A. Ailamaki, L. Alonso-Nanclares, N. Antille, S. Arsever, et al. (2015) Reconstruction and simulation of neocortical microcircuitry. Cell 163 (2), pp. 456–492. Cited by: §3.1.
  • J. L. McGaugh (2000) Memory–a century of consolidation. Science 287 (5451), pp. 248–251. Cited by: §1.
  • D. Meng and H. Chen (2017) Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp. 135–147. Cited by: §2.2.4.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1, §2.1, §2.1, §4.
  • D. H. O’Connor, M. M. Fukui, M. A. Pinsk, and S. Kastner (2002) Attention modulates responses in the human lateral geniculate nucleus. Nature neuroscience 5 (11), pp. 1203–1209. Cited by: §1, §3.1.
  • T. Pang, K. Xu, Y. Dong, C. Du, N. Chen, and J. Zhu (2019a) Rethinking softmax cross-entropy loss for adversarial robustness. In International Conference on Learning Representations, Cited by: §2.2.
  • T. Pang, K. Xu, C. Du, N. Chen, and J. Zhu (2019b) Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pp. 4970–4979. Cited by: §2.2.
  • R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried (2005) Invariant visual representation by single neurons in the human brain. Nature 435 (7045), pp. 1102–1107. Cited by: §1.
  • J. Rauber and M. Bethge (2020) Fast differentiable clipping-aware normalization and rescaling. arXiv preprint arXiv:2007.07677. Cited by: §2.1, §2.1.
  • J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: §1.
  • J. Rony, L. G. Hafemann, L. S. Oliveira, I. B. Ayed, R. Sabourin, and E. Granger (2019) Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4322–4330. Cited by: §2.1, §2.1.
  • F. Rosenblatt (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review 65 (6), pp. 386—408. External Links: Document, ISSN 0033-295X, Link Cited by: §1.
  • P. Samangouei, M. Kabkab, and R. Chellappa (2018)

    Defense-gan: protecting classifiers against adversarial attacks using generative models

    .
    In International Conference on Learning Representations, Cited by: §2.2.4.
  • L. Schott, J. Rauber, M. Bethge, and W. Brendel (2019) Towards the first adversarially robust neural network model on mnist. In Seventh International Conference on Learning Representations (ICLR 2019), pp. 1–16. Cited by: §A.1.3, §2.1, §2.2.4, §4, §4.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Cited by: §1.
  • S. Tonegawa, X. Liu, S. Ramirez, and R. Redondo (2015) Memory engram cells have come of age. Neuron 87 (5), pp. 918–931. Cited by: §1.
  • F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations, Cited by: §2.1.
  • P. Vaishnavi, T. Cong, K. Eykholt, A. Prakash, and A. Rahmati (2020) Can attention masks improve adversarial robustness?. In International Workshop on Engineering Dependable and Secure Machine Learning Systems, pp. 14–22. Cited by: §2.2.2.
  • L.J.P. van der Maaten and G.E. Hinton (2008)

    Visualizing high-dimensional data using t-sne

    .
    Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • G. Verma and A. Swami (2019)

    Error correcting output codes improve probability estimation and adversarial robustness of deep neural networks

    .
    Advances in Neural Information Processing Systems 32, pp. 8646–8656. Cited by: §2.2.
  • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P. Manzagol, and L. Bottou (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion.

    .
    Journal of machine learning research 11 (12). Cited by: §2.2.2.
  • B. J. White, S. E. Boehnke, R. A. Marino, L. Itti, and D. P. Munoz (2009) Color-related signals in the primate superior colliculus. Journal of Neuroscience 29 (39), pp. 12159–12166. Cited by: §3.1.
  • C. Xiao, P. Zhong, and C. Zheng (2019) Enhancing adversarial defense by k-winners-take-all. arXiv preprint arXiv:1905.10510. Cited by: §2.2.3.
  • H. Xie, Y. Liu, Y. Zhu, X. Ding, Y. Yang, and J. Guan (2014) In vivo imaging of immediate early gene expression reveals layer-specific memory traces in the mammalian brain. Proceedings of the National Academy of Sciences 111 (7), pp. 2788–2793. Cited by: §1, §3.1.
  • W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig (2016) Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256. Cited by: §1.
  • Y. Xu and M. Vaziri-Pashkam (2021) Limits to visual representational correspondence between convolutional neural networks and the human brain. Nature communications 12 (1), pp. 1–16. Cited by: §1.
  • Y. Yang, G. Zhang, Z. Xu, and D. Katabi (2019a) ME-net: towards effective adversarial robustness with matrix estimation. In ICML, Cited by: §3.2.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019b) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32. Cited by: §1.
  • X. Yin, S. Kolouri, and G. K. Rohde (2019) Adversarial example detection and classification with asymmetrical adversarial training. arXiv preprint arXiv:1905.11475. Cited by: §2.2.

Appendix A Appendix

CNN biCNN Madry biABS ABS biDIM DIM
-metric()
ContrastReductionAttack 99% 98% 99% 99% 99% 95% 96%
DDNAttack 15% 71% 94% 85% 84% 92% 93%
PGD 30% 76% 96% 86% 88% 93% 94%
BasicIterativeAttack 17% 67% 95% 83% 83% 93% 94%
FastGradientAttack (FGM) 55% 92% 97% 94% 86% 94% 95%
AdditiveGaussianNoiseAttack (GN) 99% 98% 98% 99% 99% 95% 96%
AdditiveUniformNoiseAttack (UN) 99% 98% 99% 99% 99% 96% 96%
ClippingAwareGN 99% 98% 98% 99% 99% 96% 96%
ClippingAwareUN 99% 98% 99% 99% 99% 96% 96%
RepeatedGN 99% 97% 98% 97% 98% 92% 95%
RepeatedUN 99% 98% 98% 97% 98% 93% 95%
ClippingAwareRepeatedGN 99% 97% 98% 97% 98% 92% 95%
ClippingAwareRepeatedUN 98% 97% 98% 97% 98% 92% 95%
DeepFoolAttack 21% 21% 95% 49% 83% 75% 89%
InversionAttack 99% 98% 99% 99% 99% 95% 96%
BinarySearchContrastReductionAttack 99% 98% 99% 99% 99% 95% 96%
LinearSearchContrastReductionAttack 99% 98% 99% 99% 99% 95% 96%
GaussianBlurAttack 99% 98% 98% 98% 99% 95% 96%
CarliniWagnerAttack 13% 10% 83% 45% 84% 51% 74%
BrendelBethgeAttack 12% 8% 50% 48% 93% 57% 71%
BoundaryAttack 19% 62% 54% 93% 90% 80% 80%
All attacks 9% 7% 41% 41% 83% 45% 66%
-metric()
PGD 0% 73% 95% 88% 11% 89% 85%
BasicIterativeAttack 0% 70% 96% 83% 8% 89% 82%
FastGradientAttack (FGSM) 7% 78% 96% 86% 38% 90% 89%
AdditiveUniformNoiseAttack 96% 98% 99% 98% 99% 96% 96%
RepeatedAdditiveUniformNoiseAttack 83% 95% 97% 96% 97% 89% 93%
DeepFoolAttack 0% 83% 95% 86% 7% 91% 78%
InversionAttack 28% 98% 98% 99% 76% 95% 95%
BinarySearchContrastReductionAttack 28% 98% 98% 98% 82% 94% 94%
LinearSearchContrastReductionAttack 28% 98% 98% 98% 82% 94% 94%
GaussianBlurAttack 97% 97% 98% 97% 98% 93% 95%
LinearSearchBlendedUniformNoiseAttack 67% 98% 98% 98% 98% 93% 95%
BrendelBethgeAttack 2% 81% 94% 89% 11% 88% 9%
All attacks 0% 69% 93% 82% 3% 78% 8%
-metric()
SaltAndPepperAttack 93% 93% 73% 97% 98% 90% 93%
Pointwise 25% 43% 2% 82% 76% 53% 59%
All attacks 25% 43% 2% 82% 76% 53% 59%
-metric()
InversionAttack 99% 98% 99% 99% 99% 95% 96%
BinarySearchContrastReductionAttack 99% 98% 99% 99% 99% 95% 96%
LinearSearchContrastReductionAttack 99% 98% 99% 99% 99% 95% 96%
GaussianBlurAttack 99% 98% 98% 99% 99% 95% 96%
LinearSearchBlendedUniformNoiseAttack 99% 98% 99% 99% 99% 95% 96%
BrendelBethgeAttack 11% 4% 16% 48% 89% 65% 65%
All attacks 11% 4% 16% 47% 89% 65% 65%
Table 3: Results for different kinds of models under 42 different adversarial attacks, arranged according to distance metrics. Almost all attacks in Foolbox v3.2.1. are included. Each entry shows the accuracy of the model for the thresholds of , , , and . For each norm, we also summarize all attacks in the type, calculating the overall accuracy.
CNN singleIM IM Dn-singleIM DIM biDIM
-metric()
ContrastReductionAttack 99% 95% 96% 95% 96% 95%
DDNAttack 15% 83% 91% 87% 93% 92%
PGD 30% 89% 95% 89% 94% 93%
BasicIterativeAttack 17% 88% 94% 90% 94% 93%
FastGradientAttack (FGM) 55% 89% 95% 90% 95% 94%
AdditiveGaussianNoiseAttack (GN) 99% 95% 96% 94% 96% 95%
AdditiveUniformNoiseAttack (UN) 99% 95% 96% 94% 96% 96%
ClippingAwareGN. 99% 95% 96% 95% 96% 96%
ClippingAwareAdditiveUN 99% 95% 96% 94% 96% 96%
RepeatedGN 99% 93% 96% 93% 95% 92%
RepeatedUN 99% 93% 95% 93% 95% 93%
ClippingAwareRepeatedGN 99% 93% 95% 92% 95% 92%
ClippingAwareRepeatedUN 98% 93% 95% 93% 95% 92%
DeepFoolAttack 21% 71% 83% 82% 89% 75%
InversionAttack 99% 95% 96% 95% 96% 95%
BinarySearchContrastReductionAttack 99% 94% 96% 94% 96% 95%
LinearSearchContrastReductionAttack 99% 94% 96% 94% 96% 95%
GaussianBlurAttack 99% 94% 96% 93% 96% 95%
CarliniWagnerAttack 13% 54% 66% 68% 74% 51%
BrendelBethgeAttack 12% 61% 58% 70% 71% 57%
BoundaryAttack 19% 65% 67% 75% 80% 80%
All attacks 9% 52% 51% 65% 66% 45%
-metric()
PGD 0% 49% 70% 72% 85% 89%
BasicIterativeAttack 0% 54% 61% 72% 82% 89%
FastGradientAttack (FGSM) 7% 64% 78% 79% 89% 90%
AdditiveUniformNoiseAttack 96% 95% 96% 95% 96% 96%
RepeatedAdditiveUniformNoiseAttack 83% 90% 93% 91% 93% 89%
DeepFoolAttack 0% 44% 61% 66% 78% 91%
InversionAttack 28% 96% 95% 92% 95% 95%
BinarySearchContrastReductionAttack 28% 93% 94% 91% 94% 94%
LinearSearchContrastReductionAttack 28% 93% 94% 91% 94% 94%
GaussianBlurAttack 97% 92% 94% 93% 95% 93%
LinearSearchBlendedUniformNoiseAttack 67% 94% 95% 93% 95% 93%
BrendelBethgeAttack 2% 2% 1% 6% 9% 88%
All attacks 0% 2% 0% 6% 8% 78%
-metric()
SaltAndPepperAttack 93% 90% 92% 91% 93% 90%
Pointwise 25% 54% 50% 58% 59% 53%
All attacks 25% 54% 50% 58% 59% 53%
-metric()
InversionAttack 99% 95% 96% 95% 96% 95%
BinarySearchContrastReductionAttack 99% 94% 96% 94% 96% 95%
LinearSearchContrastReductionAttack 99% 94% 96% 94% 96% 95%
GaussianBlurAttack 99% 94% 96% 94% 96% 95%
LinearSearchBlendedUniformNoiseAttack 99% 94% 96% 94% 96% 95%
BrendelBethgeAttack 11% 61% 57% 65% 65% 65%
All attacks 11% 61% 57% 65% 65% 65%
Table 4: Results for ablation study under 42 defferent adversarial attacks, arranged according to distance metrics. There are six models: the vanilla CNN, the single-head Internal Model (single-IM), the Internal Model without denoiser (IM), the single-head Internal Model with denoiser (Dn-singleIM), the DIM, and the biDIM.

a.1 Model training details

a.1.1 Hyperparameters and training details for DIM

In DIM, we train 1 denoiser and 10 internal models, separately. The denoiser contains a fully-connected encoder with 5 layers of the width [784,560,280,140,70], where the last layer uses linear and the others ReLU, and a fully-connected decoder with 5 layers of the width [70,140,280,560,784], where the last layer uses Tanh and the others ReLU. Each internal model contains a fully-connected encoder with 5 layers of the width [784,256,64,12,10] where the last layer uses linear and the others ReLU, and a fully-connected decoder with 5 layers of the width [10,12,64,256,784] where the last layer uses Tanh and the other ReLU. There are two types of noises added onto the input on each stage, an noise randomly from the space , and an noise with a probability 1/12 to increase by 1 and also 1/12 to decrease by 1 for every dimension of all pixels. Both of the denoiser and the internal models are trained using the Adam optimizer with the learning rate of . In addition, when training the internal models, we also randomly tune the image brightness by multiplying a factor in the range after adding the noise.

a.1.2 Hyperparameters and training details for Madry

We adopt the same network architecture as in Madry et al. (2018)

, includes two convolutional layers, two pooling layers and two fully connected layers. We implement the model in PyTorch and perform adversarial training using the same settings as in the original paper.

a.1.3 Hyperparameters and training details for CNN ABS models

For the CNN and the ABS/biABS cases, we load the pre-trained models provided by Schott et al. (2019). There are 4 convolutional layers with kernel sizes=[5,4,3,5] in the CNN model. The ABS/biABS model contains 10 variational autoencoders, one for each category in the dataset.

a.2 Attack details

To apply gradient-based attacks on the models with input binarization, we exploit a transfer-attack-like procedure. Specifically, a sigmoid function is used in substitute of the direct binarization. Then we place this differentiable proxy model directly under attacks from foolbox v3.2.1. There is a scale parameter

in the sigmod function, i.e. , which controls how steep the function is when increasing from 0 to 1. For each attack on a binary model, we attack the model 5 times for , and , respectively. At last, we adopt a finetune procedure on all the generated adversarial samples. To be concrete, if a pixel value in the adversarial image is different from its original value, we project the value to 0 or 0.5 as long as it retains the same result under binarization.

Note that this fine-tune procedure also applies to the non-gradient-based attacks on binary models. However, in this case the transfer-attack-like procedure is no longer needed.

For normal (i.e., non-binary) models, we use the default settings of attacks in the foolbox v3.2.1 package.