Representation of White- and Black-Box Adversarial Examples in Deep Neural Networks and Humans: A Functional Magnetic Resonance Imaging Study

05/07/2019 ∙ by Chihye Han, et al. ∙ 0

The recent success of brain-inspired deep neural networks (DNNs) in solving complex, high-level visual tasks has led to rising expectations for their potential to match the human visual system. However, DNNs exhibit idiosyncrasies that suggest their visual representation and processing might be substantially different from human vision. One limitation of DNNs is that they are vulnerable to adversarial examples, input images on which subtle, carefully designed noises are added to fool a machine classifier. The robustness of the human visual system against adversarial examples is potentially of great importance as it could uncover a key mechanistic feature that machine vision is yet to incorporate. In this study, we compare the visual representations of white- and black-box adversarial examples in DNNs and humans by leveraging functional magnetic resonance imaging (fMRI). We find a small but significant difference in representation patterns for different (i.e. white- versus black- box) types of adversarial examples for both humans and DNNs. However, human performance on categorical judgment is not degraded by noise regardless of the type unlike DNN. These results suggest that adversarial examples may be differentially represented in the human visual system, but unable to affect the perceptual experience.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

State-of-the-art machine vision systems based on deep neural networks (DNNs) achieve remarkable performance in high-level visual tasks such as object recognition [1, 2, 3]. However, existing DNNs are vulnerable to adversarial examples [4, 5], which are generated by adding subtle noises that lead a machine classifier, but not a human observer, to misidentify the target image.

While adversarial inputs are a serious security threat for DNNs, they have a negligible influence on humans. Motivated by the difference in degree of robustness, the present work examines the representation of adversarial examples in DNNs and humans.

Specifically, our contributions are as follows:

  • We obtain feature representations of adversarial examples in the human visual cortex with fMRI and compute their similarity to feature representations produced by hierarchical layers of a DNN using representational similarity analysis [6].

  • Along with white-box examples that exploit access to a DNN structure, adversarial examples with Gaussian black-box noises are presented to humans and a DNN in parallel to investigate their respective neural responses to structured and random noise.

  • Concurrently with fMRI measurements, categorization decision performance is recorded in humans to examine whether behavioral judgment and neural representation align for different types of adversaries.

The rest of the paper is organized as follows: In Section II, background and prior works are introduced on comparing human and DNN visual systems and adversarial examples. Section III describes methods for fMRI and behavioral experiments. Section IV shows experimental results and analysis. Finally, Section V discusses implications of the results.

Ii Background and Related Work

Ii-a Comparison of Human and DNN Visual Processing

DNNs, especially convolutional neural networks (CNNs), have recently achieved human performance in various visual tasks

[1, 2, 3]. Like their biologically inspired neural network predecessors [7, 8, 9], CNNs share key structural similarities with the ventral visual pathway of the biological brain, including neural receptive fields and hierarchical cortical organization. More importantly, successful CNN variants have shown to exhibit surprising similarities to humans in terms of visual representation and behavior. Neuroimaging studies reported that features from higher layers of DNNs can accurately predict fMRI data from human inferior temporal (IT) cortex and cell recording data from monkey IT, indicating that higher layers of DNNs have obtained similar underlying representations as primate IT for visual object recognition [10, 11]. Representations from DNNs can be also adapted to reliably model human judgment patterns in letter and image recognition, shape sensitivity, and categorical similarity [12, 13, 14, 15]. Findings that DNNs can closely predict aspects of biological visual processing have suggested their usefulness as a model for biological vision [16, 17].

Despite promising potentials, DNNs exhibit considerable discrepancies from biological vision. Despite initially reported similarities, [18] showed that the representations of DNNs are significantly non-predictive of primate IT data on an individual image level. Furthermore, DNNs are significantly more susceptible to image distortions such as additive noise, contrast reduction, and reversed brightness [19, 20, 21]. This suggests that DNNs are far less robust than humans, especially in impoverished settings. Finally, psychophysical judgment differences in humans and DNNs [22, 23] suggest that the underlying visual processing might be substantially different.

Ii-B Adversarial Examples

An adversarial example is a case that demonstrates DNNs’ shortcomings to extremes. Adversarial examples are images modified to fool a machine classifier by adding malicious noises to the original [4, 5]. A canonical example is shown in Fig. 1, where an image of a panda is misclassified to a gibbon after a human imperceptible noise is added. The generation of an adversarial example can be formally stated as follows, where is the adversarial example from the original image and is the noise level:


The perturbation noise is constructed by an optimization process that maximizes misclassification of the target image. Such attacks exploit an access to the structure of the neural network they aim to fool and thus are considered white-box attacks (see Generating Adversarial Stimuli in the Methods section for details). However, [4, 24, 25] showed that adversarial examples created for one network transfer to other networks with similar structures, enabling black-box attacks. In fact, [25] showed that an adversarial example created based on optimization on multiple networks is more likely to fool another arbitrary network. Arbitrary noises such as Gaussian blur or salt and pepper noises that are inherently independent of any specific network architecture can, by definition, serve as black-box attacks when added to an image until it is misclassified. Adversarial examples also transfer to the real world when captured with cameras or other sensors despite substantial transformations caused by lighting and camera properties [26].

Fig. 1: A canonical adversarial example adapted from [5]. An adversarial example is created by generating adversarial noise clipped to be human-imperceptible and adding it to an image of a panda, which is then classified as a gibbon by a deep neural network. See the main text for more detail on adversarial example generation.

While adversarial examples pose a serious security concern for machine classifiers, they are known to have limited impact on humans. [27] examined the representation of an image of adversarial noise (not of adversarial example) in humans by an fMRI experiment, showing that hierarchical representations of the adversarial noise in humans are increasingly less similar to those of DNN going from low to high layers in the visual cortex. This reaffirms the notion that adversarial noise contains structure meaningful for visual processing of DNNs, but not humans. On the other hand, a psychophysics experiment of [28] suggested that humans, too, are fooled by adversarial examples if exposed to them briefly, i.e. {71, 63} ms. It was observed that adversarial examples effective for humans tended to entail visually identifiable modulations in texture, contrast, and edge information, in line with previous accounts that adversarial perturbations sometimes induce semantically meaningful features that are relevant to the target class [29].

With seemingly equivocal reports of adversarial effects on humans, it is integral to consider the distinction between visual and perceptual representations. Contrary to our subjective impression, our initial sensory representation and final perceptual awareness can well be discrepant. For example, categorical representation in the human IT departs from human judgments such that human categorical judgments, but neither human nor monkey IT representation obtained by fMRI, reflect human-related sub-categorization within the animate class into human vs. nonhuman animals and the inanimate class into natural vs. artificial objects [30]. More relevantly, [31] report that the ventral visual pathway representation measured by fMRI is more prominently guided by animal appearance over animacy, while the reverse is true for human judgment and DNN representation.

In the present work, both visual representations and perceptual performance are considered as we examine fMRI and behavioral patterns of human observers in response to adversarial examples. In addition, effects of white-box and black-box noises are examined symmetrically in humans and their machine counterparts to elucidate whether the neural and behavioral responses to adversarial examples are specific to adversarial noise, as opposed to arbitrary, random noise.

Iii Methods

Iii-a Stimulus Image

Stimuli presented to human subjects and a DNN model were adapted from [32, 33]. The original human fMRI experiment consisted of presenting 96 color images (175 175 pixels) of categorical real-world objects, including animates (faces or bodies of human and nonhuman animals) and inanimates (natural and artificial objects). Time constraint posed by the need to repeat fMRI experiments for several experimental conditions motivated us to exclude the ’human body’ and ’nonhuman body’ subcategories in our experiment, leaving only 12 images of human face and 12 images of nonhuman face in the animate category. Twelve images from naturalistic and artificial objects in the inanimate category were selected to match per category image count based on non-ambiguity e.g. selecting a single object image over a scenery. The final stimuli consisted of 48 images (see Table I), resized to 224 224 pixels to enable the use of a DNN pre-trained with images of the same size.

Category Class Instance Examples
Animate Human face 12
Animal face 12
Inanimate Natural objects 12
Artificial objects 12
TABLE I: Stimulus Image Set

Iii-B Generating Adversarial Stimuli

For our DNN model, we used the PyTorch implementation of the VGG19 network to produce adversarial examples and to compute DNN features


. The VGG19 model consists of sixteen convolutional layers and three fully connected layers. The network was pre-trained with 1.2 million labelled images of 1000 categories from ImageNet


We generated adversarial images with two types of white-box attacks (with respect to DNN): Projected Gradient Descent (PGD) and Carlini and Wagner (C&W) attacks. With both white-box attacks, we designated the target class to be the gibbon class in ImageNet (”368: ’gibbon, Hylobates lar’”). For black-box adversarial images, we generated Gaussian noise not designed to deceive a specific target network. With all attacks, we verified that the top one classification result successfully changed to the target class for all 48 stimuli images. A sample image of the generated stimuli in each adversarial condition is provided in Table II.


PGD is a sub-type of a gradient-based attack referred to as fast gradient sign method (FGSM) [35] in which the adversarial noise is determined by a gradient of loss between the predicted output, , and the ground truth, , as follows:


PGD is a multi-step variant of FGSM in that it finds the adversarial perturbation by using the same equation as FGSM, but iteratively. The algorithm finds the adversary starting with random -uniform perturbation clipped in the range of the pixel values of [0, 255]. An image was put into the target boundary by subtracting the sign of a loss between and :


We chose of 1 and of with steps, but stopped early when the prediction matched the target.


C&W attack [36] is a strong optimization-based attack in which the adversarial noise is defined with learnable parameters optimized by Adam [37]

. We used the loss function suggested by the original paper as follows:



is a logit space through the network given an input

and is the parameter that controls confidence of finding an adversarial example. We minimized equation (4) using PGD, and we chose of with iterations and learning rate of .


For the black-box attack, we added Gaussian noise with mean and standard deviation of

and , respectively. We empirically determined the proper value of standard deviation by comparing the resulting level of intensity to other types of noise with the naked eye.

Condition (a) Clean (b) PGD (c) C&W (d) Gaussian
TABLE II: Example Stimulus Image

Iii-C fMRI Experiment

Iii-C1 Participants

Fourteen healthy subjects were recruited for the study (3 females, mean age 23.68, range 21-30). All subjects had normal or corrected-to-normal visual acuity of 20/40 or above and no neurological or psychiatric history. Subjects provided written informed consent regarding their participation. Experiments were in compliance with the safety guidelines for MRI research and approved by the Institutional Review Board for research involving human subjects at Korea Advanced Institute of Science and Technology.

Iii-C2 MRI Acquisition

Experiments were performed with a 12-channel 3T MR scanner (Siemens Magnetom Verio, Germany). The functional images were acquired with a T2*-weighted gradient recalled echo-planar imaging (EPI) sequence (TR, 2,000 ms; TE, 30 ms; flip angle, 90; FOV: 64 64 mm; voxel size, 3 3 3 mm, number of slices, 36). Upon completion of functional imaging, T1-weighted magnetization-prepared rapid-acquisition gradient echo (MPRAGE) images were acquired for normalization purposes (TR, 1,800 ms; TE, 2.52 mx; FA, 9; FOV, 256 256 mm; voxel size, 1 1 1 mm).

Subjects were briefed on MR safety and experimental procedures and guided through a practice run of the behavioral task (see below, Experimental Design and Tasks) before entering the scanner. They held a button press handle in each hand for the behavioral task throughout the experiment. Experimental stimuli were presented with MR-compatible video goggles (Nordic Neuro Lab, Norway).

Fig. 2: Experimental design. Each subject completed 8 runs, each of which consists of 60 trials (48 visual stimuli in one of 4 adversary conditions and 12 blank trials). Subjects were instructed to perform categorical one-back, pressing the button if subsequent visual stimuli belonged to the same semantic category.

Iii-C3 Experimental Design and Tasks

The experimental task was programmed with PsychoPy v2.0 for Windows [38]. Stimuli images were presented foveally for a duration of 300 ms. The stimulus onset asynchrony (SOA) was 2 s. Null trials were randomly inserted in each run, producing the the effect of jittered interstimulus intervals The resulting SOA for image stimuli trials ranged from 2 s to 12 s. A centered fixation cross appeared throughout the runs.

Each subject completed a total of eight runs, with each run lasting for 4 m and 12 s, all presented on the same day (total duration, 33 m 6 s). Each run belonged to one of four conditions: (1) Clean (unattacked), (2) PGD, (3) C&W, or (4) Gaussian noise attacked images. Each condition constituted a separate run such that a single run consisted of images from the same condition. Each condition was presented twice in a pseudo-randomly assigned order. Each of the runs randomly presented one of 48 images (visual angle, 9) exactly once, along with randomly interspersed 12 blank null trials showing gray background only. Each run contained 6 s of pre- and post-rest (3 volumes each). Subjects were encouraged to take a break between runs.

Subjects were instructed to fixate on a fixation cross throughout the experiment and to perform a behavioral task of categorical one-back. In this behavioral task, subjects were to press the button with the thumb of their preferred hand if the presented image stimulus belonged to the same category as the previous one (Fig. 2). The category used as the basis of the decision was animate (human and animal faces) versus inanimate (naturalistic or artificial objects). Thus, subjects pressed the button if two immediately subsequent stimuli belonged to the animate (or inanimate) class. The button responses were recorded for response accuracy, sensitivity, specificity, and latency, which were respectively calculated as follows:


, , , and indicate the number of true positive, true negative, false positive, and false negative, respectively. is the total positive case of , and is the total negative case of . tresponseTP and tonsetTP

refer to the time of response and the time of stimulus onset for a true positive instance, respectively. One-way analysis of variances (ANOVAs) was performed to detect significant effects of the noise type in each of these measures.

Iii-C4 Data Preprocessing

fMRI data preprocessing was performed using Statistical Parameter Mapping (SPM12, Wellcome Trust Centre for Neuroimaging, London, UK). The first three volumes of each run were discarded automatically during the scanning process for magnetic field stabilization. We performed a rigid body transform motion correction across runs in each subject using the middle volume as a reference. Functional images were directly normalized to the Montreal Neurological Institute (MNI) template (East Asian brains). The normalized images were rewritten at 3mm isometric voxels. No spatial smoothing was applied as recommended for representational similarity analysis.

Anatomical region ROI Abbr.
Human 1 Gyrus fusiformis {FG1 FG2 FG3 FG4} FG
2 hOC1 {hOC1} hOC1
3 hOC2 {hOC2} hOC2
4 Ventral extrastriate cortex {hOC3d hOC3v} hOC3d/4d
5 Dorsal extrastriate cortex {hOC4d hOC4v} hOC3v/4v
6 Lateral occpital cortex {hOC4la hOC4lp} hOC4l
DNN 1 {conv1_1 conv1_2} conv1
2 {conv2_1 conv2_2} conv2
3 {conv3_1 conv3_2 conv3_3 conv3_4} conv3
4 {conv4_1 conv4_2 conv4_3 conv4_4} conv4
5 {conv5_1 conv5_2 conv5_3 conv5_4} con5
6 {fc1} fc1
7 {fc2} fc2
8 {fc3} fc3
TABLE III: Regions of Interest (ROI) Definition for Human and DNN

Iii-C5 Regions of Interest (ROI) Definition

Beta maps were extracted from normalized functional volumes for regions of interest (ROI). ROIs were generated based on anatomical probability maps provided by SPM Anatomy Toolbox

[39]. A total of 12 maps including V1-4 (i.e. hOC1-4) and fusiform gyrus (i.e. FG) were chosen to represent the visual area (See Table III). The number of voxels per mask ranged from 316 to 3331. MarsBaR [40]

was used to produce masks from the probability maps and to extract the masked beta maps. For beta map extraction, we masked functional images obtained 6 s (3 volumes) after the stimuli onset to account for hemodynamic delays. Each of these beta vectors was taken as the neural representation for an image stimulus in each visual area. Extracted beta vectors

from ROI and condition were further normalized with mean and standard deviation before using them as input for the representational similarity analysis:


Iii-D Representational Similarity Analysis

Representational similarity analysis [6] is a framework that enables comparisons of representations from different modalities, e.g. computational models and fMRI patterns, by comparing the dissimilarity patterns of the representations. Representations from different modalities are compared by first constructing representational dissimilarity matrices (RDMs). RDM is a square symmetric matrix that contains pairwise (dis)similarity values between response patterns of all stimuli pairs. Using the representational similarity analysis toolbox [41], we constructed RDM for every ROI (shown in Table III) condition {Clean, PGD, C&W, Gaussian} subject {Human 1, 2, … , 14, DNN}. Given 48 stimuli images used in our experiment, each RDM was a 48 48 matrix containing dissimilarity values between the response patterns (fMRI or DNN features) elicited by two stimuli. The dissimilarity measure was 1 minus the Pearson correlation.

We then constructed the second-level correlation matrix of RDMs by computing the pairwise similarities of individual RDMs, which visually demonstrates the relatedness of brain and DNN representation patterns from each ROI and condition. The similarity measure was the Kendall’s rank correlation coefficient .

Finally, we performed statistical inference with a one-sided signed-rank test to assess the degree of relatedness between RDMs. This procedure can be used, for example, to test whether a given computational model (’candidate RDM’) explains some brain representation (’reference RDM’) better than others, cf. [10]. In our experiment, we set the subject-average human RDMs in response to the clean stimuli as the reference RDM and related them to other brain RDMs or to DNN RDMs. We repeated the process in reverse, with the ROI average of DNN RDMs in response to clean stimuli as the reference RDM.

Iv Results

Fig. 3: The correlation matrix of subject-average human and DNN RDMs. Each cell represents a pairwise similarity between two RDMs computed with Kendall’s rank correlation coefficient in the range [-1, 1].

Iv-a Comparison of Visual Representations

Fig. 3 shows the correlation matrix of subject-average human and DNN RDMs produced from the representational similarity analysis. Each cell represents a Kendall’s rank correlation coefficient between two RDMs, with each RDM containing response patterns for 48 stimuli (not shown). Each row or column represents correlations between RDMs from a single ROI and the other RDMs. Correlations for the same stimuli condition are grouped together, forming square regions for high within-condition correlations. Correlations between the identical RDMs fill the diagonal with the value of 1.

Visual inspection of the correlation matrix reveals that human RDMs have relatively small within-group correlations (upper left quadrant) compared to DNN RDMs (lower right quadrant), reflecting the difference in noise levels (not corrected).

Within-group correlations in DNNs exhibit distinctive adversary effects reflected by condition-dependent representations in the fully connected layers (fc1-3): Features from fully connected layers of DNNs in response to clean stimuli show a moderate similarity to features of Gaussian-attacked stimuli, but not with those of PGD- or C&W-attacked stimuli.

Correlations between human and DNN RDMs (lower left quadrant) also show varying patterns by noise type: Human RDMs of clean stimuli (1st column) show small positive correlations with DNN RDMs from convolutional layers (conv1-5), but negative correlations with the higher fc1-3 layers, especially for PGD and C&W adversarial conditions; Human RDMs of PGD adversary (2nd column) show more positive correlations with DNN RDMs compared to other human conditions; Human RDMs of C&W and Gaussian adversary (3rd and 4th columns) also show moderate positive correlations with DNN, less in convolutional layers than in fully connected layers.

The correlation matrix is further supplemented with statistical inference results, shown in Fig. 4. Here, stars represent significant correlations (p0.05), and the gray box represents noise ceilings.

Fig. 4: Kendall- correlations between (a) human RDMs of adversarial images and human RDMs of clean images (reference), (b) human RDMs and DNN RDMs of clean images (reference), (c) DNN RDMs of adversarial images and DNN RDMs of clean images (reference), (d) DNN RDMs and human RDMs of clean images (reference).

Fig. 4(a) shows that the ROI- and subject-average human RDM from the clean condition has small negative correlations with human RDMs from other conditions, all correlations significant. From this, it seems that the human representations of the clean images are significantly different from all the noise conditions, but whether it is so to a different degree by noise type is hard to determine due to the small magnitude of correlations among human RDMs caused by low signal-to-noise ratio of fMRI.

A between-group comparison in Fig. 4(b) shows that there is a condition-dependent difference in humans by relating each human RDM to the reference DNN RDM in the clean condition, averaged across 8 ROIs. The significance test shows that human RDMs from the white-box adversary conditions (PGD in FG, hOC1, hOC2, hOC4la/4lp, hOC3d/4d, hOC3v/4v and C&W in hOC1, hOC2, FG), but neither clean nor Gaussian conditions, have significant positive correlations with the reference DNN RDMs from the clean condition.

For comparison to the DNN representations, Fig. 4(c) shows the relatedness of all DNN RDMs with the same average DNN RDMs from the clean condition as above, where all correlations are significant. Here, representations from conv3/4/1/2 of C&W and PGD were most similar to those of the reference, followed by conv4/5, fc1-3, fc3 of Gaussian noise. Fully connected layers of PGD and C&W were more dissimilar to the reference than any others.

Lastly, in Fig. 4(d), the reference was the average of human RDMs in the clean condition, showing the relatedness of each DNN RDM to the human clean condition reference. The fc3 RDM of the clean condition shows a significant overlap with the reference, followed by conv3 of other conditions as well as fc3 of Gaussian noise. As noted from the correlation matrix, all fully connected layers of the white-box adversary conditions had negative correlations with the human clean reference.

Subject Clean PGD C&W Gaussian Clean PGD C&W Gaussian Clean PGD C&W Gaussian
1 0.989 0.968 1.000 1.000 0.957 0.981 0.958 1.000 1.000 0.941 1.000 0.978
2 0.947 0.979 0.978 0.980 0.979 0.909 0.980 1.000 0.989 0.977 0.980 0.977
3 0.989 0.979 0.978 1.000 1.000 0.976 1.000 1.000 0.989 1.000 1.000 0.950
4 1.000 1.000 1.000 1.000 0.989 1.000 1.000 0.968 0.979 0.976 1.000 1.000
5 0.989 1.000 1.000 0.984 0.989 0.982 1.000 1.000 1.000 1.000 1.000 1.000
6 0.947 0.968 0.963 0.979 0.968 0.944 0.938 1.000 0.979 0.957 0.950 1.000
7 0.989 0.989 1.000 1.000 0.989 0.975 0.976 1.000 1.000 0.981 1.000 1.000
8 1.000 0.947 0.941 1.000 0.989 1.000 0.889 1.000 0.979 0.978 1.000 0.983
9 0.915 0.989 0.981 1.000 0.968 0.923 0.981 1.000 0.989 0.925 0.905 1.000
10 1.000 1.000 0.833 1.000 1.000 1.000 1.000 0.848 0.840 1.000 1.000 1.000
11 0.989 0.989 0.978 1.000 1.000 0.978 1.000 1.000 0.989 1.000 1.000 0.974
12 0.926 0.883 0.946 0.938 0.947 0.932 0.950 0.921 0.936 0.957 0.920 0.833
13 0.968 1.000 0.942 1.000 0.989 0.967 1.000 1.000 0.968 0.981 0.971 1.000
14 1.000 0.989 0.979 1.000 0.989 1.000 1.000 1.000 0.989 0.982 1.000 0.980
Mean 0.975 0.977 0.966 0.992 0.983 0.969 0.977 0.981 0.973 0.975 0.980 0.977
Var 0.001 0.001 0.002 0.000 0.000 0.001 0.001 0.002 0.002 0.000 0.001 0.002
TABLE IV: Categorical Judgement Performance in Accuracy (acc), Specificity (spe), and Sensitivity (Sen)
Subject Clean PGD C&W Gaussian
1 0.766 0.805 0.721 0.790
2 0.810 0.842 0.952 0.744
3 0.714 0.721 0.744 0.793
4 0.477 0.514 0.514 0.493
5 0.627 0.533 0.565 0.771
6 0.580 0.696 0.583 0.551
7 0.511 0.539 0.509 0.527
8 0.596 0.586 0.513 0.563
9 0.616 0.569 0.597 0.609
10 0.598 0.683 0.643 0.672
11 0.670 0.648 0.655 0.677
12 0.904 0.822 0.880 0.819
13 0.683 0.770 0.747 0.810
14 0.745 0.782 0.631 0.6
Mean 0.664 0.679 0.661 0.674
Var 0.012 0.012 0.016 0.011
TABLE V: Categorical Judgment Performance in Response Latency (In Seconds)

Iv-B Behavioral Performance

Table IV reports the categorical judgment performance for different conditions in each human subject. The average accuracies were 97.5%, 97.7%, 96.6%, and 99.2% for Clean, PGD, C&W, and Gaussian conditions, respectively. There was no significant difference in categorical accuracy among four conditions [F(3, 52)=0.232, p=0.874]. Other performance measures also showed no statistical difference, with average specifities of 98.3%, 96.9%, 97.7%, and 98.1% [F(3, 52)= 0.331, p=0.803], and average sensitivities of 97.3%, 97.5%, 98.0%, and 97.7% [F(3, 52)= 0.424, p=0.736].

Table V reports the true positive response latency for each condition. The average response latencies were 0.664 s, 0.679 s, 0.661 s, and 0.674 s for clean, PGD, C&W, and Gaussian conditions, respectively. No statistical difference was observed [F(3, 52)=0.0689, p=0.976].

V Discussion

Our experimental results found that the presence of adversarial noise, regardless of the type, had no effect on the categorical judgments in human observers. However, in the visual representational space, different types of noise had unique patterns for both human and DNN. In the DNN, white-box adversarial attacks of PGD and C&W resulted in strongly disrupted patterns in the final, classifying layers of fc1-3, while Gaussian noise had qualitatively different, that is, weaker but more global effects across all layers. The effects of adversarial noise were not as pronounced in human fMRI; However, between-group comparison to DNN features revealed that fMRI data from different noise conditions had distinctive similarity patterns. Notably, neural representations of white-box attacked, but neither clean nor Gaussian noise, had a significant resemblance to the DNN representations. Adversarial-induced neural representations also differed in layer-specific response patterns.

Overall, it was indicated that neural processing in the early visual cortex may represent adversarial noise differently, but humans are somehow unaware of it on the perceptual level, and, as a result, unaffected on the behavioral level. A potential reason for this is that the human visual pathway, but not the machine counterpart, incorporates a correction mechanism located higher in the visual pathway that counters the adversarial effect.

Future work should investigate the role of higher visual areas such as IT in the robust perceptual representation against adversaries. Also, the possibility that gradient-based or other structured noises may be represented differently from random noise by the brain as suggested here should be explored further. Finally, efforts should be made toward building a computational model of the brain that successfully accounts for its representational and behavioral patterns as it could provide a basis for building fundamentally more robust machine vision.


This work was supported by the Engineering Research Center of Excellence (ERC) Program supported by National Research Foundation (NRF), Korean Ministry of Science & ICT (MSIT) (Grant No. NRF-2017R1A5A1014708) and by the ICT R&D program of MSIP/IITP [R-20161130-004520, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion].