Relationship between manifold smoothness and adversarial vulnerability in deep learning with local errors

07/04/2020 ∙ by Zijian Jiang, et al. ∙ SUN YAT-SEN UNIVERSITY 0

Artificial neural networks can achieve impressive performances, and even outperform humans in some specific tasks. Nevertheless, unlike biological brains, the artificial neural networks suffer from tiny perturbations in sensory input, under various kinds of adversarial attacks. It is therefore necessary to study the origin of the adversarial vulnerability. Here, we establish a fundamental relationship between geometry of hidden representations (manifold perspective) and the generalization capability of the deep networks. For this purpose, we choose a deep neural network trained by local errors, and then analyze emergent properties of trained networks through the manifold dimensionality, manifold smoothness, and the generalization capability. To explore effects of adversarial examples, we consider independent Gaussian noise attacks and fast-gradient-sign-method (FGSM) attacks. Our study reveals that a high generalization accuracy requires a relatively fast power-law decay of the eigen-spectrum of hidden representations. Under Gaussian attacks, the relationship between generalization accuracy and power-law exponent is monotonic, while a non-monotonic behavior is observed for FGSM attacks. Our empirical study provides a route towards a final mechanistic interpretation of adversarial vulnerability under adversarial attacks.



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Artificial deep neural networks have achieved the state-of-the-art performances in many domains such as pattern recognition and even natural language processing 

[1]. However, deep neural networks suffer from adversarial attacks [2, 3], i.e., they can make an incorrect classification with high confidence when the input image is slightly modified yet maintaining its class label. In contrast, for humans and other animals, the decision making systems in the brain are quite robust to imperceptible pixel perturbations in the sensory inputs [4]. This immediately establishes a fundamental question: what is the origin of the adversarial vulnerability of artificial neural networks? To address this question, we can first gain some insights from recent experimental observations of biological neural networks.

A recent investigation of recorded population activity in the visual cortex of awake mice revealed a power law behavior in the principal component spectrum of the population responses [5], i.e., the

biggest principal component (PC) variance scales as

, where is the exponent of the power law. In this analysis, the exponent is always slightly greater than one for all input natural-image stimuli, reflecting an intrinsic property of a smooth coding in biological neural networks. It can be proved that when the exponent is smaller than , where is the manifold dimension of the stimuli set, the neural coding manifold must be fractal [5], and thus slightly modified inputs may cause extensive changes in outputs. In other words, the encoding in a slow decay of population variances would capture fine details of sensory inputs, rather than an abstract concept summarizing the inputs. For a fast decay case, the population coding occurs in a smooth and differentiable manifold, and the dominant variance in the eigen-spectrum captures key features of the object identity. Thus, the coding is robust, even under adversarial attacks. Inspired by this recent study, we ask whether the power-law behavior exists in the eigen-spectrum of the correlated hidden neural activity in deep neural networks. Our goal is to clarify the possible fundamental relationship between classification accuracy, the decay rate of activity variances, manifold dimensionality and adversarial attacks of different nature.

Taking the trade-off between biological reality and theoretical analysis, we consider a special type of deep neural network, trained with a local cost function at each layer [6]

. Moreover, this kind of training offers us the opportunity to look at the aforementioned fundamental relationship at each layer. The input signal is transferred by trainable feedforward weights, while the error is propagated back to adjust the feedforward weights via random quenched weights connecting the classifier at each layer. The learning is therefore guided by the target at each layer, and layered representations are created due to this hierarchical learning. These layered representations offer us the neural activity space for the study of the above fundamental relationship.

Figure 1: Schematic illustration of deep neural networks trained with local errors. FC is the short-hand for the fully-connected module, which performs a non-linear processing. RC refers to the random classifier module, which projects the output of FC to a 10-dimensional decision space. The red and blue arrows indicate forward and backward propagations, respectively. For the forward propagation, the input image is transfered through the FCs in a layer wise manner. The FC at each layer produces an activation , which serves as an input () to the next layer, or is further weighted by quenched random weights of a RC (). After a random projection, the cross-entropy cost () is evaluated. The classification error is back-propagated to adjust the trainable network parameters, through gradients of the cost function with respect to feedforward weights () between consecutive layers.

Ii Model

The deep neural network under investigation is shown in Fig. 1. We train the network of layers (numbered from ) to classify the MNIST handwritten digits [7]

. The first layer serves as an input layer, receiving a 784-dimensional vector from the MINST dataset. These sensory inputs are then transferred to the next layer in a feedforward manner. The

() layer has neurons connected to a random classifier of a fixed size

, corresponding to the number of digit classes. The random classifier means that the connection weight to the classifier is pre-sampled from a zero-mean Gaussian distribution. The depth

and each layer’s width are flexible according to our settings. The strength of the connection or the weight between neuron at the layer and neuron at the upper layer is denoted by , which is trainable. Similarly, at the layer , the weight between neuron at this layer and neuron at the neighboring random classifier is specified by , which is quenched.

For the forward propagation, e.g., at the layer , we have an input and the pre-activation is given by , simply summarizing the input pixels according to the corresponding weights. To ensure that the pre-activation is of the order one, a scaling factor is often applied to the pre-activation. However, the scaling is not necessary in our task. A non-linear transfer function is then applied to the pre-activation to obtain an activation defined by

. Here, we use the rectified linear unit (ReLU) function as the non-linear function, i.e.,

. The input of the next layer is the activation of layer , except for the first layer where the input is a -dimensional vector characterizing one handwritten digit. Meanwhile, the activation of the current layer is also fed to the random classifier, resulting in a classification score vector whose components

. The score vector can be transformed into a classification probability by applying the softmax function, i.e.,

. can be understood as the probability of the input-label prediction. To specify the cost function, we first define as the one-hot representation for the label of an image. More precisely, , where is the digit number of the input image. Finally, the local error function at the layer is given by . The local error is nothing but the cross entropy between and . The forward propagation process can be summarized as follows:


The local cost function is minimized when for every

. The minimization is achieved by the gradient decent method. The gradient of the local error with respect to the weight of the feedforward layer can be calculated by applying the chain rule, given by:


Then, all trainable weights are updated by following , where and indicate the learning step and the learning rate, respectively. In our settings,

is a pre-fixed random Gaussian variables. In practice, we use the mini-batch-based stochastic gradient descent method. The entire training dataset is divided into hundreds of mini-batches; after sweeping through all mini-batches, an epoch of learning is completed. The learning is then tested in an unseen/test dataset. For the standard MNIST dataset, the training set has

images, and the test set has images. We can conclude whether a test image is correctly classified at the layer by comparing the position of the maximum component of the classifier score vector with that of the one-hot label , i.e., the image is correctly classified at the layer if . The test accuracy

is then estimated by the number of correctly-classified images divided by the total number of images in the test set.

After learning, the input ensemble can be transfered throughout the network in a layer-wise manner. Then, at each layer, the activity statistics can be analyzed by the eigen-spectrum of the correlation matrix (or covariance matrix). We use principle component analysis (PCA) to obtain the eigen-spectrum, which gives variances along orthogonal directions in the descending order. For each input image, the population output of neurons at the layer can be thought of as a point in the -dimensional activation space. It then follows that, for input images, the outputs can be seen as a cloud of points. The PCA first finds the direction with a maximal variance of the cloud, then chooses the second direction orthogonal to the first one, and so on. Finally, the PCA identifies orthogonal directions and corresponding variances. In our current setting, the eigenvalues of the the covariance matrix of the neural manifold explain variances. Arranging the eigenvalues in the descending order leads to the eigen-spectrum whose behavior will be later analyzed in the next section.

To consider effects of adversarial examples, we add perturbations to the original input

at each layer. The original input is obtained by the layer-wise propagation of an image from the test set. We consider two kinds of additive perturbations: one is the Gaussian white noise and the other is the fast gradient sign method (FGSM) noise 

[8, 9], representing black-box attacks and white-box attacks, respectively. Each component of the white noise is an i.i.d random number drawn from zero mean Gaussian distribution with different variances (attack/noise strength), and each component of the FGSM noise is taken from the gradient of the local cost function at each layer with respect to the immediate input of this layer. These types of perturbations are given as follows:


where , denotes the perturbation magnitude, and . In fact, the FGSM attack can be thought of as an norm ball attack around the original input image.

Figure 2: Typical trajectories of training and test error rates versus training epoch. Lines indicate the train error rate, and the symbols are the test error rate. The network width of each layer is fixed to (except the input layer), with images for training and images for testing. The initial learning rate which is multiplied by every ten epochs.

Iii Results and Discussion

In this section, we apply our model to clarify the possible fundamental relationship between classification accuracy, the decay rate of activity variances, manifold dimensionality and adversarial attacks of different nature.

iii.1 Test error decreases with depth

We first show that the deep supervised learning in our current setting works. Fig. 

2 shows that the training error decreases as the test accuracy increases (before early stopping) during training. As the layer goes deeper, the test accuracy grows until saturation despite a slight deterioration. This behavior provides an ideal candidate of deep learning to investigate the emergent properties of the layered intermediate representations after learning, without and with adversarial attacks. Next, we will study in detail how the test accuracy is related to the power-law exponent, how the test accuracy is related to the attack strength, and how the dimensionality of the layered representation changes with the exponent, under zero, weak, and strong adversarial attacks.

Figure 3: Eigen-spectrum of layer-dependent correlated activities and the power-law behavior of dominant PC dimensions. (a) The typical eigen-spectrum of deep networks trained with local errors (, ). Log-log scales are used. The inset is the eigen-spectrum measured in the visual cortex of mice (taken from Ref. [5]). (b) An example of extracting the power-law behavior at the fifth layer in (a). A linear fitting for the first ten PC components is shown in the log-log scale.

iii.2 Power-law decay of dominant eigenvalues of the activity correlation matrix

A typical eigen-spectrum of our current deep learning model is given in Fig. 3. Notice that the eigen-spectrum is displayed in the log-log scale, then the slope of the linear fit of the spectrum gives the power-law exponent . We use the first ten PC components to estimate but not all for the following two reasons: () A waterfall phenomenon appears at the position around the dimension, which is more evident at higher layers. () The first ten dimensions explain more than of the total variance, and thus they capture the key information about the geometry of the representation manifold. The waterfall phenomenon in the eigen-spectrum can occur multiple times, especially for deeper layers [Fig. 3 (a)], which is distinct from that observed in biological neural networks [see the inset of Fig. 3 (a)]. This implies that the artificial deep networks may capture fine details of stimuli in a hierarchical manner. A typical example of obtaining the power-law exponent is shown in Fig. 3 (b) for the fifth layer. When the stimulus size is chosen to be large enough (e.g., ; throughout the paper), the fluctuation of the estimated exponent due to stimulus selection can be neglected.

Figure 4: Effects of network width on test accuracy and power-law exponent . (a) Test accuracy versus layer. Error bars are estimated over independently training models. (b) versus layer. Error bars are also estimated over independently training models.

iii.3 Effects of layer width on test accuracy and power-law exponent

We then explore the effects of layer width on both test accuracy and power-law exponent. As shown in Fig. 4 (a), the test accuracy becomes more stable with increasing layer width. This is indicated by an example of which shows a large fluctuation of the test accuracy especially at deeper layers. We conclude that a few hundreds of neurons at each layer is sufficient for an accurate learning.

The power-law exponent also shows a similar behavior; the estimated exponent shows less fluctuations as the layer width increases. This result also shows that the exponent grows with layers. The deeper the layer is, the larger the exponent becomes. A larger exponent suggests that the manifold is smoother, because the dominant variance decays fast, leaving few space for encoding the irrelevant features in the stimulus ensemble. This may highlight the depth in hierarchical learning is important for capturing key characteristics of sensory inputs.

Figure 5: The power-law exponent versus test accuracy of the manifold. grows along the depth, while the test accuracy has a turnover at the layer 2, and then decreases by a very small margin. Error bars are estimated over independently training models.

iii.4 Relationship between test accuracy and power-law exponent

In the following simulations, we use neurons at each layer. The relationship between and is plotted in Fig. 5

, with error bars indicating the standard errors across

independent trainings. The typical value of grows along the depth, implying that the coding manifold becomes smoother at higher layers. The clear separation of the values between the earlier layer (e.g., the exponent of the first layer is less than one) and the deeper layers may indicate a transition from non-smooth (a slow decay of the eigen-spectrum) coding to smooth coding. Interestingly, the test accuracy does not increase with the depth for deeper layers, just simply increasing first and then quickly drops with a relatively small margin. This suggests that too much smoothness will harm the generalization ability of the network, although the impact is not significant and thus limited.

Figure 6: Relationship between test accuracy and power-law exponent when the input test data is attacked by independent Gaussian white noises. Error bars are estimated over independently training models. (a) Accuracy versus . is the attack amplitude. (b) versus . (c) Accuracy versus over different values of . Different symbol colors refer to different layers. The red arrow points to the direction along which increases from to , with an increment size of . The relationship of with increasing in the first three layer show a linear function, with the slopes of , , and respectively. The linear fitting coefficients are all larger than . Beyond the third layer, the linear relationship is not evident. For the sake of visibility, we enlarge the deeper-layer region in (d). A turning point appears. Above this point, the manifold seems to become smooth, and the exponent becomes stable even against stronger black-box attacks [see also (b)].

iii.5 Properties of the model under black-box attacks

We first consider the additive Gaussian white noise perturbation to the input representation at each layer. This kind of perturbation is also called the black-box attack, because it is not necessary to have access to the training details of the deep learning model, including architectures and loss function. Under this attack, the test accuracy decreases as the perturbation magnitude increases. As expected from the above results of manifold smoothness at deeper layers, deeper layers become more robust against the black-box attacks of increasing perturbation magnitude [Fig. 

6 (a, b)]. The -dependence of the power-law exponent shows that the adversarial robustness correlates with the manifold smoothness at deeper layers, which highlights the fundamental relationship between the adversarial vulnerability and the fractal manifold [5].

Inspecting carefully the behavior of as a function of the test accuracy, we can identify two interesting regimes separated by a turning point at [Fig. 6 (c)]. Below the turning point, increasing results in a clear decreasing in both and ; Moreover, for the first three layers, a linear relationship can be identified. The linear fitting coefficient is as high as . This observation is very interesting. The underlying mechanism is closely related to the monotonic behavior of the accuracy or exponent as a function of attack strength. Let us define their respective functions as and . Then a simple transformation leads to , which suggests that a particular choice of and allows for the observed linear relationship, e.g., a linear function. Therefore, how the earlier layer is affected by adversarial examples with increasing strength plays a vital role in shaping the interesting linear regime. Above the turning point, the linear fitting fails, and instead, a non-linear relationship takes over. One key reason is that deeper layers are more robust against adversarial examples in our current setting.

Figure 7: Relationship between test accuracy and exponent under the FGSM attack. Error bars are estimated over independently training models. (a) changes with . (b) changes with . (c) versus over different attack magnitudes. increases from to with the increment size of . The plot shows a non-monotonic behavior different from that of white-box attacks in Fig. 6 (c).
Figure 8: Relationship between dimensionality and power-law exponent. (a) estimated from the integral approximation and in the thermodynamic limit. is the layer width. (b) under the Gaussian white noise attack. The dimensionality and the exponent are estimated directly from the layered representations given the immediate perturbed input for each layer [Eq. (4)]. We show three typical cases of attack: no noise with , small noise with and strong noise with . For each case, we plot eight results corresponding to eight layers. The green dashed line is the theoretical prediction [Eq. (5)], provided that . Error bars are estimated over independently training models. (c) under the FGSM attack. The theoretical curve (dashed line) is computed with . Error bars are estimated over independently training models.

iii.6 Properties of the model under white-box attacks

We then consider the white-box attack—FGSM additive noise perturbation; the results are summarized in Fig 7. The FGSM attack is much stronger than the black-box attack, as expected from the fact that the loss function and network architecture knowledge are both used by the FGSM attack. The first few layers display evident adversarial vulnerability, or completely fail to identity correct class of the input images, while the last deeper layers still show adversarial robustness to some extent. These deeper layers also maintain a relatively high value of , although strong adversarial examples strongly reduce the manifold smoothness especially for layers next to the earlier layers. Under the FGSM attack, the - relationship becomes complicated, and the linear-nonlinear separation that occurs in the black-box attacks disappears. However, the high accuracy still implies a high value of . In particular, when the white-box perturbation is weak, the system reduces the manifold smoothness by a small margin to get a high accuracy [Fig. 7 (c)]. This can be interpreted as follows. Supported by the trained weights (no adversarial training), the manifold formed by the adversarial examples takes into account more details (or some special directions pointing to the decision boundary) of the adversarial examples. Under the FGSM attack, specific pixels with norm perturbations affecting strongly the loss function are particularly used to flip the decision output. In this sense, the competition (or trade-off) between the accuracy and the manifold smoothness captured by is present. This may explain why there exists a peak in Fig. 7 (c). The peak also appears in Fig. 7 (b). Both types of peaks have the one-to-one correspondence. Note that black-box attacks has no such properties.

iii.7 Relationship between manifold linear dimensionality and power-law exponent

The linear dimensionality of a manifold formed by data/representations can be thought of as a first approximation of intrinsic geometry of a manifold [10, 11], defined as follows:


where is the eigen-spectrum of the covariance matrix. Suppose the eigen-spectrum has a power-law decay behavior as the PC dimension increases, we simplify the dimensionality equation as follows:


where denotes the layer width, and the approximation is used to get the second equality. In the thermodynamic limit, , where is the Reimann zeta function. Note that for a small value of , a theoretical prediction of in Fig. 8 (b,c) is obtained by using the first equality (sum).

Results are shown in Fig. 8. The theoretical prediction agrees roughly with simulations under zero, weak and strong attacks of black-box and white-box types. This shows that using the power-law decay behavior of the eigen-spectrum in terms of the first few dominant dimensions to study the relationship between the manifold geometry and adversarial vulnerability of artificial neural networks is also reasonable, as also confirmed by many aforementioned non-trivial properties about this fundamental relationship. Note that when the network width increases, a deviation may be observed due to the waterfall phenomenon observed in the eigen-spectrum (see Fig. 3).

Iv Conclusion

In this work, we study the fundamental relationship between the adversarial robustness and the manifold smoothness characterized by the power-law exponent of the eigen-spectrum. The eigen-spectrum is obtained from the correlated neural activity on the representation manifold. We choose deep supervised learning with local errors as our target deep learning model, because of its nice property of allowing for analyzing each layered representation in terms of both test accuracy and manifold smoothness. We then reveal that the deeper layer has a larger value of , thereby possessing the adversarial robustness against both black- and white-box attacks. In particular, we find a turning point under the black-box attacks for the curve, separating linear and non-linear relationships. This turning point also signals the qualitative change of the manifold geometric property. Under the white-box attacks, the exponent-accuracy behavior becomes more complicated, as isotropic properties of attacks like in Gaussian white noise perturbations do not hold, which requires that a trade-off between manifold smoothness and test accuracy, given the normal trained weights (no adversarial training), should be taken.

All in all, although our study does not provide precise mechanisms underlying the adversarial vulnerability, the empirical works are expected to offer some intuitive arguments about the fundamental relationship between generalization capability and the intrinsic properties of representation manifolds inside the deep neural networks, encouraging future mechanistic studies towards the final goal of aligning machine perception and human perception [4].

This research was supported by the start-up budget 74130-18831109 of the 100-talent- program of Sun Yat-sen University, and the NSFC (Grant No. 11805284).


  • [1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, Cambridge, MA, 2016.
  • [2] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017.
  • [3] Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    , 23(5):828–841, 2019.
  • [4] Zhenglong Zhou and Chaz Firestone. Humans can decipher adversarial images. Nature Communications, 10(1):1334, 2019.
  • [5] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Matteo Carandini, and Kenneth D. Harris. High-dimensional geometry of population responses in visual cortex. Nature, 571(7765):361–365, 2019.
  • [6] Hesham Mostafa, Vishwajith Ramesh, and Gert Cauwenberghs. Deep supervised learning using local errors. Frontiers in Neuroscience, 12:608, 2018.
  • [7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:2278–2324, 1998.
  • [8] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR 2014 : International Conference on Learning Representations (ICLR) 2014, 2014.
  • [9] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR 2015 : International Conference on Learning Representations 2015, 2015.
  • [10] Haiping Huang. Mechanisms of dimensionality reduction and decorrelation in deep neural networks. Phys. Rev. E, 98:062313, 2018.
  • [11] Jianwen Zhou and Haiping Huang. Weakly-correlated synapses promote dimension reduction in deep neural networks. arXiv:2006.11569, 2020.