Exploiting the Relationship Between Kendall's Rank Correlation and Cosine Similarity for Attribution Protection

by   Fan Wang, et al.
Nanyang Technological University

Model attributions are important in deep neural networks as they aid practitioners in understanding the models, but recent studies reveal that attributions can be easily perturbed by adding imperceptible noise to the input. The non-differentiable Kendall's rank correlation is a key performance index for attribution protection. In this paper, we first show that the expected Kendall's rank correlation is positively correlated to cosine similarity and then indicate that the direction of attribution is the key to attribution robustness. Based on these findings, we explore the vector space of attribution to explain the shortcomings of attribution defense methods using ℓ_p norm and propose integrated gradient regularizer (IGR), which maximizes the cosine similarity between natural and perturbed attributions. Our analysis further exposes that IGR encourages neurons with the same activation states for natural samples and the corresponding perturbed samples, which is shown to induce robustness to gradient-based attribution methods. Our experiments on different models and datasets confirm our analysis on attribution protection and demonstrate a decent improvement in adversarial robustness.



page 2

page 9

page 19

page 20


Baseline Computation for Attribution Methods Based on Interpolated Inputs

We discuss a way to find a well behaved baseline for attribution methods...

A General Taylor Framework for Unifying and Revisiting Attribution Methods

Attribution methods provide an insight into the decision-making process ...

Attribution-driven Causal Analysis for Detection of Adversarial Examples

Attribution methods have been developed to explain the decision of a mac...

Smoothed Geometry for Robust Attribution

Feature attributions are a popular tool for explaining the behavior of D...

Locally Aggregated Feature Attribution on Natural Language Model Understanding

With the growing popularity of deep-learning models, model understanding...

Decentralized Attribution of Generative Models

There have been growing concerns regarding the fabrication of contents t...

When Explanations Lie: Why Modified BP Attribution Fails

Modified backpropagation methods are a popular group of attribution meth...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In deep neural networks (DNNs), especially DNNs for image classification problems, the model attributions explain and measure the relative impact of each feature on the final prediction. When DNNs are applied to security-sensitive tasks such as medical imaging (Singh et al., 2020a) and autonomous driving (Lorente et al., 2021), it is important for practitioners to understand and reliably interpret the mechanism behind the outputs. Therefore, with the rapid development of network architectures, the attribution robustness is becoming even more crucial.

Although numerous attribution methods have been proposed in recent studies (Simonyan et al., 2014; Zeiler & Fergus, 2014; Shrikumar et al., 2017; Sundararajan et al., 2017; Zintgraf et al., 2017), it has been pointed out that they are vulnerable to attribution attacks. Different from standard adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015; Kurakin et al., 2017; Papernot et al., 2016a; Carlini & Wagner, 2017)

that focus on misleading classifiers to incorrect outputs,

Ghorbani et al. (2019) shows that it is possible to generate visually indistinguishable images which are significantly different on their attributions, but with the same predicted label. Dombrowski et al. (2019) emphasizes on targeted attack that manipulates the attributions to any predefined target attributions while keeping the model outputs unchanged. Common adversarial defense mechanisms such as adversarial training (Madry et al., 2018) and distillation (Papernot et al., 2016b) are not able to tackle the attribution attacks; instead, researchers turn their focus on the attribution itself.

Figure 1: A visualization of integrated gradients of perturbed images by restricting distance. The three perturbed attributions (the bottom images of the 2nd–4th columns) have the same distance () to the original attribution (the bottom image of the first column). While distance remains unchanged, Kendall’s rank correlations are not guaranteed to be close. However, the cosine similarities reflect the changes of the Kendall’s rank coefficients.

As the differences between natural and perturbed attributions are usually measured by Kendall’s rank correlation (Kendall, 1948), which reflects the ordinal importance among features, i.e., the proportion of order alignment of attributions between original and perturbed images, a straightforward practice to protect the attributions against such adversaries is to maximize their Kendall’s rank correlation. Since Kendall’s rank correlation is not differentiable, in practice, it is replaced by its differentiable alternatives, such as -distance regularizers (Chen et al., 2019; Boopathy et al., 2020). However, -distance regularizers are not ideal for Kendall’s rank correlation. As shown in Fig. 1, we found that given fixed -distance between original and perturbed attributions, their Kendall’s rank correlations are drastically different, which indicates -distance is unstable as a measure of attribution similarity. Although there are also non- based regularizers, such as using Pearson’s correlation as the surrogate measurement of Kendall’s rank correlation (Ivankay et al., 2020), most of them do not provide theoretical justification to understand their performance.

In this paper, we discover that cosine similarity, as a measurement emphasizing the angle between two vectors, is consistent with Kendall’s rank correlation. We present a theorem stating that cosine similarity is positively correlated with the expected Kendall’s rank correlation. Based on the discovery of angular perspective, we then explain the shortcomings of -norm based attribution robustness methods and propose integrated gradients regularizer (IGR), an attribution robustness training regularizer that optimizes on the cosine similarity between natural and perturbed attribution. Our further analysis shows that optimizing cosine similarity encourages neurons with the same activation states, which induces attribution robustness for gradient-based attribution methods. The contributions of this work are summarized as follows:

  • We theoretically show that, under certain assumptions, Kendall’s rank correlation between two vectors is positively correlated to their cosine similarity.

  • We characterize a novel geometric perspective related to the angles between attribution vectors that explains the connection between adversarial robustness and attribution robustness for attribution methods fulfilling the axiom of completeness (Sundararajan et al., 2017).

  • Under the angular perspective, we propose integrated gradients regularizer (IGR) to robustly train neural networks. Our method is proved to encourage neurons with the same activation states for natural and corresponding perturbed images, which induces attribution robustness.

  • The experimental results show that the proposed IGR regularizer can be embedded into adversarial training methods to improve their performance in terms of both attribution and adversarial robustness and outperform the state-of-the-art attribution protection methods.

The remainder of this paper is organized as follows. We first introduce the notations and previous related works in Section 2. The content starts with the theorem disclosing the relationship between Kendall’s rank correlation and cosine similarity in Section 3. Based on that, we discuss the vector space of attribution in Section 4 and describe the proposed IGR as well as its property regarding neuron activations and its connection back to attribution robustness in Section 5. Section 6 presents our experimental results and the paper concludes in Section 7.

2 Preliminaries and Related Work

Let denote data points sampled from the distribution , where are input data and are labels. A non-bold version denotes the -th feature of vector , and the capitalized version

denotes a random variable. A classifier is the mapping from input space to the logits

parameterized by , where is the -th entry of , and the classification result of input is given by the index of maximum logit .

2.1 Attribution Methods

Model attribution, denoted by , studies the importance that the input features contribute towards the final result. The mostly used attribution methods include perturbation-based (Zeiler & Fergus, 2014; Zintgraf et al., 2017)

and backpropagation-based methods

(Bach et al., 2015), including gradient-based attribution methods (Simonyan et al., 2014; Shrikumar et al., 2017). In particular, integrated gradients (IG) (Sundararajan et al., 2017), one of the gradient-based methods, computes the attribution using the line integral of gradients from a baseline image to the input image weighted by their difference, i.e.,


IG satisfies the axiom of completeness which guarantees . We omit the baseline image in the later parts of this paper, and it is chosen to be a black image, i.e., , if not specifically stated.

2.2 Attribution Robustness

Recent studies reveal the vulnerability of neural networks that, similar to adversarial examples, imperceptible perturbations added to natural images would have significantly different attribution while their classification results remain unchanged (Ghorbani et al., 2019). Heo et al. (2019) manipulates the model parameters instead of input images to disturb attributions and remains high accuracy on classifications. Dombrowski et al. (2019) makes targeted attack that changes original attributions to any predefined attributions and gives a theoretical explanation to this phenomenon.

Engstrom et al. (2019) points out that robust optimization enhances model representations and interpretability. Chen et al. (2019) and Boopathy et al. (2020) use -norm to constrain the distance between attributions of natural and perturbed images. Sarkar et al. (2021)

proposes a contrastive regularizer that emphasizes a skewed distribution on true class attribution while a uniform one on negative class attribution.

Ivankay et al. (2020) directly optimizes Pearson correlation coefficient, a differentiable surrogate of Kendall’s rank correlation measurement. Singh et al. (2020b) uses a triplet loss to minimize the upper bound of the attribution distortion. Although the previous techniques present promising results, none of them exploits the angle between attributions explicitly for attribution protection. The method introduced in this work leverages the relationship between Kendall’s rank correlation and cosine similarity with a theoretical support for attribution robustness.

3 Kendall’s Rank Correlation and Cosine Similarity

Figure 2: (a) Visualization of Kendall’s rank correlation and cosine similarity using simulated data. Given a fixed vector with dimension of 10,000, one thousand random vectors are sampled and their corresponding and with are calculated and plotted. The positive correlation can be observed and is later proved by Theorem 1. (b) Visualization of relations between Kendall’s rank correlation and cosine similarity () and Pearson’s correlation coefficient (

) generated from the real images of Fashion-MNIST. It is noticed that

is more concordant with Kendall’s rank correlation than that of . The concordance is measured by Goodman and Kruskal’s gamma (), and it is found that cosine similarity is a more suitable substitute of Kendall’s rank correlation as while . (c) 2D illustration of comparison of attribution trained by -norm and cosine similarity. The axises are two dimensions of attribution. The solid ball and dashed ball represent two networks. Solid ball represents the untrained attribution surface and dashed ball is the trained surface .

Kendall’s rank correlation, often denoted by , is a measurement of the ordinal relationship between two quantities, where two quantities have higher when they have more concordant paris. Formally, Kendall’s rank correlation between two vectors and can be explicitly computed by


where is the dimension of the vectors. As Kendall’s rank correlation is an important metric to quantify the differences between perturbed and natural attributions, we begin by presenting the relationship between Kendall’s rank correlation and cosine similarity.

To enhance attribution robustness, it is equivalent to force the perturbed attribution to have a higher Kendall’s rank correlation with the original one. However, as Kendall’s rank correlation is not differentiable, it is difficult to directly optimize it. It is necessary to find an alternative that either approximates to or has a consistent behavior with Kendall’s rank correlation. The following theorem states that cosine similarity is an appropriate replacement as a regularization term since it is positively related to the Kendall’s rank correlation (Figure 1(a)).

Theorem 1.

Given a random vector where , and two arbitrary vectors with the same dimension, that , assume that there exists a sequence with and , where the vectors satisfy the condition that , and each can be induced from its previous vector through one of the following two operations,

  1. arbitrarily exchanging two entries of

  2. multiplying one entry in by

Then Kendall’s rank correlations of with and have the property that .

The full proof can be found in Appendix A.1. In the scenario of attribution robustness, under the above assumption, we regard as the natural attribution , and and as two perturbed attributions and . If the perturbed attribution has a greater cosine similarity with natural attribution, then their expected Kendall’s rank correlation is also greater. Explicitly speaking, if , then . This theorem provides a theoretical foundation that supports using cosine similarity for attribution protection because it directly links to Kendall’s rank correlation.

4 Characterization of Geometric Perspective on Attributions

In the last section, we have indicated the relationship between cosine similarity and Kendall’s rank correlation. In this section, we use this relationship to explain (i) the drawbacks of attribution protections based on -norm, e.g., in Chen et al. (2019), where is a natural sample and is a perturbed sample; (ii) the inappropriateness of standard adversarial training for attribution protection and (iii) the limitation of the cosine similarity, i.e., for standard adversarial protection. In this discussion, the attribution method is assumed to fulfill the axiom of completeness, i.e., and and are considered as vectors. Without loss of generality, we assume and use -norm as the illustration.

Figure 1(c) shows a two-dimensional projection for the ease of illustration, where each two-dimensional point represents an attribution of an input. Higher-dimensional cases can be extended in a similar manner. In Figure 1(c), is the original attribution of and the others are its perturbed counterparts. Given two attributions, and , where but , according to Theorem 1, is likely larger than , implying that the attribution of is likely closer to the attribution of than that of . It explains the results in Figure 1 and point (i), i.e., drawbacks of attribution protection based on -norm.

Before discussing the point (ii), it is worthy to mention that, if for all , . In other words, is the lower bound of . The standard adversarial training maximizes , where is an adversarial example. In Figure 1(c), has large , implying that is also large. In other words, the classification label of is well protected. However, standard adversarial training does not explicitly optimize the angle between and . It implies that can be small and can attack the attribution successfully. It should be mentioned that adversarial training does improve attribution robustness because it smooths the decision surface (Wang et al., 2020) although it is not the most ideal one. It explains the point (ii).

Point (iii) can be explained similarly. Since the cosine similarity, or , does not necessarily enlarge the magnitude of , it cannot improve network robustness against standard adversarial attack. Figure 1(c) shows two networks (the dashed circle and solid circle). , which implies that the two networks perform the same on attribution protection, but , meaning is more robust against the standard adversarial attack from , while is more vulnerable.

To protect against both attribution attack and adversarial attack, in the following section, the proposed IGR is optimized with adversarial loss together, where the former minimizes the angle between attribution vectors to perform attribution protection, and the latter maximizes their magnitude to offer standard adversarial protection.

5 Integrated Gradients Regularizer (IGR)

Based on the above analysis, in this section, we introduce the integrated gradients regularizer (IGR), which regularizes the cosine similarity between natural and perturbed attributions.

5.1 IGR Robust Training Objective

Since Kendall’s rank correlation and cosine similarity are positively related, we suggest to improve attribution robustness, especially integrated gradients, by maximizing the cosine similarity between natural and perturbed attributions, or equivalently, minimizing . Therefore, we propose the following training objective function incorporating the IGR



is a standard loss function used in robust training, and

is a hyper-parameter. We will later show in Section 6 that can be chosen from existing loss function in robust training, and our IGR will further improve the robustness upon baseline methods.

In practice, the integral inside IG definition can be numerically computed by its Riemann sum, and IG is approximated by


Similar to adversarial training, optimizing the above objective function in Eq. (3) requires perturbed example that maximally diverts its IG from the original counterpart. Such examples can be found by maximizing the proposed regularizer within its -ball with radius , i.e.,


It is noticed that computing the adversarial loss itself relies on , which can be obtained from adversarial attacks, i.e., . Thus, here we reuse these in IGR to avoid repeatedly using gradient descent methods to find the optimum in Eq. (5) and speed up the training. For example, if is the standard adversarial training loss function, we directly use the adversarial examples generated from PGD attack.

The use of Pearson’s correlation regularizer in Ivankay et al. (2020) as the replacement of Kendall’s rank correlation is the closest method to ours. Ivankay et al. (2020) suggests that Pearson’s correlation regularizer keeps the ranking of feature constant. However, the statement is not supported by any theoretical justification while we give a theorem that shows cosine similarity is positively related to Kendall’s rank correlation. Besides, as shown in Figure 1(b), cosine similarity is a better substitute as there are more concordant pairs between Kendall’s rank correlation and cosine similarity in the real data comparing with Pearson’s correlation. Additional comparisons are given in Appendix B.3. Furthermore, the experimental results also support that our method performs better in attribution robustness (see Section 6).

5.2 IGR Induces More Consistent Neuron Activation States

An interesting discovery about IGR is related to neuron activations. We found that the activation functions in ReLU networks trained with IGR are more often with the same neuron activation states for natural sample and corresponding perturbed sample. For deep networks with ReLU activations, if the pre-ReLU value is positive (negative) for natural sample, the probability of pre-ReLU value being positive (negative) for corresponding perturbed sample would increase when trained with IGR. To analyse this phenomenon, a single-layer neural network with ReLU activation is studied. The results from this single-layer neural network can be extended to deep networks by stacking multiple layers.

Recall that is an input image, and the network function is parameterized by , where is the column vector of , is the -th entry of matrix and is the -th entry of vector , i.e., . Then, the following proposition holds.

Proposition 1.

Given a single-layer neural network with ReLU activation, and with the above parameterization, if, for all , and

are all independent and identically distributed random variables following Gaussian distributions, i.e.,

and , and two input images are low contrast, i.e., and , then

The proof can be found in Appendix A.2. The right-hand side of Eq. (1) is called the activation consistency of natural and perturbed samples. For the sake of convenience, let the event be and be . The right-hand side of Eq. (1) can be rewritten as . Since has the upper bound that , it is obvious that


and the equality holds when event happens with the same probability as event , i.e., ; alternatively, the equality can hold when . In other words, maximizing Eq. (1) encourages that the network activates the same set of neurons for and , or deactivates the same set of neurons for and .

Although Eq. (1) indicates that encourages that and activate the same set of neurons, it does not show any relationship with attribution protection. In the following discussion, we consider two single-layer networks, , where , to reveal the relationship between activation consistency and attribution protection. Each network is parameterized by . We denote the activation states of each network for input as an -dimensional binary vector , where


Similarly, the activation state for is . We partition the indices set of vector according to its element’s value into and such that when and when . The Proposition 2 given below reveals the relationship between activation consistency and attribution protection.

Proposition 2.

For two single-layer networks as defined above, where , given fixed and , if the cardinality and for all , where and is the sample mean defined as


then the following inequality holds


The full proof can be found in Appendix A.3. The Proposition 2 states that if the network has more neurons with the same activation states for and than and for all , then the expected difference between and is smaller than that between and . Since the gradient-based attribution methods are based on gradient, if and are more similar, the corresponding attributions would also be more similar.

6 Experiments and Results

6.1 Experimental Configurations

We evaluate the performance of IGR on different datasets, including MNIST (LeCun et al., 2010), Fashion-MNIST (Xiao et al., 2017)

and CIFAR-10

(Krizhevsky, 2009). For MNIST and Fashion-MNIST, we use a network consisting of four convolutional layers followed by three fully connected layers. The model is trained by Adam Optimizer (Kingma & Ba, 2014) with learning rate

for 90 epochs. For CIFAR-10, we train a ResNet-18

(He et al., 2016) for 120 epochs using SGD (Sutskever et al., 2013) with initial learning rate 0.1, momentum 0.9 and weight decay . The learning rate decays by at the 75th and 90th epoch. All the experiments are run on NVIDIA GeForce RTX 3090.

As discussed in Section 5, IGR is applied with state-of-the-art adversarial training methods: standard adversarial training (AT)(Madry et al., 2018), TRADES (Zhang et al., 2019) and MART (Wang et al., 2019). , and in Table 1 are the objective functions of these methods, and are regarded as in Eq. (3). In Table 1, CE denotes the cross-entropy loss and KL denotes the KL-divergence. BCE is a boosted cross-entropy (see details in Wang et al. (2019)). Note that both AT and MART generate adversarial examples by maximizing the CE loss, while TRADES maximizes the KL-divergence regularizer. Following the baseline methods, we directly leverage the perturbed examples generated by their original techniques to compute the integrated gradients, as well as the IGR, instead of generating our own ones using Eq. (5). Moreover, to ensure fair comparisons, we keep the hyper-parameters the same for models with or without IGR.

Model Loss function
AT ()
Table 1: A summary of loss functions used in AT, TRADES and MART, and added with IGR

6.2 Evaluation on Attribution Robustness

Model top-k Kendall top-k Kendall top-k Kendall
Standard 32.21% 0.0955 42.83% 0.1884 46.71% 0.1662
IG-NORM (Chen et al., 2019) 36.13% 0.1562 51.84% 0.3446 74.49% 0.5811
IG-SUM-NORM (Chen et al., 2019) 41.53% 0.2240 57.27% 0.4097 78.70% 0.6901
AdvAAT (Ivankay et al., 2020) 51.74% 0.3791 73.62% 0.5810 72.11% 0.5484
ART (Singh et al., 2020b) 30.38% 0.1439 31.71% 0.2079 70.44% 0.6875
SSR (Wang et al., 2020) 38.77% 0.1650 60.40% 0.4321 71.20% 0.5498
AT (Madry et al., 2018) 34.35% 0.1846 32.00% 0.1516 72.21% 0.5578
AT+IGR 33.40% 0.1582 53.36% 0.3750 73.37% 0.5775
TRADES (Zhang et al., 2019) 36.37% 0.2127 57.01% 0.2582 78.28% 0.6903
TRADES+IGR 56.13% 0.4537 80.62% 0.6565 80.26% 0.6940
MART (Wang et al., 2019) 32.50% 0.1261 58.57% 0.4262 76.11% 0.6192
MART+IGR 37.34% 0.1854 57.97% 0.4317 76.56% 0.6328
Table 2: Attribution robustness of models trained by different defense methods under IFIA (top-k).

To evaluate our method under attribution attack, the iterative feature importance attacks (IFIA) using top-k intersection as dissimilarity function (top-k) (Ghorbani et al., 2019) is adapted. IFIA generates perturbations by iteratively maximizing the dissimilarity function that measures the changes between attributions of images, while keeps the classification results unchanged. In this experiment, we perform 200-step IFIA as in Chen et al. (2019). For MNIST and Fashion-MNIST, we choose and the perturbation size . For CIFAR-10, and . Two metrics are chosen to evaluate the performance under attribution attack as in Chen et al. (2019): top-k intersection and Kendall’s rank correlation, where top-k intersection counts the proportion of pixels that coincide in the most important features. For both metrics, a higher number indicates that the model is more robust under the attack.

To compare, attribution protection methods, IG-NORM and IG-SUM-NORM by Chen et al. (2019), Smooth Surface Regularization (SSR) (Wang et al., 2020), Attributional Robustness Traning (ART) (Singh et al., 2020b) and Adversarial Attributional Training with robust training loss (AdvAAT), are implemented and evaluated on all the datasets. A cross-entropy loss trained natural model (standard) is also included as a baseline. The details of these baseline methods are briefly introduced in Appendix B.2.

From the results in Table 2, we observed the following phenomenons. (i) Compared with baseline methods (AT, TRADES and MART), models trained with IGR outperform their corresponding counterparts in terms of both top-k intersection and Kendall’s rank correlation. (ii) Adversarial defense methods themselves also help the attribution protection, especially improve on Fashion-MNIST and CIFAR-10 comparing with the standard cross-entropy training. (iii) Compared with other attribution protection methods, standard adversarial defense methods, including AT, TRADES and MART, are weaker in attribution robustness; however, they achieve comparable or even stronger attribution protections when training with IGR. (iv) TRADES itself has the best attribution protections among models without IGR, and IGR provides the most significant boost on TRADES.

A visualization of attribution robustness is also presented in Figure 3. It is observed that the attribution of the baseline model is easily corrupted. For model trained with IGR, although the magnitudes of IG are different from IG of the original images, the directions remain nearly identical, which is also aligned with human visual perceptions.

Figure 3: IGR improves attribution robustness. The first column is the IG of the original image. The last two columns are IG of adversarial examples on a baseline model and the baseline model trained with IGR. Both baseline and baseline+IGR models make the correct classifications, while only baseline+IGR protects the model interpretations.
Model Natural FGSM PGD20 CW Natural FGSM PGD20 CW Natural FGSM PGD20 CW
AT 99.43 99.39 99.25 99.24 89.75 78.92 74.69 74.65 73.09 69.42 37.56 45.28
+IGR 99.51 99.45 99.32 99.32 80.98 79.20 76.79 76.31 73.69 70.40 38.21 46.70
TRADES 99.40 99.36 99.21 99.19 78.82 77.66 75.94 75.58 81.33 79.15 55.02 52.40
+IGR 99.40 99.40 99.26 99.24 80.61 79.05 76.89 76.44 81.65 79.54 54.65 52.43
MART 99.39 99.29 99.09 99.08 79.43 79.36 77.91 77.49 78.97 77.19 56.05 50.99
+IGR 99.51 99.39 99.28 99.24 81.51 82.13 79.93 79.01 79.27 77.35 56.47 51.11
Table 3: Adversarial accuracy (%) of CNN trained by different defense methods on MNIST, Fashion-MNIST and CIFAR-10.

6.3 Evaluation on White-Box Adversarial Robustness

(b) Fashion-MNIST
(c) CIFAR-10
Figure 4: Activation consistency of baseline models and IGR models in MNIST, Fashion-MNIST and CIFAR-10.

To evaluate the performance of IGR on adversarial robustness, the trained defense models are evaluated under white-box adversarial attacks, including FGSM (Goodfellow et al., 2015), PGD (Madry et al., 2018) and CW (Carlini & Wagner, 2017), where all the attacks have the information of the entire models, including architectures and parameters. The numbers are reported in both natural accuracy (Natural) and adversarial accuracies under FGSM, PGD with 20 and 40 steps (PGD20, PGD40), and CW attacks. The maximum allowable perturbations are chosen to be for MNIST and Fashion-MNIST, and for CIFAR-10. The white-box adversarial accuracy results, as well as natural accuracy are reported in Table 3.

As shown in Table 3, defense methods training with IGR achieve higher accuracies under all three types of attacks upon their corresponding baseline methods, except TRADES+IGR being attacked by PGD20. In the meantime, classification accuracies of natural images are also improved in seven out of nine evaluations. This suggests that training with IGR improves adversarial accuracies without losing the generalization of natural accuracies. Although IGR is designed for attribution protection, these improvements can be considered as a side-effect of IGR.

6.4 Evaluation on Activation Consistency

This section reports the experimental results that verify the claim in Section 5.2 — IGR encourages that the network activates the same set of neurons for natural and perturbed samples and . During the experiments, all the pre-activation values are recorded and used to compute the proportion of nonnegative values. Thus, the activation consistency defined on the right-hand side of Eq. (1) can be numerically computed.

Figure 4 compares the activation consistency on baselines and the corresponding models trained with IGR. It is noticed that for all the datasets, the activation consistency of models trained with IGR are consistently greater than the corresponding baseline models, which verifies our theory in Section 5.2. Moreover, as reported in Table 2, the improvements of AT+IGR in MNIST and MART+IGR in Fashion-MNIST are not as significant as others. The results are also reflected on activation consistency, as the value of activation consistency slightly improves from 0.40 to 0.43 on AT+IGR in MNIST and from 0.67 to 0.69 on MART+IGR in Fashion-MNIST, while TRADES+IGR that boosts the most in attribution robustness also increases the most in activation consistency.

7 Conclusions

In order to leverage the non-differentiable Kendall’s rank correlation for attribution protection, this paper starts with a theorem indicating the positive correlation between cosine similarity and Kendall’s rank correlation. We then introduce a geometric perspective to explain the shortcomings of based attribution defense methods and propose the integrated gradients regularizer to improve attribution robustness. It is discovered that IGR encourages networks activating the same set of neurons for natural and perturbed samples, which explains the rationale behind its performance against attribution attacks. Finally, experiments show that IGR can be combined with adversarial objective functions, which simultaneously minimizes the angle between attribution vectors for attribution robustness and maximizes their magnitude to offer standard adversarial protection.


  • Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
  • Boopathy et al. (2020) Akhilan Boopathy, Sijia Liu, Gaoyuan Zhang, Cynthia Liu, Pin-Yu Chen, Shiyu Chang, and Luca Daniel. Proper network interpretability helps adversarial robustness in classification. In

    International Conference on Machine Learning

    , pp. 1014–1023. PMLR, 2020.
  • Carlini & Wagner (2017) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE Computer Society, 2017.
  • Chen et al. (2019) Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, and Somesh Jha. Robust attribution regularization. In Advances in Neural Information Processing Systems, pp. 14300–14310, 2019.
  • Dombrowski et al. (2019) Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, pp. 13589–13600, 2019.
  • Engstrom et al. (2019) Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. Adversarial robustness as a prior for learned representations. arXiv preprint arXiv:1906.00945, 2019.
  • Ghorbani et al. (2019) Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 33, pp. 3681–3688, 2019.
  • Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • Heo et al. (2019) Juyeon Heo, Sunghwan Joo, and Taesup Moon. Fooling neural network interpretations via adversarial model manipulation. In Advances in Neural Information Processing Systems, pp. 2925–2936, 2019.
  • Ivankay et al. (2020) Adam Ivankay, Ivan Girardi, Chiara Marchiori, and Pascal Frossard. Far: A general framework for attributional robustness. arXiv preprint arXiv:2010.07393, 2020.
  • Kendall (1948) Maurice George Kendall. Rank correlation methods. 1948.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • Kurakin et al. (2017) Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJm4T4Kgx.
  • LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • Lorente et al. (2021) Maria Paz Sesmero Lorente, Elena Magán Lopez, Laura Alvarez Florez, Agapito Ledezma Espino, José Antonio Iglesias Martínez, and Araceli Sanchis de Miguel.

    Explaining deep learning-based driver models.

    Applied Sciences, 11(8):3321, 2021.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
  • Papernot et al. (2016a) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pp. 372–387. IEEE, 2016a.
  • Papernot et al. (2016b) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pp. 582–597. IEEE, 2016b.
  • Sarkar et al. (2021) Anindya Sarkar, Anirban Sarkar, and Vineeth N Balasubramanian. Enhanced regularizers for attributional robustness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 2532–2540, 2021.
  • Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning, pp. 3145–3153. PMLR, 2017.
  • Simonyan et al. (2014) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014.
  • Singh et al. (2020a) Amitojdeep Singh, Sourya Sengupta, and Vasudevan Lakshminarayanan. Explainable deep learning models in medical image analysis. Journal of Imaging, 6(6):52, 2020a.
  • Singh et al. (2020b) Mayank Singh, Nupur Kumari, Puneet Mangla, Abhishek Sinha, Vineeth N Balasubramanian, and Balaji Krishnamurthy. Attributional robustness training using input-gradient spatial alignment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp. 515–533. Springer, 2020b.
  • Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328. PMLR, 2017.
  • Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
  • Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  • Wang et al. (2019) Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, 2019.
  • Wang et al. (2020) Zifan Wang, Haofan Wang, Shakul Ramkumar, Piotr Mardziel, Matt Fredrikson, and Anupam Datta. Smoothed geometry for robust attribution. In Advances in Neural Information Processing Systems, volume 33, pp. 13623–13634. Curran Associates, Inc., 2020.
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Zeiler & Fergus (2014) Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014.
  • Zhang et al. (2019) Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472–7482. PMLR, 2019.
  • Zintgraf et al. (2017) Luisa M. Zintgraf, Taco S. Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network decisions: Prediction difference analysis. In International Conference on Learning Representations, ICLR 2017. OpenReview.net, 2017. URL https://openreview.net/forum?id=BJ5UeU9xx.

Appendix A Proofs

a.1 Proof of Theorem 1

See 1


To prove this theorem, we show that the property holds when , i.e., , which indicates that each one of the above operations on would preserve the order of Kendall’s rank correlation. The case when can be trivially generalized using mathematical induction.

Since and are two arbitrary vectors, it is safe to fix that . To analyse the cosine similarities and Kendall’s rank correlation, can be assumed to be in descending order, i.e., , since the order of and can be changed correspondingly without affecting the cosine similarities and Kendall’s rank correlation. Formally, we show in the following proof that for a random vector

following exponential distribution and an arbitrary vector

, if the cosine similarities satisfy that , then their corresponding Kendall’s rank correlations have the property that , where is generated from by (1) exchanging two entries and (2) scalar multiplications.

(1) Preservation under exchanging

Following the assumption, we consider a random vector where , and another vector . We define the new vector that is produced by arbitrarily exchanging two entries in . Suppose we exchange the -th and -th entry in , where , then . We further assume both and are normalized, i.e., .

Now if we consider the cosine similarity and the assumption that , we then have


which can be simplified as


For Kendall’s rank correlation, we denote that , and . We notice that the difference between and only occurs when and are involved. We can write down the explicit expression of


Since , we then have




(2) Preservation under scalar multiplication

We use the assumptions mentioned before that , and . Without loss of generality, we multiply by a scalar , such that , i.e., .

To compare and , it is noticed that, under our assumptions, only the sign of is needed as other terms in are equal for and ,


Under the condition that the cosine similarity , we have


Note that . For simplicity, we denote that , and it is obvious that , where is close to 1 when is large,


which can be relaxed as




where . Thus, we want to show that


We consider two cases when and . In the case that , it is obvious that


If ,


Thus, after combining the above two cases, we have


which concludes our proof.

a.2 Proof of Proposition 1

See 1


Recall that is an input image, and the network function is parameterized by , where is the column vector of , is the -th entry of matrix and is the -th entry of vector , i.e., .

Following the above notations, we first write the function as


where denotes the indicator function, and its gradient

We assume the bias terms are zeros without loss of generality, i.e., , and approximate the cosine similarity of IG using the low contrast assumption that as


Since is close to 0 in high dimensional space when , we approximate the above expression as


Notice that the indicator function is integrated from to , which does not affect the sign of and , i.e., the activation states. This implies that the activation states is the same for all samples from baseline to the corresponding image. Thus, we can write the cosine similarity as


Since and are independent random variables following Gaussian distributions, i.e., and , when is sufficiently large, we have


The cosine similarity is then transformed into expectations


Based on the assumption on Gaussian distribution, we have and , and