Deep learning models have achieved superhuman performance in a range of activities from image recognition to complex games (lecun2015deep; silver2017mastering). Unfortunately, these gains have come at the expense of model interpretability, with massive, overparametrized models being used to achieve state-of-the-art results. This is a salient problem when deep learning is applied to domains such as healthcare (miotto2018deep), criminal justice (li2018deep), and finance (huang2020deep), where a prediction needs to be explainable to the user, leading to a surge in interest in tools that can illuminate the underlying decision making process of deep learning models.
Besides being inherently black-box in nature, deep learning models have also been shown to be vulnerable to adversarial attacks where small perturbations to model input result in dramatic changes to model output (szegedy2013intriguing). This phenomenon is concerning when deep learning tools are deployed in safety-critical environments. A range of approaches have been developed to improve a model’s robustness to adversarial attack (silva2020opportunities), including the use of explainability methods to detect adversarial examples (zhang2021ggcad; wang2021adversarial). But if explainability methods are an important component in a machine learning system, then the robustness of these methods are nearly as important as the robustness of the model itself. In this paper we explore the vulnerability of concept-based interpretability methods. That is, methods that interrogate a model and its decisions based on a concept.
Concept-based interpretability methods usually rely on a user provided collection of positive examples (tokens) of a concept. While this flexibility makes these methods an attractive approach for understanding deep learning models, it also introduces vulnerabilities since corrupted tokens can result in misleading interpretations of a model’s decisions. We describe a threat model for concept-based interpretability methods, outlining adversary goals, knowledge, and capabilities. Then we introduce a family of attacks fitting this threat model which we call token pushing (TP) attacks. These learn small perturbations that when added to tokens of a concept result in remarkably different output for the interpretability method. Specifically, we optimize our perturbations so that when they are added to a token, they significantly change a model’s internal representation of the input.
We test TP attacks against two popular concept-based interpretability methods: Testing with Concept Acitvation Vectors (TCAV)(kim2018interpretability) and Faceted Feature Visualization (FFV) (goh2021multimodal)
. While TCAV and FFV are similar in that they are both concept-based, their output is quite different. TCAV quantifies the extent to which a concept is important to a model’s prediction for a specific input dataset, while FFV visualizes how individual neurons represent specific aspects of a concept. We show that TP attacks are effective for both TCAV and FFV. For example, a TP attack causes TCAV to give output indicating that stripes are not an important feature to the class ‘zebra’. On the other hand, a TP attack can radically change the feature visualizations generated by FFV (Figure3).
We evaluate TP attacks on pretrained ImageNet models(deng2009imagenet; marcel2010torchvision) using the Describable Textures Dataset (cimpoi14describing) for concept tokens. Through our experiments we show that a TP attack does not require the adversary to know what interpretability method is being used. The same perturbations that cause TCAV to fail, also cause FFV to fail. Finally TP attack possesses moderate transferability, meaning that as long as a surrogate model is available, it can be applied even when the defender model architecture is unknown.
In summary, our contributions in this paper include the following.
Formalization of an adversarial threat model for concept-based interpretability methods.
Introduction of TP attacks which perturb examples of a concept in such a way that concept-based interpretability methods give misleading output.
Demonstration of the effectiveness of TP attacks on two concept-based interpretability methods, TCAV and FFV.
2 Related work and background
Interpretability methods: Because of the size and complexity of modern deep learning architectures, skill is required to extract interpretations of how these models make decisions. Established methods range from those that focus on highlighting the importance of individual input features to those that can give clues to the importance of specific neurons to a particular class. Popular examples of interpretability methods that focus on input feature importance include saliency map methods (selvaraju2017grad; sundararajan2017axiomatic; ribeiro2016model) which rely on gradients to identify those input features (pixels in an image for example) whose change is most likely to change the network’s prediction.
Concept-based interpretability methods focus on decomposing the hidden layers of deep neural networks with respect to human-understandable concepts. One of the best known approaches in this direction involves the use of concept activation vectors (CAVs)(kim2018interpretability) which we describe in detail in the next section. Work that is either related or extends these ideas includes (zhou2018interpretable; graziani2018regression; graziani2019improved).
Feature visualization is a set of interpretability techniques (szegedy2014intriguing; mahendran2015understanding; wei2015understanding; nguyen2016multifaceted) concerned with optimizing model input so that it activates some specific node or set of nodes within the network. However, a challenge arises when one tries to analyze ‘poly-semantic neurons’ (olah2018the), neurons that activate for several conceptually distinct ideas. For example, a neuron that fires for both a boat and a cat leg is poly-semantic. Interpretability methods have imposed priors to disambiguate neurons by clustering the training images (wei2015understanding; nguyen2016multifaceted) or the hidden layer activations (carter2019activation) and using the average of the cluster as a coarse-grained image prior, parameterizing the feature visualization image with a learned GAN (nguyen2016synthesizing), or using a diversity term in the feature visualization objective (wei2015understanding).
Robustness of interpretability methods: This is not the first work that has shown that interpretability methods can be brittle. Saliency methods have been shown to produce output maps that appear to point to semantically meaningful content even when they are extracted from untrained models, indicating that these methods may sometimes simply function as edge detectors (adebayo2018local). From a more adversarial perspective, a number of works have shown that saliency methods are vulnerable to small perturbations made to either an input image or to the model itself that cause the model to offer radically different interpretations. (heo2019fooling; ghorbani2019interpretation; viering2019manipulate; subramanya2019fooling). To our knowledge, this is the first work that shows that concept-based interpretability methods are also vulnerable to adversarial attack.
2.1 TCAV and linear interpretability
In this section we describe the method of testing with concept activation vectors (TCAV) (kim2018interpretability). Let be a neural network which is composed of
layers and designed for the task of classifying whether a given inputbelongs to one of different classes. Write for the composition of the first layers so that and and let be the composition of the last layers of the network so that for any . Let be a concept for which we have a set of positive examples (tokens) and negative examples , both belonging to . These are represented in the th layer of as the points and
respectively. One can apply a binary linear classifier to separate these two sets of points, resulting in a hyperplane in. This hyperplane can be represented by two unit normal vectors. We choose the one, , that points into the region corresponding to the points . is called the concept activation vector in layer associated with concept . One can think of as the vector that points toward -ness in the th layer of the network.
Let denote the th output coordinate of corresponding to class . In the classification setting, then represents the model’s confidence that input belongs to class . To better understand the extent to which concept influences the model’s confidence of belonging to class we compute:
A positive value of indicates that increasing -ness of makes the model more confident that belongs to class . The magnitude TCAV score for a dataset is defined as
where is the subset of consisting of all instances predicted as belonging to class . We compare the concept magnitude with the TCAV magnitudes for random images in the layer, and use a standard two-sided -test to test for significance. We can also compute the relative TCAV score, which replaces the set of negative natural images in with images representing a distinct concept. An example experiment and attack using the relative TCAV score is in the Appendix.
2.2 Faceted Feature Visualization
goh2021multimodal introduced a new concept-based feature visualization objective for neuron-level interpretability, Faceted Feature Visualization (FFV). The objective disambiguates poly-semantic neurons by imposing a prior towards a linear concept in the optimization objective. Goh also utilizes a set of positive and negative examples of a concept ( and respectively). Similar to the TCAV method, one trains a binary linear classifier on the image of these two sets under the map to obtain . To visualize output that tends to activate a neuron at layer , position , while at the same time steering the visualization toward a specific context, the authors solve the following optimization problem:
where is the Hadamard product. Note that the first term helps find which result in a strong activation of , while the second term finds that tends to point in the direction of .
3 Adversarial attacks on interpretability
In this section we describe a family of adversarial attacks on concept-based model interpretability methods. An adversarial attack (goodfellow1) on a model is a small perturbation that, when applied to a specific input , results in large changes to model prediction . The meaning of ‘small’ is usually specified by a metric such as an -norm and can either be a hard or soft constraint. In this work we use projected gradient descent (PGD) (madry2018towards) to construct our attacks, though other optimization approaches could doubtless be used.
We frame the notion of a concept-based interpretability method abstractly in order to better understand its attack surface. We view such a method as a map that takes (1) a model, (2) positive tokens of the concept that we would like to steer our interpretation, (3) negative tokens of the concept and (4) an interpretation target which will be the focus of the interpretation. We call the output of an interpretability method an interpretation object. An interpretation object might be a single scalar value (as in the case of TCAV), or it may be an image (as in the case of FFV). In all cases, an interpretation object is designed to help the user better understand a model’s decision making process. Thus, we can understand an interpretability method as a function , where is the collection of models that can be interpreted, is the space of all possible positive token sets, is the space of all possible negative token sets, is the space of interpretation targets, and is the space of interpretation objects that the method produces. We note that in the case of TCAV, the interpretation target is a dataset of examples of some class , while the interpretation target of FFV is a specific node position in the model.
3.1 A threat model for attacks on concept-based interpretability methods
Following a suggestion given in (carlini2019evaluating), we state the threat model that we will consider in this paper. Since we will only be considering images as input in our experiments, we specify to that setting here. Otherwise, we use the formalism that we developed above. Specifically, we assume there exists an interpretability method , a model , set of positive image tokens , set of negative image tokens , and interpretation target . We also assume a function that quantitatively captures meaningful difference between interpretation objects.
Adversary’s goal: The adversary’s goal is to find perturbations such that maximizes the value of . That is, the change from to maximizes the difference in interpretation as measured by . In order to avoid detection, is subject to the constraint: , for some fixed .
Adversary knowledge and capabilities:
In this paper we assume that the adversary has read and write access to the tokens either before or after they have been collected.
We do not assume that the adversary has access to either (the dataset of examples predicted as belonging to class in the case of TCAV or the specific neuron position that is being targeted in the case of FFV). We do assume that the adversary knows the hidden layer that is being targeted for both TCAV and FFV.
We assume that the adversary has read access to at least a surrogate model trained on the same dataset as . We do not assume that this surrogate model needs to have the same architecture as .
Finally, we do not assume that the adversary knows the interpretability method that will be used.
The adversary’s goal is framed in terms of a function that depends on the specific interpretability method. This might seem to be in conflict with assumption 4 that says that the adversary does not have knowledge of the interpretability method being used. Actually, we show that TP attacks, which we propose below, work for specific to both TCAV and FFV simultaneously by optimizing for an objective function that disrupts the fundamental mechanism by which TCAV, FFV, and other concept-based interpretability methods work.
3.2 Attacking tokens of a concept
In this section we introduce the token pushing (TP) attack. The basic idea is simple; we find perturbations that significantly alter a model’s internal representation of the concept tokens . Using the notation developed in 3.1, we assume that the adversary has access to a copy of the defender’s model (or a surrogate model) , the hidden layer that the interpretation method will use, and write access to the set of tokens that represent a concept .
The perturbations added to each element in
shifts its hidden representation in layerso that it no longer correlates with concept . In order to find a point that can guide this shift, the first step is for the adversary to choose some collection of images that are unrelated to , . The adversary calculates the centroid of , which we denote by , which will serve as a representative of “unrelatedness” to . Then for each , the adversary uses PGD to compute
This is related to the hidden layer attack that is described in (wang2018great). A schematic of TP attack can be found in Figure 1. An example of the perturbation can be found in Figure 7 in the Appendix.
In Section 4, we show that in spite of the fact that Equation 3 is neither the interpretability objective of TCAV nor FFV, it is still effective when applied to either method. In fact, objective function 3 makes the TP attack more flexible since it acts against the underlying mechanism common to both these and other interpretability methods: the spatial proximity of hidden representations of input that are semantically related. This means that the adversary does not need to know the specific interpretability method that the defender is using. This also means that the defender does not need access to the interpretation target, as they would if they were to optimize against the interpretability objective directly.
To better understand the effectiveness of the methods proposed in Section 3.2, we apply our attacks to TCAV and FFV in the setting where these are used to interrogate an InceptionV1 model (szegedy2015going)
that has been trained on ImageNet-1k(deng2009imagenet). We choose InceptionV1 because it is a model commonly used in the interpretability literature (kim2018interpretability; olah2020an) and choose ImageNet-1k since it is easy to obtain high-quality weights for this model/dataset combination. In our case, we used the pretrained weights from Torchvision (marcel2010torchvision). The token sets that we used to capture concepts come from the Describable Textures Dataset (DTD) (cimpoi14describing). We perform all PGD attacks with and 20 steps. We use the Captum (kokhlikyan2020captum)
implementation of TCAV with a linear classifier trained via stochastic gradient descent and-regularization. See the Appendix for more experiment details.
|Baseline TCAV (no attack)|
|PGD attack on|
To test the TP attack against TCAV, we choose two concept/class pairs with straightforward associations: ‘stripes’/‘zebra’ and ‘honeycombed’/‘honeycomb’. We select sets of randomly chosen images from ImageNet which do not intersect. The same will be used for both concept/class pairs. We also fix a set of unrelated images of size that are also randomly sampled from ImageNet. Finally, we choose random sets of images from the classes ‘stripes’ and ‘honeycombed’ from DTD. is a collection of images which the InceptionV1 model predicts as ‘zebra’ or ’honeycombed’ respectively.
For each layer of InceptionV1 we run the TP attack against and . For each of the resulting pairs (respectively ) and each layer of InceptionV1, we then apply TCAV times (once for each ), calculating the difference in magnitude TCAV score between and (respectively and ). Numerical results for ‘stripes’/‘zebra’ can be found in Table 1. The larger the value, the greater the change in magnitude TCAV score before and after the attack. Plots of the raw TCAV magnitude scores for both the clean positive tokens and the attacked positive tokens (where each attack targeted a different layer of InceptionV1) are found in Figure 2. Sample stripe images before and after the attack can be found in the appendix.
We include confidence intervals for each layer based on the different
sets. The point of this is to verify that the result does not depend on having the “right” negative examples. To test that the attack perturbations work for reasons than other than the fact that they are perturbations, we also apply TCAV to positive token sets to which we have added random Gaussian noise of the per-channel mean and standard deviation of the PGD logit attack. Finally, note that we also include the results showing what happens when a TP attack targets a different hidden layer than TCAV is being applied to (these are in the off-diagonal of Table1).
We evaluate the token perturbation attack on FFV by performing feature visualizations for InceptionV1 on every channel neuron for the layers mixed3a, mixed3b, mixed4a, and mixed4b. We use the feature visualization objective equation 2, and compare feature visualizations with clean concept images , concept images with Gaussian noise, and a concept set with perturbations created by PGD on the respective hidden layer with equation 3. We give an example of FFV and the attack in the layer mixed4d in Figure 3.
We quantitatively test the effectiveness of the attack on FFV by using a variant of the Fréchet Inception Distance (FID) (heusel2017gans) as a measure of the distance between feature visualizations. Namely, we compare feature visualizations created with a channel objective (i.e., using only the first term in equation 2), faceted feature visualizations created with two sets of random images with clean stripe concepts , faceted feature visualizations with a set of stripe concepts with a perturbation created via targeting the layer mixed3b with equation 3, and faceted feature visualizations where we add Gaussian noise to . The FID score is calculated across layers for every channel neuron in InceptionV1 layers: mixed3a (256 channels), mixed3b (480 channels), mixed4a (512 channels), and mixed4b (512 channels), shown in Figure 4
. We use a PyTorch implementation of FID(Seitzer2020FID)
and use the second block of InceptionV3 as the visual similarity encoder (due to the smaller dataset size).
Our results show that TP attacks effectively changes the output of both TCAV and FFV from the baseline interpretability results. For TCAV, we can consistently lower the TCAV magnitude score that indicates the relative importance of a concept to an output class. In Table 1, we measure the TCAV magnitude score on four early layers of InceptionV1. For each run and layer, we take the average difference between the TCAV magnitude score for the striped concept set and a random concept set over 20 sets of random images. We note that, unsurprisingly, attack success tends to increase when the layer that an attack was developed for and the layer TCAV is being applied to are the same. However, we also find that the attack is often remains effective even when these are not the same. For example, in Table 1, the attack targeting the layer ‘mixed4b’ is successful across all layers examined. We also observe this in Figure 2, where an attack targeting layer mixed4b for the honeycombed concept set effectively modifies the TCAV magnitude of honeycomb to honeycombed for the four layers examined.
For FFV, we can observe the TP attack effectiveness from the visual differences between 1) a channel feature visualization (i.e., a feature visualization that optimizes the first term in equation 2), 2) the faceted feature visualization with a clean concept set , and 3) the faceted feature visualization with a perturbed concept set . We give three such examples separately using the striped, dotted, and zig-zagged concept sets in Figure 3. We use FID as a measure of visual difference, and test the effectiveness of the TP attack on FFV for the 1,760 channel neurons in the InceptionV1 layers ‘mixed3a’, ‘mixed3b’, ‘mixed4a’, and ‘mixed4b’. We use the striped concept set and perform two separate FFV visualizations for each neuron with different sets of negative concept set images. Figure 4 shows that the FID scores between the separate clean FFV runs is , while the FID score between the TP attack and the clean FFV runs are and . The significantly larger FID scores suggest that the TP attack modifies the FFV output more than the variation between runs. This, along with visualizations such as 3, suggest that a TP attack can drastically change the semantic meaning associated with the feature visualizations produced by FFV.
Finally, we find that both the TCAV magnitudes (Table 1) and the FFV FID scores (Figure 4) are susceptible to Gaussian noise added to the concept set. This suggests that, even independent of adversarial attacks, concept-based interpretability methods are brittle. This brittleness likely means that these methods are also vulnerable to natural distribution shifts in data, e.g., between the concept set and training images. We see a need for research into robust interpretability methods.
As noted in 3.1, the knowledge required for an adversary to implement an attack is decreased significantly if they do not need to know the specific model being used by the defender. We therefore test the transferrability of TP attack by applying TCAV to an ImageNet-trained ResNet-18, before and after it has been attacked using TP perturbations developed for an InceptionV1 model. We again consider the concept/class pair stripes/zebra with the same set of , , and that were used to generate Table 1. We compute the TCAV magnitude score for stripes/zebra for each of the four residual blocks in ResNet-18. Figure 5 compares a baseline score with scores for TP attacks applied to different layers.
We find that strikingly, TP attacks targeting any of the layers of InceptionV1 result in significant decreases in TCAV magnitude score when applied to the first block of ResNet18. We also see less significant decreases in TCAV magnitude score when Block 3 is targeted (with the exception of the ‘mixed3b’ TP attack which actually increases the TCAV score). The transfer TP attack does not seem to be effective against Block 2 and Block 4. We believe effectiveness against the first block may be the result of similar base features being extracted in both networks. This of course does not explain why we thereafter see a decrease in magnitude score in Block 3. These results point toward TP attack being moderately transferable, especially when TCAV is being applied to earlier layers of the defender’s model.
In this work we chose two concept-based interpretability methods to test TP attacks on. While TCAV and FFV capture some of the diversity of such methods, they do not capture the full breadth. In particular, it would be useful to understand how TP attacks behave when they are applied to other types of feature visualization methods, namely those that average over a large number of images or activations (nguyen2016multifaceted; carter2019activation) to build a visualization. Further, while we only consider image classification models, TCAV is agnostic to modality. Evaluating interpretability method brittleness in other critical modalities such as NLP would give a more complete picture of these method’s vulnerabilities. Finally, the attack model that we focus on only considers perturbations to concept tokens. To fully understand the attack surfaces of concept-based interpretability methods it would make sense to look at attacks on the other inputs to the methods: the model itself, negative examples, and the interpretation targets. As a limited example, an adversarial attack may be designed to be ‘triggered’ for only certain token and dataset interpretation target combinations.
In this work we show that concept-based interpretability methods, like much of the deep learning modeling pipeline, are vulnerable to adversarial attacks. By subtly changing the examples of a concept that a user wishes to use to interrogate a model, an adversary can induce radically different interpretations. The attacks we describe are general enough that they work for multiple interpretability methods without modification (FFV and TCAV). We hope that these result of this paper will promote better security practices, not only around the model pipeline itself, but also around the method that is being used to interpret the model.
7 Reproducibility statement
In the interest of making our results reproducible and able to be easily expanded upon, we make our codebase available to the public, including our implementations of the centroid PGD attack and faceted feature visualization we used. We also include attack and evaluation scripts with sensible defaults and examples. Finally, we provide the data used throughout this paper, including our feature visualizations. This entire repository will be available on a public GitHub repository once the anonymous review period has completed.
Appendix A Appendix
a.1 Experiment details
To run TCAV, FFV, and our attacks, we use PyTorch with an NVIDIA Tesla T4 GPU provided with Google Colab Pro as well as a single NVIDIA Tesla P100 GPU.
a.2 TP attack on relative TCAV
Here, we give an example experiment showing that TP attacks are also effective for a variant of TCAV, using relative TCAV scores. The results in Figure 6 uses concept sets for stripes, ziz-zags, and polka-dots of 35 images each. Perturbations are made on the striped concept set using the final logit layer, towards an unrelated class (the ‘toilet tissue’ ImageNet-1k class).