Log In Sign Up

Explaining Explanations: Axiomatic Feature Interactions for Deep Networks

by   Joseph D. Janizek, et al.

Recent work has shown great promise in explaining neural network behavior. In particular, feature attribution methods explain which features were most important to a model's prediction on a given input. However, for many tasks, simply knowing which features were important to a model's prediction may not provide enough insight to understand model behavior. The interactions between features within the model may better help us understand not only the model, but also why certain features are more important than others. In this work, we present Integrated Hessians (Code available at an extension of Integrated Gradients that explains pairwise feature interactions in neural networks. Integrated Hessians overcomes several theoretical limitations of previous methods to explain interactions, and unlike such previous methods is not limited to a specific architecture or class of neural network. We apply Integrated Hessians on a variety of neural networks trained on language data, biological data, astronomy data, and medical data and gain new insight into model behavior in each domain.


page 4

page 6

page 21

page 22

page 23

page 24

page 28


Visual Explanations from Deep Networks via Riemann-Stieltjes Integrated Gradient-based Localization

Neural networks are becoming increasingly better at tasks that involve c...

Rule induction for global explanation of trained models

Understanding the behavior of a trained network and finding explanations...

Gradients of Counterfactuals

Gradients have been used to quantify feature importance in machine learn...

A New Method to Visualize Deep Neural Networks

We present a method for visualising the response of a deep neural networ...

Generalized Integrated Gradients: A practical method for explaining diverse ensembles

We introduce Generalized Integrated Gradients (GIG), a formal extension ...

Explaining Local, Global, And Higher-Order Interactions In Deep Learning

We present a simple yet highly generalizable method for explaining inter...

Unified Shapley Framework to Explain Prediction Drift

Predictions are the currency of a machine learning model, and to underst...

Code Repositories


Model explainability that works seamlessly with 🤗 transformers. Explain your transformers model in just 2 lines of code.

view repo


A repository for explaining feature attributions and feature interactions in deep neural networks.

view repo

1 Introduction and Prior Work

Deep neural networks are among the most popular class of machine learning model. They have achieved state-of-the-art performance in problem domains ranging from natural language processing

(Devlin et al., 2018) to image recognition (He et al., 2016). They have even outperformed other non-linear model types on structured tabular data (Shavitt and Segal, 2018). Although neural networks have been traditionally difficult to interpret compared to simpler model classes, gaining a better understanding of their predictions is desirable for a variety of reasons. To the extent that these algorithms are used in automated decisions impacting humans, explanations may be legally required (Selbst and Powles, 2017). When used in high stakes applications, like diagnostic radiology, it is essential to ensure that models are making safe decisions for the right reasons (Geis et al., 2019). During model development, interpretability methods may generally help debug undesirable model behavior (Sundararajan et al., 2017).

Feature attribution: There has been a large number of recent approaches to interpret deep neural networks, ranging from methods that aim to distill complex models into more simple models (Tan et al., 2018; Wu et al., 2018; Puri et al., 2017), to methods that aim to identify the most important concepts learned by a network (Kim et al., 2017; Olah et al., 2018, 2017; Fong and Vedaldi, 2018; Erhan et al., 2009; Mahendran and Vedaldi, 2015). One of the best-studied sets of approaches is known as feature attribution methods (Binder et al., 2016; Shrikumar et al., 2017; Lundberg and Lee, 2017; Ribeiro et al., 2016). These approaches explain a model’s prediction by assigning credit to each input feature based on how much it influenced that prediction. Although these approaches help practitioners understand which features are important, they do not explain why certain features are important or how features interact in a model. In order to develop a richer understanding of model behavior, it is therefore desirable to develop methods to explain not only feature attributions but also feature interactions. For example, in Figure 1, we show that word-level interactions can help us distinguish why deeper, more expressive neural networks outperform simpler ones on language tasks.

Feature interaction: There are several existing methods that explain feature interactions in neural networks. For example, Cui et al. (2019) propose a method to explain global interactions in Bayesian Neural Networks (BNN). After jointly training a linear model to learn main effects with a BNN to learn interactions, they detect learned interactions by finding pairs of features that on average (across many samples) have large magnitude terms in the input hessian of the BNN. Tsang et al. (2017)

propose a framework called Neural Interaction Detection, that detects statistical interactions between features by examining the weight matrices of feed-forward neural networks. In the domain of deep learning for genomics,

Greenside et al. (2018)

propose an approach called Deep Feature Interaction Maps. This approach detects interactions between a source and target feature by calculating the change in the first-order feature attribution of the source feature when all other features are held constant but the target feature is changed.

Limitations of Prior Approaches: Recent work has shown that attempting to quantitatively or qualitatively compare methods to explain machine learning models can lead to misleading and unreliable results (Tomsett et al., 2019; Buçinca et al., 2020). Instead, we attempt to draw contrast with previous approaches from a theoretical and practical perspective.

While previous approaches have taken important steps towards understanding feature interaction in neural networks, they all suffer from practical limitations. Each of the above methods are only applicable to models trained on certain types of data, or applicable only to certain model architectures. Neural Interaction Detection only applies to feed-forward neural network architectures, and can not be used on networks with convolutions, recurrent units, or self-attention. The approach suggested by Cui et al. (2019) is limited in that it requires the use of Bayesian Neural Networks; it is unclear how to apply the method to standard neural networks. Deep Feature Interaction Maps only work when the input features for a model have a small number of discrete values (such as the genomic data used as input in their paper).

In addition to architecture and data limitations, existing methods to detect interactions do not satisfy the common-sense axioms that have been proposed by feature attribution methods (Sundararajan et al., 2017; Lundberg and Lee, 2017). This failure to ground interactions in existing theory leads these previous approaches being provably unable to find interactions when interactions are present, or more generally finding counter-intuitive interactions (see section 4).

Our Contributions:

(1) We propose an approach to quantifying pairwise feature interactions that can be applied to any neural network architecture; (2) We identify several common-sense axioms that feature-level interactions should satisfy and show that our proposed method satisfies them; (3) We provide a principled way to compute interactions in ReLU-based networks, which are piece-wise linear and have zero second derivatives; (4) We demonstrate the utility of our method by showing the insights it reveals into real production-type models and real datasets.

Figure 1:

Interactions help us understand why certain models perform better than others. Here, we examine interactions on the sentence “this movie was not bad.” We compare two models trained to do sentiment analysis on the Stanford Sentiment dataset: a pre-trained transformer, DistilBERT (left), which predicts positive with 98.2% confidence, and a convolutional neural network trained from scratch (right), which predicts negative with 97.6% confidence. The transformer picks up on negation patterns: “not bad” has a positive interaction, despite the word “bad” being negative. The CNN mostly picks up on negative interactions like “movie not” and “movie bad”.

2 Explaining Explanations with Integrated Hessians

To derive our feature interaction values, we start by considering Integrated Gradients, proposed by Sundararajan et al. (2017). We represent our model as a function .***In the case of multi-output models, such as multi-class classification problems, we assume the function is indexed into the correct output class.. For a function , the Integrated Gradients attribution for the th feature is defined as:


where is the sample we would like to explain and is some uninformative baseline value. Although is often a neural network, the only requirement in order to compute attribution values is that be differentiable along the path from to . Our key insight is that the Integrated Gradients value for a differentiable model is itself a differentiable function . This means that we can apply Integrated Gradients to itself in order to explain how much feature impacted the importance of feature :


For , we can derive that:


In the case of , the formula has an additional first-order term. We leave the derivation and the full formula to the appendix. In this view, we can interpret as the explanation of the importance of feature in terms of the input value of feature .

2.1 Fundamental Axioms for Interaction Values

Sundararajan et al. (2017) showed that, among other theoretical properties, Integrated Gradients satisfies the completeness axiom, which states:


Although we leave the derivation to the appendix, we can show the following two equalities, which are immediate consequences of completeness:


We call equation 5 the interaction completeness axiom: the sum of the terms adds up to the difference between the output of at and at the baseline . This axiom lends itself to another natural interpretation of : as the interaction between features and . That is, it represents the contribution that the pair of features and together add to the output . Satisfying interaction completeness is important because it demonstrates a relationship between model output and interaction values. Without this axiom, it is unclear how to interpret the scale of interactions.

Equation 6 provides a way to interpret the self-interaction term : it is the main effect of feature after interactions with all other features have been subtracted away. We note that equation 6 also implies the following, intuitive property about the main effect: if for all , or in the degenerate case where is the only feature, we have . We call this the self completeness axiom. Satisfying self-completeness is important because it provides a guarantee that the main effect of feature equals its feature attribution value if that feature interacts with no other features.

Integrated Gradients satisfies several other common-sense axioms such as sensitivity and linearity. We discuss generalizations of these axioms to interaction values (interaction sensitivity and interaction linearity) and demonstrate the Integrated Hessians satisfies said axioms in the appendix. We do observe one additional property here, which we call interaction symmetry: that for any , we have

. It is straightforward to show that existing neural networks and their activation functions have continuous second partial derivatives, which implies that Integrated Hessians satisfies interaction symmetry.

We discuss the special case of the ReLU activation function in Section 3.

2.2 On Computing Integrated Hessians

In practice we compute the Integrated Hessian values by a discrete sum approximation of the integral, similar to how Integrated Gradients is computed (Sundararajan et al., 2017). The original paper uses the completeness axiom to determine convergence of attribution values. Similarly, we can use the interaction completeness axiom to determine convergence of the discrete sum approximation. We find that anywhere between 50 to 300 discrete steps suffice to approximate the double integral in most cases. We leave the exact approximation formulas and a further discussion of convergence to the appendix, as well as section 3.

2.3 Baselines and Expected Hessians

Several existing feature attribution methods have pointed out the need to explain relative to a baseline value that represents a lack of information that the explanations are relative to (Sundararajan et al., 2017; Shrikumar et al., 2017; Binder et al., 2016; Lundberg and Lee, 2017). However, more recent work has pointed out that choosing a single baseline value to represent lack of information can be challenging in certain domains (Kindermans et al., 2019; Kapishnikov et al., 2019; Sundararajan and Taly, 2018; Ancona et al., 2017; Fong and Vedaldi, 2017; Sturmfels et al., 2020). As an alternative, Erion et al. (2019) proposed an extension of Integrated Gradients called Expected Gradients, which samples many baseline inputs from the training set. It is defined as:


where the expectation is over both for the training distribution and . We can apply Expected Gradients to itself to get Expected Hessians:


where the expectation is over , and . In our experiments, we use Integrated Hessians where there is a natural baseline: for language models (Section 5.1) we use the zero-embedding as a baseline, and for gene expression data (Section 5.3

) we use the zero vector. On our tabular data examples (Sections

5.2 and 5.4) we use Expected Hessians. We further discuss the hyper-parameters used to compute interactions in the appendix.

3 Smoothing ReLU Networks

One major limitation that has not been discussed in previous approaches to interaction detection in neural networks is related to the fact that many popular neural network architectures use the ReLU activation function, . Neural networks that use ReLU are piecewise linear and have second partial derivatives equal to zero in all places. Previous second-order approaches fail to detect any interaction in ReLU-based networks.

Fortunately, the ReLU activation function has a smooth approximation – the SoftPlus function:


SoftPlus more closely approximates the ReLU function as increases, and has well-defined higher-order derivatives. Furthermore, Dombrowski et al. (2019) have proved that both the outputs and first-order feature attributions of a model are minimally perturbed when ReLU activations are replaced by SoftPlus activations in a trained network. Therefore, one important insight of our approach is that Integrated Hessians for a trained ReLU network can be obtained by first replacing the ReLU activations with SoftPlus activations, then calculating the interactions. We note that no re-training is necessary for this approach.

In addition to being twice differentiable and allowing us to calculate interaction values in ReLU networks, replacing ReLU with SoftPlus leads to other desirable behavior for calculating interaction values. We show that smoothing a neural network (decreasing the value of in the SoftPlus activation function) lets us accurately approximate the Integrated Hessians value with fewer gradient calls.

Theorem 1.

For a one-layer neural network with softplus non-linearity, softplus, and

input features, we can bound the number of interpolation points

needed to approximate the Integrated Hessians to a given error tolerance by .

Proof of Theorem 1 is contained in the appendix. In addition to the proof for the single-layer case, in the appendix we also show empirical results that many-layered neural networks display the same property. Finally, we demonstrate the intuition behind these results. As we replace ReLU with SoftPlus, the decision surface of the network is smoothed - see Figure 2 and Dombrowski et al. (2019). We can see that the gradients tend to all have more similar direction along the path from reference to foreground sample once the network has been smoothed with SoftPlus replacement.

Figure 2: Replacing ReLU activations with activations with smooths the decision surface of a neural network: gradients tend to be more homogeneous along the integration path. Orange arrows show the gradient vectors at each point along the path from the reference (green x) to the input (blue dots).

4 Explanation of XOR function

Figure 3: (Left) The output surface for a neural network with sigmoid activations representing an XOR function. (Right, upper) The Integrated Gradients feature attributions for two samples, the first where both features are turned off, and the second where both features are turned on. Both have the same output, and both get identical Integrated Gradients attributions. (Right, lower) The Integrated Hessians feature interactions for the same two samples. We see that the Integrated Hessians values, but not the Integrated Gradients values, differentiate the two samples. The sample where both features are turned off has Integrated Hessians values of 0 for all pairs of features. The samples where both features are turned on has significant negative interactions between the two features.

To understand why feature interactions can be more informative than feature attributions, we can first consider the explanations for the XOR function. This illustration also highlights a weakness of any method for detecting interactions that uses the Hessian without integrating over a path. Our data consists of two binary features, and the function learned by our model is the exclusive OR function – it has a large magnitude output when either one feature or the other is on, but a low magnitude output when both features are either on or off (see Figure 3, left).

We use the point (0,0) as a baseline. The Integrated Gradients feature attributions for the states where both features are off (0,0) and both features are on (1,1) are identical: both features get 0 attribution because (see Figure 3, right upper). When we use Integrated Hessians to get all of the pairwise interactions between features for the two samples, we see that the samples now have different explanations in terms of interactions (see Figure 3, right lower). The first datapoint (0,0), which is identical to the baseline, has all 0 interaction values. The second point (1,1), now has negative interaction between the two features and positive self-interaction. Therefore the interactions are usefully able to distinguish between (0,0), which has an output of 0 because it is identical to the baseline, and (1,1), which has an output of 0 because both features are on, which on their own should increase the model output, but in interaction with each other cancel out the positive effects and drive the model’s output back to the baseline.

This example also illustrates a problem with methods like Cui et al. (2019), which identifies global interactions between features and by measuring the average magnitude of the th element of the Hessian over all samples. Looking at Figure 3, we can see that the input gradients and input hessians have completely flattened (saturated) at all points on the data manifold, and consequently the magnitude of the th element of the Hessian over all samples will be 0. By integrating between the baseline and the samples, Integrated Hessians is capable of correctly detecting the negative interaction between the two features.

5 Applications of Integrated Hessians

In this section, we outline four different use cases of Integrated Hessians on real-world data in order to demonstrate its broad applicability.

5.1 Nlp

In the past decade, neural networks have been the go-to model for language tasks, from convolutional (Kim, 2014) to recurrent (Sundermeyer et al., 2012). More recently, attention mechanisms (Vaswani et al., 2017) and large, pre-trained transformer architectures (Peters et al., 2018; Devlin et al., 2018) have achieved state of the art performance on a wide variety of tasks.

Some previous work has suggested looking at the internal weights of the attention mechanisms in models that use attention (Ghaeini et al., 2018; Lee et al., 2017; Lin et al., 2017; Wang et al., 2016). However, more recent work has suggested that looking at attention weights may not be a reliable way to interpret models with attention layers (Serrano and Smith, 2019; Jain and Wallace, 2019; Brunner et al., 2019). To overcome this, feature attributions have been applied to text classification models to understand which words most impacted the classification (Liu and Avci, 2019; Lai et al., 2019). However, these methods do not explain how words interact with their surrounding context.

We download pre-trained weights for DistilBERT (Sanh et al., 2019) from the HuggingFace Transformers library (Wolf et al., 2019). We fine-tune the model on the Stanford Sentiment Treebank dataset (Socher et al., 2013)

in which the task is to predict whether or not a movie review has positive or negative sentiment. After 3 epochs of fine-tuning, DistilBERT achieves a validation accuracy of 0.9071 (0.9054 TPR / 0.9089 TNR).

This performance does not represent state of the art, nor is sentiment analysis representative of the full complexity of existing language tasks. However, our focus in this paper is on explanation and this task is easy to fine-tune without needing to extensively search over hyper-parameters. We leave further fine-tuning details to the appendix.

In Figure 4, we show interactions generated by Integrated Hessians and attributions generated by Integrated Gradients on an example drawn from the validation set. The figure demonstrates that DistilBERT has learned intuitive interactions that would not be revealed from feature attributions alone. For example, a word like “painfully,” which might have a negative connotation on its own, has a large positive interaction with the word “funny” in the phrase “painfully funny.” We include more examples in the appendix.

In Figure 1, we demonstrate how interactions can help us understand one reason why a fine-tuned DistilBERT model outperforms a simpler model: a convolutional neural network (CNN) that gets an accuracy of 0.82 on the validation set. DistilBERT picks up on positive interactions between negation words (“not”) and negative adjectives (“bad”) that a CNN fails to fully capture.

Finally, in Figure 5

, we use interaction values to reveal saturation effects: many negative adjectives describing the same noun interact positively. Although this may seem counter-intuitive at first, it reflects the structure of language. If a phrase has only one negative adjective, it stands out as the word that makes the phrase negative. At some point, however, describing a noun with more and more negative adjectives makes any individual negative adjective less important towards classifying that phrase as negative.

Figure 4: Interactions in text reveal learned patterns such as the phrase ”painfully funny” having positive interaction despite the word ”painfully” having negative attribution. These interactions are not evident from attributions alone.
Figure 5: Interactions help us reveal an unintuitive pattern in language models: saturation. Although the word ”movie” interacts negatively with all negative modifying adjectives, those negative adjectives themselves all interact positively. The more negative adjectives are in the sentence, the less each individual negative adjective matters towards the overall classification of the sentence.

5.2 Heart disease prediction

Here we aggregate interactions learned from many samples in a clinical dataset and use the interactions to reveal global patterns. We examine the Cleveland heart disease dataset (Detrano et al., 1989; Das et al., 2009). After preprocessing, the dataset contains 298 patients with 13 associated features, including demographic information like age and gender and clinical measurements such as systolic blood pressure and serum cholesterol. The task is to predict whether or not a patient has coronary artery disease.

We split the data into 238 patients for training (of which 109 have coronary artery disease) and 60 for testing (of which 28 have coronary artery disease). A two-layer neural network with softplus activation achieves a test accuracy of 0.87 (0.82 TPR and 0.91 TNR) and a test AUROC of 0.92. We did not extensively search over hyper-parameters for this task, as our goal is not state of the art performance but rather model interpretation.

In Figure 6, we examine the interactions with a feature describing the number of major coronary arteries with calcium accumulation (0 to 3), as determined by cardiac cinefluoroscopy (Detrano et al., 1986). Previous research has shown that this technique is a reliable way to gauge calcium build-up in major blood vessels, and serves as a strong predictor of coronary artery disease (Detrano et al., 1986; Bartel et al., 1974; Liu et al., 2015). Our model correctly learns that more coronary arteries with evidence of calcification indicate increased risk of disease. Additionally, Integrated Hessians reveals that our model learns a negative interaction between the number of coronary arteries with calcium accumulation and female gender. This supports the well-known phenomenon of under-recognition of heart disease in women – at the same levels of cardiac risk factors, women are less likely to have clinically manifest coronary artery disease (Maas and Appelman, 2010).

In addition to including more details about training procedure and dataset preprocessing, we show additional interactions and attributions in the appendix.

Figure 6: Left: Expected Gradients feature importance of the number of major vessels with accumulation of calcium as indicated by cardiac cinefluoroscopy. More vessels with calcium build-up indicated increased risk. Right: Expected Hessians feature interactions between patient gender and the number of major vessels containing calcium. When the Expected Hessians interactions are aggregated across the dataset, they reveal that our model has learned that women with calcium deposition in one coronary artery are less likely than men to be diagnosed with coronary artery disease.

5.3 Drug combination response prediction

In the domain of anti-cancer drug combination response prediction, plotting Integrated Hessians helps us to glean biological insights into the process we are modeling. We consider one of the largest publicly-available datasets measuring drug combination response in acute myeloid leukemia (Tyner et al., 2018). Each one of 12,362 samples consists of the measured response of a 2-drug pair tested in the cancer cells of a patient. The input features are split between features describing the drug combinations (binary labels of individual drugs and their molecular targets) and features describing the cancerous cells (RNA-seq expression levels for all genes). Following Hao et al. (2018), we first learn an embedding of biological pathways from individual gene expression levels in order to have a more interpretable model. Following Preuer et al. (2017), the drug features and pathway features are then input into a simple feed-forward neural network (see appendix for more details).

Looking at only the first-order attributions, we see that the presence or absence of the drug Venetoclax in the drug combination is the most important feature (see Figure 7, top left). We can also easily see that first-order explanations are inadequate in this case – while the presence of Venetoclax is generally predictive of a more responsive drug combination, the amount of positive response to Venetoclax is predicted to vary across samples.

Integrated Hessians gives us the insight that some of this variability can be attributed to the drug Venetoclax is combined with. We can see that the model has learned a strong negative interaction between Venetoclax and Artemisinin (see Figure 7, top right). Biological interactions are known to occur between anti-cancer drugs, and are an area of great clinical interest to to their potential therapeutic effects. Using additional data not directly available to the model, we can determine the ground truth as to which patients actually had positive and negative biological interactions between Venetoclax and Artemisinin (see appendix for details of calculation). We see that the Integrated Hessians interaction values are significantly more negative in the group with real biological negative interactions ().

Finally, we can gain insight into the variability in the interaction values between Venetoclax and Artemisinin by plotting them against the expression level of a pathway containing cancer genes (see Figure 7, bottom). We see that patients with higher expression of this pathway tend to have a more negative interaction (sub-additive response) than patients with lower expression of this pathway. Integrated Hessians helps us understand the interactions between drugs in our model, as well as what genetic factors influence this interaction.

Figure 7: Top Left: Integrated Gradients values for Venetoclax. Top Right: Venetoclax interactions with Artemisinin across all samples. Bottom: Venetoclax and Artemisinin interaction is driven by expression of genes in cancer samples.

5.4 Pulsar star prediction

In this section, we use a physics dataset to confirm that a model has learned global pattern that is visible in the training data. We utilize the HRTU2 dataset, curated by Lyon et al. (2016) and originally gathered by Keith et al. (2010). The task is to predict whether or not a particular signal measured from a radio telescope is a pulsar star or generated from radio frequency interference (e.g. background noise). The features include statistical descriptors of measurements made from the radio telescope. The dataset contains 16,259 examples generated through radio frequency interference and 1,639 examples that are pulsars. We leave a more complete description of the data and its features to the appendix, and refer the reader to Lyon (2016) for a more complete background on pulsar star detection.

We split the data into 14,318 training examples (1,365 are pulsars) and 3,580 testing examples (274 are pulsars). We train a two-layer, softplus neural network and achieve a held out test accuracy of 0.98 (0.86 TPR and 0.99 TNR). In Figure 8

, we examine the interaction between two key features in the dataset: kurtosis of the integrated profile, which we abbreviate as kurtosis (IP), and standard deviation of the dispersion-measure signal-to-noise ratio curve, which we abbreviate as standard deviation (DM-SNR).

The bottom of Figure 8 shows that kurtosis (IP) is a highly predictive feature, while standard deviation (DM-SNR) is less predictive. However, in the range where kurtosis (IP) is roughly between 0 and 2, standard deviation (DM-SNR) helps distinguish between a concentration of negative samples at standard deviation (DM-SNR) 40. We can verify that the model we’ve trained correctly learns this interaction. By plotting the interaction values learned by the model against the value of kurtosis (IP), we can see a peak positive interaction for points in the indicated range and with high standard deviation (DM-SNR). Interaction values show us that the model has successfully learned the expected pattern: that standard deviation (DM-SNR) has the most discriminative power when kurtosis (IP) is in the indicated range.

Figure 8: Top Left: Attributions to kurtosis (IP) generated by expected gradients. Top Right: The model learns a peak positive interaction when kurtosis (IP) is in the range [0, 2]. Bottom: A plot of the training data along the axes of the two aforementioned features, colored by class label. Although kurtosis (IP) seems to be the more predictive feature, in the highlighted band the standard deviation (DM-SNR) provides useful additional information: larger standard deviation (DM-SNR) implies higher likelihood of being a pulsar star.

6 Additional Related Work

Although we covered the bulk of the related work in Section 1, we mention some additional related work here. Detecting interactions - when two or more variables have a combined effect not equal to their additive effect - has a long history in statistics (Southwood, 1978), economics (Balli and Sørensen, 2013)

and game theory

(Grabisch and Roubens, 1999). In the context of machine learning, many methods have been proposed to learn interactions from data, e.g. using additive models (Coull et al., 2001; Lou et al., 2013), group-lasso (Lim and Hastie, 2015), or trees (Sorokina et al., 2008). We view these approaches as tangential to our work: they aim to learn new models to detect specific, pairwise interactions; we propose a method to explain interactions in deep networks that have already been trained. Doing so is especially important in domains where neural networks achieve state of the art performance, like natural language.

We are unaware of any work to explain interactions in neural networks other than those works mentioned in Section 1 (Cui et al., 2019; Tsang et al., 2017; Greenside et al., 2018). Parallel to our work, Lundberg et al. (2020) propose an extension of feature attribution methods to feature interactions for tree-based models which satisfies similar properties to the ones we propose.

7 Conclusion

In this work we propose a novel method to explain feature interactions in neural networks. The interaction values we propose have two natural interpretations: (1) as the combined effect of two features to the output of a model, and (2) as the explanation of one feature’s importance in terms of another. Our method provably satisfies common-sense axioms that previous methods do not - and unlike such previous methods, places no requirement on neural network architecture, class or data type.

Additionally, we demonstrate how to glean interactions from neural networks trained with a ReLU activation function which has no second-derivative. In accordance with recent work, we show why replacing the ReLU activation function with the softplus activation function at explanation time is both intuitive and efficient.

Finally, we perform extensive experiments to reveal the utility of our method, from understanding performance gaps between model classes to discovering patterns a model has learned on high-dimensional data. We conclude that although feature attribution methods provide valuable insight into model behavior, such methods by no means end the discussion on interpretability. Rather, they encourage further work in deeper understanding model behavior.


  • M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2017) Towards better understanding of gradient-based attribution methods for deep neural networks. arXiv preprint arXiv:1711.06104. Cited by: §2.3.
  • H. O. Balli and B. E. Sørensen (2013) Interaction effects in econometrics. Empirical Economics 45 (1), pp. 583–603. Cited by: §6.
  • A. G. Bartel, J. T. Chen, R. H. Peter, V. S. Behar, Y. Kong, and R. G. Lester (1974) The significance of coronary calcification detected by fluoroscopy: a report of 360 patients. Circulation 49 (6), pp. 1247–1253. Cited by: §5.2.
  • A. Binder, G. Montavon, S. Lapuschkin, K. Müller, and W. Samek (2016) Layer-wise relevance propagation for neural networks with local renormalization layers. In International Conference on Artificial Neural Networks, pp. 63–71. Cited by: §1, §2.3.
  • G. Brunner, Y. Liu, D. Pascual, O. Richter, M. Ciaramita, and R. Wattenhofer (2019) On identifiability in transformers. arXiv preprint arXiv:1908.04211. Cited by: §5.1.
  • Z. Buçinca, P. Lin, K. Z. Gajos, and E. L. Glassman (2020) Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. arXiv preprint arXiv:2001.08298. Cited by: §1.
  • B. A. Coull, D. Ruppert, and M. Wand (2001) Simple incorporation of interactions into additive models. Biometrics 57 (2), pp. 539–545. Cited by: §6.
  • D. Croft, A. F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Garapati, M. Gillespie, M. R. Kamdar, et al. (2014) The reactome pathway knowledgebase. Nucleic acids research 42 (D1), pp. D472–D477. Cited by: §E.3.
  • T. Cui, P. Marttinen, and S. Kaski (2019) Recovering pairwise interactions using neural networks. arXiv preprint arXiv:1901.08361. Cited by: §1, §1, §4, §6.
  • R. Das, I. Turkoglu, and A. Sengur (2009) Effective diagnosis of heart disease through neural networks ensembles. Expert systems with applications 36 (4), pp. 7675–7680. Cited by: §5.2.
  • R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. H. Guppy, S. Lee, and V. Froelicher (1989)

    International application of a new probability algorithm for the diagnosis of coronary artery disease

    The American journal of cardiology 64 (5), pp. 304–310. Cited by: §D.1, §5.2.
  • R. Detrano, E. E. Salcedo, R. E. Hobbs, and J. Yiannikas (1986) Cardiac cinefluoroscopy as an inexpensive aid in the diagnosis of coronary artery disease. The American Journal of Cardiology 57 (13), pp. 1041 – 1046. External Links: ISSN 0002-9149, Document, Link Cited by: §5.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §5.1.
  • A. Dombrowski, M. Alber, C. Anders, M. Ackermann, K. Müller, and P. Kessel (2019) Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, pp. 13567–13578. Cited by: §3, §3.
  • D. Erhan, Y. Bengio, A. Courville, and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §1.
  • G. Erion, J. D. Janizek, P. Sturmfels, S. Lundberg, and S. Lee (2019) Learning explainable models using attribution priors. arXiv preprint arXiv:1906.10670. Cited by: §2.3.
  • R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 3429–3437. Cited by: §2.3.
  • R. Fong and A. Vedaldi (2018) Net2vec: quantifying and explaining how concepts are encoded by filters in deep neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 8730–8738. Cited by: §1.
  • J. R. Geis, A. P. Brady, C. C. Wu, J. Spencer, E. Ranschaert, J. L. Jaremko, S. G. Langer, A. Borondy Kitts, J. Birch, W. F. Shields, et al. (2019)

    Ethics of artificial intelligence in radiology: summary of the joint European and North American multisociety statement

    Radiology 293 (2), pp. 436–440. Cited by: §1.
  • R. Ghaeini, X. Z. Fern, and P. Tadepalli (2018) Interpreting recurrent and attention-based neural models: a case study on natural language inference. arXiv preprint arXiv:1808.03894. Cited by: §5.1.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §B.2.
  • M. Grabisch and M. Roubens (1999) An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of game theory 28 (4), pp. 547–565. Cited by: §6.
  • P. Greenside, T. Shimko, P. Fordyce, and A. Kundaje (2018) Discovering epistatic feature interactions from neural network models of regulatory dna sequences. Bioinformatics 34 (17), pp. i629–i637. Cited by: §1, §6.
  • J. Hao, Y. Kim, T. Kim, and M. Kang (2018) PASNet: pathway-associated sparse deep neural network for prognosis prediction from high-throughput data. BMC bioinformatics 19 (1), pp. 1–13. Cited by: §E.3, §5.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • S. Jain and B. C. Wallace (2019) Attention is not explanation. arXiv preprint arXiv:1902.10186. Cited by: §5.1.
  • M. Kanehisa et al. (2002) The kegg database. In Novartis Foundation Symposium, pp. 91–100. Cited by: §E.3.
  • A. Kapishnikov, T. Bolukbasi, F. Viégas, and M. Terry (2019) Segment integrated gradients: better attributions through regions. arXiv preprint arXiv:1906.02825. Cited by: §2.3.
  • M. Keith, A. Jameson, W. Van Straten, M. Bailes, S. Johnston, M. Kramer, A. Possenti, S. Bates, N. Bhat, M. Burgay, et al. (2010) The high time resolution universe pulsar survey–i. system configuration and initial discoveries. Monthly Notices of the Royal Astronomical Society 409 (2), pp. 619–627. Cited by: §5.4.
  • B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres (2017) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). arXiv preprint arXiv:1711.11279. Cited by: §1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §5.1.
  • P. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim (2019) The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267–280. Cited by: §2.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §C.1, §C.2, §E.3.
  • V. Lai, J. Z. Cai, and C. Tan (2019) Many faces of feature importance: comparing built-in and post-hoc feature importance in text classification. arXiv preprint arXiv:1910.08534. Cited by: §5.1.
  • J. Lee, J. Shin, and J. Kim (2017)

    Interactive visualization and manipulation of attention-based neural machine translation

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 121–126. Cited by: §5.1.
  • J. T. Leek and J. D. Storey (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLOS Genetics 3 (9), pp. 1–12. External Links: Link, Document Cited by: §E.2.
  • M. Lim and T. Hastie (2015) Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics 24 (3), pp. 627–654. Cited by: §6.
  • Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §5.1.
  • F. Liu and B. Avci (2019) Incorporating priors with feature attribution on text classification. arXiv preprint arXiv:1906.08286. Cited by: §5.1.
  • W. Liu, Y. Zhang, C. Yu, Q. Ji, M. Cai, Y. Zhao, and Y. Zhou (2015) Current understanding of coronary artery calcification. Journal of geriatric cardiology: JGC 12 (6), pp. 668. Cited by: §5.2.
  • Y. Lou, R. Caruana, J. Gehrke, and G. Hooker (2013) Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–631. Cited by: §6.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in neural information processing systems, pp. 4765–4774. Cited by: §1, §1, §2.3.
  • S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. Lee (2020) From local explanations to global understanding with explainable ai for trees. Nature Machine Intelligence 2 (1), pp. 56–67. External Links: Document, ISBN 2522-5839, Link Cited by: §6.
  • R. J. Lyon, B. Stappers, S. Cooper, J. Brooke, and J. Knowles (2016) Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Monthly Notices of the Royal Astronomical Society 459 (1), pp. 1104–1123. Cited by: §5.4.
  • R. J. Lyon (2016) Why are pulsars hard to find?. Ph.D. Thesis, The University of Manchester (United Kingdom). Cited by: §5.4.
  • A. H. Maas and Y. E. Appelman (2010) Gender differences in coronary heart disease. Netherlands Heart Journal 18 (12), pp. 598–603. Cited by: §5.2.
  • A. Mahendran and A. Vedaldi (2015) Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196. Cited by: §1.
  • A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5, pp. 621. External Links: Link Cited by: §E.2.
  • D. Nishimura (2001) BioCarta. Biotech Software & Internet Report: The Computer Software Journal for Scient 2 (3), pp. 117–120. Cited by: §E.3.
  • C. Olah, A. Mordvintsev, and L. Schubert (2017) Feature visualization. Distill 2 (11), pp. e7. Cited by: §1.
  • C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev (2018) The building blocks of interpretability. Distill 3 (3), pp. e10. Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §5.1.
  • K. Preuer, R. P. I. Lewis, S. Hochreiter, A. Bender, K. C. Bulusu, and G. Klambauer (2017) DeepSynergy: predicting anti-cancer drug synergy with Deep Learning. Bioinformatics 34 (9), pp. 1538–1546. External Links: ISSN 1367-4803, Document, Link, Cited by: §E.3, §5.3.
  • N. Puri, P. Gupta, P. Agarwal, S. Verma, and B. Krishnamurthy (2017) Magix: model agnostic globally interpretable explanations. arXiv preprint arXiv:1706.07160. Cited by: §1.
  • P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §C.3.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §C.1, §5.1.
  • A. D. Selbst and J. Powles (2017) Meaningful information and the right to explanation. International Data Privacy Law 7 (4), pp. 233–242. External Links: ISSN 2044-3994, Document, Link, Cited by: §1.
  • S. Serrano and N. A. Smith (2019) Is attention interpretable?. arXiv preprint arXiv:1906.03731. Cited by: §5.1.
  • I. Shavitt and E. Segal (2018) Regularization learning networks: deep learning for tabular datasets. In Advances in Neural Information Processing Systems, pp. 1379–1389. Cited by: §1.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145–3153. Cited by: §1, §2.3.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §C.1, §5.1.
  • D. Sorokina, R. Caruana, M. Riedewald, and D. Fink (2008) Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th international conference on Machine learning, pp. 1000–1007. Cited by: §6.
  • K. E. Southwood (1978) Substantive theory and statistical interaction: five models. American Journal of Sociology 83 (5), pp. 1154–1203. Cited by: §6.
  • P. Sturmfels, S. Lundberg, and S. Lee (2020) Visualizing the impact of feature attribution baselines. Distill 5 (1), pp. e22. Cited by: §B.1, §2.3.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §A.3, §B.1, §B.1, §C.3, Explaining Explanations: Axiomatic Feature Interactions for Deep Networks, §1, §1, §2.1, §2.2, §2.3, §2.
  • M. Sundararajan and A. Taly (2018) A note about: local explanation methods for deep neural networks lack sensitivity to parameter values. arXiv preprint arXiv:1806.04205. Cited by: §2.3.
  • M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §5.1.
  • I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §D.2, §F.2.
  • S. Tan, R. Caruana, G. Hooker, P. Koch, and A. Gordo (2018) Learning global additive explanations for neural nets using model distillation. arXiv preprint arXiv:1801.08640. Cited by: §1.
  • R. Tomsett, D. Harborne, S. Chakraborty, P. Gurram, and A. Preece (2019) Sanity checks for saliency metrics. External Links: 1912.01451 Cited by: §1.
  • M. Tsang, D. Cheng, and Y. Liu (2017) Detecting statistical interactions from neural network weights. arXiv preprint arXiv:1705.04977. Cited by: §A.2.3, §1, §6.
  • J. W. Tyner, C. E. Tognon, D. Bottomly, B. Wilmot, S. E. Kurtz, S. L. Savage, N. Long, A. R. Schultz, E. Traer, M. Abel, et al. (2018) Functional genomic landscape of acute myeloid leukaemia. Nature 562 (7728), pp. 526–531. Cited by: §E.1, §5.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §5.1.
  • Y. Wang, M. Huang, X. Zhu, and L. Zhao (2016) Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615. Cited by: §5.1.
  • D. S. Wishart, Y. D. Feunang, A. C. Guo, E. J. Lo, A. Marcu, J. R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda, et al. (2018) DrugBank 5.0: a major update to the drugbank database for 2018. Nucleic acids research 46 (D1), pp. D1074–D1082. Cited by: §E.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §C.1, §5.1.
  • M. Wu, M. C. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. Doshi-Velez (2018) Beyond sparsity: tree regularization of deep models for interpretability. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.


Appendix A Deriving Interaction Values

In this section, we expand the derivation of the Integrated Hessians formula, further discuss relevant axioms that interaction values should satisfy, and how we compute Integrated Hessians in practice.

a.1 The Integrated Hessians Formula

Here we derive the formula for Integrated Hessians from it’s definition: . We start by expanding out using the definition of Integrated Gradients:


We consider the function , and we first assume that


where we’ve assume that the function satisfies the conditions for the Leibniz Integral Rule (so that integration and differentiation are interchangeable). These conditions require that the derivative of , and its second derivative function are continuous over in the integration region, and that the bounds of integration are constant with respect to . It’s easy to see that the bounds of integration are constant with respect to . It is also straightforward to see that common neural network activation functions - for example, , , or where

is the cumulative distribution function of the normal distribution - have continuous first and second partial derivatives, which implies compositions of these functions have continuous first and second partial derivatives as well. Although this is not the case with the ReLU activation function, we discuss replacing it with softplus in the maintext.

We can proceed by plugging equation 14 into the original definition of :


where all we’ve done is re-arrange terms.

Deriving proceeds similarly:


using the chain rule. After similar re-arrangement, we can arrive at:


The derivation for Expected Hessians simply observes that the integrals can be viewed as integrating over the product of two uniform distributions

, e.g.


where represents the data distribution and .

a.2 Axioms Satisfied by Integrated Hessians

a.2.1 Self and Interaction Completeness

The proof that Integrated Hessians satisfies the axioms interaction completeness and self completeness are straightforward to show, but we include the step-by-step derivation here. First, we note that for any because . Then, by completeness of Integrated Gradients, we have that:


Re-arrangement gives us the self completeness axioms:


Since Integrated Gradients satisfies completeness, we have:


Making the appropriate substitution from equation 24 shows the interaction completeness axiom:


a.2.2 Sensitivity

Integrated Gradients satisfies an axiom called sensitivity, which can be phrased as follows. Given an input and a baseline , if for all except where and if , then . Specifically, by completeness we know that . Intuitively, this is saying that if only one feature differs between the baseline and the input and changing that feature changes the output, then the amount the output changes should be equal to the importance of that feature.

We can extend this axiom to the interaction case by considering the case when two features differ from the baseline. We call this axiom interaction sensitivity, and can be described as follows. If an input and a baseline are equal everywhere except and , and if , then: and for all . Intuitively, this says that if the only features that differ from the baseline are and , then the difference in the output must be solely attributable to the main effects of and plus the interaction between them. This axiom holds simply by applying interaction completeness and observing that if or .

a.2.3 Implementation Invariance

The implementation invariance axiom, described in the original paper, states the following. For two models and such that , then for all features and all points regardless of how and are implemented. Although it seems trivial, this axiom does not necessarily hold for attribution methods that use the implementation or structure of the network in order to generate attributions. Critically, this axiom also does not hold for the interaction method proposed by Tsang et al. (2017), which looks at the first layer of a feed forward neural network. Two networks may represent exactly the same function but differ greatly in their first layer.

This axiom is trivially seen to hold for Integrated Hessians since it holds for Integrated Gradients. However, this axiom is desirable because without it, it may mean that attributions/interactions are encoding information about unimportant aspects of model structure rather than the actual decision surface of the model.

a.2.4 Linearity

Integrated Gradients satisfies an axiom called linearity, which can be described as follows. Given two networks and , consider the output of the weighted ensemble of the two networks . Then the attribution of the weighted ensemble equals the weighted sum of attributions for all features and samples . This axiom is desirable because it preserves linearity within a network, and allows easy computation of attributions for network ensembles.

We can generalize linearity to interactions using the interaction linearity axiom: for any and all points . Given that is composition of linear functions , in terms of the parameterized networks and , it itself is a linear function of the networks and therefore Integrated Hessians satisfies interaction linearity.

a.2.5 Symmetry-Preserving

We say that two features and are symmetric with respect to if swapping them does not change the output of anywhere. That is, ). The original paper shows that Integrated Gradients is symmetry-preserving, that is, if and are symmetric with respect to , and if and for some input and baseline , then . We can make the appropriate generalization to interaction values: if the same conditions as above hold, then for any feature . This axiom holds since, again and are symmetry-preserving. This axiom is desirable because it says that if two features are functionally equivalent to a model, then they must interact the same way with respect to that model.

a.3 Approximating Integrated Hessians in Practice

To compute Integrated Gradients in practice, Sundararajan et al. (2017) introduce the following discrete sum approximation:


where is the number of points used to approximate the integral. To compute Integrated Hessians, we introduce a similar discrete sum approximation:


Typically, it is easiest to compute this quantity when and the number of samples drawn is thus a perfect square - however, when a non-square number of samples is desired we can generate a number of sample points from the product distribution of two uniform distributions such that the number is the largest perfect square above the desired number of samples, and index the sorted samples appropriately to get the desired number. The above formula omits the first-order term in but it can be computed using the same principle.

Expected Hessians has a similar, if slightly easier form:


where is the th sample from the product distribution of two uniform distributions. We find that in general that less than 300 samples are required for any given problem to approximately satisfy interaction completeness. For most problems, a number far less than 300 suffices (e.g. around 50) although this is model and data dependent: larger models and higher-dimensional data generally require more samples than smaller models and lower-dimensional data.

Appendix B Effects of Smoothing ReLU Networks

b.1 Proof of Theorem 1

Theorem 2.

For a one-layer neural network with softplus non-linearity, softplus, and input features, we can bound the number of interpolation points needed to approximate the Integrated Hessians to a given error tolerance by .


As pointed out in Sundararajan et al. (2017) and Sturmfels et al. (2020), completeness can be used to assess the convergence of the approximation. We first show that decreasing improves convergence for Integrated Gradients. In order to accurately calculate the Integrated Gradients value for a feature , we want to be able to bound the error between the approximate computation and exact value. The exact value is given as:


To simplify notation, we can define the partial derivative that we want to integrate over in the th coordinate as :


Since the single layer neural network with softplus activation is monotonic along the path, the error in the approximate integral can be lower bounded by the left Riemann sum :


and can likewise be upper-bounded by the right Riemann sum :


We can then bound the magnitude of the error between the Riemann sum and the true integral by the difference between the right and left sums:


By the mean value theorem, we know that for some and , . Therefore:


Rewriting in terms of the original function, we have:


We can then consider the gradient vector of :


where each coordinate is maximized at the zeros input vector, and takes a maximum value of . We can therefore bound the error in convergence as:


Ignoring the dependency on path length and the magnitude of the weights of the neural network, we see that:


This demonstrates that the number of interpolation points necessary to achieve a set error rate decreases as the activation function is smoothed (the value of is decreased). While this proof only bounds the error in the approximation of the integral for a single feature, we get the error in completeness by multiplying by an additional factor of features.

We can extend the same proof to the Integrated Hessians values. We first consider the error for estimating off diagonal terms

. The true value we are trying to approximate is given as:


For the sake of notation, we can say . Assuming that we are integrating from the all-zeros baseline as suggested in Sundararajan et al. (2017), since is monotonic on either interval from the 0 baseline, we can again bound the error in the double integral by the magnitude of the difference in the left and right Riemann sums…


We can then again use monotonicity over the interval to say that , which gives us:


By the mean value theorem, we know that for some , . Substituting gives us:


We can then consider the elements of the gradient vector:


For the univariate version of each coordinate, we can maximize the function by taking the derivative with respect to and setting it equal to 0.


We can see that this equation holds only when = 0, and can solve it by finding the roots of this quadratic equation, which occur when . When we plug that back in, we find the absolute value of the function in that coordinate takes a maximum value of . Therefore, for a given set of fixed weights of the network, we can see that the coordinate-wise maximum magnitude of , and that the number of interpolation points necessary to reach a desired level of error in approximating the double integral decreases as is decreased. Again ignoring the fixed weights and path length, the number of interpolation points necessary is bounded by:


For the terms (main effect terms) the error will have another additive factor of . This is because there is an added term to the main effect equal to:


When we bound the error in this approximate integral by the difference between the double left sum and double right sum, we get that:


Following the exact same steps as in Equation 35 through Equation 39, we can then show the bound on the error of the on-diagonal terms will have an additional term that is . Due to the axiom of interaction completeness, the error bound of the entire convergence can be obtained by adding up all of the individual terms, incurring another factor of in the bound.

b.2 Empirically Improves Convergence

In addition to theoretically analyzing the effects of smoothing the activation functions of a single-layer neural network on the convergence of the approximately calculation of Integrated Gradients and Integrated Hessians, we also wanted to empirically analyze the same phenomenon in deeper networks. We assessed this by creating two networks: one with 5 hidden layers of 50 nodes, and a second with 10 hidden layers of 50 nodes. These networks were then randomly initialized using the Xavier Uniform intialization scheme (Glorot and Bengio, 2010). We created 10 samples to explain, each with 100 features drawn at random from the standard normal distribution. To evaluate the convergence of our approximate Integrated Hessians values, we plot the interaction completeness error (the difference between the sum of the Integrated Hessians value and the difference of the function output at a sample and the function output at the zeros baseline). We plot the completeness error as a fraction of the magnitude of the function output. As we decrease the value of , we smooth the activations, and we can see that the number of interpolations required to converge decreases (see Figure 9 and Figure 10). We note that the randomly initialized weights of each network are held constant and the only thing changed is the value of in the activation function.

Figure 9: 5-Layer Network Results. Interaction completeness error (difference between model output and sum of Integrated Hessians values) decreases more quickly with number of interpolation points as the parameter for the softplus activation function is decreased (as the function is smoothed). Results are averaged over 10 samples with 100 features for a neural network with 5 hidden layers of 50 nodes each.
Figure 10: 10-Layer Network Results. We can see that decreasing has an even more dramatic effect on convergence in this case when the network is deeper than in the 5-layer case. Again, this result shows that the interaction completeness error decreases more quickly with number of interpolation points as the activation function is smoothed. Results are averaged over 10 samples with 100 features for a neural network with 10 hidden layers of 50 nodes each.

Appendix C Details on the Sentiment Analysis Task

c.1 Fine-Tuning DistilBERT

As mentioned in the main text, we download pre-trained weights for DistilBERT, a pre-trained language model introduced in Sanh et al. (2019), from the HuggingFace Transformers library (Wolf et al., 2019). We fine-tune the model on the Stanford Sentiment Treebank dataset introduced by Socher et al. (2013). We fine-tune for 3 epochs using a batch size of 32 and a learning rate of 0.00003. We use a a max sequence length of 128 tokens, and the Adam algorithm for optimization (Kingma and Ba, 2014). We tokenize using the HuggingFace build-in tokenizer, which does so in an uncased fashion. We did not search for these hyper-parameters - rather, they were the defaults presented for fine-tuning in the HuggingFace repository. We find that they work adequately for our purposes, and didn’t attempt to search through more hyper-parameters.

c.2 Training a CNN

The convolutional neural network that we compare to in the main text is one we train from scratch on the same dataset. We randomly initialize 32-dimensional embeddings and use a max sequence length of 52. First, we apply dropout to the embeddings with dropout rate 0.5. The network itself is composed of 1D convolutions with 32 filters of size 3 and 32 filters of size 8. Each filter size is applied separately to the embedding layer, after which max pooling with a stride of 2 is applied and then the output of both convolutions is concatenated together and fed through a dropout layer with a dropout rate of 0.5 during training. A hidden layer of size 50 follows the dropout, finally followed by a linear layer generating a scalar prediction that the sigmoid function is applied to.

We train with a batch size of 128 for 2 epochs and use a learning rate of 0.001. We optimize using the Adam algorithm with the default hyper-parameters (Kingma and Ba, 2014). Since this model was not pre-trained on a large language corpus and lacks the expressive power of a deep transformer, it is unable to capture patterns like negation that a fine-tuned DistilBERT does.

c.3 Generating Attributions and Interactions

In order to generate attributions and interactions, we use Integrated Gradients and Integrated Hessians with the zero-embedding - the embedding produced by the all zeros vector, which normally encodes the padding token - as a baseline. Because embedding layers are not differentiable, we generate attributions and interactions to the word embeddings and then sum over the embedding dimension to get word-level attributions and interactions, as done in

Sundararajan et al. (2017). When computing attributions and interactions, we use 256 background samples. Because DistilBERT uses the GeLU activation function (Ramachandran et al., 2017), which has continuous first and second partial derivatives, there is no need to use the softplus replacement. When we plot interactions, we avoid plotting the main-effect terms in order to better visualize the interactions between words.

c.4 Additional Examples of Interactions

Here we include additional examples of interactions learned on the sentiment analysis task. First we expand upon the idea of saturation in natural language, displayed in Figure 11. We display interactions learned by a fine-tuned DistilBERT on the following sentences: “a bad movie” (negative with 0.9981 confidence), “a bad, terrible movie” (negative with 0.9983 confidence), “a bad, terrible, awful movie” (negative with 0.9984 confidence) and “a bad, terrible, awful, horrible movie” (negative with 0.9984 confidence). The confidence of the network saturates: a network output only gets so negative before it begins to flatten. However, the number of negative adjectives in the sentence increases. This means a sensible network would spread the same amount of credit (because the attributions sum to the saturated output) across a larger number of negative words, which is exactly what DistilBERT does. However, this means that each word gets less negative attribution than it would if it was on its own. Thus, the negative words have positive interaction effects, which is exactly what we see from the figure.

In Figure 12, we give another example of the full interaction matrix on a sentence from the validation set. In Figure 13, we give an example of how explaining the importance of a particular word can understand whether that word is important because of its main effect or because of its surrounding context. We show additional examples from the validation set in Figures 14, 15, 16, 17, 18. We note that while some interactions make intuitive sense to humans (“better suited” being negative or “good script” being positive), there are many other examples of interactions that are less intuitive. These interactions may indicate that the Stanford Sentiment Treebank dataset does not fully capture the expressive power of language (e.g. it doesn’t have enough samples to fully represent all of the possible interactions in language), or it may indicate that the model has learned higher order effects that cannot be explained by pairwise interactions alone.

Figure 11: The effects of increasing saturation. As we add more negative adjectives to describe the word “movie”, those negative adjectives interact more and more positively, even though those words interact negatively with the word they are describing. This is because each individual negative adjective has less impact on the overall negativity of the sentence the more negative adjectives there are.
Figure 12: An example from the Stanford Sentiment Analysis Treebank validation set. Interactions highlight intuitive patterns in text, such as phrases like ”beautifully observed” and ”miraculously unsentimental” being strongly positive interactions.
Figure 13: An example from the Stanford Sentiment Analysis Treebank validation set. This example shows that the word ”nonprofessional” has a main effect that is negative, but the surrounding context outweights the main effect and makes the overall attribution positive.
Figure 14: An example from the Stanford Sentiment Analysis Treebank validation set. This example highlights the phrase “better suited” being strongly negative, which is very intuitive. However, some of the other interactions are slightly less intuitive and may indicate lack of training data or higher-order interactions beyond word pairs.
Figure 15: Another example from the Stanford Sentiment Analysis Treebank validation set. Notice how the strongest interact pairs may not necessarily be adjacent words, e.g. “artfully” and “technical”.
Figure 16: An example from the Stanford Sentiment Analysis Treebank validation set. Interestingly, the phrase “a good” has a negative interaction. This may indicate saturation effects, higher order effects, or that the model has simply learned an unintuitive pattern.
Figure 17: An example from the Stanford Sentiment Analysis Treebank validation set. This example shows some very intuitive negative interactions among the phrase “there’s little to love”. Interestingly, “this english” has a positive interaction: perhaps the dataset has a bias for english movies?
Figure 18: An example from the Stanford Sentiment Analysis Treebank validation set. This one also shows intuitive patterns, e.g. “quite compelling” being strongly positive.

Appendix D Details on the Heart Disease Prediction Task

d.1 Data Description

As mentioned in the main text, the dataset we use has 13 associated features. The list of features, which we reproduce here, is from (Detrano et al., 1989), the original paper introducing the dataset:

  1. Age of patient (mean: 54.5 years standard deviation: 9.0)

  2. Gender (202 male, 96 female)

  3. Resting systolic blood pressure (131.6 mm Hg 17.7)

  4. Cholesterol (246.9 mg/dl 51.9)

  5. Whether or not a patient’s fasting blood sugar was above 120 mg/dl (44 yes)

  6. Maximum heart rate achieved exercise (149.5 bpm 23.0)

  7. Whether or not a patient has exercise-induced angina (98 yes)

  8. Excercise-induced ST-segment depression (1.05 mm 1.16)

  9. Number of major vessels appearing to contain calcium as revealed by cinefluoroscopy (175 patients with 0, 65 with 1, 38 with 2, 20 with 3)

  10. Type of pain a patient experienced if any (49 experienced typical anginal pain, 84 experienced atypical anginal pain, 23 experienced non-anginal pain and 142 patients experienced no chest pain)

  11. Slope of peak exercise ST segment (21 patients had upsloping segments, 138 had flat segments, 139 had downsloping segments)

  12. Whether or not a patient had thallium defects as revealed by scintigraphy (2 patients with no information available, 18 with fixed defects, 115 with reversible defects and 163 with no defects)

  13. Classification of resting electrocardiogram (146 with normal resting ecg, 148 with an ST-T wave abnormality, and 4 with probable or definite left centricular hypertrophy)

d.2 Training Details

On this task, we use a two layer neural network with 128 and 64 hidden units, respectively, with softplus activation after each layer. We optimize using gradient descent (processing the entire training set in a single batch) with an initial learning rate of 0.1 that decays exponentially with a rate 0.99 after each epoch. We use nesterov momentum with

(Sutskever et al., 2013). After training for 200 epochs, the network achieves a held-out accuracy of 0.8667 with 0.8214 true positive rate and 0.9062 true negative rate. We note that the hyper-parameters chosen here were not carefully tuned on a validation set - they were simply those that seemed converge to a reasonable performance on the training set. Our focus is not state of the art prediction or comparing model performances, but rather interpreting patterns a reasonable model learns.

d.3 Generating Attributions and Interactions

To generate attributions and interactions for this dataset, we use Expected Gradients and Expected Hessians with the training set forming the background distribution. We use 200 samples to compute both attributions and interactions, although we note this number is probably larger than necessary but was easy to compute due to the small size of the dataset.

d.4 Additional Examples of Attributions and Interactions

Figure 19 shows which features were most important towards predicting heart disease aggregated over the entire dataset, as well as the trend of importance values. Interestingly, the model learns some strangely unintuitive trends: if a patient doesn’t experience chest pain, they are more likely to have heart disease than if they experience anginal chest pain! This could indicate problems with the way certain features were encoded, or perhaps dataset bias. Figure 20 demonstrates an interaction learned by the network between maximum heart rate achieved and gender, and Figure 21 demonstrates an interaction between exercise-induced ST-segment depression and number of major vessels appearing to contain calcium.

Figure 19: A summary of which features were most important towards predicting heart disease. A positive attribution value indicates increased risk of heart disease (negative value indicates decreased risk of heart disease). The features are ordered by largest mean absolute magnitude over the dataset. For binary features, high (yellow) indicates true while low (green) indicates false. For example, for the feature “experienced atypical angina”, yellow means the patient did experience atypical angina and green means the patient did not.
Figure 20: An interaction learned by the model between maximum achieved heart rate during exercise and the gender of the patient. In general, achieving a higher heart rate during exercise indicated lower risk of heart disease - but the model learns that this pattern stronger for men than it is for women.
Figure 21: An interaction learned between ST-segment depression and number of major vessels appearing to contain calcium. The interaction seems to indicate that if a patient has many vessels appearing to contain calcium, then st-segment depression is less important toward driving risk, probably because the number of major vessels containing calcium becomes the main risk driver.

Appendix E Details for anti-cancer drug combination response prediction

e.1 Data description

As mentioned in the main text, our dataset consists of 12,362 samples (available from Each sample consists of the measured response of a 2-drug pair tested in the cancer cells of a patient (Tyner et al., 2018). The 2-drug combination was described with both a drug identity indicator and a drug target indicator. For each sample, the drug identity indicator is a vector where each element represents one of the 46 anti-cancer drugs present in the data, and takes a value of if the corresponding drug is not present in the combination and a value of if the corresponding drug is present in the combination. Therefore, for each sample, will have 44 elements equal to 0 and 2 elements equal to 1. This is the most compact possible representation for the 2-drug combinations. The drug target indicator is a vector where each element represents one of the 112 unique molecular targets of the anti-cancer drugs in the dataset. Each entry in this vector is equal to if neither drug targets the given molecule, equal to if one of the drugs in the combination targets the given molecule, and equal to if both drugs target the molecule. The targets were compiled using the information available on DrugBank (Wishart et al., 2018). The ex vivo samples of each patient’s cancer was described using gene expression levels for each gene in the transcriptome, as measured by RNA-seq, . Before training, the data was split into two parts – 80% of samples were used for model training, and an additional 20% were used as a held-out validation set to determine when the model had been trained for a sufficient number of epochs.

e.2 RNA-seq preprocessing

The cancerous cells in each sample were described using RNA-seq data – measurements of the expression level of each gene in the sample. We describe here the preprocessing steps used to remove batch effects while preserving biological signal. We first converted raw transcript counts to fragments per kilobase of exon model per million mapped reads (FPKM), a measure that is known to better reflect the molar amount of each transcript in the original sample than raw counts. FPKM accounts for this by normalizing the counts for different genes according to the length of transcript, as well as for the total number of reads included in the sample (Mortazavi et al., 2008). The equation for FPKM is given as:


where is the vector containing the number of raw counts for a particular transcript across all samples, is the effective length of that transcript, and represents the total number of counts. After converting raw counts to FPKM, we opt to consider only the protein-coding part of the transcriptome by removing all non-protein-coding transcripts from the dataset. Protein-coding transcripts were determined according to the list provided by the HUGO Gene Nomenclature Committee ( In addition to non-protein-coding transcripts, we also removed any transcript that was not observed in of samples. Transcripts are then

transformed and made 0-mean unit variance. Finally, the ComBat tool (a robust empirical Bayes regression implemented as part of the sva R package) was used to correct for batch effects

(Leek and Storey, 2007).

e.3 Model and training description

Figure 22: Neural network architecture for anti-cancer drug combination response prediction. We learn an embedding from all RNA-seq gene expression features () to KEGG pathways by sparsely connecting the inputs only to nodes corresponding to the pathways of which they are members. When we calculate feature attributions and interactions, we attribute to the layer that contains the raw drug inputs and the learned pathway embeddings (layer boxed with dashed blue line).

To model the data, we combined the successful approaches of Preuer et al. (2017) and Hao et al. (2018). Our network architecture is a simple feed-forward network (Figure 22), as in Preuer et al. (2017), where there were two hidden layers of 500 and 250 nodes respectively, both with Tanh activation. In order to improve performance and interpretability, we followed Hao et al. (2018) in learning a pathway-level embedding of the gene expression data. The RNA-seq data, , was sparsely connected to a layer of nodes, where each node corresponded to a single pathway from KEGG, BioCarta, or Reactome (Kanehisa and others, 2002; Nishimura, 2001; Croft et al., 2014)

. We made this embedding non-linear by following the sparse connections with a Tanh activation function. The non-linear pathway embeddings were then concatenated to the drug identity indicators and the drug target indicators, and these served as inputs to the densely connected layers.We trained the network to optimize a mean squared error loss function, and used the Adam optimizer in PyTorch with default hyperparameters and a learning rate equal to

(Kingma and Ba, 2014). We stopped the training when mean squared error on the held-out validation set failed to improve over 10 epochs, and found that the network reached an optimum at epochs. For the sake of easier calculation and more human-intuitive attribution, we attribute the model’s output to the layer with the pathway embedding and drug inputs, rather than to the raw RNA-seq features and drug inputs (see