1 Introduction
Stateoftheart machine learning models, such as deep neural networks, are exceptional at modeling complex dependencies in structured data, such as text (Vaswani et al., 2017; Tai et al., 2015), images (He et al., 2016; Huang et al., 2017), and DNA sequences (Alipanahi et al., 2015; Zeng et al., 2016). However, there has been no clear explanation on what type of dependencies are captured in the blackbox models that perform so well (Ribeiro et al., 2018; Murdoch et al., 2018).
In this paper, we make one of the first attempts at solving this important problem through interpreting two forms of structures, i.e., contextdependent representations and contextfree representations. A contextdependent representation is the one in which a model’s prediction depends specifically on a data instance level (such as a sentence or an image). In order to illustrate the concept, we consider an example in image analysis. A yellow roundshape object can be identified as the sun or the moon given its context, either bright blue sky or dark night. A contextfree representation is one where the representation behaves similarly independent of instances (i.e., global behaviors). In a hypothetical task of classifying sentiment in sentences, each sentence carries very different meaning, but when “not” and “bad” depend on each other, their sentiment contribution is almost always positive  i.e., the structure is contextfree.
To investigate contextdependent and contextfree structure, we lend to existing definitions in interpretable machine learning (Ribeiro et al., 2016; Kim et al., 2018). A contextdependent interpretation is a local interpretation of the dependencies at or within the vicinity of a single data instance. Conversely, a contextfree interpretation is a global interpretation of how those dependencies behave in a model irrespective of data instances. In this work, we study a key form of dependency: an interaction relationship between the prediction and input features. Interactions can describe arbitrarily complex relationships between these variables and are commonly captured by stateoftheart models like deep neural networks (Tsang et al., 2017; Murdoch et al., 2018). Interactions which are contextdependent or contextfree are therefore local or global interactions, respectively.
We propose Mahé, a framework for explaining the contextdependent and contextfree structures of any complex prediction model, with a focus on explaining neural networks. The contextdependent explanations are built based on recent work on local intepretations (such as (Ribeiro et al., 2016; Murdoch et al., 2018; Singh et al., 2018)). Specifically, Mahé takes as input a model to explain and a data instance, and returns a hierarchical explanation, a format proposed by Singh et al. (2018) to show local groupvariable relationships used in predictions (Figure 1). To provide contextfree explanations, Mahé generalizes those contextdependent interactions with consistent behavior in a model and determines whether a local representation in the model is responsible for the global behavior. In this case, Mahé takes as input a model and representative data corresponding to an interaction of interest and returns whether or not that interaction is contextfree. We conduct experiments on both synthetic datasets and realworld application datasets, which shows that Mahé’s contextdependent explanations can significantly outperform stateoftheart methods for local interaction interpretation, and Mahé is capable of successfully finding contextfree explanations of interactions. In addition, we identify promising cases where the methodology for contextfree explanations can successfully edit models. Our contributions are as follows: 1) Mahé achieves the task of improved contextdependent explanations based on interaction detection and fitting performance and modelagnostic generality, compared to stateoftheart methods for local interaction interpretation, 2) Mahé
is the first to provide contextfree explanations of interactions in deep learning models, and 3)
Mahé provides a promising direction for modifying contextfree interactions in deep learning models without significant performance degradation.2 Related Works
Attribution Interpretability: A common form of interpretation is feature attribution, which is concerned with how features of a data instance contribute to a model output. Within this category, there are two distinct approaches: additive and sensitivity attribution. Additive attribution interprets how much each feature contributes to the model output when these contributions are summed. In contrast, sensitivity attribution interprets how sensitive a model output is to changes in features. Examples of additive attribution techniques include LIME (Ribeiro et al., 2016) and CD (Murdoch et al., 2018). Examples of sensitivity attribution methods include Integrated Gradients (Sundararajan et al., 2017), DeepLIFT (Shrikumar et al., 2017), and SmoothGrad (Smilkov et al., 2017). Unlike previous approaches, Mahé provides additive attribution interpretations that consist of nonadditive groups of variables (interactions) in addition to the normal additive contributions of each variable.
Interaction Interpretability: An interaction in its generic form is a nonadditive effect between features on an outcome variable. Only until recently has there been development in interpreting nonadditive interactions despite often being learned in complex machine learning models. The difficulty interpreting nonadditive interactions stems from their lack of exact functional identity compared to, for example, a multiplicative interaction. Methods that exist to interpret nonadditive interactions are NID (Tsang et al., 2017) and Additive Groves (Sorokina et al., 2008). In contrast, many more methods exist to interpret specific interactions, namely multiplicative ones. Notable methods include CD (Murdoch et al., 2018), TreeShap (Lundberg & Lee, 2017), and GLMs with multiplicative interactions (Purushotham et al., 2014). Unlike previous methods, our approach provides local interpretations of the more challenging nonadditive interaction.
Locally Interpretable ModelAgnostic Explanations (LIME): LIME (Ribeiro et al., 2016) is a very popular type of model interpretation. Its popularity comes from additive attribution interpretations to explain the output of any prediction model. The original and most popular version of LIME uses a linear model to approximate model predictions in the local vicinity of a data instance. Since its introduction, variants of LIME have been proposed, for example Anchors (Ribeiro et al., 2018) and LIMESUP (Hu et al., 2018). While Anchors generates a form of contextfree explanation, its method of selecting fully representative features for a prediction does not consider interactions. For example, Anchors assumes that (not, bad) “virtually guarentees” a sentiment prediction to be positive, whereas in Mahé this is not necessarily true; only their interaction is positive (See Table 6 for an example). LIMESUP touches upon interactions but does not study their interpretation.
3 Interaction Explanations
Let be a target function (model) of interest, e.g. a classifier, and be a local approximation of and is interpretable in contrast to . A common choice for is a linear model, which is interpretable in each linear term. Namely, for an data instance , weights and bias , interpretations are given by , known as additive attributions (Lundberg & Lee, 2017) from
(1) 
Given a set of data points that are infinitesimally close or local to , a linear approximation of will accurately fit to the functional surface of at the data instance, such that . Because it is possible in such scenarios that , there must be some nonzero distances between and to obtain informative attribution scores. LIME, as it was originally proposed, uses a linear approximation as above where samples are generated in a nonzero local vicinity of (Ribeiro et al., 2016). The drawback of linear LIME is that there is often an error .
For complex models , the functional surface at can be nonlinear. Because consists of with distance from , a closer fit to in its nonlinear vicinity, i.e. , can be achieved with the following generalization of Eq. 1:
(2) 
where can be any function, for example one that is arbitrarily nonlinear. This function is called a generalized additive model (GAM) (Hastie & Tibshirani, 1990), and now attribution scores can be given by for each feature . For the purposes of interpreting individual feature attribution, the GAM may be enough. However, if we would like broader explanations, we can also obtain nonadditive attributions or interactions between variables (Lou et al., 2013), which can provide an even better fit to the complex local vicinity. Expanding Eq. 2 with interactions yields:
(3) 
where can again be any function, are interacting variables corresponding to the variable indices , and is a set of interactions. Attribution scores are now generated from both and . In this paper, we learn and
using Multilayer Perceptrons (MLPs).
orcan be converted to classification by applying a sigmoid function.
Adding nonadditive interactions, , that are truly present in the local vicinity increases the representational capacity of . corresponds to nonadditive interacting features if and only if (Eq. 3) cannot be decomposed into a sum of arbitrary subfunctions , each not depending on a corresponding interacting variable (Tsang et al., 2017), i.e.
4 Mahé Framework
In this section, we introduce our Mahé framework, which can provide contextdependent and contextfree explanations of interactions. To provide contextdependent explanations, we propose to use a twostep procedure that first identifies what variables interact locally, then learns a model of interactions (as Eq. 3) to provide a local interaction score at the data instance in question. The procedure of first detecting interactions then building nonadditive models for them has been studied previously (Lou et al., 2013; Tsang et al., 2017); however, previous works have not focused on using the same nonadditive models to provide local interaction attribution scores, which enable us to visualize interactions of any size as demonstrated later in §5.2.3.
4.1 ContextDependent Explanations
Local Interaction Detection: To perform interaction detection on samples in the local vicinity of data instance , we first sample points in the neighborhood of with a maximum neighborhood distance under a distance metric . While the choice of depends on the feature type(s) of , we always set
, i.e. one standard deviation from the mean of a Gaussian weighted sampling kernel. When all features are continuous, neighborhood points are sampled with mean
and to generate , , whereis a normal distribution truncated at
. When features are categorical, they are converted to onehot binary representation. For of binary features, we sample each point aroundby first selecting a number of random features to flip (or perturb) from a uniform distribution between
and . The max number of flips is derived from for a distance metric that is usually cosine distance (Ribeiro et al., 2016). Distances between local samples and are then weighted by a Gaussian kernel to become sample weights (e.g. the frequency each sample appears in the sampled dataset).^{1}^{1}1In cases where features are a mixture of continuous and onehot categorical variables, a way of sampling points is to adapt the approach for binary features to handle the mixture of feature types
(Ribeiro et al., 2016). The main difference now is that continuous features are drawn from a uniform distribution truncated at and are standard scaled to have similar magnitudes as the binary features. Since continuous features are present, can be distance, then a Gaussian kernel can be applied to sample distances as before. For contextdependent explanations, the exact choice of depends on the stability and interaction orders of explanations. The interaction orders may become too large and uninformative because the local vicinity area covers too much complex representation from . Thus we recommend tuning to the task at hand.Our framework is flexible to any interaction detection method that applies to the dataset . Since we seek to detect nonadditive interactions, we use the neural interaction detection (NID) framework (Tsang et al., 2017), which interprets learned neural network weights to obtain interactions. To the best of our knowledge, this detection method is the only polynomialtime algorithm that accurately ranks anyorder nonadditive interactions after training one model, compared to alternative methods that must train an exponential number
of models. The basic idea of NID is to interpret an MLP’s accurate representation of data to accurately identify the statistical interactions present in this data. Because MLPs learn interactions at nonlinear activation functions, NID performs feature interaction detection by tracing highstrength
regularized weights from features to common hidden units. In particular, NID efficiently detects anyorder interactions by first assuming each first layer hidden unit in a trained MLP captures at most one interaction, then NID greedily identifies these interactions and their strengths through a 2D traversal over the MLP’s input weight matrix, . The result is that instead of testing for interactions by training models, now only models and tests are needed.In addition to its efficiency, applying NID to our framework Mahé has several advantages. One is the universal approximation capabilities of MLPs (Hornik, 1991), allowing them to approximate arbitrary interacting functions in the potentially complex local vicinity of . Another advantage is the independence of features in the sampled points of . Normally, interaction detection methods cannot identify high interaction strengths involving a feature that is correlated with others because interaction signals spread and weaken among correlated variables (Sorokina et al., 2008). Without facing correlations, NID can focus more on interpreting the datagenerating function, the target model
. One disadvantage of our application of NID is the curse of dimensionality for MLPs when
is large (e.g. ) (Theodoridis et al., 2008), which is oftentimes the case for images. In general, large input dimensions should be reduced as much as possible to avoid overfitting. For images, is normally reduced in modelagnostic explanation methods by using segmented aggregations of pixels called superpixels as features (Ribeiro et al., 2016; Lundberg & Lee, 2017; Ribeiro et al., 2018).Hierarchical Interaction Attributions: Upon obtaining an interaction ranking from NID, GAMs with interactions (Eq. 3) can be learned for different top interactions ranked by their strengths (Tsang et al., 2017). In the Mahé framework, there are different levels of a hierarchical explanation which constitutes our contextdependent explanation, where is the number of levels with interaction explanations, and at the last level. When presenting the hierarchy such as Figure 1 Step 3, the first level shows the additive attributions of individual features from by a trained in Eqs. 1 or 2, such as the explanation from linear LIME. Subsequently, the parameters of are frozen before interaction models are added to construct in Eq. 3. The next levels of the hierarchy can be presented as either the interaction attribution of as in Figure 1 or those of (Eq. 3), where at each level is increased and either or are (re)trained. Interaction models are trained on the residual of to maintain consistent univariate explanations and to prevent degeneracy in univariate functions from overlapping interaction functions. Since is trained at each hierarchical level on , the fit of each can also be explained via predictive performance, such as performance in Figure 1 Step 3. The stopping criteria for the number of hierarchical levels can depend on the predictive performance or user preference.
4.2 ContextFree Explanations
In order to provide contextfree explanations, we propose determining whether the local interactions assumed to be contextdependent in §4.1 can generalize to explain global behavior in . To this end, we first define ideal conditions for which a generic local explanation can generalize. For choosing distance metric and sampling points in the local vicinity of , please refer to §4.1 and our considerations for generalizing explanations at the end of this section.
Definition 1 (Generalizing Local Explanations).
Let be the model output we wish to explain, and be the data domain of . Let a local explanation of at be some explanation that is true for and depends on samples that are only in the local vicinity of , i.e. provided a distance metric and distance . The local explanation is a global explanation if the following two conditions are met: 1) Explanation is true for at all data samples in , including samples outside the local vicinity of , i.e. all samples satisfying . 2) There exists a sample and a local modification to (modifying in the vicinity ) that changes for all samples in while still meeting condition 1).
For example, consider a simple linear regression model we wish to explain,
. Let its local explanation be the feature attributions and . This local explanation is a global explanation because 1) for all values of and , the feature attributions are still and , and 2) if any of the weights are changed, e.g. , the attribution explanation will change, but the feature attributions are still and for all values of and .Our contextfree explanation of interaction is: whenever local interaction exists, its attribution will in general have the same polarity (or sign). Since it is impossible to empirically prove that a local explanation is true for all data instances globally (via Definition 1), this work is focused on providing evidence of contextfree interactions. This evidence can be obtained by checking whether our explanation is consistent with the two conditions from Definition 1 for the interaction of interest : 1) For representative data instances in the domain of , if local interaction exists, does it always have the same attribution polarity? The representative data instances should be separated from each other at an average distance beyond . 2) Can local interaction at a single data instance be used to negate ’s attribution polarity for all representative data instances where exists?
The advantage of checking the response of to local modification is determining if consistent explanations across data instances are more than just coincidence. This is especially important when only a limited number of data instances are available to test on. We propose to modify an interaction attribution of the model’s output at data instance by utilizing a trained model of interaction , where (Eq. 3). Let be a modified version of . We can then define a modified form of Eq. 3:
(4) 
Without retraining , we use and the same local vicinity in to generate a new dataset . Finally, we can modify the interaction attribution of by finetuning on dataset . In this paper, we modify interactions by negating them: , where negates the interaction attribution with a specified magnitude .
How can modifying a local interaction affect interactions outside its local vicinity? This would suggest that the manifold hypothesis is true for
’s representations of these interactions (Figure 2). The manifold hypothesis states that similar data lie near a lowdimensional manifold in a highdimensional space (Turk & Pentland, 1991; Lee et al., 2003; Cayton, 2005). Studies have suggested that the hypothesis applies to the data representations learned by neural networks (Rifai et al., 2011; Basri & Jacobs, 2016). The hypothesis is frequently used to visualize how deep networks represent data clusters (Maaten & Hinton, 2008; LeCun et al., 2015), and it has been applied to representations of interactions (Reed et al., 2014), but not for neural networks.Part of our objective is to generalize our explanation as much as possible. In the case of languagerelated tasks, we additionally generalize based on our meaning of a local interaction and the distance metric we use, . In this paper, local interactions for language tasks do not have word interactions fixed to specific positions; instead, these interactions are only defined by the words themselves (the interaction values) and their positional order. For example, the (“not”, “bad”) interaction would match in the sentences: “this is not bad” and “this does not seem that bad”. For comparing texts and measuring vicinity sizes, we use edit distance (Levenshtein, 1966), which allows us to compare sentences with different word counts.^{2}^{2}2Unfortunately, for imagerelated tasks, we could not generalize our definition of local interactions despite the translation invariance of deep convnets. Although we define distance metrics for each domain (§5.1), we found that our results were not very sensitive to the exact choice of valid distance metric.
5 Experiments
5.1 Experimental Setup
We evaluate the effectiveness of Mahé first on synthetic data and then on four realworld datasets. To evaluate contextdependent explanations of Mahé, we first evaluate the accuracy of Mahé at local interaction detection and modeling on the outputs of complex base models trained on synthetic ground truth interactions. We compare Mahé to ShapTree (Lundberg et al., 2018), ACDMLP (Singh et al., 2018), and ACDLSTM (Murdoch et al., 2018; Singh et al., 2018)
, which are local interaction modeling baselines for the respective models they explain: XGBoost
(Chen & Guestrin, 2016), multilayer perceptrons (MLP), and long shortterm memory networks (LSTM)
(Hochreiter & Schmidhuber, 1997). Synthetic datasets have features (Table 2).models  dataset  average 
DNACNN  MYCDNA  
SentimentLSTM  SST  
ResNet152  ImageNet ‘14  
Transformer  WikiText103 
In all other experiments, we study Mahé
’s explanations of stateoftheart level models trained on realworld datasets. The stateoftheart models are: 1) DNACNN, a 2layer 1D convolutional neural network (CNN) trained on MYCDNA binding data
^{3}^{3}3The motif and flanking regions of DNA sequences in the training set are shuffled to simulate unalignment. (Mordelet et al., 2013; Yang et al., 2013; Alipanahi et al., 2015; Zeng et al., 2016; Wang et al., 2018), 2) SentimentLSTM, a 2layer bidirectional LSTM trained on the Stanford Sentiment Treebank (SST) (Socher et al., 2013; Tai et al., 2015), 3) ResNet152, an image classifier pretrained on ImageNet ‘14 (Russakovsky et al., 2015; He et al., 2016), and 4) Transformer, a machine translation model pretrained on WMT14 En Fr (Vaswani et al., 2017; Ott et al., 2018). Avg. for our contextdependent evaluations, similar to our contextfree tests, are shown in Table 1.The following hyperparameters are used in our experiments. We use
localvicinity samples in for synthetic experiments and samples for experiments explaining models of realworld datasets, with  trainvalidationtest splits to train and evaluate Mahé. The distance metrics for vicinity size are: distance for synthetic experiments, cosine distance for DNACNN and ResNet152, and edit distance for SentimentLSTM and Transformer. We use onoff superpixel and word approaches to binary feature representation for explaining ResNet152 and SentimentLSTM respectively (Ribeiro et al., 2016; Lundberg & Lee, 2017), and the other experiments for realworld datasets use perturbation distributions that randomly perturbs features to belong to the same categories of original features, as in (Ribeiro et al., 2018).The superpixel segmenter we use is quickshift (Vedaldi & Soatto, 2008; Ribeiro et al., 2016).For the hyperparameters of the neural networks in Mahé, we use MLPs with  firsttolast hidden layer sizes to perform interaction detection in the NID framework (Tsang et al., 2017). These MLPs are trained with regularization . The learning rate used is always except for Transformer experiments, whose learning rate of helped with interaction detection under highly unbalanced output classes. The MLPbased interaction models in the GAM (Eq. 3) always have architectures of . They are trained with regularization of and learning rate of . Because learning GAMs can be slow, we make a linear approximation of the univariate functions in Eq. 3, such that . This approximation also allows us to make direct comparisons between Mahé and linear LIME, since is exactly the linear part (Eq. 1). All neural networks train with early stopping, and Level is decided where validation performance does not improve more than with a patience of levels. ranges from to in our experiments.
5.2 ContextDependent Explanations
5.2.1 Synthetic Experiments
In order to evaluate Mahé’s contextdependent explanations, we first compare them to stateofthemethods for local interaction interpretation. A standard way to evaluate the accuracy of interaction detection and modeling methods has been to experiment on synthetic data because ground truth interactions are generally unknown in realworld data (Hooker, 2004; Sorokina et al., 2008; Lou et al., 2013; Tsang et al., 2017). Similar to Hooker (2007), we evaluate interactions in a subset region of a synthetic function domain. We generate synthetic data using functions (Table 2) with continuous features uniformly distributed between to , train complex base models (as specified in §5.1) on this data, and run different local interaction interpretation methods on trials of data instances at randomly sampled locations on the synthetic function domain. Between trials, base models with different random initializations are trained to evaluate the stability of each interpretation method. We evaluate how well each method fits to interactions by first assuming the true interacting variables are known, then computing the Mean Squared Error (MSE) between the predicted interaction attribution of each interpretation method and the ground truth at uniformly drawn locations within the local vicinity of a data instance, averaged over all randomly sampled data instances and trials (Figure 3a). We also evaluate the interaction detection performance of each method by comparing the average Rprecision (Manning et al., 2008) of their interaction rankings across the same sampled data instances (Figure 3b). Rprecision is the percentage of the top items in a ranking that are correct out of , the number of correct items. Since only ever have ground truth interaction, is always . Compared to ShapTree, ACDMLP, and ACDLSTM, the Mahé framework is the only one capable of detection and fitting, and it is the only modelagnostic approach.
5.2.2 Evaluating on RealWorld Data
In this section, we demonstrate our approaches to evaluating Mahé’s contextdependent explanations on realworld data. We first evaluate the prediction performance of Mahé on the test set of as interactions are added in Eq. 3, i.e. increases. For a given value of , we run Mahé times on each of randomly selected data instances from the test sets associated with DNACNN, SentimentLSTM, and ResNet152. For Transformer, performance is examined on a specific grammar (cet) translation, to be detailed in §5.3. The local vicinity samples and model initializations in Mahé are randomized in every trial. We select the that gives the worst performance for Mahé at in each base model, out of , , , and , where is the average pairwise distance between data instances in respective test sets. Results are shown in Table 3 for starting from , which is linear LIME, and increasing to the last hierarchical level .
DNACNN  SentimentLSTM  ResNet152  Transformer  
linear LIME  
Mahé  
Mahé 
Example of explanations that Mechanical Turk users choose from for a sentiment analysis task. (a) is linear LIME, (b) is
Mahé. LIME explanations are shown as positive and negative contributions of each feature (word) to the prediction, and Mahé explanations are shown similarly with one of the contributions belonging to a single interaction or group of words.An alternative approach to evaluating Mahé is to determine out of LIME and Mahé explanations, could human evaluators prefer Mahé explanations? We recruit a total of Amazon Mechanical Turk users to participate in comparing explanations of SentimentLSTM predictions. While the presented LIME explanations are standard, we adjust Mahé to only show the interaction and merge its attribution with subsumed features’ attributions to make the difference between LIME and Mahé subtle (Figure 4). We present evaluators with explanations for randomly selected test sentences under the main condition that these sentences must have at least one detected interaction, which is the case for of sentences. In total, there are explanations for sentences, each of which is examined by evaluators, and a majority vote of their preference is taken. Each evaluator is only allowed to pick between explanations for a maximum of sentences. Please see Appendix B for additional conditions used to select sentences for evaluators and more examples like Figure 4. The result of this experiment is that the majority of preferred explanations (, ) is with interactions, supporting their inclusion in hierarchical explanations.
5.2.3 Hierarchical Explanations



Interaction  %cet  %cet  
(this, event)  
(this, article)  
(this, incident)  
(this, album)  
(this, arrangement)  
(that, afternoon)  
(this, location)  
(this, effect) 
Sample EnglishFrench Translations 


English  This event took place on 10 August 2008.  
Fr. before  Cet événement a eu lieu le 10 Mars 2008.  
Fr. after  Cette rencontre a eu lieu le 10 Mars 2008.  
English  This incident made it into the music video.  
Fr. before  Cet incident a été intégré dans le vidéo musical.  
Fr. after  C’est pas mal du tout ca!  
English  The initial language of this article was French.  
Fr. before  La langue initiale de cet article était le Français.  
Fr. after  La langue originale du présent article était le Français. 
Examples of contextdependent hierarchical explanations for ResNet152, SentimentLSTM, and Transformer are shown in Figure 6, Table 6, and Appendix E respectively after page 9. For the image explanations in Figure 6, superpixels belonging to the same entity often interact to support its prediction. One interesting exception is (Figure 6 (d)) because water is not detected as an important interaction with buffalo in the prediction of water buffalo. This could be due to various reasons. For example, water may not be a discriminatory feature because there are a mix of training images of water buffalo in ImageNet with and without water. The same is true for related classes like bison. Explanations may also appear unintuitive when a model misbehaves. Therefore, quantitative validations, such as the predictive performance of adding interactions in each hierarchical level (e.g. scores in Figure 6), can be critical for trusting explanations.
5.3 ContextFree Explanations
In this section, we show examples of contextfree explanations of interactions found by Mahé. We first study the contextfree interactions learned by SentimentLSTM. To have enough sentences for this evaluation, we use data from IMDB movie reviews (Maas et al., 2011) in addition to the test set of SST. Based on our results (Figure 5), we observe that the polarities of certain local interactions are almost always the same, where the words of matching interactions can be separated by any number of words inbetween. To ensure that this global behavior is not a coincidence, we modify local interaction behavior in SentimentLSTM to check for a global change in this behavior (§4.2). As a result, when the model’s local interaction attribution at a single data instance is negated, the attribution is almost always the opposite sign for the rest of the sentences.
A notable insight about SentimentLSTM is that it appears to represent (too, bad) and (only, worse) as globally positive sentiments, and Mahé’s modification in large part rectifies this misbehavior (Figure 5). The modifications to SentimentLSTM only cause an average reduction of test accuracy, indicating that the original learned representation stays largely intact. Results for are shown with the average pairwise edit distance between sentences being . Words in detected interactions are separated by words on average.
Next, we study the possibility of identifying contextfree interactions in Transformer on a known form of interaction in EnglishtoFrench translations: translations into a special French word for “this” or “that”, cet, which only appears when the noun it modifies begins with a vowel. Some examples of cet interactions are (this, event), (this, article), and (this, incident), whose nouns have the same starting vowels in French. For our explanation task, the presence of cet in a translation is used as a binary prediction variable for local interaction extraction. To minimize the sources of cet, we limit original sentence lengths to 15 words, and we perform translations on WikiText103 (Merity et al., 2016) to evaluate on enough sentences. The results of contextfree experiments on cet interactions of adjacent English words are shown in Table 5. The interactions always have positive polarities towards cet, and after modifying Transformer at a single data instance for a given interaction, its polarity almost always become negative, just like the contextfree interactions in SentimentLSTM. Examples of new translations from the modified Transformer are shown in the “after” rows in Table 5, where cet now disappears from the translations. The test BLEU score of Transformer only decreases by an average percent difference of from modification, which is done through differentiating the max value of cet
output neurons over all translated words. Results for
, are shown.Experiments on DNACNN and ResNet152 show similar results at fixed interaction positions (§4.2). For DNACNN, out of the times a 6way interaction of the CACGTG motif (Sharon et al., 2008) was detected in the test set, every time yielded a positive attribution polarity towards DNAprotein affinity, and the same was true after modifying the model in the opposite polarity (cosine distance , ). For ResNet152, contextfree interactions are also found (cosine distance , ). However, because superpixels are used, the interactions found may contain artifacts caused by superpixel segmenters, yielding less intuitive interactions (see Appendix A).
5.4 Limitations
Although Mahé obtains accurate local interactions on synthetic data using NID, there is no guarantee that NID finds correct interactions. Mahé faces common issues of modelagnostic perturbation methods in interpreting highdimensional feature spaces, choice of perturbation distribution, and speed (Ribeiro et al., 2016, 2018). Finally, an exhaustive search is used for contextfree explanations.
6 Conclusion
In this work, we proposed Mahé, a modelagnostic framework of providing contextdependent and contextfree explanations of local interactions. Mahé has demonstrated the capability of outperforming existing approaches to local interaction interpretation and has shown that local interactions can be contextfree. In future work, we wish to make the process of finding contextfree interactions more efficient, and study to what extent model behavior can be changed by editing its interactions or univariate effects. Finally, we would like to study the interpretations provided by Mahé more closely to find new insights into structured data.
Method  Level  Fit  Hierarchical Explanation  Max magn. 
linear LIME  1  0.621  the film is really not so much bad as bland  0.744 
Mahé  2  0.751  not, bad  
Mahé  3  0.916  not, bad, bland  
Mahé  4  0.926  film, not, bad, bland  0.119 
linear LIME  1  a very average science fiction film  
Mahé  2  science, fiction  
Mahé  3  a, average  
Mahé  4  a, very, average  
linear LIME  1  a charming romantic comedy that is by far the  
lightest dogme film and among the most enjoyable  
Mahé  2  charming, enjoyable  
Mahé  3  charming, lightest, enjoyable  0.072 
References
 Alipanahi et al. (2015) Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dnaand rnabinding proteins by deep learning. Nature biotechnology, 33(8):831, 2015.
 Basri & Jacobs (2016) Ronen Basri and David Jacobs. Efficient representation of lowdimensional manifolds using deep networks. arXiv preprint arXiv:1602.04723, 2016.
 Bien et al. (2013) Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.
 Cayton (2005) Lawrence Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12(117):1, 2005.
 Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. ACM, 2016.
 Hastie & Tibshirani (1990) Trevor J Hastie and Robert J Tibshirani. Generalized additive models, 1990.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hooker (2004) Giles Hooker. Discovering additive structure in black box functions. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 575–580. ACM, 2004.
 Hooker (2007) Giles Hooker. Generalized functional anova diagnostics for highdimensional functions of dependent variables. Journal of Computational and Graphical Statistics, 16(3):709–732, 2007.
 Hornik (1991) Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
 Hu et al. (2018) Linwei Hu, Jie Chen, Vijayan N Nair, and Agus Sudjianto. Locally interpretable models and effects based on supervised partitioning (limesup). arXiv preprint arXiv:1806.00663, 2018.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1, pp. 3, 2017.

Kim et al. (2018)
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda
Viegas, et al.
Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav).
In International Conference on Machine Learning, pp. 2673–2682, 2018.  LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.

Lee et al. (2003)
KuangChih Lee, Jeffrey Ho, MingHsuan Yang, and David Kriegman.
Videobased face recognition using probabilistic appearance manifolds.
In Computer vision and pattern recognition, 2003. proceedings. 2003 ieee computer society conference on, volume 1, pp. I–I. IEEE, 2003.  Levenshtein (1966) Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pp. 707–710, 1966.
 Lou et al. (2013) Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–631. ACM, 2013.
 Lundberg & Lee (2017) Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774, 2017.
 Lundberg et al. (2018) Scott M Lundberg, Gabriel G Erion, and SuIn Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
 Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P111015.
 Maaten & Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.
 Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 Mordelet et al. (2013) Fantine Mordelet, John Horton, Alexander J Hartemink, Barbara E Engelhardt, and Raluca Gordân. Stability selection for regressionbased models of transcription factor–dna binding specificity. Bioinformatics, 29(13):i117–i125, 2013.
 Murdoch et al. (2018) W James Murdoch, Peter J Liu, and Bin Yu. Beyond word importance: Contextual decomposition to extract interactions from lstms. arXiv preprint arXiv:1801.05453, 2018.
 Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. arXiv preprint arXiv:1806.00187, 2018.
 Purushotham et al. (2014) Sanjay Purushotham, Martin Renqiang Min, CC Jay Kuo, and Rachel Ostroff. Factorized sparse learning models with interpretable high order feature interactions. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 552–561. ACM, 2014.
 Reed et al. (2014) Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, pp. 1431–1439, 2014.
 Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. ACM, 2016.

Ribeiro et al. (2018)
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
Anchors: Highprecision modelagnostic explanations.
In
AAAI Conference on Artificial Intelligence
, 2018.  Rifai et al. (2011) Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent classifier. In Advances in Neural Information Processing Systems, pp. 2294–2302, 2011.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Sharon et al. (2008) Eilon Sharon, Shai Lubliner, and Eran Segal. A featurebased approach to modeling protein–dna interactions. PLoS computational biology, 4(8):e1000154, 2008.
 Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017.
 Singh et al. (2018) Chandan Singh, W James Murdoch, and Bin Yu. Hierarchical interpretations for neural network predictions. arXiv preprint arXiv:1806.05337, 2018.
 Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.

Socher et al. (2013)
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Ng, and Christopher Potts.
Recursive deep models for semantic compositionality over a sentiment
treebank.
In
Proceedings of the 2013 conference on empirical methods in natural language processing
, pp. 1631–1642, 2013.  Sorokina et al. (2008) Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink. Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th international conference on Machine learning, pp. 1000–1007. ACM, 2008.
 Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328, 2017.
 Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075, 2015.
 Theodoridis et al. (2008) Sergios Theodoridis, Konstantinos Koutroumbas, et al. Pattern recognition. IEEE Transactions on Neural Networks, 19(2):376, 2008.
 Tsang et al. (2017) Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural network weights. arXiv preprint arXiv:1705.04977, 2017.
 Turk & Pentland (1991) Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Vedaldi & Soatto (2008) Andrea Vedaldi and Stefano Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision, pp. 705–718. Springer, 2008.
 Wang et al. (2018) Meng Wang, Cheng Tai, Weinan E, and Liping Wei. Define: deep convolutional neural networks accurately quantify intensities of transcription factordna binding and facilitate evaluation of functional noncoding variants. Nucleic acids research, 46(11):e69–e69, 2018.
 Yang et al. (2013) Lin Yang, Tianyin Zhou, Iris Dror, Anthony Mathelier, Wyeth W Wasserman, Raluca Gordân, and Remo Rohs. Tfbsshape: a motif database for dna shape features of transcription factor binding sites. Nucleic acids research, 42(D1):D148–D155, 2013.
 Zeng et al. (2016) Haoyang Zeng, Matthew D Edwards, Ge Liu, and David K Gifford. Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics, 32(12):i121–i127, 2016.
Supplementary Materials
Appendix A ContextFree explanations in ResNet152
Appendix B Further details of Mechanical Turk experiment
Besides requiring detected interactions, several other conditions were used to choose sentences for Mechanical Turk evaluators. We ensure that there is a significant attribution difference between LIME and Mahé by only choosing among sentences that have a polarity difference between Mahé’s interaction and LIME’s corresponding linear attributions. To reduce ambiguities of uninterpretable explanations arising from a misbehaving model  an issue also faced by Sundararajan et al. (2017) in interpretation evaluation  we only show explanations of sentences that the model classified correctly. We also attempt to limit the effort that evaluators need to analyze explanations by only showing sentences with  words with uniform representation of each sentence length.
An example of the interface that evaluators select from is shown in Figure 8. Figure 9 shows randomly selected examples that evaluators analyze. The visualization tool for presenting additive attribution explanations is graciously provided by the official code repository of LIME ^{4}^{4}4https://github.com/marcotcr/lime.

Appendix C Runtime
Figures 10 and 11 show runtimes of contextdependent and free explanations using Mahé. All experiments were conducted on Intel Xeon  GHz CPUs and Nvidia Ti GPUs. Experiments with MLPs were run on CPUs and inference/retraining of DNACNN, SentimentLSTM, ResNet152, and Transformer were run on GPUs.
Appendix D Comparisons to Baselines for ContextFree Explanations
Appendix E Hierarchical explanations of cet interactions in Transformer
Method  Level  Fit  Hierarchical Explanation  Max magn. 
linear LIME  0.782  this article was last updated on substance in august 2012  0.657  
Mahé  0.948  this, article  3.707  
linear LIME  0.696  this effect takes part in making lead slightly less reactive chemically  0.643  
Mahé  0.96  this, effect  2.459  
linear LIME  0.734  the population size of this bird has not yet been quantified or estimated 
0.605  
Mahé  0.926  this, bird  1.211 
Appendix F Experiments with large number of features
We performed experiments on the accuracy and runtime of the MLP used for interaction detection (via NID) on datasets with large number of features. We generate synthetic data of samples and features with randomly generated pairwise interactions of using the following equation (Purushotham et al., 2014):
where is the instance of the design matrix , is the
instance of the response variable
, contains the weights of pairwise interactions, contains the weights of main effects, and . was generated as a sum of rank one matrices, . is normally distributed with meanand variance
. Both and are sparse vectors of nonzero density and are normally distributed with mean and variance . was set to be .We found that in low settings, i.e. , only needed to be at least to recover  pairwise interactions at AUC. Increasing to still required , but performance stability significantly improved between and for detecting  interactions. When k, we could not detect interactions at and did not study further due to large training time. In general, increasing by an order of magnitude at fixed required 49x more runtime. As a rough estimate, increasing by an order of magnitude at fixed required x more runtime. There is high variance in the runtime associated with increasing because of the early stopping used.
Based on our experiments, we recommend limiting to be under , so that model training can complete in under seconds. Once interaction detection via NID is done, the extracted interaction sets tend to be much smaller than , and (Eq. 3) for each interaction is likely to train faster than the original MLP with inputs. We note that identifying interactions in high dimensional input spaces like images and image models is an interesting and challenging research problem and is left for future work.
Appendix G More examples of interactions with consistent polarities in SentimentLSTM
Interaction  num samples  percent polarity 


(not, good)  negative  
(falls, flat)  negative  
(not, funny)  negative  
(not, miss)  positive  
(still, love)  positive  
(bad, worst)  positive  
(never, off)  positive 
Comments
There are no comments yet.