During testing, we fit decision trees to our model and an unregularized model on molecule property prediction at the same local neighborhood such that the functional approximations are comparable in AUC (because the scale is not crucial). The split criterion on each node is based on the existence of acomplete chemical substructure in Morgan fingerprints (Rogers & Hahn, 2010). The color of each Morgan fingerprint simply reflects the radius of the fingerprint.
Modern machine learning tasks are increasingly complex, requiring flexible models with large numbers of parameters such as deep networks (Silver et al., 2016; Vaswani et al., 2017; Huang et al., 2017). Such modeling gains often come at the cost of transparency or interpretability. This is particularly problematic when predictions are fed into decision-critical applications such as medicine where the ability to verify predictions may be just as important as the raw predictive power.
It seems plausible to guide a flexible neural network towards a complex yet well-understood (i.e., transparent) functional class. For example, in realizing Wasserstein-1 distance (Arjovsky et al., 2017), the discriminator should be limited to 1-Lipschitz functions. A strict adherence to a complex, global functional class is not the only way to achieve transparency. For example, linearity is a desirable characteristic for transparency but is sensible to enforce only locally. We offer therefore a new notion of transparency – functional transparency – where the goal is to guide models to adopt a desirable local behavior yet allowing them to be more flexible globally. Note that functional transparency should be established only approximately in many cases since, e.g., strict local linearity implies global linearity.
Previous approaches to interpretability have mainly focused on models that operate on fixed-size data, such as scalar-features (Lakkaraju et al., 2016) or image prediction (Selvaraju et al., 2016; Mahendran & Vedaldi, 2015). The emphasis has been on feature relevance or selection (Ribeiro et al., 2016). Recent methods do address some of the challenges in sequential data (Lei et al., 2016; Arras et al., 2017), primarily in NLP tasks where the input sequence is discrete. Interpretability for continuous temporal data (Al-Shedivat et al., 2017; Wu et al., 2018a) or graph structures remains largely unexplored.
We develop a novel approach to transparency that is naturally suited for structured data. At the core of our approach is a game-theoretic definition of transparency. This is set up as a two-player co-operative game between a predictor and a witness. The predictor remains a complex model whereas the witness is chosen from a simple transparent family. Transparency arises from the fact that the predictor is encouraged to exemplify simple behavior as captured by the witness in each local region while remaining globally powerful. The approach differs from global regularization of models towards interpretability (Wu et al., 2018a), models that are constructed a priori to be interpretable, either architecturally or in terms of the function class (Al-Shedivat et al., 2017; Lei et al., 2016), or from post-hoc explanations of black-box methods via local perturbations (Ribeiro et al., 2016; Alvarez-Melis & Jaakkola, 2017). Our models are guided towards functional transparency during learning.
As an illustration, we contrast our approach with methods that seek to obtain interpretable explanations after the fact (e.g., (Ribeiro et al., 2016)). Derived explanation after training can be misleading in some cases if the explanation does not match the functional behavior of the model. For example, Figure 1 shows local decision tree approximations for two models: our model trained with such local witnesses (a, left), and an unregularized model (b, right). The trees are constructed to achieve the same level of approximation. The tree for the unregularized model only filters one sample in each split, lacking generality to explain the (local) behavior. This phenomenon is related to unstable explanations that arise with already trained models (Alvarez-Melis & Jaakkola, 2018b; Ghorbani et al., 2019).
The game theoretic approach is very flexible in terms of models and scenarios. We therefore illustrate the approach across a few novel scenarios: explaining graph convolutional models using decision trees, revealing local functional variation of a deep sequence model, and exemplifying decision rules for the encoder in unsupervised graph representation learning. Our main contributions are:
A novel game-theoretic approach to transparency, applicable to a wide range of prediction models, architectures, and local transparency classes, without requiring differentiability.
Analysis on the effective size of the local regions and establishing equilibria pertaining to different game formulations.
Illustration of deep models across several tasks, from chemical property prediction, physical component modeling, to molecule representation learning.
2 Related Work
The role of transparency is to expose the inner-workings of an algorithm (Citron & Pasquale, 2014; Pasquale, 2015), such as decision making systems. This is timely for state-of-the-art machine learning models that are typically over-parameterized (Silver et al., 2016; He et al., 2016) and therefore effectively black-box models. An uncontrolled model is also liable to various attacks (Goodfellow et al., 2014).
Our goal is to regularize a complex deep model so that it exhibits a desired local behavior. The approach confers an approximate operational guarantee rather than directly interpretability. In contrast, examples of archetypal interpretable models include linear classifiers, decision trees(Quinlan, 2014), and decision sets (Lakkaraju et al., 2016); recent approaches also guide complex models towards highlighting pieces of input used for prediction (Lei et al., 2016), grounding explanations via graphical models (Al-Shedivat et al., 2017), or generalizing linear models while maintaining interpretability (Alvarez-Melis & Jaakkola, 2018a). A model conforming to a known functional behavior, at least locally, as in our approach, is not necessarily itself human-interpretable. The approximate guarantee we offer is that the complex model indeed follows such a behavior and we also quantify to what extent this guarantee is achieved.
Previous work on approximating a functional class via neural networks can be roughly divided into two types: parametrization-based and regularization-based methods. Works in the first category seek self-evident adherence to a functional class, which include maintaining Lipschitz continuity via weight clipping (Arjovsky et al., 2017)
, orthogonal transformation via scaled Cayley transform of skew-symmetric matrices(Helfrich et al., 2017), and “stable” recurrent networks via spectral norm projection on the transition matrix (Miller & Hardt, 2018).
A softer approach is to introduce a regularization problem that encourages neural networks to match properties of the functional class. Such regularization problem might come in the form of a gradient penalty as used in several variants of GAN (Gulrajani et al., 2017; Bellemare et al., 2017; Mroueh et al., 2018)
under the framework of integral probability metrics (IPM)(Müller, 1997), layer-wise regularization of transformation matrices (Cisse et al., 2017) towards parseval tightness (Kovačević et al., 2008) for robustness, and recent adversarial approaches to learn representations for certain independence statements (Ganin et al., 2016; Zhao et al., 2017). Typically, a tailored regularization problem is introduced for each functional class. Our work follows this general theme in the sense of casting the overall problem as a regularization problem. However, we focus on transparency and our approach – a general co-operative game – is quite different. Our methodology is applicable to any choice of (local) functional class without any architectural restrictions on the deep model whose behavior is sculpted. The optimization of functional deviation in the game must remain tractable, of course.
In this work, given a dataset , we learn an (unrestricted) predictive function together with a transparent – and usually simpler – function defined over a functional class . We refer to functions and as the predictor and the witness, respectively, throughout the paper. Note that we need not make any assumptions on the functional class , instead allowing a flexible class of predictors. In contrast, the family of witnesses is strictly constrained to be a transparent functional set, such as the set of linear functions or decision trees. We assume to have a deviation function such that , which measures discrepancy between two elements in and can be used to optimize and . To simplify the notation, we define . We introduce our game-theoretic framework in §3.1, analyze it in §3.2, and instantiate the framework with concrete models in §4.
3.1 Game-Theoretic Transparency
There are many ways to use a witness function to guide the predictor by means of discrepancy measures. However, since the witness functions can be weak such as linear functions, we cannot expect that a reasonable predictor would agree to it globally. Instead, we make a slight generalization to enforce this criterion only locally, over different sets of neighborhoods. To this end, we define local transparency by measuring how close is to the family over a local neighborhood around an observed point . One straightforward instantiation of such a neighborhood in temporal domain will be simply a local window of points . Our resulting local discrepancy measure is
The summation can be replaced by an integral when a continuous neighborhood is used. The minimizing witness function, , is indexed by the point around which it is estimated; depending on the function , the minimizing witness can change from one neighborhood to another. If we view the minimization problem game-theoretically, is the best response strategy of the local witness around .
The local discrepancy measure can be incorporated into an overall estimation criterion in many ways so as to guide the predictor towards the desired functional form. This guidance can be offered as a uniform constraint with a permissible -margin, as an additive symmetric penalty, or defined asymmetrically as a game theoretic penalty where the information sets for the predictor and the witness are no longer identical. We consider each of these in turn.
Uniform criterion. A straightforward formulation is to confine to remain within a margin of the best fitting witness for every local neighborhood. Assume that a primal loss is given for a learning task. The criterion imposes the -margin constraint uniformly as
We assume that the optimal with respect to each constraint may be efficiently found due to the simplicity of and the regularity of . We also assume that the partial derivatives with respect to , for fixed witnesses, can be computed straightforwardly under sufficiently regular in a Lagrangian form. In this case, we can solve for , local witnesses, and the Lagrange multipliers using the mirror-prox algorithm (Nemirovski, 2004).
The hard constraints in the uniform criterion will lead to strict transparency guarantees. However, the effect may be undesirable in some cases where the observed data (hence the predictor) do not agree with the witness in all places. The resulting loss of performance may be too severe. As an alternative, we can enforce the agreement with local witnesses to be small in aggregate across neighborhoods.
Symmetric game. We define an additive, unconstrained, symmetric criterion to smoothly trade off between performance and transparency. The resulting objective is
To illustrate the above idea, we generate a synthetic dataset to show a neighborhood in Figure 1(a) with an unconstrained piecewise linear predictor in Figure 1(b). Clearly, does not agree with a linear witness within this neighborhood. However, when we solve for together with a linear witness as in Figure 1(c), the resulting function has a small residual deviation from , more strongly adhering to the linear functional class while still closely tracking the observed data. Figure 1(d) shows the flexibility of our framework where a very different functional behavior can be induced by changing the functional class for the witness.
Asymmetric game. Solving the symmetric criterion can be computationally inefficient since the predictor is guided by its deviation from each of the local witness on all points within each of the local neighborhoods. Moreover, the predictor value at any point is subject to potentially conflicting regularization terms across the neighborhoods, which is undesirable. The inner summation in Eq. (3) may involve different sizes of neighborhoods (e.g., end-point boundary cases) and this makes it more challenging to parallelize the computation.
We would like to impose even functional regularization at every based on how much the value deviates from the witness associated with the local region . This approach leads to an asymmetric co-operative formulation, where the information sets for the predictor and local witnesses differ. Specifically, the local best-response witness is chosen to minimize the local discrepancy as in Eq. (1), and thus depends on values within the whole region; in contrast, the predictor only receives feedback in terms of the resulting deviation at , only seeing . From the point of view of the predictor , the best response strategy is obtained by minimizing
To train the proposed method, we perform alternating updates for and on their respective criteria.
We consider here the effectiveness of regularization in relation to the neighborhood size and establish fixed point equations for the predictor under the three estimation criteria. For simplicity, we assume and , but the results are generalizable to our examples in §4. All the proofs are in Appendix A.
Neighborhood size. The formulation involves a key trade-off between the size of the region where the function should be simple and the overall accuracy achieved by the predictor. When the neighborhood is too small, local witnesses become perfect, inducing no regularization on . Thus the size of the region is a key parameter. A neighborhood size is sufficient if the witness class cannot readily overfit values within the neighborhood. Formally,
We say that a neighborhood size is effective for if for any we can find s.t.
A trivial example is when is the constant class, a neighborhood size is effective if . Note that the neighborhood in the above definition can be any finite collection of points . For example, the points in the neighborhood induced by a temporal window need not remain in a small -norm ball.
For linear models and decision trees, we have
is the tight lower bound on the effective neighborhood size for the linear class.
is the tight lower bound on the effective neighborhood size for the decision tree class with depth bounded by .
When the sample sizes within the neighborhoods fall below such bounds, regularization can still be useful if the witness class is not uniformly flexible or if the algorithm for finding the witness is limited (e.g., greedy algorithm for decision trees).
Equilibrium solutions. The symmetric game constitutes a standard minimization problem, but the existence or uniqueness of equilibria under the asymmetric game are not obvious. Our main results in this section make the following assumptions.
(A1) the predictor is unconstrained.
(A2) both the loss and deviation are squared errors.
We note that (A3) and (A4) are not technically necessary but simplify the presentation. We denote the predictor in the uniform criterion (Eq. (2)), the symmetric game (Eq. (3)), and the asymmetric game (Eq. (4)) as , , and , respectively. We use to denote the neighborhood (), and
to denote the vector. denotes the pseudo-inverse of . Then we have
If (A1-5) hold and the witness is in the linear family, the optimal satisfies
and the optimal , at every equilibrium, is the fixed point
The equilibrium in the linear class is not unique when the witness is not fully determined in a neighborhood due to degeneracy. To avoid these cases, we can use Ridge regression to obtain a stable equilibrium (proved also in Appendix).
A special case of Theorem 2 is when , which effectively yields the equilibrium result for the constant class; we found it particularly useful to understand the similarity between the two games in this scenario. Concretely, each becomes equivalent to . As a result, the solution for both the symmetric and asymmetric game induce the optimal predictors as recursive convolutional averaging of neighboring points with the same decay rate , while the convolutional kernel evolves twice as fast in the symmetric game than in the asymmetric game.
Next, we show that the hard uniform constraint criterion yields a very different equilibrium.
If (A1-5) hold and the witness is in the linear family, the optimal satisfies
for , where
A noticeable difference from the games is that, under uniform criterion, the optimal predictor may faithfully output the actual label if the functional constraint is satisfied, while the functional constraints are translated into a “convolutional” operator in the games.
Efficient computation. We also analyze ways of accelerating the computation required for solving the symmetric game. An equivalent criterion is given by
If is squared error, is differentiable, is sub-differentiable, and A(4-5) hold, then
where , induces the same equilibrium as the symmetric game.
The result is useful when training on GPU and is solved analytically on CPU. Compared to a for-loop to handle different neighborhood sizes for Eq. (3) on the GPU, computing a summarized feedback as in Lemma 4 on CPU is more efficient (and easier to implement).
Discussion We investigated here discrete neighborhoods and they are suitable also for structured data as in the experiments. The method itself can be generalized to continuous neighborhoods with an additional difficulty: the exact computation and minimization of functional deviation between the predictor and the witness in such neighborhood is in general intractable. We may apply results from learning theory (e.g., (Shamir, 2015)) to bound the (generalization) gap between the deviation computed by finite samples from the continuous neighborhood and the actual deviation under a uniform probability measure.
4.1 Conditional Sequence Generation
The basic idea of co-operative modeling extends naturally to conditional sequence generation over longer periods. Broadly, the mechanism allows us to inspect the temporal progression of sequences on a longer term basis.
Given an observation sequence , the goal is to estimate probability over future events , typically done via maximum likelihood. For brevity, we use to denote . We model the conditional distribution of given
as a multivariate Gaussian distribution with meanand covariance
, both parametrized as recurrent neural networks. Each local witness modelis estimated based on the neighborhood with respect to the mean function . A natural choice would be a -order Markov autoregressive (AR) model with an deviation loss as:
. The AR model admits an analytical solution similar to linear regression.
4.2 Chemical Property Prediction
The models discussed in §3 can be instantiated on highly-structured data, such as molecules, too. These are usually represented as a graph whose nodes encode the atom types and edges encode the chemical bonds. Such representation enables the usage of recent graph convolutional networks (GCNs) (Dai et al., 2016; Lei et al., 2017) as the predictor . As it is hard to realize a simple explanation on the raw graph representation, we exploit an alternative data representation for the witness model; we leverage depth-bounded decision trees that take as input Morgan fingerprints (Rogers & Hahn, 2010) , which are vector representations for the binary existence of a chemical substructures in a molecule (e.g., the nodes in Fig. 1).
The neighborhood includes molecules with Tanimoto similarity greater than , automatically constructed through matching molecular pair analysis (Griffen et al., 2011). Here we use a multi-label binary classification task as an example, and adopt a cross-entropy loss for each label axis for simplicity. At each neighborhood , we construct a witness decision tree that minimizes the total variation (TV) from the predictor as
We note that Eq. (6) is an upper bound and efficient alternative to fitting a tree for each label axis independently.
4.3 Molecule Representation Learning
Our approach can be further applied to learn transparent latent graph representations by variational autoencoders (VAEs)(Kingma & Welling, 2013; Jin et al., 2018). Concretely, given a molecular graph , the VAE encoder outputs the approximated posterior over the latent space, where is the continuous representation of molecule . Following common practice, is restricted to be diagonal. The VAE decoder then reconstructs the molecule from its probabilistic encoding . Our goal here is to guide the behavior of the neural encoder such that the derivation of (probabilistic) can be locally explained by a decision tree.
We adopt the same setting for the witness function and neighborhoods as in §4.2, except that the local decision tree
now outputs a joint normal distribution with parameters. To train the encoder, we extend the original VAE objective with a local deviation loss defined on the KL divergence between the VAE posterior and witness posterior at each neighborhood as
The VAE is trained to maximize . For ease of implementation, we asymmetrically estimate each decision tree with mean squared error between the vectors and .
We conduct experiments on chemical and time-series datasets. Due to the lack of existing works for explaining structured data, we adopt an ablation setting – comparing our approach (Game) versus an unregularized model (Deep) – and focus on measuring the transparency. We use subscripts to denote specific versions of the Game models. Note that we only fit the local witnesses to the Deep model during testing for evaluation. Unless otherwise noted, the reported results are based on the testing set.
5.1 Molecule Property Prediction
|(the higher the better)||AUC||0.742||0.824||0.818|
|(the higher the better)||0.959||0.967||0.922|
We conduct experiments on molecular toxicity prediction on the Tox21 dataset from MoleculeNet benchmark (Wu et al., 2018b), which contains 12 binary labels and molecules. The labels are very unbalanced; the fraction of the positive label is between and among the 12 labels. We use GCN as the predictor and decision trees as the witnesses as in §4.2. The neighborhood sizes of about of the molecules are larger than , whose median and maximum are and , respectively. Since each neighborhood has a different size , we set the maximum tree depth as for each neighborhood, which ensures that the corresponding size is effective for (see Definition 1). More details are in Appendix B.
Evaluation Measures: For all the measures, the results are averaged across the label axes.
(1) Performance: For the predictor, we compare its predictions with respect to the labels in AUC, denoted as AUC. As each local witness also realizes a function of , it is also evaluated against the labels in AUC, denoted as AUC.
(2) Transparency: As labels are unavailable for testing data in practice, it is more realistic to measure the similarity between the predictor and the local witnesses to understand the validity of the explanations derived from the decision trees . To this end111Since the predictor probability can be scaled arbitrarily to minimize the TV from decision trees without affecting performance, using TV to measure transparency as used in training is not ideal., we generalize the AUC criterion for continuous labels for references and predictions as
The proposed score has the same pairwise interpretation as AUC, recovers AUC when is binary, and is normalized to . Locally, we measure the criterion for the local witnesses with respect to the predictor in each testing neighborhood as the local deviation, where the average result is denoted as . Globally, the criterion is also validated among the testing data, denoted as .
The results with the uniform and symmetric criteria are shown in Table 1. A baseline vanilla decision tree, with depth tuned between and , yields 0.617 in . Compared to , the local deviation in is marginally improved due to the strict constraint at the cost of severe performance loss. We investigate the behaviors in training neighborhoods and find that exhibits a tiny fraction of high deviation losses, allowing the model to behave more flexibly than the strictly constrained (see Figure 5 in Appendix B). In terms of performance, our model is superior to the Deep model in both the predictor and local witnesses. When comparing the witnesses to the predictor, locally and globally, the Game models significantly improve the transparency from the Deep model. The local deviation should be interpreted relatively since the tree depth inherently prevents local overfitting.
We visualize the resulting witness trees in Figure 1 under the same transparency constraint: for a local neighborhood, we grow the witness tree for the Deep model until the local transparency in is comparable to the model. For explaining the same molecule, the tree for the Deep model is deeper and extremely unbalanced. Since a Morgan fingerprint encodes the existence of a substructure of molecule graphs, an unbalanced tree focusing on the left branch (non-existence of a substructure) does not capture much generality. Hence, the explanation of the Deep model does not provide as much insight as our model.
Here we do an analysis on the tree depth constraint for the witness model, as a shallower tree is easier to interpret, but more challenging to establish transparency due to the restricted complexity. To this end, we revise the depth constraint to during training and testing, and vary . All the resulting Game models outperform the Deep models in AUC, and we report the transparency score in terms of in Table 2. Even when , the witness trees in our Game model still represent the predictor more faithfully than those in the Deep model with .
5.2 Physical Component Modeling
We next validate our approach on a physical component modeling task with the bearing dataset from NASA (Lee et al., 2016), which records 4-channel acceleration data on 4 co-located bearings. We divide the sequence into disjoint subsequences, resulting in subsequences. Since the dataset exhibits high frequency periods of 5 points and low frequency periods of 20 points, we use the first points in an sequence to forecast the next . We parametrize and jointly by stacking layer of CNN, LSTM, and fully connected layers. We set the neighborhood radius to such that the witnesses are fit with completely different data for the beginning and the end of the sequence. The Markov order is set to to ensure the effectiveness of the neighborhood sizes. More details are in Appendix C.
Evaluation involves three different types of errors: 1) ‘error’ is the root mean squared error (RMSE) between greedy autoregressive generation and the ground truth, 2) ‘deviation’ is RMSE between the predictor and the witness , and 3) ‘TV’ is the average total variation of witness parameters between every two consecutive time points. Since the deviation and error are both computed on the same space in RMSE, the two measures are readily comparable. For testing, the witnesses are estimated based on the autoregressive generative trajectories.
We present the results in Table 3 to study the impact of the game coefficient and the symmetry of the games. The trends in the measures are quite monotonic on : with an increasing , the model gradually operates toward the AR family with lower deviation and TV but higher error. When , the Game models are more accurate than the Deep model () due to the regularization effect. Given the same hyper-parameters, marginally lower deviation in the symmetric game than in the asymmetric game confirms our analysis about the similarity between the two. In practice, the asymmetric game is more efficient and substantially easier to implement than the symmetric game. Indeed, the training time is sequences/second for the asymmetric game, and sequences/second for the symmetric game. If we use the formula in Lemma 4, the symmetric game can be accelerated to sequences/second, but the formula does not generalize to other deviation losses.
We visualize the witnesses with their parameters along the autoregressive generative trajectories in Figure 3. The stable functional patterns of the Game model as reflected by , before and after the point, highlight not only close local alignments of the predictor and the AR family (being constant vectors across columns) but also flexible variation of functional properties on the predictor across regions. In contrast, the Deep model yields unstable linear coefficients, and relies more on offsets/biases than the Game model, while the linear weights are more useful for grounding the coordinate relevance for interpretability. Finally, we remark that despite the uninterpretable nature of temporal signals, the functional pattern reflected by the linear weights as shown here yields a simple medium to understand its behavior. Due to space limitation, the additional analysis and visualization are included in Appendix C.
5.3 Molecule Representation Learning
Finally, we validate our approach on learning representations for molecules with VAEs, where we use the junction tree VAE (Jin et al., 2018) as an example. Here the encoders of VAEs, with and without the guidance of local decision trees as in §4.3, are denoted as Deep and Game, respectively. The models are trained on the ZINC dataset (Sterling & Irwin, 2015) containing 1.5M molecules, and evaluated on a test set with 20K molecules. We measure the performance in terms of the evidence lower bound (ELBO) over the test set. Here we consider two scenarios: the ELBO using the raw latent representations from the original neural encoder, and using the interpreted latent representations generated by locally fitted decision trees. The average deviation loss in KL divergence , defined in §4.3, over the testing neighborhoods is also evaluated.
The results are shown in Table 4. Our Game model performs consistently better under all the metrics. Figure 4 shows an example of how our decision tree explains the local neighborhood of a molecule. We found most of the substructures selected by the decision tree occur in the side chains outside of Bemis-Murcko scaffold (Bemis & Murcko, 1996). This shows the variation in the latent representation mostly reflects the local changes in the molecules, which is expected since changes in the scaffold typically lead to global changes such as chemical property changes.
|Model||ELBOneural encoder||ELBOdecision tree||deviation ()|
We propose a novel game-theoretic approach to learning transparent models on structured data. The game articulates how the predictor model’s fitting can be traded off against agreeing locally with a transparent witness. This work opens up many avenues for future work, from theoretical analysis of the games to a multi-player setting.
The work was funded in part by a grant from Siemens Corporation and in part by an MIT-IBM grant on deep rationalization.
- Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
- Al-Shedivat et al. (2017) Al-Shedivat, M., Dubey, A., and Xing, E. P. Contextual explanation networks. arXiv preprint arXiv:1705.10301, 2017.
- Alvarez-Melis & Jaakkola (2018a) Alvarez-Melis, D. and Jaakkola, T. Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, pp. 7786–7795, 2018a.
- Alvarez-Melis & Jaakkola (2017) Alvarez-Melis, D. and Jaakkola, T. S. A causal framework for explaining the predictions of black-box sequence-to-sequence models. Proceedings of EMNLP, 2017.
- Alvarez-Melis & Jaakkola (2018b) Alvarez-Melis, D. and Jaakkola, T. S. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049, 2018b.
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Arras et al. (2017) Arras, L., Horn, F., Montavon, G., Müller, K.-R., and Samek, W. ” What is relevant in a text document?”: An interpretable machine learning approach. PloS one, 12(8):e0181142, 2017.
- Bellemare et al. (2017) Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., and Munos, R. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017.
- Bemis & Murcko (1996) Bemis, G. W. and Murcko, M. A. The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry, 39(15):2887–2893, 1996.
- Cisse et al. (2017) Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847, 2017.
- Citron & Pasquale (2014) Citron, D. K. and Pasquale, F. The scored society: due process for automated predictions. Wash. L. Rev., 89:1, 2014.
- Dai et al. (2016) Dai, H., Dai, B., and Song, L. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pp. 2702–2711, 2016.
- Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
- Ghorbani et al. (2019) Ghorbani, A., Abid, A., and Zou, J. Interpretation of neural networks is fragile. AAAI, 2019.
- Goodfellow et al. (2014) Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. 12 2014.
- Griffen et al. (2011) Griffen, E., Leach, A. G., Robb, G. R., and Warner, D. J. Matched molecular pairs as a medicinal chemistry tool: miniperspective. Journal of medicinal chemistry, 54(22):7739–7750, 2011.
- Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In
- Helfrich et al. (2017) Helfrich, K., Willmott, D., and Ye, Q. Orthogonal recurrent neural networks with scaled cayley transform. arXiv preprint arXiv:1707.09520, 2017.
- Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, volume 1, pp. 3, 2017.
- Jin et al. (2018) Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
- Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kovačević et al. (2008) Kovačević, J., Chebira, A., et al. An introduction to frames. Foundations and Trends® in Signal Processing, 2(1):1–94, 2008.
- Lakkaraju et al. (2016) Lakkaraju, H., Bach, S. H., and Leskovec, J. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675–1684. ACM, 2016.
- (25) Lee, G.-H., Alvarez-Melis, D., and Jaakkola, T. S. Game-theoretic interpretability for temporal modeling. The 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018) at ICML 2018. URL https://arxiv.org/pdf/1807.00130.pdf.
- Lee et al. (2016) Lee, J., Qiu, H., Yu, G., Lin, J., and Rexnord Technical Services (2007). IMS, U. o. C. Bearing data set. NASA Ames Prognostics Data Repository (http://ti.arc.nasa.gov/project/prognostic-data-repository), NASA Ames Research Center, Moffett Field, CA, 7(8), 2016.
- Lei et al. (2016) Lei, T., Barzilay, R., and Jaakkola, T. Rationalizing Neural Predictions. In EMNLP 2016, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 107–117, 2016. URL http://arxiv.org/abs/1606.04155.
- Lei et al. (2017) Lei, T., Jin, W., Barzilay, R., and Jaakkola, T. Deriving neural architectures from sequence and graph kernels. arXiv preprint arXiv:1705.09037, 2017.
- Mahendran & Vedaldi (2015) Mahendran, A. and Vedaldi, A. Understanding deep image representations by inverting them. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
- Miller & Hardt (2018) Miller, J. and Hardt, M. When recurrent models don’t need to be recurrent. arXiv preprint arXiv:1805.10369, 2018.
- Mroueh et al. (2018) Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y. Sobolev gan. International Conference on Learning Representations, 2018.
- Müller (1997) Müller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
- Nemirovski (2004) Nemirovski, A. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
- Pasquale (2015) Pasquale, F. The black box society: The secret algorithms that control money and information. Harvard University Press, 2015.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
- Quinlan (2014) Quinlan, J. R. C4. 5: programs for machine learning. Elsevier, 2014.
- Ribeiro et al. (2016) Ribeiro, M. T., Singh, S., and Guestrin, C. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939778. URL http://arxiv.org/abs/1602.04938http://doi.acm.org/10.1145/2939672.2939778.
- Rogers & Hahn (2010) Rogers, D. and Hahn, M. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
- Selvaraju et al. (2016) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. https://arxiv. org/abs/1610.02391 v3, 7(8), 2016.
- Shamir (2015) Shamir, O. The sample complexity of learning linear predictors with the squared loss. The Journal of Machine Learning Research, 16(1):3475–3486, 2015.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Sterling & Irwin (2015) Sterling, T. and Irwin, J. J. Zinc 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11):2324–2337, 2015.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
Wu et al. (2018a)
Wu, M., Hughes, M. C., Parbhoo, S., Zazzi, M., Roth, V., and Doshi-Velez, F.
Beyond sparsity: Tree regularization of deep models for
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018a. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16285.
- Wu et al. (2018b) Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018b.
- Zhao et al. (2017) Zhao, M., Yue, S., Katabi, D., Jaakkola, T. S., and Bianchi, M. T. Learning sleep stages from radio signals: a conditional adversarial architecture. In International Conference on Machine Learning, pp. 4100–4109, 2017.
Appendix A Proofs
Our main results in this section make the following assumptions.
(A1) the predictor is unconstrained.
(A2) both the loss and deviation are squared errors.
We note that (A3) and (A4) are not technically necessary but simplify the presentation. We denote the predictor in the uniform criterion (Eq. (2)), the symmetric game (Eq. (3)), and the asymmetric (Eq. (4)) game as , , and , respectively. We use to denote the neighborhood (), and to denote the vector . denotes the pseudo-inverse of . Then we have
If (A1-5) hold and the witness is in the linear family, the optimal satisfies
and the optimal , at every equilibrium, is the fixed point
We first re-write the symmetric criterion explicitly as a game:
where is the best response strategy from the local witness.
Since is unconstrained and the objective in convex in it, we can treat each as a distinct variable, and use the derivative to find its optimum:
where . Note that we only have to collect witnesses that are relevant to for the first equality, and the second equality is due to (A4). On the other hand, the objective for in the asymmetric game is:
The corresponding optimum is:
For both games, the objective for can be described as:
Then Eq. (10) is an optimal witness at .
and we note that every optimal witness has the same values on
Note that the equilibrium for the linear class is not unique when the solution of Eq. (9) is not unique: there may be infinitely many optimal solution to the witness in a neighborhood due to degeneracy. In this case, Theorem 2 adopts the minimum norm solution as used in the pseudo-inverse in Eq. (10). In this case, one may use Ridge regression instead to establish a strongly convex objective for the witness to ensure a unique solution, where the objective for the witness is rewritten as
with a positive .
If (A1-5) hold and the witness is in the linear family, the optimal satisfies
for , where
The objective for the uniform criterion is:
Our strategy is to temporarily treat each as a fixed function, and then replace it with its best response strategy.
Since is unconstrained (in capacity), we can treat each as a distinct variable for optimization. For each , we first filter its relevant criteria:
For any feasible , we can further rewrite the constraint of with respect to each as:
Collectively, we can fold all the upper bounds of as
All the lower bounds can be folded similarly.
Finally, since the objective for is simply a squared error with an interval constraint, evidently if satisfies the lower bounds and upper bounds, then . If
then we have
Otherwise, we have
For each is in the linear class, Eq. (13) is an optimal solution.
and we note that every optimal witness has the same values on .
Since the optimal is functionally dependent to , to obtain the optimal , we combine our previous result with such that the optimality conditions for and are both satisfied. Finally, we have