Gaussian Process Regression with Local Explanation

07/03/2020 ∙ by Yuya Yoshikawa, et al. ∙ 0

Gaussian process regression (GPR) is a fundamental model used in machine learning. Owing to its accurate prediction with uncertainty and versatility in handling various data structures via kernels, GPR has been successfully used in various applications. However, in GPR, how the features of an input contribute to its prediction cannot be interpreted. Herein, we propose GPR with local explanation, which reveals the feature contributions to the prediction of each sample, while maintaining the predictive performance of GPR. In the proposed model, both the prediction and explanation for each sample are performed using an easy-to-interpret locally linear model. The weight vector of the locally linear model is assumed to be generated from multivariate Gaussian process priors. The hyperparameters of the proposed models are estimated by maximizing the marginal likelihood. For a new test sample, the proposed model can predict the values of its target variable and weight vector, as well as their uncertainties, in a closed form. Experimental results on various benchmark datasets verify that the proposed model can achieve predictive performance comparable to those of GPR and superior to that of existing interpretable models, and can achieve higher interpretability than them, both quantitatively and qualitatively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (GPs) have been well studied for constructing probabilistic models as priors of nonlinear functions in the machine learning (ML) community. They have demonstrated great success in various problem settings, such as regression Rasmussen (2003); Wilson et al. (2012), classification Rasmussen (2003); Csató et al. (2000), time-series forecasting Roberts et al. (2013), and black-box optimization Snoek et al. (2012). A fundamental model on GPs is Gaussian process regression (GPR) Rasmussen (2003); owing to its high predictive performances and versatility in using various data structures via kernels, it has been used in not only the ML community, but also in various other research areas, such as finance Gonzalvez et al. (2019), geostatistics Camps-Valls et al. (2016), material science Zhang et al. (2020) and medical science Cheng et al. (2017); Futoma (2018).

GPR is defined on an infinite-dimensional feature space via kernel functions. Therefore, it requires the values of the kernels defined on two samples, i.e., a Gram matrix as an input, rather than the samples themselves. Owing to the nonlinearity of the kernel, GPR enables nonlinear predictions; however, it cannot explain what features contribute to the predictions, like linear regression models. Therefore, users of GPR cannot judge whether the predictions are reasonable and performed by fair decision.

For the interpretability of ML, several methodologies that explain the features that contribute to the outputs of prediction models, including GPR, have been proposed; in this case, the prediction models are regarded as black boxes Molnar (2019); Chen et al. (2018). Their representative methods are local interpretable model-agnostic explanations (LIME) Ribeiro et al. (2016) and Shapley additive explanations (SHAP) Lundberg and Lee (2017), which approximate the prediction for each test sample by a locally linear explanation model. By observing the weights of the learned explanation model, the feature contributions to the prediction can be understood. However, some limitations exist in these methods. First, because the forms of the prediction and explanation models differ, it is unsure whether the estimated feature contributions reflect those of the prediction model. Furthermore, because the explanation model is learned on each test sample, it may not obtain consistent explanations on similar samples.

Figure 1: Example of explanation for prediction by GPX for a sample on the Boston housing dataset Harrison Jr and Rubinfeld (1978). We provide further examples and feature description in Appendix A of the supplementary material.

To overcome the aforementioned limitations, we propose a novel framework for GP-based regression models, Gaussian process regression with local explanation, called GPX, which reveals the feature contributions to the prediction for each sample, while maintaining the predictive performance of GPR. In GPX, both the prediction and explanation for each sample are performed using an easy-to-interpret locally linear model. Therefore, no gap exists between the prediction and explanation. The weight vector of the locally linear model is assumed to be generated from multivariate GP priors Álvarez et al. (2012). As the multivariate GP priors have a covariance function defined as kernels on the samples, GPX ensures that similar samples have similar weights. The hyperparameters of GPX are estimated by maximizing the marginal likelihood, in which the weight vectors for all the training samples are integrated out. For a test sample, the predictions with their uncertainties of the target variable and weight vector are obtained by computing their predictive distributions. The explanation for the predicted target variable is provided using the estimated weight vector with uncertainty, as shown in Figure 1. Depicting the explanation with uncertainty helps users of GPX judge the reasonability of the predicted weights.

In experiments, we evaluated GPX both qualitatively and quantitatively in terms of predictive performance and interpretability on various benchmark datasets. The experimental results show that 1) GPX can achieve predictive errors comparable to GPR and lower errors compared with existing interpretable methods, 2) it can outperform model-agnostic interpretable methods and locally linear methods in terms of three interpretability measurements, and 3) the feature contributions produced by GPX are appropriate.

2 Related Work

Linear regression models are simple types of interpretable models. A number of studies introducing various regularizations have been conducted to produce methods such as the ridge regression 

Hoerl and Kennard (1970) and lasso Tibshirani (1996) methods. These models have global weight vectors that are shared across all samples. In kernel methods, Automatic Relevance Determination (ARD), which considers the global relevance of each feature contained in kernel functions, is adopted Neal (2012); Wipf and Nagarajan (2008). The above approaches have a significant issue in explainability, i.e., when a feature has inconsistent weight/relevance, for example, the weight/relevance of a feature is changed by the influence of another feature, global weights/relevances cannot be coped with it.

On the other hand, some locally linear models for regression have been proposed, such as the network lasso Hallac et al. (2015) and localized lasso Yamada et al. (2017), which have a weight vector for each sample. Therefore, these methods can avoid the drawbacks of globally linear models. To receive the benefit, we focus on generating predictions with explanations using locally linear models. In the network and localized lasso, the weights of locally linear models are estimated via optimization with network-based regularization, where the network must be defined on samples in advance. If the network is not provided, as assumed in standard regression problems, we can construct a -nearest-neighbor graph of samples to create a network, where is a hyperparameter that must be optimized via cross validation. Meanwhile, GPX can estimate weights and their uncertainties without constructing graphs by assuming that weights are determined by functions generated from GPs.

With regard to research on deep neural networks (DNNs), a number of studies have been conducted on making predictions generated by DNNs interpretable 

Chen et al. (2019); Arras et al. (2017); Ying et al. (2019). Some of these studies have developed methods that make interpretable predictions by generating locally linear models for each sample using DNNs Melis and Jaakkola (2018); Schwab et al. (2019); Yoshikawa and Iwata (2020). These concepts inspired our study, but we formalize our model without DNNs. To the best of our knowledge, our study is the first to develop a GP-based regression model with local explanations. Unlike DNN-based locally linear models, GPX is beneficial for tasks in which GPR can be used advantageously, such as Bayesian optimization Snoek et al. (2012); Golovin et al. (2017).

3 Proposed Model

In this section, we describe the proposed model, i.e., Gaussian process regression with local explanation, called GPX.

We consider a scalar-valued regression problem. Suppose that training data containing samples is provided. is an original input representing the th sample, where is an original input space. Although a typical representation for is a vector, it can be any data representation on which kernel functions are defined, such as graphs Vishwanathan et al. (2010) and sets Muandet et al. (2012); Yoshikawa et al. (2014). is a target variable for the sample. is a -dimensional vector of simplified representation for . Because GPX explains the prediction via a simplified representation, the meaning of each dimension of should be easily understood by humans, e.g., tabular data and bag-of-words representation for text. is an optional input; therefore, if can be used as a simplified representation, one can define . Let us denote , and .

In GPX, both the prediction of target variables and their explanations are performed via easy-to-interpret locally linear models, i.e., target variable for the th sample is assumed to be obtained using locally linear function , defined as follows:

(1)

where is a -dimensional weight vector for the th sample, and

is a Gaussian noise with variance

. Here, the explanation for the th sample is obtained using either weight vector or feature contributions .

Estimating without any constraints is an ill-posed problem because the number of free parameters in , is larger than that of target variable . To avoid this problem in GPX, we assume that functions determining are generated from a multivariate GP. More specifically, weight vector for the th sample is obtained as follows:

(2)

where is a -dimensional Gaussian noise with variance , and

is an identity matrix of order

. Here, vector-valued function is a function that determines the weight vector for each sample, and each element of is generated from a univariate GP independently, as follows:

(3)

is the mean function, and is the covariance function with set of parameters . We use zero mean function for as standard regularizers such as and assume. Covariance function is a kernel function defined on two inputs , . For example, one can use a scaled RBF kernel with parameters as the kernel function when are vectors, defined as follows:

(4)

By using generated as such, GPX ensures that two similar samples, i.e., those having a large kernel value, have similar weight vectors.

We let

. Based on the generative process above, the joint distribution of GPX is written as follows:

(5)
(6)
(7)

where and denote the th row and th column vectors of , respectively, and is a Gram matrix of order , which is identical to the requirement in GPR.

4 Training and Prediction

In this section, we describe the derivation of the marginal likelihood of GPX, the hyperparameter estimation for GPX, and the derivation of the predictive distributions of target variables and weight vectors for test samples.

Marginal likelihood.   To ease the derivation of the marginal likelihood, we first modified the formulation of the joint distribution (5), while maintaining its mathematical meanings, as follows:

(8)

where is a block diagonal matrix of order whose block is , and is a function that flattens the input matrix in a column-major order. Here, , where is a diagonal matrix whose diagonal elements possess the values of the input vector. In (5), functions that output -dimensional column vectors in are generated from GPs; however, in (8), it is rewritten such that a single function that outputs an -dimensional flatten vector is generated from a single GP. Consequently, the likelihood of target variables

can be rewritten as a single multivariate normal distribution.

Subsequently, we derived the marginal likelihood by integrating out and in (8). Owing to the property of normal distributions, it can be obtained analytically, as follows:

(9)

where .

Hyperparameter estimation.   If is differentiable with respect to , all the hyperparameters, i.e., , , and , can be estimated by maximizing the logarithm of the marginal likelihood with respect to them for the training data using gradient-based optimization methods, e.g., L-BFGS Liu and Nocedal (1989).

Predictive distributions.   For a new test sample , our goal is to infer the predictive distributions of target variable and weight vector . First, the predictive distribution of is obtained similarly as in the standard GPR, as follows:

(10)

where and .

Second, the predictive distribution of is obtained by solving the following integral:

(11)
(12)

where we define , , , and . is an -by- block matrix, where each block is an -by- matrix, and -block of the block matrix is for , and the other blocks are zero matrices. Solving the integral analytically according to the property of the normal distributions, we obtain

(13)

We provide the detailed derivation of predictive distributions (10) and (29) in Appendix B of the supplementary material.

Accordingly, the marginal likelihood (9) and the predictive distribution for  (10) are similar to those of GPR, except that GPX can obtain the predictive distribution for  (29). Since GPX can be used with the same input as GPR if , it can be employed in existing ML models, instead of GPR.

Computational efficiency.   As with ordinary GPR, the computational cost of GPX is dominated by the inverse computation. The computation of requires inverting a square matrix of order , . However, because the matrix is block diagonal and every diagonal block comprises , a square matrix of order , can be obtained by inverting only once. The remaining inverse matrix is of order . Therefore, all the inverse matrices appearing in GPX can be obtained using a naive implementation with a computational complexity of , which is the same as that in GPR. To significantly reduce the computational cost, efficient computation methods for GPR, such as the inducing variable method Titsias (2009) and KISS-GP Wilson and Nickisch (2015), can be used for GPX. In addition, because and are sparse matrices, one can obtain the predictive distributions efficiently using libraries for sparse matrix computation.

5 Experiments

In this section, we demonstrate the effectiveness of the proposed model, GPX, quantitatively and qualitatively, by comparing various interpretable models. Through a quantitative evaluation, we evaluated the models based on the following perspectives:

  • Accuracy: How accurate is the prediction of the interpretable model?

  • Faithfulness: Are feature contributions indicative of “true” importance?

  • Sufficiency: Do -most important features reflect the prediction?

  • Stability: How consistent are the explanations for similar or neighboring examples?

In addition, we qualitatively evaluated whether the feature contributions produced by the models were appropriate by visualizing them.

All the experiments were done with a computer with Intel Xeon Gold 6132 2.6GHz CPU with 16 cores, and 120GB of main memory.

5.1 Preparation

Datasets.

We used eight datasets in the UCI machine learning repository Dua and Graff (2017), referred to as Digits 27, Abalone 1, Diabetes 11, Boston Harrison Jr and Rubinfeld (1978), Fish 31, Wine 43, Paper 28 and Drug 12 in our experiments. We provide the details of the datasets in Appendix C of the supplementary material. Digits dataset is originally a classification dataset for recognizing handwritten digits from 0 to 9. To use it as a regression problem, we transformed the labels into target variables of scalar values, i.e., the target variables for the labels from 0 to 4 were , and those for the remaining labels were . With Paper and Drug datasets whose samples were represented as sentences, the original input and the simplified input differed, i.e., we used the 512-dimensional sentence vectors obtained using Sentence Transformers Reimers and Gurevych (2020) as , while we used bag-of-words binary vectors for the sentences as . Each of the remaining datasets had the same and . In all the datasets, the values of and were standardized before training and prediction. For a quantitative evaluation of each dataset, we evaluated the average scores over five experiments performed on different training/test splittings, where the training set was 80% of the entire dataset, whereas the remaining was the test set.

GPX setup.   In GPX, we consistently used a scaled RBF kernel defined as (4). The hyperparameters of GPX were estimated based on the method described in Section 4, where they were initialized with , and . In addition, we initialized bandwidth parameter

using median heuristics 

Garreau et al. (2017).

Comparing methods.   We compared GPX with several methods with globally or locally linear weights that can used as interpretable feature contributions for predictions. Lasso Tibshirani (1996) and Ridge Hoerl and Kennard (1970) are standard linear regression models with and regularizers, respectively, where their weights are globally shared across all samples. The network lasso (“Network” for short) is a locally linear model that regularizes the weights of nodes such that neighboring nodes in a network have similar weights Hallac et al. (2015). In our case, each node represents a sample, and the network is a

-nearest neighbor graph on the samples based on the cosine similarity on

. The localized lasso (“Localized” for short) is an extension of the network lasso; it can estimate the sparse and exclusive weights of each sample by further incorporating an regularizer into the network lasso Yamada et al. (2017). Furthermore, to compare model-agnostic interpretable methods with GPX, we used LIME Ribeiro et al. (2016) and Kernel SHAP Lundberg and Lee (2017), which produce a locally linear model for each test sample to explain the prediction by a black-box prediction model. For a fair comparison, we used GPR as the prediction model. The hyperparameters of GPR were estimated similarly as for GPX. Meanwhile, those of the remaining comparing methods were optimized by grid search. We provide the detailed description of the comparing methods in the Appendix D of the supplementary material.

5.2 Results

GPX (ours) Lasso Ridge Localized Network GPR
Digits 0.078 0.010 0.399 0.028 0.398 0.024 0.135 0.042 0.163 0.033 0.074 0.008
Abalone 0.428 0.036 0.477 0.052 0.477 0.053 0.519 0.023 0.534 0.016 0.427 0.034
Diabetes 0.493 0.041 0.504 0.039 0.503 0.040 0.610 0.062 0.667 0.091 0.490 0.048
Boston 0.116 0.053 0.293 0.078 0.284 0.081 0.208 0.070 0.233 0.062 0.116 0.052
Fish 0.370 0.061 0.437 0.055 0.437 0.055 0.479 0.051 0.523 0.054 0.375 0.066
Wine 0.579 0.048 0.723 0.039 0.712 0.044 NA NA 0.605 0.046
Paper 0.806 0.054 0.821 0.047 0.936 0.057 0.981 0.058 0.919 0.088 0.762 0.087
Drug 0.835 0.027 0.875 0.037 0.911 0.036 NA NA 0.844 0.033
Table 1:

Average and standard deviation of mean squared errors (MSEs) for the predictions of target variables on each dataset (lower is better). The value of bold typeface indicates the best value in each dataset, whereas that underlined indicates that the value is the best in paired t-test with significant level

. “NA” indicates that their performances cannot be measured owing to memory or computational time limitations. As GPR is not an interpretable model, we only included its performances in this table to show that those of GPX and GPR are similar.

Accuracy.

First, we demonstrate the predictive performances of GPX and the comparing methods in Table 1. We found that GPX achieved the lowest predictive errors on all the datasets, compared with the globally or locally linear models. In addition, we found that the predictive errors of GPX and GPR were comparable on all the datasets. This result indicates that GPR can be replaced by GPX to achieve similar predictive performances.

GPX (ours) GPR+LIME GPR+SHAP Localized Network
Digits 0.888 0.003 0.384 0.038 0.651 0.013 0.300 0.029 0.352 0.021
Abalone 0.898 0.017 0.775 0.024 NA 0.432 0.042 0.497 0.033
Diabetes 0.966 0.008 0.844 0.027 0.928 0.010 0.340 0.074 0.365 0.086
Boston 0.898 0.026 0.693 0.035 0.869 0.009 0.562 0.075 0.525 0.039
Fish 0.902 0.016 0.672 0.043 0.826 0.033 0.480 0.046 0.394 0.090
Table 2: Average and standard deviation of faithfulness scores on each dataset (higher is better). This table can be interpreted similarly as Table 1. For Wine dataset, we could not measure the scores of the methods except for GPX owing to computational time limitations. For Paper and Drug datasets, we did not evaluate the scores as changes in cannot reflect .

Faithfulness.   Assessing the correctness of the estimated contribution of each feature to a prediction requires a reference “true” contribution for comparison. As this is rarely available, a typical approach for measuring the faithfulness of the contributions produced by interpretable models is to rely on the proxy notion of the contributions: observing the effect of removing features on the model’s prediction. Following previous studies Melis and Jaakkola (2018); Bhatt et al. (2020), we computed the faithfulness score by removing features one-by-one, measuring the differences between the original predictions and the predictions from the inputs without the removed features, and calculating the correlation between the differences and the contributions of the removed features.

Table 2 shows the faithfulness scores of GPX and the comparing methods. Here, we denote the results of LIME and Kernel SHAP using GPR as the black-box prediction model by GPR+LIME and GPR+SHAP, respectively. We found that GPX achieved the best faithfulness scores on all the datasets. As GPX predicts and explains using a single locally linear model for each test sample, when removing a feature from the input, the contribution of the feature is subtracted from the prediction directly. Meanwhile, because GPR+LIME and GPR+SHAP have different prediction and explanation models, a gap may exist between the estimated contribution in the explanation model and the latent contribution in the prediction. Because the predictions by GPX and GPR were performed using similar calculations, their faithfulness differences were likely due to the gap.


(a) Digits

(b) Diabetes

(c) Boston
Figure 2: Average sufficiency scores on Digits, Diabetes, and Boston datasets (lower is better). The filled area on each line indicates its standard deviation.

Sufficiency.   In general, the inputs contain many irrelevant features that do not contribute to the predictions, and discovering important features in all the features is difficult for users of the models. Therefore, a desirable property of the interpretable models is that it can assign high contributions only for important features that affect the predictions well. To quantify how each method satisfies the property, we define the sufficiency score at , where is the number of important features. In particular, the sufficiency score at was computed by identifying important features in the descending order of the absolute values of their estimated contributions, predicting from the inputs having only important features, and comparing them against the original predictions. Because the number of important features varied according to the sample and dataset, we evaluated them at .

Figure 2 shows the sufficiency scores of GPX and the comparing methods. We found that GPX outperformed the others on Digits dataset, whereas GPX, GPR+LIME, and GPR+SHAP produced the best sufficiency scores on Diabetes dataset. These results indicate that GPX was appropriately assigned high contributions for the important features. On Boston dataset, we found that GPX was slightly inferior to the localized lasso at , although GPX outperformed it at . This was because the localized lasso has a regularizer that induces sparse weights. This result suggests that GPX can be further improved by employing the mechanism for generating sparse weights.

GPX (ours) GPR+LIME GPR+SHAP Localized Network
Digits 0.034 0.011 0.122 0.027 0.073 0.019 0.117 0.065 0.133 0.095
Abalone 0.338 0.056 2.820 1.445 NA 3.094 1.783 5.994 3.383
Diabetes 0.015 0.002 0.482 0.084 0.196 0.030 0.758 0.581 1.271 0.261
Boston 0.117 0.038 0.545 0.234 0.258 0.087 0.233 0.268 0.455 0.670
Fish 0.222 0.068 0.814 0.384 0.398 0.095 1.414 1.318 1.994 1.923
Paper 0.001 0.000 0.074 0.014 0.041 0.019 0.014 0.006 0.223 0.045
Table 3: Average and standard deviation of stability score on each dataset (lower is better). This table can be interpreted similarly as Table 1. For Wine and Drug datasets, owing to computational time limitations, we could not measure the scores of the methods except for GPX.

Stability.   To generate meaningful explanations, interpretable methods must be robust against local perturbations from the input, as explanations that are sensitive to slight changes in the input may be regarded as inconsistent by users. In particular, flexible models such as locally linear models might be sensitive to such changes for achieving better predictions. As with Alvarez–Melis and Jaakkola Melis and Jaakkola (2018), we used the following quantity for measuring the stability of the estimated weights for test sample , as follows:

(14)

where, is a set of test samples; and are the estimated weights associated with test samples and , respectively; is a parameter that determines neighboring samples; is the dimensionality of . We set in our experiments. Intuitively, the stability score will be high when the estimated weights for the sample and its neighboring samples are similar. Subsequently, we computed the stability score on a dataset by averaging the quantity (14) on all the test samples in the dataset.

Table 3 shows the stability scores on each dataset. We found that GPX achieved the best stability scores on all the datasets. For GPR+LIME and GPR+SHAP, we found that the stability scores were lower than that of GPX, although the prediction powers of GPX and GPR were comparable. This would be because LIME and Kernel SHAP estimated the weights independently over the test samples. With the localized and network lasso, we found that the variance of the stability score over the datasets was large.

Figure 3: Examples of estimated weights of each model on Digits dataset. The upper row shows the weights for the sample with digit two (), whereas the bottom one displays those for the sample with digit six (). Red and blue denote positive and negative weights, respectively, and their color strengths represent their magnitudes.

Qualitative comparison.   Finally, we qualitatively compared the estimated weights using GPX and the comparing methods on Digits dataset, in which the appropriate contributions for predictions were apparent. For this comparison, we rescaled the inputs and to be within .

Figure 3 shows the estimated weights on two samples. We provide the results for all the digits in Appendix E of the supplementary material. On this dataset, the appropriate weights can be obtained by assigning weights having the same sign with the target variable to black pixels. We found that the methods except for GPX and the localized lasso could not estimate reasonable weights. Meanwhile, the weights estimated by GPX and the localized lasso were appropriate, although they exhibited different characteristics, i.e., dense weights from GPX, whereas sparse ones from the localized lasso. The task determines the better explanation; however, as showing important regions rather than pixels is meaningful for images, the estimated weights using GPX would be easier to interpret on Digits dataset. Furthermore, the degree of sparsity in the localized lasso can be changed as a hyperparameter; if the value of the hyperparameter is zero, the localized lasso is identical to the network lasso. However, because the estimated weights using the network lasso were inappropriate, those using GPX cannot be mimicked by the localized lasso.

6 Conclusion

We proposed a GP-based regression model with sample-wise explanations. The proposed model assumes that each sample has a locally linear model, which is used for both prediction and explanation, and the weight vector of the locally linear model are generated from multivariate GP priors. The hyperparameters of the proposed models were estimated by maximizing the marginal likelihood, in which all the weight vectors were integrated out. Subsequently, for a test sample, the proposed model predicted its target variable and weight vector with uncertainty. In the experiments, we confirmed that the proposed model outperformed the existing globally and locally linear models and achieved comparable performances with the standard GPR in terms of predictive performance. We confirmed that the proposed model was superior to the existing methods, including model-agnostic interpretable methods, in terms of three interpretability measurements. Subsequently, we showed that the feature weights estimated by the proposed model were appropriate as the explanation.

In future studies, we will confirm the effectiveness of the proposed model by applying its concept into various problems in which GPs have been successfully used, such as time-series forecasting and black-box optimization, In addition, we will extend the proposed model for further improvements in interpretability, e.g., by employing the mechanism of inducing sparsity for the weight vectors.

Broader Impact

The proposed model could be applied to a wide range of applications in which Gaussian process regression has been used, such as finance Gonzalvez et al. (2019), geostatistics Camps-Valls et al. (2016), material science Zhang et al. (2020) and medical science Cheng et al. (2017); Futoma (2018). The proposed model could be used to make a non-linear prediction with an explanation for an individual sample, e.g., company, country, material object, and patient in these applications.

Using the proposed model brings in some benefits. For example, the explanation provides the opportunities for users of the proposed model to judge whether the prediction is reasonable and whether it is performed by fair decision. Furthermore, the uncertainties in the prediction and explanation in which the proposed model estimates help improve the correctness of the judgment. Consequently, if the users feel that the explanation is unreasonable or unfair, they could fix the model or the training data to avoid such a false explanation next time. On the other hand, the proposed model could face risk by increasing predictability and explainability, i.e., when the users unduly trust the proposed model or ignore the large uncertainties, the users could trust the prediction and the explanation even when they are wrong; consequently, they could make the wrong, unfair or biased decision making.

Evaluating the reasonability of the explanation needs expert knowledge in the applications, and it is rarely available in general. Therefore, we encourage research to investigate whether the explanation produced by the proposed model is reasonable based on the expert knowledge in the applications. For the aforementioned risk, we encourage research to investigate the influence of the explanation by the interpretable models, including the proposed model to the trustworthiness of the models and users’ decision making.

References

  • [1] Abalone data set. Note: https://archive.ics.uci.edu/ml/datasets/Abalone Cited by: Appendix C, §5.1.
  • M. A. Álvarez, L. Rosasco, and N. D. Lawrence (2012) Kernels for vector-valued functions: a review. Foundations and Trends® in Machine Learning 4 (3), pp. 195–266. Cited by: §1.
  • L. Arras, F. Horn, G. Montavon, K. Müller, and W. Samek (2017) " What is relevant in a text document?": an interpretable machine learning approach. PloS one 12 (8). Cited by: §2.
  • U. Bhatt, A. Weller, and J. M. Moura (2020) Evaluating and aggregating feature-based model explanations. arXiv preprint arXiv:2005.00631. Cited by: §5.2.
  • C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: Appendix B, Appendix B.
  • G. Camps-Valls, J. Verrelst, J. Munoz-Mari, V. Laparra, F. Mateo-Jimenez, and J. Gomez-Dans (2016) A survey on gaussian processes for earth-observation data analysis: a comprehensive investigation. IEEE Geoscience and Remote Sensing Magazine 4 (2), pp. 58–78. Cited by: §1, Broader Impact.
  • C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su (2019)

    This looks like that: deep learning for interpretable image recognition

    .
    In Advances in Neural Information Processing Systems, pp. 8928–8939. Cited by: §2.
  • J. Chen, L. Song, M. Wainwright, and M. Jordan (2018) Learning to explain: an information-theoretic perspective on model interpretation. In International Conference on Machine Learning, pp. 883–892. Cited by: §1.
  • L. Cheng, G. Darnell, B. Dumitrascu, C. Chivers, M. E. Draugelis, K. Li, and B. E. Engelhardt (2017) Sparse multi-output gaussian processes for medical time series prediction. arXiv preprint arXiv:1703.09112. Cited by: §1, Broader Impact.
  • L. Csató, E. Fokoué, M. Opper, B. Schottky, and O. Winther (2000) Efficient approaches to gaussian process classification. In Advances in neural information processing systems, pp. 251–257. Cited by: §1.
  • [11] Diabetes data set. Note: https://archive.ics.uci.edu/ml/datasets/diabetes Cited by: Appendix C, §5.1.
  • [12] Drug review dataset (druglib.com) data set. Note: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29# Cited by: Appendix C, §5.1.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Appendix C, §5.1.
  • J. Futoma (2018) Gaussian process-based models for clinical time series in healthcare. Ph.D. Thesis, Duke University. Cited by: §1, Broader Impact.
  • D. Garreau, W. Jitkrittum, and M. Kanagawa (2017) Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269. Cited by: §5.1.
  • D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley (2017) Google vizier: a service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1487–1495. Cited by: §2.
  • J. Gonzalvez, E. Lezmi, T. Roncalli, and J. Xu (2019) Financial applications of gaussian processes and bayesian optimization. arXiv preprint arXiv:1903.04841. Cited by: §1, Broader Impact.
  • D. Hallac, J. Leskovec, and S. Boyd (2015) Network Lasso: Clustering and Optimization in Large Graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 387–396. External Links: ISBN 978-1-4503-3664-2, Document Cited by: §2, §5.1.
  • D. Harrison Jr and D. L. Rubinfeld (1978) Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management 5, pp. 81–102. Cited by: Appendix A, Appendix C, Figure 1, §5.1.
  • A. E. Hoerl and R. W. Kennard (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §2, §5.1.
  • D. C. Liu and J. Nocedal (1989) On the limited memory bfgs method for large scale optimization. Mathematical programming 45 (1-3), pp. 503–528. Cited by: §4.
  • S. M. Lundberg and S. Lee (2017) A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. Cited by: §1, §5.1.
  • D. A. Melis and T. Jaakkola (2018) Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, pp. 7775–7784. Cited by: §2, §5.2, §5.2.
  • C. Molnar (2019) Interpretable machine learning. Note: https://christophm.github.io/interpretable-ml-book/ Cited by: §1.
  • K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf (2012) Learning from distributions via support measure machines. In Advances in neural information processing systems, pp. 10–18. Cited by: §3.
  • R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §2.
  • [27] Optical recognition of handwritten digits data set. Note: https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits Cited by: Appendix C, §5.1.
  • [28] Paper reviews data set. Note: https://archive.ics.uci.edu/ml/datasets/Paper+Reviews Cited by: Appendix C, §5.1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: Appendix C, Appendix D.
  • K. B. Petersen and M. S. Pedersen (2012) The matrix cookbook. Technical University of Denmark. Note: Version: November 15, 2012 External Links: Review Matrix Cookbook, Link Cited by: Appendix B, Appendix B.
  • [31] QSAR fish toxicity data set. Note: https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity Cited by: Appendix C, §5.1.
  • C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: Appendix B, Appendix D, §1.
  • N. Reimers and I. Gurevych (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813. Note: https://github.com/UKPLab/sentence-transformers Cited by: Appendix C, §5.1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016)

    "Why Should I Trust You?": Explaining the Predictions of Any Classifier

    .
    In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 1135–1144. External Links: ISBN 978-1-4503-4232-2, Document Cited by: §1, §5.1.
  • S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain (2013) Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371 (1984), pp. 20110550. Cited by: §1.
  • P. Schwab, D. Miladinovic, and W. Karlen (2019) Granger-causal attentive mixtures of experts: Learning important features with neural networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 4846–4853. Cited by: §2.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §2.
  • R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §2, §5.1.
  • M. Titsias (2009) Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567–574. Cited by: §4.
  • S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt (2010) Graph kernels. Journal of Machine Learning Research 11 (Apr), pp. 1201–1242. Cited by: §3.
  • A. G. Wilson, D. A. Knowles, and Z. Ghahramani (2012) Gaussian process regression networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1139–1146. Cited by: §1.
  • A. Wilson and H. Nickisch (2015)

    Kernel interpolation for scalable structured gaussian processes (kiss-gp)

    .
    In International Conference on Machine Learning, pp. 1775–1784. Cited by: §4.
  • [43] Wine quality data set. Note: https://archive.ics.uci.edu/ml/datasets/wine+quality Cited by: Appendix C, §5.1.
  • D. P. Wipf and S. S. Nagarajan (2008) A new view of automatic relevance determination. In Advances in neural information processing systems, pp. 1625–1632. Cited by: §2.
  • M. Yamada, T. Koh, T. Iwata, J. Shawe-Taylor, and S. Kaski (2017) Localized Lasso for High-Dimensional Regression. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, pp. 325–333. Cited by: §2, §5.1.
  • Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec (2019) Gnnexplainer: generating explanations for graph neural networks. In Advances in Neural Information Processing Systems, pp. 9240–9251. Cited by: §2.
  • Y. Yoshikawa, T. Iwata, and H. Sawada (2014) Latent support measure machines for bag-of-words data classification. In Advances in neural information processing systems, pp. 1961–1969. Cited by: §3.
  • Y. Yoshikawa and T. Iwata (2020) Neural generators of sparse local linear models for achieving both accuracy and interpretability. arXiv preprint arXiv:2003.06441. Cited by: §2.
  • Y. Zhang, D. W. Apley, and W. Chen (2020) Bayesian optimization for materials design with mixed quantitative and qualitative variables. Scientific Reports 10 (1), pp. 1–13. Cited by: §1, Broader Impact.

Appendix A Feature Description and Additional Examples for the Boston Housing Dataset

Feature name Description
CRIM Per capita crime rate by town
ZN Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS Proportion of non-retail business acres per town.
CHAS

Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX Nitric oxides concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000’s
Table 4: Feature names and their descriptions for the Boston housing dataset.

The Boston housing dataset, referred to as “Boston” in our experiments, contains information collected by the U.S. Census Service regarding housing in the area of Boston, Massachusetts [19] and is used for predicting house prices based on the information. Table 4 lists the names of the features and their descriptions for the Boston housing dataset.

Figure 4 presents four examples of feature contributions estimated by GPX. We found that each of these examples has different feature contributions, although some of the features, such as “AGE” and “DIS,” had consistent positive or negative contributions, respectively.

Appendix B Detailed Derivation of Predictive Distributions

In this appendix, we describe the derivation of predictive distributions in detail. For a new test sample , our goal is to infer the predictive distributions of the target variable and weight vector .

Predictive distribution of .   The predictive distribution of is obtained similarly to the standard GPR [32]. In Section 4, we demonstrated that the marginal distribution of training target variables for GPX is defined as

(15)

where . According to (15), the joint marginal distribution of and is defined as

(16)

where , and . The predictive distribution of is the conditional distribution of given with training and testing inputs. Therefore, it can be obtained by applying the formula of conditional distributions for normal distributions [30, Eq. (354)] to (16) as follows:

(17)


Figure 4: Examples of feature contributions estimated by GPX for the Boston housing dataset. Here, the error bars denote the standard deviations or feature contributions.

Predictive distribution of .   The predictive distribution of can be obtained by solving the following equation:

(18)

where the first integrand is the conditional distribution of and the second integrand is the posterior distribution of . The conditional distribution of is derived similarly to the conditional distribution of  (17). The distribution of in which the functions are integrated out is given by

(19)

where is the th column vector of . According to this, the joint distribution of and is defined as

(20)

where we let and . Subsequently, we can obtain the conditional distribution of by applying the formula of conditional distributions for normal distributions [30, Eq. (354)] to (20) as follows:

(21)

Here, we can rewrite (21) as a single -dimensional multivariate normal distribution as follows:

(22)

where is a block diagonal matrix of order whose block is , is a function that flattens the input matrix in column-major order, and . is an -by- block matrix, where each block is an -by- matrix the and -block of the block matrix is for , while the other blocks are zero matrices. By letting , we obtain

(23)

To derive the posterior distribution of , , we first consider the joint distribution of and . This distribution is straightforwardly obtained as

(24)

which can be rewritten as

(25)

where . By applying the formula of conditional distributions of normal distributions [5, Eqs. (2.113)–(2.117)] to (25), we can obtain

(26)

where we let . Here, the computation of requires inverting a square matrix of order with a computational complexity of . By using the Woodbury identity [30, Eq. (156)] to compute this inversion efficiently, we can transform into , which requires inverting a matrix of order , . Consequently, we obtain

(27)

From (23) and (27), one can see that (18) can be represented by the following equation:

(28)

This integral can be obtained in a closed form, as shown in [5, Eqs. (2.113)–(2.117)]. Therefore, we can obtain the predictive distribution of as follows:

(29)

Appendix C Specification of Datasets

Digits 1,797 64
Abalone 4,177 10
Diabetes 442 10
Boston 506 13
Fish 908 6
Wine 6,497 11
Paper 399 2,990
Drug 3,989 2,429
Table 5: Specification of datasets.

We considered eight datasets from the UCI machine learning repository [13], which were referred to as Digits [27], Abalone [1], Diabetes [11], Boston [19], Fish [31], Wine [43], Paper [28], and Drug [12] in our experiments.

The first six datasets consisted of tabular data. We treated the original inputs and simplified inputs identically in our experiments. The Digits dataset was originally developed as a classification dataset for recognizing handwritten digits from zero to nine. As described in Section 5.1, we used this dataset for a regression problem by transforming the digit labels into binary values of 1 or . Here, we used only the testing set from the original Digits dataset because that is how scikit-learn [29] distributes this dataset. The Abalone dataset is a dataset for predicting the age of abalone based on physical measurements. The Diabetes dataset is a dataset for predicting the onset of diabetes based on diagnostic measures. The Boston dataset is a dataset for predicting house prices, as described in Appendix A. The Fish dataset is a dataset for predicting acute aquatic toxicity toward the fish pimephales promelas for a set of chemicals. The Wine dataset is a dataset for predicting the quality of white and red wines based on physicochemical tests. The remaining two datasets are text datasets. The Paper dataset is a dataset for predicting evaluation scores for papers based on review texts written mainly in Spanish. The Drug dataset is a drug review dataset for predicting 10-star ratings for drugs based on patient review texts. For each dataset, and are different. Specifically, we used the 512-dimensional sentence vectors obtained using sentence transformers [33] as and used bag-of-words binary vectors of the sentences as , where the cutoff frequencies for words were set to two and five for the Paper and Drug datasets, respectively. Table 5 lists the number of samples and number of features in each dataset.

Appendix D Detailed Description of Comparing Methods

In this appendix, we describe the implementation and hyperparameter search methods used for comparing methods.

We implemented GPR using PyTorch v1.5.0

111https://pytorch.org/. All hyperparameters for GPR were estimated by maximizing marginal likelihood [32], where we initialized the hyperparameters to the same values as those for GPX. For Lasso and Ridge, we used the implementations provided by scikit-learn [29]. The hyperparameters that regularize the strengths of the and regularizers in Lasso and Ridge, respectively, were optimized through a grid search using functions provided by scikit-learn (i.e., sklearn.linear_model.LassoCV and sklearn.linear_model.RidgeCV) with the default options. The search range for the hyperparameters for Lasso was limited to within 100 grid points such that the ratio of its minimum value to its maximum value was capped at 0.001, while that for Ridge was limited to within a range of . For the localized lasso, we used the original implementation written in Python222https://riken-yamada.github.io/localizedlasso.html. The hyperparameters and their search ranges for the localized lasso are the strength of network regularization , strength of the regularizer , and for the -nearest-neighbor graph. The hyperparameters were optimized through a grid search. The network lasso is a special case of the localized lasso. If for the localized lasso is zero, then the localized lasso is identical to the network lasso. Therefore, we used the implementation of the localized lasso and set for the network lasso. The hyperparameter search for the network lasso was the same as that for the localized lasso, except for the setting of . For LIME and Kernel SHAP, we used the original implementations333LIME: https://github.com/marcotcr/lime, Kernel SHAP: https://github.com/slundberg/shap.

Appendix E Additional Results for the Digits Dataset

Table 6 shows the computational times of each of the methods on the Digits dataset. First, the training times of GPX and GPR were much the same, and GPX was significantly faster than the localized and network lasso. Since the localized and network lasso requires the hyperparameter search, the actual training times of the localized and network lasso were about 48 and 12 times longer than the times shown in the table, respectively. Second, the prediction time of GPX was significantly faster than GPR+LIME/SHAP. In total, GPX was the fastest in the comparing methods.

Figure 5 presents additional examples of estimated weights for the Digits dataset. We found that the weights estimated by GPX were appropriately assigned such that the regions of black pixels have weights with the same signs as those of the target variables.

In terms of the stability of explanations, estimated weights for the same digit should be similar. Figure 6 presents three examples of estimated weights for the digit two. We found that GPX estimated similar weights for all three examples.

GPX GPR+LIME GPR+SHAP Localized Network
Training 5.74 5.53 5.76 105.84 106.16
Prediction 24.57 155.54 2643.06 1.59 1.61
Total 30.31 161.07 2648.82 107.43 107.77
Table 6: Training, prediction and total times (seconds) of each method on the Digits dataset. Here, the training times of GPR+LIME/SHAP are those of GPR, while the prediction times of GPR+LIME/SHAP are those of producing explanations for all the test samples.
Figure 5: Examples of estimated weights for digits ranging from zero to nine for the Digits dataset. The five upper rows present the weights of samples with digits of zero to four (), whereas the five bottom rows present those for samples with digits from five to nine (). Red and blue denote positive and negative weights, respectively, and their color strengths represent their magnitudes.
Figure 6: Different examples of estimated weights for digit two for the Digits dataset.