1 Introduction
Gaussian processes (GPs) have been well studied for constructing probabilistic models as priors of nonlinear functions in the machine learning (ML) community. They have demonstrated great success in various problem settings, such as regression Rasmussen (2003); Wilson et al. (2012), classification Rasmussen (2003); Csató et al. (2000), timeseries forecasting Roberts et al. (2013), and blackbox optimization Snoek et al. (2012). A fundamental model on GPs is Gaussian process regression (GPR) Rasmussen (2003); owing to its high predictive performances and versatility in using various data structures via kernels, it has been used in not only the ML community, but also in various other research areas, such as finance Gonzalvez et al. (2019), geostatistics CampsValls et al. (2016), material science Zhang et al. (2020) and medical science Cheng et al. (2017); Futoma (2018).
GPR is defined on an infinitedimensional feature space via kernel functions. Therefore, it requires the values of the kernels defined on two samples, i.e., a Gram matrix as an input, rather than the samples themselves. Owing to the nonlinearity of the kernel, GPR enables nonlinear predictions; however, it cannot explain what features contribute to the predictions, like linear regression models. Therefore, users of GPR cannot judge whether the predictions are reasonable and performed by fair decision.
For the interpretability of ML, several methodologies that explain the features that contribute to the outputs of prediction models, including GPR, have been proposed; in this case, the prediction models are regarded as black boxes Molnar (2019); Chen et al. (2018). Their representative methods are local interpretable modelagnostic explanations (LIME) Ribeiro et al. (2016) and Shapley additive explanations (SHAP) Lundberg and Lee (2017), which approximate the prediction for each test sample by a locally linear explanation model. By observing the weights of the learned explanation model, the feature contributions to the prediction can be understood. However, some limitations exist in these methods. First, because the forms of the prediction and explanation models differ, it is unsure whether the estimated feature contributions reflect those of the prediction model. Furthermore, because the explanation model is learned on each test sample, it may not obtain consistent explanations on similar samples.
To overcome the aforementioned limitations, we propose a novel framework for GPbased regression models, Gaussian process regression with local explanation, called GPX, which reveals the feature contributions to the prediction for each sample, while maintaining the predictive performance of GPR. In GPX, both the prediction and explanation for each sample are performed using an easytointerpret locally linear model. Therefore, no gap exists between the prediction and explanation. The weight vector of the locally linear model is assumed to be generated from multivariate GP priors Álvarez et al. (2012). As the multivariate GP priors have a covariance function defined as kernels on the samples, GPX ensures that similar samples have similar weights. The hyperparameters of GPX are estimated by maximizing the marginal likelihood, in which the weight vectors for all the training samples are integrated out. For a test sample, the predictions with their uncertainties of the target variable and weight vector are obtained by computing their predictive distributions. The explanation for the predicted target variable is provided using the estimated weight vector with uncertainty, as shown in Figure 1. Depicting the explanation with uncertainty helps users of GPX judge the reasonability of the predicted weights.
In experiments, we evaluated GPX both qualitatively and quantitatively in terms of predictive performance and interpretability on various benchmark datasets. The experimental results show that 1) GPX can achieve predictive errors comparable to GPR and lower errors compared with existing interpretable methods, 2) it can outperform modelagnostic interpretable methods and locally linear methods in terms of three interpretability measurements, and 3) the feature contributions produced by GPX are appropriate.
2 Related Work
Linear regression models are simple types of interpretable models. A number of studies introducing various regularizations have been conducted to produce methods such as the ridge regression
Hoerl and Kennard (1970) and lasso Tibshirani (1996) methods. These models have global weight vectors that are shared across all samples. In kernel methods, Automatic Relevance Determination (ARD), which considers the global relevance of each feature contained in kernel functions, is adopted Neal (2012); Wipf and Nagarajan (2008). The above approaches have a significant issue in explainability, i.e., when a feature has inconsistent weight/relevance, for example, the weight/relevance of a feature is changed by the influence of another feature, global weights/relevances cannot be coped with it.On the other hand, some locally linear models for regression have been proposed, such as the network lasso Hallac et al. (2015) and localized lasso Yamada et al. (2017), which have a weight vector for each sample. Therefore, these methods can avoid the drawbacks of globally linear models. To receive the benefit, we focus on generating predictions with explanations using locally linear models. In the network and localized lasso, the weights of locally linear models are estimated via optimization with networkbased regularization, where the network must be defined on samples in advance. If the network is not provided, as assumed in standard regression problems, we can construct a nearestneighbor graph of samples to create a network, where is a hyperparameter that must be optimized via cross validation. Meanwhile, GPX can estimate weights and their uncertainties without constructing graphs by assuming that weights are determined by functions generated from GPs.
With regard to research on deep neural networks (DNNs), a number of studies have been conducted on making predictions generated by DNNs interpretable
Chen et al. (2019); Arras et al. (2017); Ying et al. (2019). Some of these studies have developed methods that make interpretable predictions by generating locally linear models for each sample using DNNs Melis and Jaakkola (2018); Schwab et al. (2019); Yoshikawa and Iwata (2020). These concepts inspired our study, but we formalize our model without DNNs. To the best of our knowledge, our study is the first to develop a GPbased regression model with local explanations. Unlike DNNbased locally linear models, GPX is beneficial for tasks in which GPR can be used advantageously, such as Bayesian optimization Snoek et al. (2012); Golovin et al. (2017).3 Proposed Model
In this section, we describe the proposed model, i.e., Gaussian process regression with local explanation, called GPX.
We consider a scalarvalued regression problem. Suppose that training data containing samples is provided. is an original input representing the th sample, where is an original input space. Although a typical representation for is a vector, it can be any data representation on which kernel functions are defined, such as graphs Vishwanathan et al. (2010) and sets Muandet et al. (2012); Yoshikawa et al. (2014). is a target variable for the sample. is a dimensional vector of simplified representation for . Because GPX explains the prediction via a simplified representation, the meaning of each dimension of should be easily understood by humans, e.g., tabular data and bagofwords representation for text. is an optional input; therefore, if can be used as a simplified representation, one can define . Let us denote , and .
In GPX, both the prediction of target variables and their explanations are performed via easytointerpret locally linear models, i.e., target variable for the th sample is assumed to be obtained using locally linear function , defined as follows:
(1) 
where is a dimensional weight vector for the th sample, and
is a Gaussian noise with variance
. Here, the explanation for the th sample is obtained using either weight vector or feature contributions .Estimating without any constraints is an illposed problem because the number of free parameters in , is larger than that of target variable . To avoid this problem in GPX, we assume that functions determining are generated from a multivariate GP. More specifically, weight vector for the th sample is obtained as follows:
(2) 
where is a dimensional Gaussian noise with variance , and
is an identity matrix of order
. Here, vectorvalued function is a function that determines the weight vector for each sample, and each element of is generated from a univariate GP independently, as follows:(3) 
is the mean function, and is the covariance function with set of parameters . We use zero mean function for as standard regularizers such as and assume. Covariance function is a kernel function defined on two inputs , . For example, one can use a scaled RBF kernel with parameters as the kernel function when are vectors, defined as follows:
(4) 
By using generated as such, GPX ensures that two similar samples, i.e., those having a large kernel value, have similar weight vectors.
We let
. Based on the generative process above, the joint distribution of GPX is written as follows:
(5)  
(6)  
(7) 
where and denote the th row and th column vectors of , respectively, and is a Gram matrix of order , which is identical to the requirement in GPR.
4 Training and Prediction
In this section, we describe the derivation of the marginal likelihood of GPX, the hyperparameter estimation for GPX, and the derivation of the predictive distributions of target variables and weight vectors for test samples.
Marginal likelihood. To ease the derivation of the marginal likelihood, we first modified the formulation of the joint distribution (5), while maintaining its mathematical meanings, as follows:
(8) 
where is a block diagonal matrix of order whose block is , and is a function that flattens the input matrix in a columnmajor order. Here, , where is a diagonal matrix whose diagonal elements possess the values of the input vector. In (5), functions that output dimensional column vectors in are generated from GPs; however, in (8), it is rewritten such that a single function that outputs an dimensional flatten vector is generated from a single GP. Consequently, the likelihood of target variables
can be rewritten as a single multivariate normal distribution.
Subsequently, we derived the marginal likelihood by integrating out and in (8). Owing to the property of normal distributions, it can be obtained analytically, as follows:
(9) 
where .
Hyperparameter estimation. If is differentiable with respect to , all the hyperparameters, i.e., , , and , can be estimated by maximizing the logarithm of the marginal likelihood with respect to them for the training data using gradientbased optimization methods, e.g., LBFGS Liu and Nocedal (1989).
Predictive distributions. For a new test sample , our goal is to infer the predictive distributions of target variable and weight vector . First, the predictive distribution of is obtained similarly as in the standard GPR, as follows:
(10) 
where and .
Second, the predictive distribution of is obtained by solving the following integral:
(11)  
(12) 
where we define , , , and . is an by block matrix, where each block is an by matrix, and block of the block matrix is for , and the other blocks are zero matrices. Solving the integral analytically according to the property of the normal distributions, we obtain
(13) 
We provide the detailed derivation of predictive distributions (10) and (29) in Appendix B of the supplementary material.
Accordingly, the marginal likelihood (9) and the predictive distribution for (10) are similar to those of GPR, except that GPX can obtain the predictive distribution for (29). Since GPX can be used with the same input as GPR if , it can be employed in existing ML models, instead of GPR.
Computational efficiency. As with ordinary GPR, the computational cost of GPX is dominated by the inverse computation. The computation of requires inverting a square matrix of order , . However, because the matrix is block diagonal and every diagonal block comprises , a square matrix of order , can be obtained by inverting only once. The remaining inverse matrix is of order . Therefore, all the inverse matrices appearing in GPX can be obtained using a naive implementation with a computational complexity of , which is the same as that in GPR. To significantly reduce the computational cost, efficient computation methods for GPR, such as the inducing variable method Titsias (2009) and KISSGP Wilson and Nickisch (2015), can be used for GPX. In addition, because and are sparse matrices, one can obtain the predictive distributions efficiently using libraries for sparse matrix computation.
5 Experiments
In this section, we demonstrate the effectiveness of the proposed model, GPX, quantitatively and qualitatively, by comparing various interpretable models. Through a quantitative evaluation, we evaluated the models based on the following perspectives:

Accuracy: How accurate is the prediction of the interpretable model?

Faithfulness: Are feature contributions indicative of “true” importance?

Sufficiency: Do most important features reflect the prediction?

Stability: How consistent are the explanations for similar or neighboring examples?
In addition, we qualitatively evaluated whether the feature contributions produced by the models were appropriate by visualizing them.
All the experiments were done with a computer with Intel Xeon Gold 6132 2.6GHz CPU with 16 cores, and 120GB of main memory.
5.1 Preparation
Datasets.
We used eight datasets in the UCI machine learning repository Dua and Graff (2017), referred to as Digits 27, Abalone 1, Diabetes 11, Boston Harrison Jr and Rubinfeld (1978), Fish 31, Wine 43, Paper 28 and Drug 12 in our experiments. We provide the details of the datasets in Appendix C of the supplementary material. Digits dataset is originally a classification dataset for recognizing handwritten digits from 0 to 9. To use it as a regression problem, we transformed the labels into target variables of scalar values, i.e., the target variables for the labels from 0 to 4 were , and those for the remaining labels were . With Paper and Drug datasets whose samples were represented as sentences, the original input and the simplified input differed, i.e., we used the 512dimensional sentence vectors obtained using Sentence Transformers Reimers and Gurevych (2020) as , while we used bagofwords binary vectors for the sentences as . Each of the remaining datasets had the same and . In all the datasets, the values of and were standardized before training and prediction. For a quantitative evaluation of each dataset, we evaluated the average scores over five experiments performed on different training/test splittings, where the training set was 80% of the entire dataset, whereas the remaining was the test set.
GPX setup. In GPX, we consistently used a scaled RBF kernel defined as (4). The hyperparameters of GPX were estimated based on the method described in Section 4, where they were initialized with , and . In addition, we initialized bandwidth parameter
using median heuristics
Garreau et al. (2017).Comparing methods. We compared GPX with several methods with globally or locally linear weights that can used as interpretable feature contributions for predictions. Lasso Tibshirani (1996) and Ridge Hoerl and Kennard (1970) are standard linear regression models with and regularizers, respectively, where their weights are globally shared across all samples. The network lasso (“Network” for short) is a locally linear model that regularizes the weights of nodes such that neighboring nodes in a network have similar weights Hallac et al. (2015). In our case, each node represents a sample, and the network is a
nearest neighbor graph on the samples based on the cosine similarity on
. The localized lasso (“Localized” for short) is an extension of the network lasso; it can estimate the sparse and exclusive weights of each sample by further incorporating an regularizer into the network lasso Yamada et al. (2017). Furthermore, to compare modelagnostic interpretable methods with GPX, we used LIME Ribeiro et al. (2016) and Kernel SHAP Lundberg and Lee (2017), which produce a locally linear model for each test sample to explain the prediction by a blackbox prediction model. For a fair comparison, we used GPR as the prediction model. The hyperparameters of GPR were estimated similarly as for GPX. Meanwhile, those of the remaining comparing methods were optimized by grid search. We provide the detailed description of the comparing methods in the Appendix D of the supplementary material.5.2 Results
GPX (ours)  Lasso  Ridge  Localized  Network  GPR  

Digits  0.078 0.010  0.399 0.028  0.398 0.024  0.135 0.042  0.163 0.033  0.074 0.008 
Abalone  0.428 0.036  0.477 0.052  0.477 0.053  0.519 0.023  0.534 0.016  0.427 0.034 
Diabetes  0.493 0.041  0.504 0.039  0.503 0.040  0.610 0.062  0.667 0.091  0.490 0.048 
Boston  0.116 0.053  0.293 0.078  0.284 0.081  0.208 0.070  0.233 0.062  0.116 0.052 
Fish  0.370 0.061  0.437 0.055  0.437 0.055  0.479 0.051  0.523 0.054  0.375 0.066 
Wine  0.579 0.048  0.723 0.039  0.712 0.044  NA  NA  0.605 0.046 
Paper  0.806 0.054  0.821 0.047  0.936 0.057  0.981 0.058  0.919 0.088  0.762 0.087 
Drug  0.835 0.027  0.875 0.037  0.911 0.036  NA  NA  0.844 0.033 
Average and standard deviation of mean squared errors (MSEs) for the predictions of target variables on each dataset (lower is better). The value of bold typeface indicates the best value in each dataset, whereas that underlined indicates that the value is the best in paired ttest with significant level
. “NA” indicates that their performances cannot be measured owing to memory or computational time limitations. As GPR is not an interpretable model, we only included its performances in this table to show that those of GPX and GPR are similar.Accuracy.
First, we demonstrate the predictive performances of GPX and the comparing methods in Table 1. We found that GPX achieved the lowest predictive errors on all the datasets, compared with the globally or locally linear models. In addition, we found that the predictive errors of GPX and GPR were comparable on all the datasets. This result indicates that GPR can be replaced by GPX to achieve similar predictive performances.
GPX (ours)  GPR+LIME  GPR+SHAP  Localized  Network  

Digits  0.888 0.003  0.384 0.038  0.651 0.013  0.300 0.029  0.352 0.021 
Abalone  0.898 0.017  0.775 0.024  NA  0.432 0.042  0.497 0.033 
Diabetes  0.966 0.008  0.844 0.027  0.928 0.010  0.340 0.074  0.365 0.086 
Boston  0.898 0.026  0.693 0.035  0.869 0.009  0.562 0.075  0.525 0.039 
Fish  0.902 0.016  0.672 0.043  0.826 0.033  0.480 0.046  0.394 0.090 
Faithfulness. Assessing the correctness of the estimated contribution of each feature to a prediction requires a reference “true” contribution for comparison. As this is rarely available, a typical approach for measuring the faithfulness of the contributions produced by interpretable models is to rely on the proxy notion of the contributions: observing the effect of removing features on the model’s prediction. Following previous studies Melis and Jaakkola (2018); Bhatt et al. (2020), we computed the faithfulness score by removing features onebyone, measuring the differences between the original predictions and the predictions from the inputs without the removed features, and calculating the correlation between the differences and the contributions of the removed features.
Table 2 shows the faithfulness scores of GPX and the comparing methods. Here, we denote the results of LIME and Kernel SHAP using GPR as the blackbox prediction model by GPR+LIME and GPR+SHAP, respectively. We found that GPX achieved the best faithfulness scores on all the datasets. As GPX predicts and explains using a single locally linear model for each test sample, when removing a feature from the input, the contribution of the feature is subtracted from the prediction directly. Meanwhile, because GPR+LIME and GPR+SHAP have different prediction and explanation models, a gap may exist between the estimated contribution in the explanation model and the latent contribution in the prediction. Because the predictions by GPX and GPR were performed using similar calculations, their faithfulness differences were likely due to the gap.
Sufficiency. In general, the inputs contain many irrelevant features that do not contribute to the predictions, and discovering important features in all the features is difficult for users of the models. Therefore, a desirable property of the interpretable models is that it can assign high contributions only for important features that affect the predictions well. To quantify how each method satisfies the property, we define the sufficiency score at , where is the number of important features. In particular, the sufficiency score at was computed by identifying important features in the descending order of the absolute values of their estimated contributions, predicting from the inputs having only important features, and comparing them against the original predictions. Because the number of important features varied according to the sample and dataset, we evaluated them at .
Figure 2 shows the sufficiency scores of GPX and the comparing methods. We found that GPX outperformed the others on Digits dataset, whereas GPX, GPR+LIME, and GPR+SHAP produced the best sufficiency scores on Diabetes dataset. These results indicate that GPX was appropriately assigned high contributions for the important features. On Boston dataset, we found that GPX was slightly inferior to the localized lasso at , although GPX outperformed it at . This was because the localized lasso has a regularizer that induces sparse weights. This result suggests that GPX can be further improved by employing the mechanism for generating sparse weights.
GPX (ours)  GPR+LIME  GPR+SHAP  Localized  Network  

Digits  0.034 0.011  0.122 0.027  0.073 0.019  0.117 0.065  0.133 0.095 
Abalone  0.338 0.056  2.820 1.445  NA  3.094 1.783  5.994 3.383 
Diabetes  0.015 0.002  0.482 0.084  0.196 0.030  0.758 0.581  1.271 0.261 
Boston  0.117 0.038  0.545 0.234  0.258 0.087  0.233 0.268  0.455 0.670 
Fish  0.222 0.068  0.814 0.384  0.398 0.095  1.414 1.318  1.994 1.923 
Paper  0.001 0.000  0.074 0.014  0.041 0.019  0.014 0.006  0.223 0.045 
Stability. To generate meaningful explanations, interpretable methods must be robust against local perturbations from the input, as explanations that are sensitive to slight changes in the input may be regarded as inconsistent by users. In particular, flexible models such as locally linear models might be sensitive to such changes for achieving better predictions. As with Alvarez–Melis and Jaakkola Melis and Jaakkola (2018), we used the following quantity for measuring the stability of the estimated weights for test sample , as follows:
(14) 
where, is a set of test samples; and are the estimated weights associated with test samples and , respectively; is a parameter that determines neighboring samples; is the dimensionality of . We set in our experiments. Intuitively, the stability score will be high when the estimated weights for the sample and its neighboring samples are similar. Subsequently, we computed the stability score on a dataset by averaging the quantity (14) on all the test samples in the dataset.
Table 3 shows the stability scores on each dataset. We found that GPX achieved the best stability scores on all the datasets. For GPR+LIME and GPR+SHAP, we found that the stability scores were lower than that of GPX, although the prediction powers of GPX and GPR were comparable. This would be because LIME and Kernel SHAP estimated the weights independently over the test samples. With the localized and network lasso, we found that the variance of the stability score over the datasets was large.
Qualitative comparison. Finally, we qualitatively compared the estimated weights using GPX and the comparing methods on Digits dataset, in which the appropriate contributions for predictions were apparent. For this comparison, we rescaled the inputs and to be within .
Figure 3 shows the estimated weights on two samples. We provide the results for all the digits in Appendix E of the supplementary material. On this dataset, the appropriate weights can be obtained by assigning weights having the same sign with the target variable to black pixels. We found that the methods except for GPX and the localized lasso could not estimate reasonable weights. Meanwhile, the weights estimated by GPX and the localized lasso were appropriate, although they exhibited different characteristics, i.e., dense weights from GPX, whereas sparse ones from the localized lasso. The task determines the better explanation; however, as showing important regions rather than pixels is meaningful for images, the estimated weights using GPX would be easier to interpret on Digits dataset. Furthermore, the degree of sparsity in the localized lasso can be changed as a hyperparameter; if the value of the hyperparameter is zero, the localized lasso is identical to the network lasso. However, because the estimated weights using the network lasso were inappropriate, those using GPX cannot be mimicked by the localized lasso.
6 Conclusion
We proposed a GPbased regression model with samplewise explanations. The proposed model assumes that each sample has a locally linear model, which is used for both prediction and explanation, and the weight vector of the locally linear model are generated from multivariate GP priors. The hyperparameters of the proposed models were estimated by maximizing the marginal likelihood, in which all the weight vectors were integrated out. Subsequently, for a test sample, the proposed model predicted its target variable and weight vector with uncertainty. In the experiments, we confirmed that the proposed model outperformed the existing globally and locally linear models and achieved comparable performances with the standard GPR in terms of predictive performance. We confirmed that the proposed model was superior to the existing methods, including modelagnostic interpretable methods, in terms of three interpretability measurements. Subsequently, we showed that the feature weights estimated by the proposed model were appropriate as the explanation.
In future studies, we will confirm the effectiveness of the proposed model by applying its concept into various problems in which GPs have been successfully used, such as timeseries forecasting and blackbox optimization, In addition, we will extend the proposed model for further improvements in interpretability, e.g., by employing the mechanism of inducing sparsity for the weight vectors.
Broader Impact
The proposed model could be applied to a wide range of applications in which Gaussian process regression has been used, such as finance Gonzalvez et al. (2019), geostatistics CampsValls et al. (2016), material science Zhang et al. (2020) and medical science Cheng et al. (2017); Futoma (2018). The proposed model could be used to make a nonlinear prediction with an explanation for an individual sample, e.g., company, country, material object, and patient in these applications.
Using the proposed model brings in some benefits. For example, the explanation provides the opportunities for users of the proposed model to judge whether the prediction is reasonable and whether it is performed by fair decision. Furthermore, the uncertainties in the prediction and explanation in which the proposed model estimates help improve the correctness of the judgment. Consequently, if the users feel that the explanation is unreasonable or unfair, they could fix the model or the training data to avoid such a false explanation next time. On the other hand, the proposed model could face risk by increasing predictability and explainability, i.e., when the users unduly trust the proposed model or ignore the large uncertainties, the users could trust the prediction and the explanation even when they are wrong; consequently, they could make the wrong, unfair or biased decision making.
Evaluating the reasonability of the explanation needs expert knowledge in the applications, and it is rarely available in general. Therefore, we encourage research to investigate whether the explanation produced by the proposed model is reasonable based on the expert knowledge in the applications. For the aforementioned risk, we encourage research to investigate the influence of the explanation by the interpretable models, including the proposed model to the trustworthiness of the models and users’ decision making.
References
 [1] Abalone data set. Note: https://archive.ics.uci.edu/ml/datasets/Abalone Cited by: Appendix C, §5.1.
 Kernels for vectorvalued functions: a review. Foundations and Trends® in Machine Learning 4 (3), pp. 195–266. Cited by: §1.
 " What is relevant in a text document?": an interpretable machine learning approach. PloS one 12 (8). Cited by: §2.
 Evaluating and aggregating featurebased model explanations. arXiv preprint arXiv:2005.00631. Cited by: §5.2.
 Pattern recognition and machine learning. springer. Cited by: Appendix B, Appendix B.
 A survey on gaussian processes for earthobservation data analysis: a comprehensive investigation. IEEE Geoscience and Remote Sensing Magazine 4 (2), pp. 58–78. Cited by: §1, Broader Impact.

This looks like that: deep learning for interpretable image recognition
. In Advances in Neural Information Processing Systems, pp. 8928–8939. Cited by: §2.  Learning to explain: an informationtheoretic perspective on model interpretation. In International Conference on Machine Learning, pp. 883–892. Cited by: §1.
 Sparse multioutput gaussian processes for medical time series prediction. arXiv preprint arXiv:1703.09112. Cited by: §1, Broader Impact.
 Efficient approaches to gaussian process classification. In Advances in neural information processing systems, pp. 251–257. Cited by: §1.
 [11] Diabetes data set. Note: https://archive.ics.uci.edu/ml/datasets/diabetes Cited by: Appendix C, §5.1.
 [12] Drug review dataset (druglib.com) data set. Note: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29# Cited by: Appendix C, §5.1.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Appendix C, §5.1.
 Gaussian processbased models for clinical time series in healthcare. Ph.D. Thesis, Duke University. Cited by: §1, Broader Impact.
 Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269. Cited by: §5.1.
 Google vizier: a service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1487–1495. Cited by: §2.
 Financial applications of gaussian processes and bayesian optimization. arXiv preprint arXiv:1903.04841. Cited by: §1, Broader Impact.
 Network Lasso: Clustering and Optimization in Large Graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 387–396. External Links: ISBN 9781450336642, Document Cited by: §2, §5.1.
 Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management 5, pp. 81–102. Cited by: Appendix A, Appendix C, Figure 1, §5.1.
 Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §2, §5.1.
 On the limited memory bfgs method for large scale optimization. Mathematical programming 45 (13), pp. 503–528. Cited by: §4.
 A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. Cited by: §1, §5.1.
 Towards robust interpretability with selfexplaining neural networks. In Advances in Neural Information Processing Systems, pp. 7775–7784. Cited by: §2, §5.2, §5.2.
 Interpretable machine learning. Note: https://christophm.github.io/interpretablemlbook/ Cited by: §1.
 Learning from distributions via support measure machines. In Advances in neural information processing systems, pp. 10–18. Cited by: §3.
 Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §2.
 [27] Optical recognition of handwritten digits data set. Note: https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits Cited by: Appendix C, §5.1.
 [28] Paper reviews data set. Note: https://archive.ics.uci.edu/ml/datasets/Paper+Reviews Cited by: Appendix C, §5.1.
 Scikitlearn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: Appendix C, Appendix D.
 The matrix cookbook. Technical University of Denmark. Note: Version: November 15, 2012 External Links: Review Matrix Cookbook, Link Cited by: Appendix B, Appendix B.
 [31] QSAR fish toxicity data set. Note: https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity Cited by: Appendix C, §5.1.
 Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: Appendix B, Appendix D, §1.
 Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813. Note: https://github.com/UKPLab/sentencetransformers Cited by: Appendix C, §5.1.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier
. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 1135–1144. External Links: ISBN 9781450342322, Document Cited by: §1, §5.1.  Gaussian processes for timeseries modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371 (1984), pp. 20110550. Cited by: §1.

Grangercausal attentive mixtures of experts: Learning important features with neural networks.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 4846–4853. Cited by: §2.  Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §2.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §2, §5.1.
 Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567–574. Cited by: §4.
 Graph kernels. Journal of Machine Learning Research 11 (Apr), pp. 1201–1242. Cited by: §3.
 Gaussian process regression networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1139–1146. Cited by: §1.

Kernel interpolation for scalable structured gaussian processes (kissgp)
. In International Conference on Machine Learning, pp. 1775–1784. Cited by: §4.  [43] Wine quality data set. Note: https://archive.ics.uci.edu/ml/datasets/wine+quality Cited by: Appendix C, §5.1.
 A new view of automatic relevance determination. In Advances in neural information processing systems, pp. 1625–1632. Cited by: §2.
 Localized Lasso for HighDimensional Regression. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, pp. 325–333. Cited by: §2, §5.1.
 Gnnexplainer: generating explanations for graph neural networks. In Advances in Neural Information Processing Systems, pp. 9240–9251. Cited by: §2.
 Latent support measure machines for bagofwords data classification. In Advances in neural information processing systems, pp. 1961–1969. Cited by: §3.
 Neural generators of sparse local linear models for achieving both accuracy and interpretability. arXiv preprint arXiv:2003.06441. Cited by: §2.
 Bayesian optimization for materials design with mixed quantitative and qualitative variables. Scientific Reports 10 (1), pp. 1–13. Cited by: §1, Broader Impact.
Appendix A Feature Description and Additional Examples for the Boston Housing Dataset
Feature name  Description 

CRIM  Per capita crime rate by town 
ZN  Proportion of residential land zoned for lots over 25,000 sq.ft. 
INDUS  Proportion of nonretail business acres per town. 
CHAS  Charles River dummy variable (1 if tract bounds river; 0 otherwise) 
NOX  Nitric oxides concentration (parts per 10 million) 
RM  Average number of rooms per dwelling 
AGE  Proportion of owneroccupied units built prior to 1940 
DIS  Weighted distances to five Boston employment centers 
RAD  Index of accessibility to radial highways 
TAX  Fullvalue propertytax rate per $10,000 
PTRATIO  Pupilteacher ratio by town 
B  1000(Bk  0.63)^2 where Bk is the proportion of blacks by town 
LSTAT  % lower status of the population 
MEDV  Median value of owneroccupied homes in $1000’s 
The Boston housing dataset, referred to as “Boston” in our experiments, contains information collected by the U.S. Census Service regarding housing in the area of Boston, Massachusetts [19] and is used for predicting house prices based on the information. Table 4 lists the names of the features and their descriptions for the Boston housing dataset.
Figure 4 presents four examples of feature contributions estimated by GPX. We found that each of these examples has different feature contributions, although some of the features, such as “AGE” and “DIS,” had consistent positive or negative contributions, respectively.
Appendix B Detailed Derivation of Predictive Distributions
In this appendix, we describe the derivation of predictive distributions in detail. For a new test sample , our goal is to infer the predictive distributions of the target variable and weight vector .
Predictive distribution of . The predictive distribution of is obtained similarly to the standard GPR [32]. In Section 4, we demonstrated that the marginal distribution of training target variables for GPX is defined as
(15) 
where . According to (15), the joint marginal distribution of and is defined as
(16) 
where , and . The predictive distribution of is the conditional distribution of given with training and testing inputs. Therefore, it can be obtained by applying the formula of conditional distributions for normal distributions [30, Eq. (354)] to (16) as follows:
(17) 
Predictive distribution of . The predictive distribution of can be obtained by solving the following equation:
(18) 
where the first integrand is the conditional distribution of and the second integrand is the posterior distribution of . The conditional distribution of is derived similarly to the conditional distribution of (17). The distribution of in which the functions are integrated out is given by
(19) 
where is the th column vector of . According to this, the joint distribution of and is defined as
(20) 
where we let and . Subsequently, we can obtain the conditional distribution of by applying the formula of conditional distributions for normal distributions [30, Eq. (354)] to (20) as follows:
(21) 
Here, we can rewrite (21) as a single dimensional multivariate normal distribution as follows:
(22) 
where is a block diagonal matrix of order whose block is , is a function that flattens the input matrix in columnmajor order, and . is an by block matrix, where each block is an by matrix the and block of the block matrix is for , while the other blocks are zero matrices. By letting , we obtain
(23) 
To derive the posterior distribution of , , we first consider the joint distribution of and . This distribution is straightforwardly obtained as
(24) 
which can be rewritten as
(25) 
where . By applying the formula of conditional distributions of normal distributions [5, Eqs. (2.113)–(2.117)] to (25), we can obtain
(26) 
where we let . Here, the computation of requires inverting a square matrix of order with a computational complexity of . By using the Woodbury identity [30, Eq. (156)] to compute this inversion efficiently, we can transform into , which requires inverting a matrix of order , . Consequently, we obtain
(27) 
Appendix C Specification of Datasets
Digits  1,797  64 
Abalone  4,177  10 
Diabetes  442  10 
Boston  506  13 
Fish  908  6 
Wine  6,497  11 
Paper  399  2,990 
Drug  3,989  2,429 
We considered eight datasets from the UCI machine learning repository [13], which were referred to as Digits [27], Abalone [1], Diabetes [11], Boston [19], Fish [31], Wine [43], Paper [28], and Drug [12] in our experiments.
The first six datasets consisted of tabular data. We treated the original inputs and simplified inputs identically in our experiments. The Digits dataset was originally developed as a classification dataset for recognizing handwritten digits from zero to nine. As described in Section 5.1, we used this dataset for a regression problem by transforming the digit labels into binary values of 1 or . Here, we used only the testing set from the original Digits dataset because that is how scikitlearn [29] distributes this dataset. The Abalone dataset is a dataset for predicting the age of abalone based on physical measurements. The Diabetes dataset is a dataset for predicting the onset of diabetes based on diagnostic measures. The Boston dataset is a dataset for predicting house prices, as described in Appendix A. The Fish dataset is a dataset for predicting acute aquatic toxicity toward the fish pimephales promelas for a set of chemicals. The Wine dataset is a dataset for predicting the quality of white and red wines based on physicochemical tests. The remaining two datasets are text datasets. The Paper dataset is a dataset for predicting evaluation scores for papers based on review texts written mainly in Spanish. The Drug dataset is a drug review dataset for predicting 10star ratings for drugs based on patient review texts. For each dataset, and are different. Specifically, we used the 512dimensional sentence vectors obtained using sentence transformers [33] as and used bagofwords binary vectors of the sentences as , where the cutoff frequencies for words were set to two and five for the Paper and Drug datasets, respectively. Table 5 lists the number of samples and number of features in each dataset.
Appendix D Detailed Description of Comparing Methods
In this appendix, we describe the implementation and hyperparameter search methods used for comparing methods.
We implemented GPR using PyTorch v1.5.0
^{1}^{1}1https://pytorch.org/. All hyperparameters for GPR were estimated by maximizing marginal likelihood [32], where we initialized the hyperparameters to the same values as those for GPX. For Lasso and Ridge, we used the implementations provided by scikitlearn [29]. The hyperparameters that regularize the strengths of the and regularizers in Lasso and Ridge, respectively, were optimized through a grid search using functions provided by scikitlearn (i.e., sklearn.linear_model.LassoCV and sklearn.linear_model.RidgeCV) with the default options. The search range for the hyperparameters for Lasso was limited to within 100 grid points such that the ratio of its minimum value to its maximum value was capped at 0.001, while that for Ridge was limited to within a range of . For the localized lasso, we used the original implementation written in Python^{2}^{2}2https://rikenyamada.github.io/localizedlasso.html. The hyperparameters and their search ranges for the localized lasso are the strength of network regularization , strength of the regularizer , and for the nearestneighbor graph. The hyperparameters were optimized through a grid search. The network lasso is a special case of the localized lasso. If for the localized lasso is zero, then the localized lasso is identical to the network lasso. Therefore, we used the implementation of the localized lasso and set for the network lasso. The hyperparameter search for the network lasso was the same as that for the localized lasso, except for the setting of . For LIME and Kernel SHAP, we used the original implementations^{3}^{3}3LIME: https://github.com/marcotcr/lime, Kernel SHAP: https://github.com/slundberg/shap.Appendix E Additional Results for the Digits Dataset
Table 6 shows the computational times of each of the methods on the Digits dataset. First, the training times of GPX and GPR were much the same, and GPX was significantly faster than the localized and network lasso. Since the localized and network lasso requires the hyperparameter search, the actual training times of the localized and network lasso were about 48 and 12 times longer than the times shown in the table, respectively. Second, the prediction time of GPX was significantly faster than GPR+LIME/SHAP. In total, GPX was the fastest in the comparing methods.
Figure 5 presents additional examples of estimated weights for the Digits dataset. We found that the weights estimated by GPX were appropriately assigned such that the regions of black pixels have weights with the same signs as those of the target variables.
In terms of the stability of explanations, estimated weights for the same digit should be similar. Figure 6 presents three examples of estimated weights for the digit two. We found that GPX estimated similar weights for all three examples.
GPX  GPR+LIME  GPR+SHAP  Localized  Network  

Training  5.74  5.53  5.76  105.84  106.16 
Prediction  24.57  155.54  2643.06  1.59  1.61 
Total  30.31  161.07  2648.82  107.43  107.77 
Comments
There are no comments yet.