1 Introduction
As the field of interpretable and explainable Machine Learning is getting more and more traction, machine learning methods whose reasoning and decision processes were once incomprehensible to human users are finally starting to become transparent. While a general solution to the challenge of humanly tangible, yet sufficiently complex models still seems to be far off in the future, recent developments have yielded promising results. The primary advantages of interpretable models are, as noted in DoshiVelez and Kim (2017), (scientific) understanding on the one hand and on the other hand safety, especially operational and ethical safety.
In regards to the former, expressing a complex learning problem in a form that humans can make sense of also presents the chance to let human prior knowledge augment the learning process. Besides the well established field of imitation learning
^{1}^{1}1see for example Hussein et al. (2017) for an overview, Bayesian methods are another apparent candidate for such endeavor. The strongest points for the Bayesian route are  first  the ability to express expert knowledge even before any data is available. Second, Bayesian statistics is embedded in a rigorous mathematical foundation that allows to derive theoretical results in a deductive manner. In practical terms, this can be particularly valuable when observational data are sparse or highly expensive to obtain or generate.
To give a concrete motivational example, consider a simple regression problem where the relation between input and target features is given by, leaving aside potential noise, a linear function. If a reasonably informed expert is aware of this relation and simultaneously able to suitably articulate this prior knowledge to a machine learning model, we can expect the performance of such model to improve over an uninformed counterpart.
While Bayesian methods are commonly praised for their ability to deal with the presence of expert knowledge, the complexity of modern Bayesian models only permits the expression of very general prior beliefs. Consider the case of Gaussian Process (GP) models as arguably the figurehead of Bayesian nonparametrics. The choice of the GP kernel function allows, in theory, to express certain functional prior assumptions. Due to the above mentioned complexity issue however, it is fairly common to use some variation of the squared exponential (SE) kernel per default. On the one hand, generic prior distributions as implied by the SEkernel might be the only reasonable choice to describe a complex problem on a global scale. On the other hand though, prior information about more granular properties of the target function is completely discarded under these circumstances.
This brings us back to the initial consideration of leveraging transparent machine learning models in order to mitigate this limitation. For this purpose, we will take advantage of the common approach to have a model learn a globally complex function while being able to locally, for a given instance, decompose a model’s decision into the contribution of each feature. Such procedure has been proposed for Neural Networks in particular by
AlvarezMelis and Jaakkola (2018) who coined the term selfexplaining models. We will directly transfer their idea to variational approximations for GPs as introduced in Titsias (2009); Hensman et al. (2013) and exploit the resulting structure of the variational posterior. The proposed model is both selfexplaining as well and at the same time extends the possibilities to express prior knowledge in the context of GPs.2 Transparent Machine Learning
In regards to transparency in machine learning, terms like interpretable Machine Learning or
explainable Artificial Intelligence
(XAI) have become quite widespread and popular. However, actual definitions of such terms still vary from author to author. In our context, we use the following definitions of interpretation and explanation from Montavon et al. (2018):Definition 1
An interpretation is the mapping of an abstract concept into a domain that the human can make sense of. An explanation is the collection of features of the interpretable domain, that have contributed for a given example to produce a decision.
Besides images and text being interpretable domains as noted in Montavon et al. (2018)
, we note that reasonably sized mathematical or statistical models can be considered as being interpretable as well. Take for example the standard linear regression model
(1) 
and the wellknown interpretation of each coefficient being the average marginal effect of the corresponding variable or feature. Unless such model contains dozens of relevant variables, a human with sufficient domain knowledge about the problem being modelled can then easily make sense of the resulting qualitative implications. It can also be seen that a linear model provides a globally applicable explanation for a given example, i.e. the contribution of each feature is always the same, independently of an example’s location in its domain .
On the other hand, it is obvious that the plain linear model is unable to deal with complex problems in a satisfactory manner, yet problems of high complexity are particularly relevant in machine learning. In order to solve this rather severe shortcoming, a straightforward extension of (1) are socalled varying coefficient models as first introduced by Hastie and Tibshirani (1993). Here, the parameters are themselves functions of some covariates :
(2) 
where denotes the th feature column of . For our purpose, we usually have . In order to obtain a sufficiently flexible model from (2), using a universal approximator like Neural Networks for the is an obvious choice and proposed in particular by AlvarezMelis and Jaakkola (2018) under the umbrella term selfexplaining neural networks (SENN). Although the latter propose an even more general model, the definition in (2) as a special case shall suffice for our means. Hence, our model of interest is defined as follows:
(3) 
with each being the th of
output neurons of a standard feedforward Neural Network. Replacing the neural network by
GP regression models, we arrive at a selfexplaining Bayesian variant which was introduced by Yoshikawa and Iwata (2020) under the term GPX. In order to proceed, we now provide a brief recap on GP models and variational approximations in the following before exposing our main contributions.3 Gaussian Processes
The building blocks of GPs, see Rasmussen (2003), are a prior distribution over functions, , and a likelihood . Using Bayes’ law, we are interested in a posterior distribution obtained as
(4) 
The prior distribution is a Gaussian Process, fully specified by , typically , and covariance kernel function :
(5) 
We assume the input domain for to be a bounded subset of the real numbers, . A common choice for is the ARDkernel
(6) 
where is a diagonal matrix with entries in and . For , (6) is equivalent to an SEkernel. We denote by the positive semidefinite GramMatrix, obtained as , the th row of training input matrix containing observations in total, and write for the GramMatrix over .
Provided that , i.e. observations are i.i.d. univariate Gaussian conditioned on , it is possible to directly calculate a corresponding posterior distribution for new inputs as
(7) 
where , , ;
is the identity matrix with according dimension.
In order to make GPs feasible for large datasets, the work of Titsias (2009); Hensman et al. (2013, 2015) developed and refined Sparse Variational Gaussian Processes (SVGPs). SVGPs, introduce a set of so called inducing locations and corresponding inducing variables . The resulting posterior distribution, , is then approximated through a variational distribution  usually  by maximizing the evidence lower bound (ELBO):
(8) 
where we obtain by evaluating the GP prior distribution at inducing locations
. Following standard results for Gaussian random variables, it can also be shown that for marginal
evaluated at arbitrary and with , we have(9) 
4 Selfexplaining variational posterior distributions
Instead of formulating the prior GP model and subsequently deriving its variational approximation, we will proceed the opposite way by formulating the general structure of our variational distribution first. Using (3) as a starting point for our model, an obvious adaption can be achieved by replacing the neural network with GPs, each modeling one corresponding varying coefficient. This is in direct relation to Yoshikawa and Iwata (2020) who construct a selfexplaining GP prior in this manner, coined GPX. Such prior can be shown to yield a closed form posterior distribution of the same structure. Hence, part of our work can be seen as a variational extension to their method. As will be seen however, our method allows for extensions whose relation to the former is not as obvious. We will refer to our method as SEVGP  Selfexplaining variational Gaussian Process from now on.
The SEVGP is composed as follows  we have:

independent sets of inducing variables at inducing locations , , each corresponding to a separate GP

independent GPs, , with inducing locations as defined in 1.

The actual target process, , evaluated at arbitrary input matrix and constructed as
(10) with the th column of and denoting elementwise multiplication.
We can formally write the joint probability density via the conditional distributions
(11) 
where we summarized for convenience. Also we wrote instead of to stress that (11) is the target structure for the variational distribution for now.
Choosing as in the standard SVGP model, we obtain in correspondence to (9):
(12) 
Defining , we can derive the marginal distribution of the target GP as
(13) 
From the construction of our model it also follows that , i.e. relates to only via . We therefore conclude that
(14) 
where .
In addition, it is straightforward to see that under (14) we have
(15) 
Equations (12), (13) and (14) now allow to construct different prior distributions and hence express different prior beliefs. We remind the reader that the paramount goal of all three approaches is expression of meaningful prior belief on the one hand and transparency of the result on the other hand. While the latter has been exemplified in this section, the former is achieved by three different interpretations of the proposed variational posterior. We provide a tabular overview of all three variants in Appendix A.
4.1 As a variational extension for GPX
As stated above, the most obvious interpretation is as a sparse variational approximation of a GPX model. There are two alternative ways to implement this variant:

Have both prior and variational process structured as in (13) and perform variational inference for SVGPs as usual.

Treat each varying coefficient GP separately, i.e. each GP has its own set of inducing points and variables
As the former case would be trivial and not help us incorporating any meaningful prior knowledge about the coefficients, we focus on the latter. This case can be easily derived by constructing the prior conditioned on its realizations at , , as in (14), i.e.
(16) 
Adding now , the joint prior distribution over is
(17) 
The variational distribution is slightly modified to match the standard SVGP structure per process :
(18) 
(19) 
Equation (19) then results in the following ELBO  the derivation of this result and all subsequent ones can be found in the appendix:
(20) 
It should be clear that (20) also implies that we can use batch sampling methods in order to apply our method to large datasets.
This rather obvious SVGP extension to GPX allows to incorporate prior knowledge about each of the varying coefficients or respectively each feature’s contribution individually.
4.2 As a variational approximation for an arbitrary GP
In order to allow for general functional prior knowledge, we now use (13) directly and approximate an arbitrary GP posterior . As a crucial distinction to plain SVGPs, we allow the covariance functions of and to be different while keeping the structure of selfexplaining as before.
Hence, the KLobjective for variational inference in this case becomes
(21) 
Since (21) denotes a KLdivergence between two stochastic processes, we cannot proceed as in the finite dimensional case as noted in Sun et al. (2019). As a result, we cannot obtain a usual ELBO either but derive a functional evidence lower bound (fELBO) instead:
(22) 
with the finite dimensional evaluation of over the union set consisting of training observations and so called augmentation points sampled uniformly from . Also, denotes the evaluation of the finite dimensional covariance matrix of evaluated at
As long as the component kernel functions of are flexible enough, it is possible to approximate a large variety of prior functions with this approach. To stress the difference of this approach to standard SVGPs, the resulting variational posterior allows for casebased explanations for each instance predicted.
4.3 As a variational approximation for a GP with additional priors over coefficients
This last variant can be interpreted as a combination of the other two approaches. In a corresponding usecase scenario we might have prior knowledge available on both the functional form over all input variables and, additionally, specific prior knowledge about the individual contribution of certain features. To embed this idea into our framework, we make the following adjustments to the prior and variational distributions from (17) and (18):
The modified prior distribution is now structured as
(23) 
the modified variational distribution as
(24) 
As the resulting ELBO formula facilitates the explanation of our reasoning behind (23) and (24), we first state the resulting variational objective. For the KLdivergence we have
(25) 
As it turns out, we require a functional lower bound in this case as well:
(26) 
denotes the evaluation of the finite dimensional covariance matrix of at .
Assuming independence between and in the prior distribution while keeping them dependent in the variational approximation requires some explanation. Presuming that the true datagenerating function is unlikely to follow an additive structure as postulated in (16), there is no reason to split into additive components in the general case. Hence, we treat as an independent process, irrespective of any supplemental and corresponding inducing variables. In correspondence to the construction discussed at the beginning, this implies
(27) 
in accordance with (23). The discussed prior distribution is hence rather a convenience in order to influence both as a whole and each individually while at the same time conserving the transparent structure of the variational distribution.
5 Experiments
In order to empirically evaluate our method, we first conducted an experiment on posterior soundness to validate for a simple example the properties we discussed in section 4. The choice of coefficient priors and full process prior should be reflected accordingly in the corresponding posteriors of each variant. As a second experiment, we compared the results of SEVGP with SENN on standard regression datasets.
More implementation details can be found in Appendix C.
5.1 Posterior soundness
We sampled input data uniformly from and generated the target feature as . For the varying coefficient prior, (, variants and ) and variational (, all three variants) processes a with zero mean. As a kernel function a, we used a summation kernel of a constant kernel and an SEkernel and the only trainable parameter. While the former corresponds to the prior belief of the varying coefficient being constant, the latter exemplifies potential deviation from constancy. The tradeoff between both assumptions can be steered via . For our experiments, we set .
For variants , we posed a secondorder polynomial GP prior over the full process,  i.e. our prior belief matches the data generating process. As can be seen in Figure 1, the posterior means for these variants approach the true mean function more quickly. In the larger sample case, the posterior means under the polynomial prior better fit the true mean outside the range of observed training examples. As for the difference between variants and , we observed that the additional prior in yielded a regularizing effect on the posterior distribution.
Variational posterior predictive mean functions for variants
(green), (blue), (purple) for samples (red dots) of size (left) and (right) with true mean function (red) .5.2 SENNcomparison
In order to evaluate the practical applicability of our approach, we compared it against a SENN as in AlvarezMelis and Jaakkola (2018) on four UCI datasets^{2}^{2}2Boston Housing, Concrete Slump, Red Whine, White Whine, see Dua and Graff (2017). It should be stressed that our aim was not to show that our method provides better results in general but rather that its performance is comparable to SENN. Also, we only included variant in this evaluation given that we had no sensible prior knowledge about available for these datasets.
We ran 10fold crossvalidation per dataset and evaluated both meansquared error (MSE) and coefficient stability, i.e. explanation coherence for neighboring datapoints. To evaluate both measures for the GP models, we used the posterior mean function. In regards to stability, we averaged over all training examples the following stability measure per instance :
(28) 
SEVGP (4.1) (ours)  

MSE  Stability  MSE  Stability  
Boston  
Concrete  
Wine red  
Wine white 
MSE and coefficient stability for SENN and SEVGP (variant 4.1) posterior mean; average and standard deviation over 10fold cross validation
with
the stacked vector of coefficients at
derived either from the GP posterior mean or the SENN output neurons. denotes the set of the training instances with closest to ^{3}^{3}3Yoshikawa and Iwata (2020) propose to use all training instances in the neighorhood offor their stability measure. For our datasets this lead to issues due to outliers, hence we resorted to this variant.
. As a kernel function for the SEVGP coefficient priors, we used the same kernel as in 5.1 but replaced the SEkernel by an ARDkernel (6) and set to allow for more variability in the coefficients.Table 1 shows that SEVGP achieves comparable performance to SENN in terms of MSE and, particularly, stability.
6 Related work
The results of AlvarezMelis and Jaakkola (2018); Yoshikawa and Iwata (2020); Guhaniyogi et al. (2020) directly inspired our approach from an explainability and transparency point of view. The overarching, general theme of our present work however revolves around the question of how to make prior knowledge available to complex Machine Learning models. Niyogi et al. (1998); Ferranti et al. (2017); von Rueden et al. (2019); Yang and Ren (2020) all discuss the potentially beneficial role of expert and domain knowledge in Machine Learning, yet either mention Bayesian methods only briefly or not at all. Nevertheless, Bayesian nonparametrics have already been applied successfully in countless classical statistical modeling problems with an emphasis on incorporating prior knowledge  see Gelman et al. (2013) for a variety of examples.
Recent work on functional variational inference as discussed particularly in Sun et al. (2019); Burt et al. (2020) could be a fruitful step towards a synthesis of meaningful prior models and modern Machine Learning architectures. On another note, Knoblauch et al. (2019) reinterpret variational inference as a mere optimization problem where the KLdivergences in an evidence lower bound are merely seen as regularization terms. This view is apparently related to our derivations in 4.3 where we explicitly dragged along the inducing variables in order to influence the posterior distribution based on prior knowledge about the coefficients and the overall function simultaneously.
7 Limitations and discussion
The main challenge we currently see for our method is its scalability to more complex problems. In particular, image classification tasks could potentially benefit from more expressive prior distributions, especially when training datasets are difficult to populate or diversify. This also distinguishes the proposed model from its neural counterpart in AlvarezMelis and Jaakkola (2018)
which can be scaled quite efficiently. Nevertheless, the sparse variational setup will likely allow for further improvements in terms of scalability. With this in mind, we are confident that our method can be extended to computer vision tasks in the future.
On a broader scale, the general possibilities to interweave Bayesian methods with transparent machine learning might be far from being exhausted with our contribution. As the field of interactive and humanintheloop machine learning  see Fails and Olsen Jr (2003); Holzinger (2016)  is getting more traction, such methods could be used for transparent humanmachine feedback loops as proposed for example by Teso and Kersting (2019).
8 Broader impact
The potential societal impact that we hope to contribute to with this work is twofold: First, the selfexplaining structure of our approach allows for transparency about which features or variables contributed to a given outcome. This allows to check whether a prediction is driven by any form of model bias and unfairness at runtime. On the flipside, we haven’t worked out any global guarantees for fairness and unbiasedness in this work.
The second possible impact of our method is the ability to reduce the chance of unfairness and biasedness from the start through the choice of prior distributions on the varying coefficients. By using for example a sufficiently positive prior mean function for a coefficient, we can guide the corresponding posterior coefficient to take on positive values with high probability as well, resulting in a positive effect for a given feature. However, there are again no guarantees and the resulting predictions would still need to be validated.
On the negative side of impact, we see no actively harmful potential of our method. However, a user might falsely presume these properties to be present. A subsequent negative impact would thus be highly dependent on the specific usecase at hand and could occur with any method whenever such methodological misunderstanding prevails. To prevent such issues, any user of our method should be made aware of these shortcomings.
References
 Towards robust interpretability with selfexplaining neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 7786–7795. External Links: Link Cited by: §1, §2, §5.2, §6, §7.
 Julia: a fresh approach to numerical computing. SIAM review 59 (1), pp. 65–98. Cited by: Appendix C.
 Understanding variational inference in functionspace. CoRR abs/2011.09421. External Links: Link, 2011.09421 Cited by: §6.
 Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §1.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: footnote 2.
 Interactive machine learning. In Proceedings of the 8th international conference on Intelligent user interfaces, pp. 39–45. Cited by: §7.
 The value of prior knowledge in machine learning of complex network systems. Bioinform. 33 (22), pp. 3610–3618. External Links: Link, Document Cited by: §6.
 Bayesian data analysis. CRC press. Cited by: §6.
 Distributed bayesian varying coefficient modeling using a gaussian process prior. arXiv preprint arXiv:2006.00783. Cited by: §6.
 Varyingcoefficient models. Journal of the Royal Statistical Society: Series B (Methodological) 55 (4), pp. 757–779. Cited by: §2.
 Scalable variational gaussian process classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 912, 2015, G. Lebanon and S. V. N. Vishwanathan (Eds.), JMLR Workshop and Conference Proceedings, Vol. 38. External Links: Link Cited by: §3.
 Gaussian processes for big data. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI 2013, Bellevue, WA, USA, August 1115, 2013, A. Nicholson and P. Smyth (Eds.), External Links: Link Cited by: §1, §3.
 Interactive machine learning for health informatics: when do we need the humanintheloop?. Brain Informatics 3 (2), pp. 119–131. External Links: Link, Document Cited by: §7.
 Imitation learning: A survey of learning methods. ACM Comput. Surv. 50 (2), pp. 21:1–21:35. External Links: Link, Document Cited by: footnote 1.
 Don’t unroll adjoint: differentiating ssaform programs. CoRR abs/1810.07951. External Links: Link, 1810.07951 Cited by: Appendix C.
 Jupyter notebooksa publishing format for reproducible computational workflows.. Vol. 2016. Cited by: Appendix C.
 Generalized variational inference: three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063. Cited by: §6.
 Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 73, pp. 1–15. External Links: Link, Document Cited by: §2, §2.
 Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86 (11), pp. 2196–2209. Cited by: §6.
 Gaussian processes in machine learning. In Advanced Lectures on Machine Learning, ML Summer Schools 2003, Canberra, Australia, February 214, 2003, Tübingen, Germany, August 416, 2003, Revised Lectures, O. Bousquet, U. von Luxburg, and G. Rätsch (Eds.), Lecture Notes in Computer Science, Vol. 3176, pp. 63–71. External Links: Link, Document Cited by: §3.
 Functional variational bayesian neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: Appendix B, §C.1, §4.2, §6.
 Explanatory interactive machine learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2019, Honolulu, HI, USA, January 2728, 2019, V. Conitzer, G. K. Hadfield, and S. Vallor (Eds.), pp. 239–245. External Links: Link, Document Cited by: §7.
 Variational learning of inducing variables in sparse gaussian processes. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 1618, 2009, D. A. V. Dyk and M. Welling (Eds.), JMLR Proceedings, Vol. 5, pp. 567–574. External Links: Link Cited by: §1, §3.
 Informed machine learning–a taxonomy and survey of integrating knowledge into learning systems. arXiv preprint arXiv:1903.12394. Cited by: §6.
 A quantitative perspective on values of domain knowledge for machine learning. arXiv preprint arXiv:2011.08450. Cited by: §6.
 Gaussian process regression with local explanation. CoRR abs/2007.01669. External Links: Link, 2007.01669 Cited by: §2, §4, §6, footnote 3.
Appendix A Overview over prior structures
Variant  Lower Bound  Implied prior knowledge 

4.1  Knowledge about individual variables/coefficients  
4.2  Knowledge about general functional relation  
4.3  Knowledge about general functional relation and individual variables/coefficients 
Appendix B Derivations
We first state the following wellknown result for the expected sum of loglikelihoods of i.i.d univariate Gaussians with respect to a common mean
vector with multivariate Gaussian distribution:
(29) 
In order to approximate a stochastic process posterior through variational process , the following theorem from Sun et al. [2019] allows to construct a lower bound via the finite dimensional measurement set :
Theorem 1
If contains all training inputs , then
We refer to the above work for the proof of this theorem.
b.1 Derivation of
b.2 Derivation of
is directly obtained from Theorem 1 by plugging in our GP definitions for and by applying (29).
b.3 Derivation of
We follow the origina proof for Theorem 1 and show that the proposed is indeed a lower bound for under the stated assumptions. We first state the following result which we will prove later on:
This allows us to write the corresponding as
where we summarized and used the fact that for observations the augmentation points are irrelevant, hence holds.
where used the fact that by construction (23), hence and therefore .
We next derive the closed form expression for :