1 Introduction
Identifying relevant features and interactions from complex data sets and models remains a topic of active research. This is a fundamental problem with important applications in many scientific disciplines. Often the goal is to improve understanding of the model, but the identified features and interactions can also be used to build a more interpretable surrogate model.
For models that can implicitly model nonlinear effects and interactions, the typical approach is to accumulate local relevance estimates based on predictions made at the training inputs or their permutations. Many local measures use the partial derivative of the model’s prediction with respect to the features as the basis for measuring relevance
(Guyon et al., 2002). Since the average derivative can vary from positive to negative, most approaches use absolute or squared derivatives (Ruck et al., 1990; Dorizzi, 1996; Czernichow, 1996; Refenes and Zapranis, 1999; Leray and Gallinari, 1999; Sundararajan et al., 2017). However, the derivative of the mean prediction fails to take into account the uncertainty of the prediction. The average predictive comparison of Gelman and Pardoe (2007) also integrates over the posterior uncertainty, but is prohibitively expensive in practice. Paananen et al. (2019) take into account changes in predictive uncertainty, but rely on finite differences that are prone to numerical errors.Local values of the mean prediction or its derivative are often used also for identifying interacting variable pairs from complex models. For example, Friedman et al. (2008) and Greenwell et al. (2018) use partial dependence functions (Friedman, 2001) to construct statistics that measure the strength of pairwise interactions. Cui et al. (2019)
use the Hessian of the prediction of a neural network as a local measure of interaction strength. The approach of
Crawford et al. (2019) takes into account the predictive uncertainty, but only computes the relevance of variables without separating the main effects and interactions. The individual conditional expectation plots of Goldstein et al. (2015) can also be used to identify interactions, but they rely on visualization and do not provide quantification of the strength of the interactions.The contributions of this work are summarized as follows. First, we present a novel method for assessing the predictive relevance of input variables and their interactions. Instead of using the first partial derivatives or the crossderivatives of the mean prediction of a model, we instead differentiate the KullbackLeibler divergence from one predictive distribution to another. We derive measures analogous to the derivatives that take into account the uncertainty in the model’s predictions. Second, we show that the first derivative of the KullbackLeibler divergence of the predictive distribution is an analytical generalization of a previous finite difference method (Paananen et al., 2019). We also discuss that a similar finite difference equivalent is not viable for the crossderivative of the divergence measure that we use to identify interacting variable pairs.
We demonstrate the effectiveness of the proposed methods by applying Gaussian process models to simulated and real benchmark datasets with both nonlinearities and interactions. Based on the empirical evaluation, our methods are competitive to alternative methods in terms of both computational cost and effectiveness in identifying the relevant main effects and interactions.
2 Differentiating Probability Dissimilarity Measures
After observing data
consisting of a vector of outputs
and a matrix of inputs, let us denote the posterior predictive distribution of our model
at some input point asHere is the vector of parameters of the distribution that depends on , , and . To keep the notation concise, we denote the posterior predictive distribution at input simply as . Here, we assume that all input variables are centered and normalized.
In order to assess the predictive relevance of input variables, we are interested in estimating how much the predictions of a model change when the value of one input variable changes. For models that are linear with respect to the input variables, the regression coefficient of each variable is a global measure of relevance. For more flexible models, the absolute value of the derivative of the mean prediction is an analogous measure of relevance locally. In this section, we derive a local relevance measure that takes into account changes in both the mean prediction and its uncertainty through the posterior predictive distribution. It is a natural way to include uncertainty about the model parameters into the task of identifying relevant variables and interactions. Even for simple models such as a linear regression model, the variance of the predictive distribution is not constant when there is uncertainty about the model parameters.
The change in the predictive distribution due to perturbations of an input variable can be formalized as the derivative of a dissimilarity measure between two predictive distributions with respect to the input variable. Contrary to simply assessing the derivative of the mean prediction, this approach takes changes in predictive uncertainty into account in a principled manner. Our dissimilarity measure of choice is the KullbackLeibler divergence, but other measures could be used as well. Mathematically, we are thus evaluating the partial derivatives of the KullbackLeibler divergence from the predictive distribution at to the predictive distribution at with respect to the ’th input variable when the points and coincide and the two predictive distributions are equal. Because the KullbackLeibler divergence has a global minimum of zero when , the first derivative with respect to input variable is zero. We thus use the square root of the second derivative to measure the local sensitivity of the predictive distribution to changes in an input variable.
The square root of the KullbackLeibler divergence is a natural scale to use for two reasons. First, it brings the derivative to the same scale as the absolute derivative of the mean prediction, which is used often as a measure of local relevance (Ruck et al., 1990; Dorizzi, 1996; Refenes and Zapranis, 1999)
. For example, for a model with a Gaussian predictive distribution, the square root of the derivative of the divergence measure is proportional to the absolute derivative of the mean prediction instead of its square. Second, when the two predictive distributions are equal, the derivatives of the KullbackLeibler divergence are equal to the derivatives of the JensenShannon divergence, the square root of which defines a metric on the space of probability distributions. A similar scaling of the KullbackLeibler divergence has been used by
Simpson et al. (2017) for constructing prior distributions.Based on the predictive distribution of our model , we define the KLdiff local measure of the predictive relevance of a single variable at the input point as
(1)  
where denotes the KullbackLeibler divergence from the predictive distribution to , and are the parameters of .
The KLdiff measure in equation (1) has two kinds of partial derivatives: (i) second derivative of the KullbackLeibler divergence with respect to the parameters of the predictive distribution, and (ii) first derivative of the parameters with respect to the input variable . These are obtained as follows: (i) The curvature of KullbackLeibler divergence is represented by the Fisher information. Thus, for a probability distribution parameterized by , the second partial derivatives of KullbackLeibler divergence with respect to each are given by the Fisher information matrix of the distribution. (ii) The partial derivatives of the parameters with respect to input variable depend on the specific choice of model. Thus, the only requirements for using the proposed method are that the predictive distribution of the model must be available analytically (either the exact distribution or an approximation) and that the derivatives (ii) must be available. Fortunately, there are a large number of models where these conditions are fulfilled. In the supplementary material, we derive the terms (i) and (ii) for Gaussian process models with different likelihoods.
2.1 Illustrative Example
To gain intuition about the role of the different components in equation (1), we will analyze a Bayesian linear regression model as a toy example. For a Gaussian likelihood, the observations of the target variable are modelled as
where are the regression coefficients and is the noise variance. If the prior is an improper uniform prior on , integrating over the uncertainty about all parameters makes the posterior predictive distribution at a new point (treated as a row vector)
Here,
represents the degrees of freedom, and
and are the number of observations and input variables, respectively. are the maximum likelihood estimates of the regression coefficients. Now, the derivatives of the three parameters of the predictive distribution with respect to the input variable areThe nonzero second derivatives of the KullbackLeibler divergence are given by the Fisher information matrix of the Student distribution:
The KLdiff local measure of the predictive relevance of variable from equation (1) thus evaluates to
(2) 
The two terms have the following roles in the local relevance measure of input variable at a point . In the absence of the second term, the measure would be proportional to the absolute value of the maximum likelihood estimate of the regression coefficient,
, divided by the standard deviation of the predictive distribution. The first term thus measures the absolute derivative of the mean prediction, but input points with more uncertainty are given less weight. Also the second term in equation (
2) quantifies the amount of uncertainty in the posterior predictive distribution, but in a different way. Even if for some variable would be exactly zero, the second term is nonzero as long as there is uncertainty about the model parameters that causes the predictive uncertainty to vary in the input space.To visualize the roles of the two terms in equation (2), we simulated 10 observations from a linear model with two input variables and whose true regression coefficients are and
. The input variables are independent and normally distributed with zero mean and standard deviation one, and the noise standard deviation is
. Based on the posterior predictive distribution, the relevances for both variables given by equation (2) are shown in Figure 1 as a function of the corresponding input variable.The red lines show the contribution of the terms for the relevant variable , where we see that observations at the edges of the data are given less relevance, and that the second term contributes very little. The blue lines represent the relevance values of the irrelevant variable, where the first term is now very small because the estimate for the regression coefficient, , is small. In this case, the second term in equation (2) increases the relevance measure because there is still a significant amount of uncertainty about the model parameters. The KLdiff method is thus conservative in the sense that it amplifies the relevance measure for variables that seem irrelevant when there is still a lot of uncertainty.
When the number of observations increases, the posterior becomes more concentrated, the contribution of the second term begins to vanish, and the relevance measures of both variables approach the constant . Note that the KLdiff measure is not attempting to estimate the value of the regression coefficients, but rather compare the input variables to each other. Using a different prior distribution on the regression coefficients would change the predictive distribution and also the KLdiff relevance measure. We leave investigating the influence of the prior as future research.
For other likelihoods or models, the different terms in equation (1
) may not have such clear interpretations as with the linear regression model above. For example, in a binary classification task, the posterior predictive distribution can be considered a Bernoulli distribution which has only a single parameter. Nevertheless, the principle of taking into account the uncertainty about the model parameters still holds.
As the number of observations approaches infinity, if the model posterior concentrates to a point, the predictive uncertainty in the linear regression model approaches a nonzero constant. Thus, in the limit, the KLdiff local relevance measure for the linear model is proportional to irrespective of the input point of evaluating. In the supplementary material, we show that the regression coefficient plays a significant role also in the asymptotic results for other generalized linear models. Equation (1) is thus a very natural way of evaluating the predictive relevance of input variables that also generalizes to nonlinear and nonparametric models.
2.2 Identifying Pairwise Interactions
Efficiently and accurately identifying interaction effects between input variables is important for many applications. Based on a complex model, the identified interactions can, for example, be used to build a more interpretable surrogate model. Because the number of variable pairs increases quadratically with the number of variables, it is crucial to be able to identify the interactions with a single model without having to explicitly fit multiple models when assessing the interactions.
Just like the partial derivatives of a model’s mean prediction with respect to one input variable can be used as a local measure of relevance of that variable, the crossderivatives with respect to two input variables measure the strength of their joint interaction effect. For example Cui et al. (2019) use the crossderivatives of a neural network’s prediction directly, while Friedman et al. (2008) estimate the interactions implicitly. By computing the crossderivatives of the KullbackLeibler divergence of predictive distributions, we construct an analogous measure of interaction strength that takes into account the predictive uncertainty.
We define the KLdiff local measure for the strength of the interaction between variables and as
(3)  
where we omitted higher order derivatives and noncrossderivative terms. This measure has the same form as the KLdiff measure, but the partial derivatives of the parameters of the predictive distribution are now crossderivatives. The interpretation of the terms is similar to KLdiff, and the second derivatives of the KullbackLeibler divergence will similarly weight the local measures based on the predictive uncertainty.
2.3 Global Measures of Predictive Relevance
For assessing the global predictive relevance of input variables or pairs of variables, it is natural to integrate the local relevance measures KLdiff and KLdiff over the probability distribution of the inputs. As an approximation of this we use the average over all of the training points. Using the global measures, we can rank the input variables or variable pairs. These rankings can be used to directly select a smaller set of variables and interactions to use for a surrogate model, or they can be used as a starting point for another variable selection method.
When identifying the relevance of interactions, one could also evaluate the local relevance measures in different permutations of the training inputs, as done by Friedman et al. (2008) and Greenwell et al. (2018). However, this may be inefficient and inaccurate if the input variables are not independent. It is also possible to assess relevance in some particular subsection of the inputs or outside the training data.
2.4 Applicability
The requirements for using the proposed KLdiff and KLdiff methods are that we must have an analytical representation of the predictive distribution conditioned on the input variables, and that the derivatives of its parameters with respect to the inputs must be available. For the latter, computational tools such as automatic differentiation can be used (Baydin et al., 2018). There are no restrictions to the support of the predictive distribution, and the methods are thus applicable to many learning tasks.
Because the proposed methods measure the relevance of input variables locally, they are most useful for nonlinear and complex models. For example, Gaussian process models are suitable because they can represent flexible functions with nonlinearities and interactions, but still have good uncertainty quantification (O’Hagan, 1978; MacKay, 1998; Neal, 1998; Rasmussen and Williams, 2006). In the supplementary material we show the derivatives required for the KLdiff and KLdiff methods for Gaussian processes and commonly used likelihoods.
Paananen et al. (2019) use a finite difference like method that evaluates the KullbackLeibler divergence of predictive distributions when the input variables are perturbed. In the supplementary material, we show that their method can be derived as a finite difference approximation to the KLdiff measure. Using KLdiff avoids numerical errors related to finite difference and is easier because the selection of the perturbation size is avoided. For an appropriately chosen perturbation, the two equations produce practically identical results. The finite difference approximation is more general as it requires only an analytical representation of the predictive distribution but not the derivatives of its parameters with respect to the inputs.
For the KLdiff method, it is not possible to derive an equivalent finite difference approximation because there is no secondorder KullbackLeibler divergence. Riihimäki et al. (2010) perturb two input variables at a time with a unit length perturbation and measure the change in predictions by KullbackLeibler divergence. For an infinitesimal perturbation this would be equivalent to a directional derivative instead of the crossderivative in the KLdiff method.
Due to its generality, the KLdiff and KLdiff
methods could in principle be applied to any binary classifier, where the classification probability is available and gradients are computable, such as a neural network. However, because the methods weight observations based on their classification probability, the results may be misleading if the model lacks proper uncertainty quantification. We thus recommend using the methods for probabilistic models where the uncertainty is properly taken into consideration. Due to the conservative property discussed in Section
2.1, they are most useful when the number of observations is relatively small and there is a lot of uncertainty about the parameters of the model.The added computational expense of the KLdiff and KLdiff methods compared to just differentiating the mean prediction depends on the used model. For many models, the cost is not significant compared to the cost of inference.
3 Experiments
In this section, we demonstrate the variable relevance assessment methods discussed in Section 2 with both simulated and real data. The focus of the experiments is on assessing the performance of the pairwise interaction method KLdiff. In the supplement, we also analyze the accuracy of the KL method of Paananen et al. (2019) which uses a finite difference approximation, and show that for a well chosen perturbation size, it is practically equivalent to the KLdiff method for individual variables.
In all examples, we use Gaussian process models for demonstrating the methods and comparing them to alternatives. In order to be fully Bayesian, one should integrate over the posterior distribution of the hyperparameters of the Gaussian process model. Here, we instead optimize the hyperparameters by maximizing the marginal likelihood, which enables representing the predictive distribution in analytical form. This is a commonly used approach and works well when the number of observations is much larger than the number of features. We use the squared exponential covariance function (supplementary equation (
5)) which generates smooth functions and can model interactions between all input variables.We compare the KLdiff single variable relevance assessment method to two alternative methods. The first method is automatic relevance determination (ARD) (MacKay, 1994; Neal, 1995; Williams and Rasmussen, 1996; Seeger, 2000). This method is a modelbased approach that ranks the variables based on their individual lengthscale hyperparameters such that a variable with a short lengthscale is assigned high relevance. The second baseline method is “RelATive cEntrality” (RATE), which defines a measure analogous to the regression coefficient that applies to nonlinear models (Crawford et al., 2019).
We compare the interaction relevance KLdiff method to two alternative methods that rely on local relevance measures. The first method is the statistic of Friedman et al. (2008) that makes predictions for each variable pair. The second method is from Greenwell et al. (2018) and computes the standard deviation of partial dependency functions to evaluate the strength of interactions. We follow the recommendation of the authors to evaluate the partial dependency functions at the
pairwise deciles of the two input variables, which results in
predictions for each variable pair. Both methods thus need a much larger number of predictions than the KLdiff, which uses only predictions for each interaction.3.1 Simulated Individual and Pairwise Effects
In the first experiment we consider a toy example with 12 input variables, each independently normally distributed with zero mean and standard deviation . The target variable is constructed as a sum of 8 additive terms depending on a single input variable, and 3 terms depending on two input variables. The terms depending on a single variable have a range of linear and nonlinear dependencies on the inputs, and the interaction terms are the product of two input variables. The model is as follows:
(4)  
where the interacting variables are . The scaling factors and are such that the expected variance of each of the terms is equal. As most inputs are between the interval , the frequencies are chosen such that is almost linear and has one period in the interval.
We generated 400 observations and modelled the data using a Gaussian process with a Gaussian observation model. The global single variable and interaction relevance values of all 12 input variables were computed by averaging over the predictions at the training inputs. Based on 20 random simulations of the data, the means of the single variable relevances are shown in Figure 2 together with uncertainty intervals of the means. Variables 1 to 8 have equally strong main effects in the data, whereas variables 9 to 12 do not have a main effect. Variable pairs have interaction effects, and only variable 9 is totally irrelevant.
Figure 2 shows that all three methods correctly identify variable 9 as irrelevant. By looking at variables 2, 3, 5, 7, and 8 we see that ARD favours the nonlinear main effects over the linear effects, and KLdiff as well but to a much smaller extent. This was also reported by Paananen et al. (2019). Both methods also give higher relevance values when the variable has both a main effect and interaction. For RATE, the variance between the different training data sets is very large, leading to highly variable ranking in different data sets. For variables 10 to 12 with interaction effects but no main effects, RATE gives almost zero relevance, but ARD and KLdiff consistently identify them as relevant.
The averages and uncertainty intervals of the pairwise relevance values are shown in Figure 3. The markers with a green background depict the three pairs that have a true interaction effect in the data. Black represents the KLdiff method, orange depicts the Hstatistic (HS), and purple depicts the partial dependence variation method (PD) of Greenwell et al. (2018). The figure shows that all three methods consistently separate the true interactions from the noninteracting pairs of variables, but KLdiff is the only method that gives them roughly equal relevance values. The statistic exaggerates the strength of the third interacting pair where neither of them have main effects, which was also noted in Greenwell et al. (2018). Conversely, the PD method overemphasizes the strength of the leftmost pair whose both variables have main effects in addition to the interaction effect. All methods are fairly consistent between different simulated data sets.
In this simulated example, all the input variables are independent. As both HS and PD evaluate predictions at different permutations of the training data, they may not work reliably when the input variables are not independent. This issue does not concern the KLdiff method as it is only evaluated at the actual training data.
In this example, we evaluated the relevance of all variable pairs. However, if the number of variables is large, computational resources can be saved by first evaluating the individual relevances using the KLdiff method, and then evaluating only the pairs of the most relevant variables. As Figure 2 shows, the KLdiff method identifies variables 10 to 12 as relevant when they have an interaction effect even though they do not have a main effect.
To study how many observations the different methods require to reliably detect the true interactions in the data, we generate data from the same model with different numbers of observations ranging from to . In Figure 4, we plot the relevance values of six variable pairs averaged from 50 simulations. The solid lines represent pairs with a true interaction, and the dotted lines are pairs without an interaction. In the left plot, both variables in the pairs have a main effect. In the middle plot, only one variable in the pairs has a main effect, and in the right plot neither variable has a main effect. For each of the 50 simulations, the interaction relevance values are scaled so that the maximum given by each method is one. Thus, the ideal value is 1 for the solid lines and 0 for the dotted lines.
Figure 4 shows that when the number of observations is small, all methods perform roughly equal. However, as the number of observations increases, the KLdiff is the only method that correctly identifies that the relevance of the three true interactions (solid lines) are equal. The HS method overemphasizes the variable pair where neither variable has a main effect (right plot), whereas the PD method overemphasizes the variable pair where both variables have a main effect (left plot).
3.2 Benchmark Data Set
In this experiment we evaluate how well the KLdiff
method identifies interaction effects from a real data set. We use the Bike sharing data set from the UCI machine learning repository
^{1}^{1}1https://archive.ics.uci.edu/ml/index.html, where the target variable is the hourly number of bike uses from a bicycle rental system. There are 6 input variables representing weather and date: feeling temperature, humidity, winds speed, hour, weekday, and a binary variable indicating if the day is a
working day or not. We only use observations from February, which results in a total of 1339 hourly observations from 2 years. We model the problem using a Gaussian process model with a Poisson likelihood to model the number of hourly bike uses. The derivatives needed by the KLdiff for this model are presented in the supplementary material.When evaluating the pairwise interactions using the full data, KLdiff identifies a strong interaction between hour and feeling temperature as well as hour and weekday. To evaluate the plausibility of the interactions identified by the different methods, we compare the outofsample predictive performance of models with explicit interaction terms chosen based on the identified interactions. We compare the performance of the models using crossvalidation with 10 random splits into training and test sets of size 500 and 839, and log predictive density as the utility function. For each training set, the Gaussian process model with full interactions is fitted, and the pairwise interactions are identified with KLdiff, HS, and PD. Based on these, a model with 1, 2, or 3 interaction terms is fitted again, and its predictive performance is evaluated on the test data. For comparison, we also evaluate the model with no interactions. The mean log predictive densities (MLPDs) across different test splits as well as uncertainty intervals of the means are shown in Figure 5.
The figure shows that modelling only the three strongest pairwise interactions increases the outofsample predictive performance to the level of the model with all interactions. The KLdiff method identifies more relevant interactions than the competing methods, as seen in the better predictive performance of the models with 1 to 3 interactions.
4 Conclusions
In this work we proposed a method for assessing the predictive relevance of interacting input variable pairs and individual variables by using local derivatives of the KullbackLeibler divergence between predictive distributions. Using both simulated and real datasets, we demonstrated that the method can reliably identify main effects as well as interactions in nonlinear models for complex datasets. By utilizing the predictive uncertainty of a model, the method is comparable in performance and requires fewer predictions compared to alternative methods that only use the mean prediction. We also showed that the KLdiff method can be beneficial for preselecting the variable pairs whose interactions should be evaluated with KLdiff to avoid the exhaustive assessment of all pairs.
We only considered the KullbackLeibler divergence and showed that it has a conservative property that inflates small local relevance estimates when there is a lot of uncertainty in the model’s predictions. We leave the investigation of other divergence measures as future research. Another possible future direction is the search of interactions between observation groups which would be useful for building multilevel models.
5 Acknowledgements
We thank Alejandro Catalina and Kunal Ghosh for helpful comments to improve the manuscript.
References
 Baydin et al. (2018) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18(153).
 Crawford et al. (2019) Crawford, L., Flaxman, S. R., Runcie, D. E., West, M., et al. (2019). Variable prioritization in nonlinear black box methods: A genetic association case study. The Annals of Applied Statistics, 13(2):958–989.
 Cui et al. (2019) Cui, T., Marttinen, P., and Kaski, S. (2019). Recovering pairwise interactions using neural networks. arXiv preprint arXiv:1901.08361.
 Czernichow (1996) Czernichow, T. (1996). Architecture selection through statistical sensitivity analysis. In International Conference on Artificial Neural Networks, pages 179–184. Springer.
 Dorizzi (1996) Dorizzi, B. (1996). Variable selection using generalized rbf networks: Application to the forecast of the french tbonds. Proceedings of IEEEIMACS’96, Lille, France.

Friedman (2001)
Friedman, J. H. (2001).
Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.  Friedman et al. (2008) Friedman, J. H., Popescu, B. E., et al. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954.
 Gelman and Pardoe (2007) Gelman, A. and Pardoe, I. (2007). Average predictive comparisons for models with nonlinearity, interactions, and variance components. Sociological Methodology, 37(1):23–51.
 Goldstein et al. (2015) Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65.
 Greenwell et al. (2018) Greenwell, B. M., Boehmke, B. C., and McCarthy, A. J. (2018). A simple and effective modelbased variable importance measure. arXiv preprint arXiv:1805.04755.

Guyon et al. (2002)
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002).
Gene selection for cancer classification using support vector machines.
Machine learning, 46(13):389–422.  Leray and Gallinari (1999) Leray, P. and Gallinari, P. (1999). Feature selection with neural networks. Behaviormetrika, 26(1):145–166.
 MacKay (1994) MacKay, D. J. (1994). Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062.
 MacKay (1998) MacKay, D. J. (1998). Introduction to gaussian processes. In Bishop, J., editor, Neural Networks and Machine Learning, pages 133–166. Springer Verlag.

Minka (2001)
Minka, T. P. (2001).
A family of algorithms for approximate Bayesian inference
. PhD thesis, Massachusetts Institute of Technology.  Neal (1998) Neal, R. (1998). Regression and classification using gaussian process priors (with discussion). In Bernardo, J., Berger, J., Dawid, A., and Smith, A., editors, Bayesian statistics, volume 6, pages 475–501. Oxford University Press.
 Neal (1995) Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
 O’Hagan (1978) O’Hagan, A. (1978). Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society: Series B (Methodological), 40(1):1–24.
 Opper and Winther (2000) Opper, M. and Winther, O. (2000). Gaussian processes for classification: Meanfield algorithms. Neural Computation, 12(11):2655–2684.
 Paananen et al. (2019) Paananen, T., Piironen, J., Andersen, M. R., and Vehtari, A. (2019). Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution. volume 89 of Proceedings of Machine Learning Research, pages 1743–1752.
 Rasmussen (2003) Rasmussen, C. (2003). Gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals. Bayesian statistics, 7:651–659.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning, volume 1. MIT press Cambridge.
 Refenes and Zapranis (1999) Refenes, A.P. and Zapranis, A. (1999). Neural model identification, variable selection and model adequacy. Journal of Forecasting, 18(5):299–332.
 Riihimäki et al. (2010) Riihimäki, J., Sund, R., and Vehtari, A. (2010). Analysing the length of care episode after hip fracture: a nonparametric and a parametric Bayesian approach. Health care management science, 13(2):170–181.

Ruck et al. (1990)
Ruck, D. W., Rogers, S. K., and Kabrisky, M. (1990).
Feature selection using a multilayer perceptron.
Journal of Neural Network Computing, 2(2):40–48.  Seeger (2000) Seeger, M. (2000). Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In Advances in neural information processing systems, pages 603–609.
 Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., Sørbye, S. H., et al. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1):1–28.
 Solak et al. (2003) Solak, E., MurraySmith, R., Leithead, W. E., Leith, D. J., and Rasmussen, C. E. (2003). Derivative observations in Gaussian process models of dynamic systems. In Advances in neural information processing systems, pages 1057–1064.
 Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3319–3328. JMLR. org.
 Williams and Barber (1998) Williams, C. K. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.
 Williams and Rasmussen (1996) Williams, C. K. and Rasmussen, C. E. (1996). Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520.
References
 Baydin et al. (2018) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18(153).
 Crawford et al. (2019) Crawford, L., Flaxman, S. R., Runcie, D. E., West, M., et al. (2019). Variable prioritization in nonlinear black box methods: A genetic association case study. The Annals of Applied Statistics, 13(2):958–989.
 Cui et al. (2019) Cui, T., Marttinen, P., and Kaski, S. (2019). Recovering pairwise interactions using neural networks. arXiv preprint arXiv:1901.08361.
 Czernichow (1996) Czernichow, T. (1996). Architecture selection through statistical sensitivity analysis. In International Conference on Artificial Neural Networks, pages 179–184. Springer.
 Dorizzi (1996) Dorizzi, B. (1996). Variable selection using generalized rbf networks: Application to the forecast of the french tbonds. Proceedings of IEEEIMACS’96, Lille, France.

Friedman (2001)
Friedman, J. H. (2001).
Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.  Friedman et al. (2008) Friedman, J. H., Popescu, B. E., et al. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954.
 Gelman and Pardoe (2007) Gelman, A. and Pardoe, I. (2007). Average predictive comparisons for models with nonlinearity, interactions, and variance components. Sociological Methodology, 37(1):23–51.
 Goldstein et al. (2015) Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65.
 Greenwell et al. (2018) Greenwell, B. M., Boehmke, B. C., and McCarthy, A. J. (2018). A simple and effective modelbased variable importance measure. arXiv preprint arXiv:1805.04755.

Guyon et al. (2002)
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002).
Gene selection for cancer classification using support vector machines.
Machine learning, 46(13):389–422.  Leray and Gallinari (1999) Leray, P. and Gallinari, P. (1999). Feature selection with neural networks. Behaviormetrika, 26(1):145–166.
 MacKay (1994) MacKay, D. J. (1994). Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062.
 MacKay (1998) MacKay, D. J. (1998). Introduction to gaussian processes. In Bishop, J., editor, Neural Networks and Machine Learning, pages 133–166. Springer Verlag.

Minka (2001)
Minka, T. P. (2001).
A family of algorithms for approximate Bayesian inference
. PhD thesis, Massachusetts Institute of Technology.  Neal (1998) Neal, R. (1998). Regression and classification using gaussian process priors (with discussion). In Bernardo, J., Berger, J., Dawid, A., and Smith, A., editors, Bayesian statistics, volume 6, pages 475–501. Oxford University Press.
 Neal (1995) Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
 O’Hagan (1978) O’Hagan, A. (1978). Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society: Series B (Methodological), 40(1):1–24.
 Opper and Winther (2000) Opper, M. and Winther, O. (2000). Gaussian processes for classification: Meanfield algorithms. Neural Computation, 12(11):2655–2684.
 Paananen et al. (2019) Paananen, T., Piironen, J., Andersen, M. R., and Vehtari, A. (2019). Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution. volume 89 of Proceedings of Machine Learning Research, pages 1743–1752.
 Rasmussen (2003) Rasmussen, C. (2003). Gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals. Bayesian statistics, 7:651–659.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning, volume 1. MIT press Cambridge.
 Refenes and Zapranis (1999) Refenes, A.P. and Zapranis, A. (1999). Neural model identification, variable selection and model adequacy. Journal of Forecasting, 18(5):299–332.
 Riihimäki et al. (2010) Riihimäki, J., Sund, R., and Vehtari, A. (2010). Analysing the length of care episode after hip fracture: a nonparametric and a parametric Bayesian approach. Health care management science, 13(2):170–181.

Ruck et al. (1990)
Ruck, D. W., Rogers, S. K., and Kabrisky, M. (1990).
Feature selection using a multilayer perceptron.
Journal of Neural Network Computing, 2(2):40–48.  Seeger (2000) Seeger, M. (2000). Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In Advances in neural information processing systems, pages 603–609.
 Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., Sørbye, S. H., et al. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1):1–28.
 Solak et al. (2003) Solak, E., MurraySmith, R., Leithead, W. E., Leith, D. J., and Rasmussen, C. E. (2003). Derivative observations in Gaussian process models of dynamic systems. In Advances in neural information processing systems, pages 1057–1064.
 Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3319–3328. JMLR. org.
 Williams and Barber (1998) Williams, C. K. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.
 Williams and Rasmussen (1996) Williams, C. K. and Rasmussen, C. E. (1996). Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520.
Supplementary Material
Finite Difference Approximation of the KullbackLeibler Divergence
Consider two probability distributions, and parameterized by vectors and , respectively. Keeping constant, let us make a secondorder approximation of the KullbackLeibler divergence between the distributions in the neighbourhood around .
The first two terms are zero, because the KullbackLeibler divergence obtains a minimum value of zero at . Dropping them and the third degree term, we are left with the approximation
If the distributions and are predictive distributions, then the parameters and depend on the input point , i.e. . When only one input variable, , is varied, an infinitesimal change in the parameters can be written as
Thus we get
Rearranging the terms gives the approximate equivalence
Based on the chain rule of differentiation, this is equal to
Finally, taking the square root gives the approximate equivalence
where the left hand side is the finite difference KL method of Paananen et al. (2019) and the right hand side is the KLdiff measure.
Asymptotic Results for Generalized Linear Models
In Section 2.1 in the main paper, we discussed that for a Bayesian linear regression model, the predictive distribution becomes constant as a function of the input variables when the number of observations goes to infinity. In this section we analyze the asymptotic results of the logistic regression model.
Logistic Regression Model
The predictive distribution of a logistic regression model is a Bernoulli distribution. In the asymptotic limit, the posterior of the regression coefficients concentrates to a point , and the “success probability” parameter as a function of the input variables is the logistic function
The derivative of with respect to is
The second derivative of the KullbackLeibler divergence with respect to is the Fisher information of the Bernoulli distribution
In the limit when the number of observations goes to infinity, the KLdiff measure thus evaluates to
Again we see that the relevance measure is proportional to the absolute value of the regression coefficient. In addition, due to the logistic link function, the local relevance measure is higher for points close to the decision boundary compared to points further away.
Derivatives of Gaussian Process Predictive Distribution Parameters
Gaussian process models are a widely used in supervised learning, where the task is to predict an output from a dimensional input . The type of functions the Gaussian process can represent are determined by its covariance function, which is a key decision made during modelling. The covariance function defines the covariance between the function values at the input points and
. We assume the Gaussian process has a zero mean, in which case the joint distribution of the latent output values
at the training points iswhere is the covariance matrix between the latent function values at the training inputs such that .
In this work, we use the squared exponential covariance function
(5)  
Here, the hyperparameter determines the overall variability of the functions, and are the lengthscales of each input dimension. By defining an observation model that links the observations to the latent values of the Gaussian process, the model can be used for inference and predictions in many supervised learning tasks.
For example, in regression with an assumption of Gaussian noise, the posterior distribution of latent values for a new input point is a univariate normal distribution with mean and variance
(6)  
where is the noise variance,
is the identity matrix, and
is the vector of training outputs. For many other observation models, the posterior of latent values is not Gaussian, but it is commonplace to approximate it with a Gaussian distribution during inference, and many methods have been developed for doing so
(Williams and Barber, 1998; Opper and Winther, 2000; Minka, 2001; Rasmussen and Williams, 2006). The variable relevance assessment thus depends implicitly on the posterior approximation, as does any method that uses the predictive distribution.Differentiating Gaussian Processes
We assume that the posterior distribution of latent values is Gaussian. Because differentiation is a linear operation, the derivatives of the parameters of a Gaussian process posterior distribution with respect to input variables are available in closed form (Solak et al., 2003; Rasmussen, 2003). For example, for the Gaussian observation model, the derivatives of the mean and variance of the predictive distribution in equation (6) with respect to the input variable at point are given as
For the squared exponential covariance function in equation (5), the partial derivatives with respect to the input variable are
The second derivatives are
For the KLdiff measure, we need derivatives with respect to the parameters of the predictive distribution and not the posterior of the latent values. However, for many observation models these are obtained as a function of the derivatives of the latent values. In this section, we derive the derivatives for some commonly used observation models.
Regression with Gaussian Observation Model
In regression problems, it is commonly assumed that the noise has a Gaussian distribution. For a Gaussian observation model, the predictive distribution for a new observation at a single input point is a normal distribution, which we will denote
where and are the mean and variance of the posterior distribution of latent values at , and is the noise variance. Now, the derivatives of and with respect to input variable are simply
The second derivatives of the KullbackLeibler divergence are given by the Fisher information matrix of the normal distribution
Thus, the KLdiff measure takes the form
Here, the first term is proportional to the slope of the mean prediction scaled by the predictive uncertainty, as with the linear regression model discussed in Section 2.1 of the main paper. The KLdiff measure evaluates to
Binary Classification
For binary classification problems, the predictive distribution is a Bernoulli distribution with only one parameter, the probability of positive classification. This is obtained by squashing the latent Gaussian process function through a link function and integrating over the posterior of the latent function values. Two commonly used link functions for Gaussian process classification are the logit and probit. The Probit link function has the benefit that the predictive distribution has an analytical formula when the posterior distribution of latent values is approximated with a Gaussian. Using a Probit link function, the predictive probability has thus an approximate analytical form
where is the cumulative distribution of the standard normal distribution.
Now, the derivatives of with respect to are
Comments
There are no comments yet.