Ranking variables and interactions using predictive uncertainty measures

10/17/2019 ∙ by Topi Paananen, et al. ∙ 29

For complex nonlinear supervised learning models, assessing the relevance of input variables or their interactions is not straightforward due to the lack of a direct measure of relevance, such as the regression coefficients in generalized linear models. One can assess the relevance of input variables locally by using the mean prediction or its derivative, but this disregards the predictive uncertainty. In this work, we present a Bayesian method for identifying relevant input variables with main effects and interactions by differentiating the Kullback-Leibler divergence of predictive distributions. The method averages over local measures of relevance and has a conservative property that takes into account the uncertainty in the predictive distribution. Our empirical results on simulated and real data sets with nonlinearities demonstrate accurate and efficient identification of relevant main effects and interactions compared to alternative methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identifying relevant features and interactions from complex data sets and models remains a topic of active research. This is a fundamental problem with important applications in many scientific disciplines. Often the goal is to improve understanding of the model, but the identified features and interactions can also be used to build a more interpretable surrogate model.

For models that can implicitly model nonlinear effects and interactions, the typical approach is to accumulate local relevance estimates based on predictions made at the training inputs or their permutations. Many local measures use the partial derivative of the model’s prediction with respect to the features as the basis for measuring relevance 

(Guyon et al., 2002). Since the average derivative can vary from positive to negative, most approaches use absolute or squared derivatives (Ruck et al., 1990; Dorizzi, 1996; Czernichow, 1996; Refenes and Zapranis, 1999; Leray and Gallinari, 1999; Sundararajan et al., 2017). However, the derivative of the mean prediction fails to take into account the uncertainty of the prediction. The average predictive comparison of Gelman and Pardoe (2007) also integrates over the posterior uncertainty, but is prohibitively expensive in practice. Paananen et al. (2019) take into account changes in predictive uncertainty, but rely on finite differences that are prone to numerical errors.

Local values of the mean prediction or its derivative are often used also for identifying interacting variable pairs from complex models. For example, Friedman et al. (2008) and Greenwell et al. (2018) use partial dependence functions (Friedman, 2001) to construct statistics that measure the strength of pairwise interactions. Cui et al. (2019)

use the Hessian of the prediction of a neural network as a local measure of interaction strength. The approach of 

Crawford et al. (2019) takes into account the predictive uncertainty, but only computes the relevance of variables without separating the main effects and interactions. The individual conditional expectation plots of Goldstein et al. (2015) can also be used to identify interactions, but they rely on visualization and do not provide quantification of the strength of the interactions.

The contributions of this work are summarized as follows. First, we present a novel method for assessing the predictive relevance of input variables and their interactions. Instead of using the first partial derivatives or the cross-derivatives of the mean prediction of a model, we instead differentiate the Kullback-Leibler divergence from one predictive distribution to another. We derive measures analogous to the derivatives that take into account the uncertainty in the model’s predictions. Second, we show that the first derivative of the Kullback-Leibler divergence of the predictive distribution is an analytical generalization of a previous finite difference method (Paananen et al., 2019). We also discuss that a similar finite difference equivalent is not viable for the cross-derivative of the divergence measure that we use to identify interacting variable pairs.

We demonstrate the effectiveness of the proposed methods by applying Gaussian process models to simulated and real benchmark datasets with both nonlinearities and interactions. Based on the empirical evaluation, our methods are competitive to alternative methods in terms of both computational cost and effectiveness in identifying the relevant main effects and interactions.

2 Differentiating Probability Dissimilarity Measures

After observing data

consisting of a vector of outputs

and a matrix of inputs

, let us denote the posterior predictive distribution of our model

at some input point as

Here is the vector of parameters of the distribution that depends on , , and . To keep the notation concise, we denote the posterior predictive distribution at input simply as . Here, we assume that all input variables are centered and normalized.

In order to assess the predictive relevance of input variables, we are interested in estimating how much the predictions of a model change when the value of one input variable changes. For models that are linear with respect to the input variables, the regression coefficient of each variable is a global measure of relevance. For more flexible models, the absolute value of the derivative of the mean prediction is an analogous measure of relevance locally. In this section, we derive a local relevance measure that takes into account changes in both the mean prediction and its uncertainty through the posterior predictive distribution. It is a natural way to include uncertainty about the model parameters into the task of identifying relevant variables and interactions. Even for simple models such as a linear regression model, the variance of the predictive distribution is not constant when there is uncertainty about the model parameters.

The change in the predictive distribution due to perturbations of an input variable can be formalized as the derivative of a dissimilarity measure between two predictive distributions with respect to the input variable. Contrary to simply assessing the derivative of the mean prediction, this approach takes changes in predictive uncertainty into account in a principled manner. Our dissimilarity measure of choice is the Kullback-Leibler divergence, but other measures could be used as well. Mathematically, we are thus evaluating the partial derivatives of the Kullback-Leibler divergence from the predictive distribution at to the predictive distribution at with respect to the ’th input variable when the points and coincide and the two predictive distributions are equal. Because the Kullback-Leibler divergence has a global minimum of zero when , the first derivative with respect to input variable is zero. We thus use the square root of the second derivative to measure the local sensitivity of the predictive distribution to changes in an input variable.

The square root of the Kullback-Leibler divergence is a natural scale to use for two reasons. First, it brings the derivative to the same scale as the absolute derivative of the mean prediction, which is used often as a measure of local relevance (Ruck et al., 1990; Dorizzi, 1996; Refenes and Zapranis, 1999)

. For example, for a model with a Gaussian predictive distribution, the square root of the derivative of the divergence measure is proportional to the absolute derivative of the mean prediction instead of its square. Second, when the two predictive distributions are equal, the derivatives of the Kullback-Leibler divergence are equal to the derivatives of the Jensen-Shannon divergence, the square root of which defines a metric on the space of probability distributions. A similar scaling of the Kullback-Leibler divergence has been used by 

Simpson et al. (2017) for constructing prior distributions.

Based on the predictive distribution of our model , we define the KL-diff local measure of the predictive relevance of a single variable at the input point as

(1)

where denotes the Kullback-Leibler divergence from the predictive distribution to , and are the parameters of .

The KL-diff measure in equation (1) has two kinds of partial derivatives: (i) second derivative of the Kullback-Leibler divergence with respect to the parameters of the predictive distribution, and (ii) first derivative of the parameters with respect to the input variable . These are obtained as follows: (i) The curvature of Kullback-Leibler divergence is represented by the Fisher information. Thus, for a probability distribution parameterized by , the second partial derivatives of Kullback-Leibler divergence with respect to each are given by the Fisher information matrix of the distribution. (ii) The partial derivatives of the parameters with respect to input variable depend on the specific choice of model. Thus, the only requirements for using the proposed method are that the predictive distribution of the model must be available analytically (either the exact distribution or an approximation) and that the derivatives (ii) must be available. Fortunately, there are a large number of models where these conditions are fulfilled. In the supplementary material, we derive the terms (i) and (ii) for Gaussian process models with different likelihoods.

2.1 Illustrative Example

To gain intuition about the role of the different components in equation (1), we will analyze a Bayesian linear regression model as a toy example. For a Gaussian likelihood, the observations of the target variable are modelled as

where are the regression coefficients and is the noise variance. If the prior is an improper uniform prior on , integrating over the uncertainty about all parameters makes the posterior predictive distribution at a new point (treated as a row vector)

Here,

represents the degrees of freedom, and

and are the number of observations and input variables, respectively. are the maximum likelihood estimates of the regression coefficients. Now, the derivatives of the three parameters of the predictive distribution with respect to the input variable are

The nonzero second derivatives of the Kullback-Leibler divergence are given by the Fisher information matrix of the Student- distribution:

The KL-diff local measure of the predictive relevance of variable from equation (1) thus evaluates to

(2)

The two terms have the following roles in the local relevance measure of input variable at a point . In the absence of the second term, the measure would be proportional to the absolute value of the maximum likelihood estimate of the regression coefficient,

, divided by the standard deviation of the predictive distribution. The first term thus measures the absolute derivative of the mean prediction, but input points with more uncertainty are given less weight. Also the second term in equation (

2) quantifies the amount of uncertainty in the posterior predictive distribution, but in a different way. Even if for some variable would be exactly zero, the second term is nonzero as long as there is uncertainty about the model parameters that causes the predictive uncertainty to vary in the input space.

To visualize the roles of the two terms in equation (2), we simulated 10 observations from a linear model with two input variables and whose true regression coefficients are and

. The input variables are independent and normally distributed with zero mean and standard deviation one, and the noise standard deviation is

. Based on the posterior predictive distribution, the relevances for both variables given by equation (2) are shown in Figure 1 as a function of the corresponding input variable.

Figure 1: For a simulated linear model with true regression coefficients 1 and 0, the local relevance measure given by equation (2) as a function of the variables. Red represents the relevant variable () and blue depicts the irrelevant variable (). The dashed and dotted lines show the contributions of the two terms in equation (2).

The red lines show the contribution of the terms for the relevant variable , where we see that observations at the edges of the data are given less relevance, and that the second term contributes very little. The blue lines represent the relevance values of the irrelevant variable, where the first term is now very small because the estimate for the regression coefficient, , is small. In this case, the second term in equation (2) increases the relevance measure because there is still a significant amount of uncertainty about the model parameters. The KL-diff method is thus conservative in the sense that it amplifies the relevance measure for variables that seem irrelevant when there is still a lot of uncertainty.

When the number of observations increases, the posterior becomes more concentrated, the contribution of the second term begins to vanish, and the relevance measures of both variables approach the constant . Note that the KL-diff measure is not attempting to estimate the value of the regression coefficients, but rather compare the input variables to each other. Using a different prior distribution on the regression coefficients would change the predictive distribution and also the KL-diff relevance measure. We leave investigating the influence of the prior as future research.

For other likelihoods or models, the different terms in equation (1

) may not have such clear interpretations as with the linear regression model above. For example, in a binary classification task, the posterior predictive distribution can be considered a Bernoulli distribution which has only a single parameter. Nevertheless, the principle of taking into account the uncertainty about the model parameters still holds.

As the number of observations approaches infinity, if the model posterior concentrates to a point, the predictive uncertainty in the linear regression model approaches a non-zero constant. Thus, in the limit, the KL-diff local relevance measure for the linear model is proportional to irrespective of the input point of evaluating. In the supplementary material, we show that the regression coefficient plays a significant role also in the asymptotic results for other generalized linear models. Equation (1) is thus a very natural way of evaluating the predictive relevance of input variables that also generalizes to nonlinear and nonparametric models.

2.2 Identifying Pairwise Interactions

Efficiently and accurately identifying interaction effects between input variables is important for many applications. Based on a complex model, the identified interactions can, for example, be used to build a more interpretable surrogate model. Because the number of variable pairs increases quadratically with the number of variables, it is crucial to be able to identify the interactions with a single model without having to explicitly fit multiple models when assessing the interactions.

Just like the partial derivatives of a model’s mean prediction with respect to one input variable can be used as a local measure of relevance of that variable, the cross-derivatives with respect to two input variables measure the strength of their joint interaction effect. For example Cui et al. (2019) use the cross-derivatives of a neural network’s prediction directly, while Friedman et al. (2008) estimate the interactions implicitly. By computing the cross-derivatives of the Kullback-Leibler divergence of predictive distributions, we construct an analogous measure of interaction strength that takes into account the predictive uncertainty.

We define the KL-diff local measure for the strength of the interaction between variables and as

(3)

where we omitted higher order derivatives and non-cross-derivative terms. This measure has the same form as the KL-diff measure, but the partial derivatives of the parameters of the predictive distribution are now cross-derivatives. The interpretation of the terms is similar to KL-diff, and the second derivatives of the Kullback-Leibler divergence will similarly weight the local measures based on the predictive uncertainty.

2.3 Global Measures of Predictive Relevance

For assessing the global predictive relevance of input variables or pairs of variables, it is natural to integrate the local relevance measures KL-diff and KL-diff over the probability distribution of the inputs. As an approximation of this we use the average over all of the training points. Using the global measures, we can rank the input variables or variable pairs. These rankings can be used to directly select a smaller set of variables and interactions to use for a surrogate model, or they can be used as a starting point for another variable selection method.

When identifying the relevance of interactions, one could also evaluate the local relevance measures in different permutations of the training inputs, as done by Friedman et al. (2008) and Greenwell et al. (2018). However, this may be inefficient and inaccurate if the input variables are not independent. It is also possible to assess relevance in some particular subsection of the inputs or outside the training data.

2.4 Applicability

The requirements for using the proposed KL-diff and KL-diff methods are that we must have an analytical representation of the predictive distribution conditioned on the input variables, and that the derivatives of its parameters with respect to the inputs must be available. For the latter, computational tools such as automatic differentiation can be used (Baydin et al., 2018). There are no restrictions to the support of the predictive distribution, and the methods are thus applicable to many learning tasks.

Because the proposed methods measure the relevance of input variables locally, they are most useful for nonlinear and complex models. For example, Gaussian process models are suitable because they can represent flexible functions with nonlinearities and interactions, but still have good uncertainty quantification (O’Hagan, 1978; MacKay, 1998; Neal, 1998; Rasmussen and Williams, 2006). In the supplementary material we show the derivatives required for the KL-diff and KL-diff methods for Gaussian processes and commonly used likelihoods.

Paananen et al. (2019) use a finite difference like method that evaluates the Kullback-Leibler divergence of predictive distributions when the input variables are perturbed. In the supplementary material, we show that their method can be derived as a finite difference approximation to the KL-diff measure. Using KL-diff avoids numerical errors related to finite difference and is easier because the selection of the perturbation size is avoided. For an appropriately chosen perturbation, the two equations produce practically identical results. The finite difference approximation is more general as it requires only an analytical representation of the predictive distribution but not the derivatives of its parameters with respect to the inputs.

For the KL-diff method, it is not possible to derive an equivalent finite difference approximation because there is no second-order Kullback-Leibler divergence. Riihimäki et al. (2010) perturb two input variables at a time with a unit length perturbation and measure the change in predictions by Kullback-Leibler divergence. For an infinitesimal perturbation this would be equivalent to a directional derivative instead of the cross-derivative in the KL-diff method.

Due to its generality, the KL-diff and KL-diff

methods could in principle be applied to any binary classifier, where the classification probability is available and gradients are computable, such as a neural network. However, because the methods weight observations based on their classification probability, the results may be misleading if the model lacks proper uncertainty quantification. We thus recommend using the methods for probabilistic models where the uncertainty is properly taken into consideration. Due to the conservative property discussed in Section 

2.1, they are most useful when the number of observations is relatively small and there is a lot of uncertainty about the parameters of the model.

The added computational expense of the KL-diff and KL-diff methods compared to just differentiating the mean prediction depends on the used model. For many models, the cost is not significant compared to the cost of inference.

3 Experiments

In this section, we demonstrate the variable relevance assessment methods discussed in Section 2 with both simulated and real data. The focus of the experiments is on assessing the performance of the pairwise interaction method KL-diff. In the supplement, we also analyze the accuracy of the KL method of Paananen et al. (2019) which uses a finite difference approximation, and show that for a well chosen perturbation size, it is practically equivalent to the KL-diff method for individual variables.

In all examples, we use Gaussian process models for demonstrating the methods and comparing them to alternatives. In order to be fully Bayesian, one should integrate over the posterior distribution of the hyperparameters of the Gaussian process model. Here, we instead optimize the hyperparameters by maximizing the marginal likelihood, which enables representing the predictive distribution in analytical form. This is a commonly used approach and works well when the number of observations is much larger than the number of features. We use the squared exponential covariance function (supplementary equation (

5)) which generates smooth functions and can model interactions between all input variables.

We compare the KL-diff single variable relevance assessment method to two alternative methods. The first method is automatic relevance determination (ARD) (MacKay, 1994; Neal, 1995; Williams and Rasmussen, 1996; Seeger, 2000). This method is a model-based approach that ranks the variables based on their individual length-scale hyperparameters such that a variable with a short length-scale is assigned high relevance. The second baseline method is “RelATive cEntrality” (RATE), which defines a measure analogous to the regression coefficient that applies to nonlinear models (Crawford et al., 2019).

We compare the interaction relevance KL-diff method to two alternative methods that rely on local relevance measures. The first method is the -statistic of Friedman et al. (2008) that makes predictions for each variable pair. The second method is from Greenwell et al. (2018) and computes the standard deviation of partial dependency functions to evaluate the strength of interactions. We follow the recommendation of the authors to evaluate the partial dependency functions at the

pairwise deciles of the two input variables, which results in

predictions for each variable pair. Both methods thus need a much larger number of predictions than the KL-diff, which uses only predictions for each interaction.

3.1 Simulated Individual and Pairwise Effects

In the first experiment we consider a toy example with 12 input variables, each independently normally distributed with zero mean and standard deviation . The target variable is constructed as a sum of 8 additive terms depending on a single input variable, and 3 terms depending on two input variables. The terms depending on a single variable have a range of linear and nonlinear dependencies on the inputs, and the interaction terms are the product of two input variables. The model is as follows:

(4)

where the interacting variables are . The scaling factors and are such that the expected variance of each of the terms is equal. As most inputs are between the interval , the frequencies are chosen such that is almost linear and has one period in the interval.

We generated 400 observations and modelled the data using a Gaussian process with a Gaussian observation model. The global single variable and interaction relevance values of all 12 input variables were computed by averaging over the predictions at the training inputs. Based on 20 random simulations of the data, the means of the single variable relevances are shown in Figure 2 together with uncertainty intervals of the means. Variables 1 to 8 have equally strong main effects in the data, whereas variables 9 to 12 do not have a main effect. Variable pairs have interaction effects, and only variable 9 is totally irrelevant.

Figure 2 shows that all three methods correctly identify variable 9 as irrelevant. By looking at variables 2, 3, 5, 7, and 8 we see that ARD favours the nonlinear main effects over the linear effects, and KL-diff as well but to a much smaller extent. This was also reported by Paananen et al. (2019). Both methods also give higher relevance values when the variable has both a main effect and interaction. For RATE, the variance between the different training data sets is very large, leading to highly variable ranking in different data sets. For variables 10 to 12 with interaction effects but no main effects, RATE gives almost zero relevance, but ARD and KL-diff consistently identify them as relevant.

Figure 2: Single variable relevance values of all 12 input variables in the simulated model from equation (4). The number of observations is 400. In each simulation, the relevance values are scaled so that the maximum relevance given by each method is one, and the error bars represent uncertainty intervals of the means.

The averages and uncertainty intervals of the pairwise relevance values are shown in Figure 3. The markers with a green background depict the three pairs that have a true interaction effect in the data. Black represents the KL-diff method, orange depicts the H-statistic (HS), and purple depicts the partial dependence variation method (PD) of Greenwell et al. (2018). The figure shows that all three methods consistently separate the true interactions from the non-interacting pairs of variables, but KL-diff is the only method that gives them roughly equal relevance values. The -statistic exaggerates the strength of the third interacting pair where neither of them have main effects, which was also noted in Greenwell et al. (2018). Conversely, the PD method over-emphasizes the strength of the leftmost pair whose both variables have main effects in addition to the interaction effect. All methods are fairly consistent between different simulated data sets.

Figure 3: The relevance values of each pair of input variables in the simulated model from equation (4). Markers with a green background depict variable pairs with real pairwise interaction effect in the data. The error bars represent uncertainty intervals for the mean from 20 simulated data sets. In each simulation, the relevance values are scaled so that the maximum relevance given by each method is one.

In this simulated example, all the input variables are independent. As both HS and PD evaluate predictions at different permutations of the training data, they may not work reliably when the input variables are not independent. This issue does not concern the KL-diff method as it is only evaluated at the actual training data.

In this example, we evaluated the relevance of all variable pairs. However, if the number of variables is large, computational resources can be saved by first evaluating the individual relevances using the KL-diff method, and then evaluating only the pairs of the most relevant variables. As Figure 2 shows, the KL-diff method identifies variables 10 to 12 as relevant when they have an interaction effect even though they do not have a main effect.

To study how many observations the different methods require to reliably detect the true interactions in the data, we generate data from the same model with different numbers of observations ranging from to . In Figure 4, we plot the relevance values of six variable pairs averaged from 50 simulations. The solid lines represent pairs with a true interaction, and the dotted lines are pairs without an interaction. In the left plot, both variables in the pairs have a main effect. In the middle plot, only one variable in the pairs has a main effect, and in the right plot neither variable has a main effect. For each of the 50 simulations, the interaction relevance values are scaled so that the maximum given by each method is one. Thus, the ideal value is 1 for the solid lines and 0 for the dotted lines.

Figure 4: The interaction relevance values given by the KL-diff, HS, and PD methods to six variable pairs for data sets with different numbers of observations. The solid lines represent pairs with a true interaction, and dotted lines are pairs without an interaction. In the left plot, both variables in the pairs have a main effect. In the middle plot, only one variable in the pairs has a main effect, and in the right plot neither variable has a main effect. In all plots, the ideal values would be 1 and 0 for the solid and dotted lines, respectively. The error bars represent uncertainty intervals for the mean from 50 simulated data sets.

Figure 4 shows that when the number of observations is small, all methods perform roughly equal. However, as the number of observations increases, the KL-diff is the only method that correctly identifies that the relevance of the three true interactions (solid lines) are equal. The HS method over-emphasizes the variable pair where neither variable has a main effect (right plot), whereas the PD method over-emphasizes the variable pair where both variables have a main effect (left plot).

3.2 Benchmark Data Set

In this experiment we evaluate how well the KL-diff

method identifies interaction effects from a real data set. We use the Bike sharing data set from the UCI machine learning repository

111https://archive.ics.uci.edu/ml/index.html, where the target variable is the hourly number of bike uses from a bicycle rental system. There are 6 input variables representing weather and date: feeling temperature, humidity, winds speed, hour, weekday

, and a binary variable indicating if the day is a

working day or not. We only use observations from February, which results in a total of 1339 hourly observations from 2 years. We model the problem using a Gaussian process model with a Poisson likelihood to model the number of hourly bike uses. The derivatives needed by the KL-diff for this model are presented in the supplementary material.

When evaluating the pairwise interactions using the full data, KL-diff identifies a strong interaction between hour and feeling temperature as well as hour and weekday. To evaluate the plausibility of the interactions identified by the different methods, we compare the out-of-sample predictive performance of models with explicit interaction terms chosen based on the identified interactions. We compare the performance of the models using cross-validation with 10 random splits into training and test sets of size 500 and 839, and log predictive density as the utility function. For each training set, the Gaussian process model with full interactions is fitted, and the pairwise interactions are identified with KL-diff, HS, and PD. Based on these, a model with 1, 2, or 3 interaction terms is fitted again, and its predictive performance is evaluated on the test data. For comparison, we also evaluate the model with no interactions. The mean log predictive densities (MLPDs) across different test splits as well as uncertainty intervals of the means are shown in Figure 5.

Figure 5: Mean log predictive densities (MLPDs) on independent test sets from the Bike sharing data for Gaussian process models with different numbers of interactions. With each method, the interactions were identified from each training data set using the model with all interactions. The error bars represent uncertainty intervals for the means from 10 different train-test splits.

The figure shows that modelling only the three strongest pairwise interactions increases the out-of-sample predictive performance to the level of the model with all interactions. The KL-diff method identifies more relevant interactions than the competing methods, as seen in the better predictive performance of the models with 1 to 3 interactions.

4 Conclusions

In this work we proposed a method for assessing the predictive relevance of interacting input variable pairs and individual variables by using local derivatives of the Kullback-Leibler divergence between predictive distributions. Using both simulated and real datasets, we demonstrated that the method can reliably identify main effects as well as interactions in nonlinear models for complex datasets. By utilizing the predictive uncertainty of a model, the method is comparable in performance and requires fewer predictions compared to alternative methods that only use the mean prediction. We also showed that the KL-diff method can be beneficial for pre-selecting the variable pairs whose interactions should be evaluated with KL-diff to avoid the exhaustive assessment of all pairs.

We only considered the Kullback-Leibler divergence and showed that it has a conservative property that inflates small local relevance estimates when there is a lot of uncertainty in the model’s predictions. We leave the investigation of other divergence measures as future research. Another possible future direction is the search of interactions between observation groups which would be useful for building multilevel models.

5 Acknowledgements

We thank Alejandro Catalina and Kunal Ghosh for helpful comments to improve the manuscript.

References

  • Baydin et al. (2018) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18(153).
  • Crawford et al. (2019) Crawford, L., Flaxman, S. R., Runcie, D. E., West, M., et al. (2019). Variable prioritization in nonlinear black box methods: A genetic association case study. The Annals of Applied Statistics, 13(2):958–989.
  • Cui et al. (2019) Cui, T., Marttinen, P., and Kaski, S. (2019). Recovering pairwise interactions using neural networks. arXiv preprint arXiv:1901.08361.
  • Czernichow (1996) Czernichow, T. (1996). Architecture selection through statistical sensitivity analysis. In International Conference on Artificial Neural Networks, pages 179–184. Springer.
  • Dorizzi (1996) Dorizzi, B. (1996). Variable selection using generalized rbf networks: Application to the forecast of the french t-bonds. Proceedings of IEEE-IMACS’96, Lille, France.
  • Friedman (2001) Friedman, J. H. (2001).

    Greedy function approximation: a gradient boosting machine.

    Annals of statistics, pages 1189–1232.
  • Friedman et al. (2008) Friedman, J. H., Popescu, B. E., et al. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954.
  • Gelman and Pardoe (2007) Gelman, A. and Pardoe, I. (2007). Average predictive comparisons for models with nonlinearity, interactions, and variance components. Sociological Methodology, 37(1):23–51.
  • Goldstein et al. (2015) Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65.
  • Greenwell et al. (2018) Greenwell, B. M., Boehmke, B. C., and McCarthy, A. J. (2018). A simple and effective model-based variable importance measure. arXiv preprint arXiv:1805.04755.
  • Guyon et al. (2002) Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002).

    Gene selection for cancer classification using support vector machines.

    Machine learning, 46(1-3):389–422.
  • Leray and Gallinari (1999) Leray, P. and Gallinari, P. (1999). Feature selection with neural networks. Behaviormetrika, 26(1):145–166.
  • MacKay (1994) MacKay, D. J. (1994). Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062.
  • MacKay (1998) MacKay, D. J. (1998). Introduction to gaussian processes. In Bishop, J., editor, Neural Networks and Machine Learning, pages 133–166. Springer Verlag.
  • Minka (2001) Minka, T. P. (2001).

    A family of algorithms for approximate Bayesian inference

    .
    PhD thesis, Massachusetts Institute of Technology.
  • Neal (1998) Neal, R. (1998). Regression and classification using gaussian process priors (with discussion). In Bernardo, J., Berger, J., Dawid, A., and Smith, A., editors, Bayesian statistics, volume 6, pages 475–501. Oxford University Press.
  • Neal (1995) Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
  • O’Hagan (1978) O’Hagan, A. (1978). Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society: Series B (Methodological), 40(1):1–24.
  • Opper and Winther (2000) Opper, M. and Winther, O. (2000). Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12(11):2655–2684.
  • Paananen et al. (2019) Paananen, T., Piironen, J., Andersen, M. R., and Vehtari, A. (2019). Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution. volume 89 of Proceedings of Machine Learning Research, pages 1743–1752.
  • Rasmussen (2003) Rasmussen, C. (2003). Gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals. Bayesian statistics, 7:651–659.
  • Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning, volume 1. MIT press Cambridge.
  • Refenes and Zapranis (1999) Refenes, A.-P. and Zapranis, A. (1999). Neural model identification, variable selection and model adequacy. Journal of Forecasting, 18(5):299–332.
  • Riihimäki et al. (2010) Riihimäki, J., Sund, R., and Vehtari, A. (2010). Analysing the length of care episode after hip fracture: a nonparametric and a parametric Bayesian approach. Health care management science, 13(2):170–181.
  • Ruck et al. (1990) Ruck, D. W., Rogers, S. K., and Kabrisky, M. (1990).

    Feature selection using a multilayer perceptron.

    Journal of Neural Network Computing, 2(2):40–48.
  • Seeger (2000) Seeger, M. (2000). Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In Advances in neural information processing systems, pages 603–609.
  • Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., Sørbye, S. H., et al. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1):1–28.
  • Solak et al. (2003) Solak, E., Murray-Smith, R., Leithead, W. E., Leith, D. J., and Rasmussen, C. E. (2003). Derivative observations in Gaussian process models of dynamic systems. In Advances in neural information processing systems, pages 1057–1064.
  • Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org.
  • Williams and Barber (1998) Williams, C. K. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.
  • Williams and Rasmussen (1996) Williams, C. K. and Rasmussen, C. E. (1996). Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520.

References

  • Baydin et al. (2018) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18(153).
  • Crawford et al. (2019) Crawford, L., Flaxman, S. R., Runcie, D. E., West, M., et al. (2019). Variable prioritization in nonlinear black box methods: A genetic association case study. The Annals of Applied Statistics, 13(2):958–989.
  • Cui et al. (2019) Cui, T., Marttinen, P., and Kaski, S. (2019). Recovering pairwise interactions using neural networks. arXiv preprint arXiv:1901.08361.
  • Czernichow (1996) Czernichow, T. (1996). Architecture selection through statistical sensitivity analysis. In International Conference on Artificial Neural Networks, pages 179–184. Springer.
  • Dorizzi (1996) Dorizzi, B. (1996). Variable selection using generalized rbf networks: Application to the forecast of the french t-bonds. Proceedings of IEEE-IMACS’96, Lille, France.
  • Friedman (2001) Friedman, J. H. (2001).

    Greedy function approximation: a gradient boosting machine.

    Annals of statistics, pages 1189–1232.
  • Friedman et al. (2008) Friedman, J. H., Popescu, B. E., et al. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954.
  • Gelman and Pardoe (2007) Gelman, A. and Pardoe, I. (2007). Average predictive comparisons for models with nonlinearity, interactions, and variance components. Sociological Methodology, 37(1):23–51.
  • Goldstein et al. (2015) Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65.
  • Greenwell et al. (2018) Greenwell, B. M., Boehmke, B. C., and McCarthy, A. J. (2018). A simple and effective model-based variable importance measure. arXiv preprint arXiv:1805.04755.
  • Guyon et al. (2002) Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002).

    Gene selection for cancer classification using support vector machines.

    Machine learning, 46(1-3):389–422.
  • Leray and Gallinari (1999) Leray, P. and Gallinari, P. (1999). Feature selection with neural networks. Behaviormetrika, 26(1):145–166.
  • MacKay (1994) MacKay, D. J. (1994). Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062.
  • MacKay (1998) MacKay, D. J. (1998). Introduction to gaussian processes. In Bishop, J., editor, Neural Networks and Machine Learning, pages 133–166. Springer Verlag.
  • Minka (2001) Minka, T. P. (2001).

    A family of algorithms for approximate Bayesian inference

    .
    PhD thesis, Massachusetts Institute of Technology.
  • Neal (1998) Neal, R. (1998). Regression and classification using gaussian process priors (with discussion). In Bernardo, J., Berger, J., Dawid, A., and Smith, A., editors, Bayesian statistics, volume 6, pages 475–501. Oxford University Press.
  • Neal (1995) Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
  • O’Hagan (1978) O’Hagan, A. (1978). Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society: Series B (Methodological), 40(1):1–24.
  • Opper and Winther (2000) Opper, M. and Winther, O. (2000). Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12(11):2655–2684.
  • Paananen et al. (2019) Paananen, T., Piironen, J., Andersen, M. R., and Vehtari, A. (2019). Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution. volume 89 of Proceedings of Machine Learning Research, pages 1743–1752.
  • Rasmussen (2003) Rasmussen, C. (2003). Gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals. Bayesian statistics, 7:651–659.
  • Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning, volume 1. MIT press Cambridge.
  • Refenes and Zapranis (1999) Refenes, A.-P. and Zapranis, A. (1999). Neural model identification, variable selection and model adequacy. Journal of Forecasting, 18(5):299–332.
  • Riihimäki et al. (2010) Riihimäki, J., Sund, R., and Vehtari, A. (2010). Analysing the length of care episode after hip fracture: a nonparametric and a parametric Bayesian approach. Health care management science, 13(2):170–181.
  • Ruck et al. (1990) Ruck, D. W., Rogers, S. K., and Kabrisky, M. (1990).

    Feature selection using a multilayer perceptron.

    Journal of Neural Network Computing, 2(2):40–48.
  • Seeger (2000) Seeger, M. (2000). Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In Advances in neural information processing systems, pages 603–609.
  • Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., Sørbye, S. H., et al. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1):1–28.
  • Solak et al. (2003) Solak, E., Murray-Smith, R., Leithead, W. E., Leith, D. J., and Rasmussen, C. E. (2003). Derivative observations in Gaussian process models of dynamic systems. In Advances in neural information processing systems, pages 1057–1064.
  • Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org.
  • Williams and Barber (1998) Williams, C. K. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.
  • Williams and Rasmussen (1996) Williams, C. K. and Rasmussen, C. E. (1996). Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520.

Supplementary Material

Finite Difference Approximation of the Kullback-Leibler Divergence

Consider two probability distributions, and parameterized by vectors and , respectively. Keeping constant, let us make a second-order approximation of the Kullback-Leibler divergence between the distributions in the neighbourhood around .

The first two terms are zero, because the Kullback-Leibler divergence obtains a minimum value of zero at . Dropping them and the third degree term, we are left with the approximation

If the distributions and are predictive distributions, then the parameters and depend on the input point , i.e. . When only one input variable, , is varied, an infinitesimal change in the parameters can be written as

Thus we get

Rearranging the terms gives the approximate equivalence

Based on the chain rule of differentiation, this is equal to

Finally, taking the square root gives the approximate equivalence

where the left hand side is the finite difference KL method of Paananen et al. (2019) and the right hand side is the KL-diff measure.

Asymptotic Results for Generalized Linear Models

In Section 2.1 in the main paper, we discussed that for a Bayesian linear regression model, the predictive distribution becomes constant as a function of the input variables when the number of observations goes to infinity. In this section we analyze the asymptotic results of the logistic regression model.

Logistic Regression Model

The predictive distribution of a logistic regression model is a Bernoulli distribution. In the asymptotic limit, the posterior of the regression coefficients concentrates to a point , and the “success probability” parameter as a function of the input variables is the logistic function

The derivative of with respect to is

The second derivative of the Kullback-Leibler divergence with respect to is the Fisher information of the Bernoulli distribution

In the limit when the number of observations goes to infinity, the KL-diff measure thus evaluates to

Again we see that the relevance measure is proportional to the absolute value of the regression coefficient. In addition, due to the logistic link function, the local relevance measure is higher for points close to the decision boundary compared to points further away.

Derivatives of Gaussian Process Predictive Distribution Parameters

Gaussian process models are a widely used in supervised learning, where the task is to predict an output from a -dimensional input . The type of functions the Gaussian process can represent are determined by its covariance function, which is a key decision made during modelling. The covariance function defines the covariance between the function values at the input points and

. We assume the Gaussian process has a zero mean, in which case the joint distribution of the latent output values

at the training points is

where is the covariance matrix between the latent function values at the training inputs such that .

In this work, we use the squared exponential covariance function

(5)

Here, the hyperparameter  determines the overall variability of the functions, and are the length-scales of each input dimension. By defining an observation model that links the observations to the latent values of the Gaussian process, the model can be used for inference and predictions in many supervised learning tasks.

For example, in regression with an assumption of Gaussian noise, the posterior distribution of latent values for a new input point is a univariate normal distribution with mean and variance

(6)

where is the noise variance,

is the identity matrix, and

is the vector of training outputs. For many other observation models, the posterior of latent values is not Gaussian, but it is commonplace to approximate it with a Gaussian distribution during inference, and many methods have been developed for doing so 

(Williams and Barber, 1998; Opper and Winther, 2000; Minka, 2001; Rasmussen and Williams, 2006). The variable relevance assessment thus depends implicitly on the posterior approximation, as does any method that uses the predictive distribution.

Differentiating Gaussian Processes

We assume that the posterior distribution of latent values is Gaussian. Because differentiation is a linear operation, the derivatives of the parameters of a Gaussian process posterior distribution with respect to input variables are available in closed form (Solak et al., 2003; Rasmussen, 2003). For example, for the Gaussian observation model, the derivatives of the mean and variance of the predictive distribution in equation (6) with respect to the input variable at point are given as

For the squared exponential covariance function in equation (5), the partial derivatives with respect to the input variable are

The second derivatives are

For the KL-diff measure, we need derivatives with respect to the parameters of the predictive distribution and not the posterior of the latent values. However, for many observation models these are obtained as a function of the derivatives of the latent values. In this section, we derive the derivatives for some commonly used observation models.

Regression with Gaussian Observation Model

In regression problems, it is commonly assumed that the noise has a Gaussian distribution. For a Gaussian observation model, the predictive distribution for a new observation at a single input point is a normal distribution, which we will denote

where and are the mean and variance of the posterior distribution of latent values at , and is the noise variance. Now, the derivatives of and with respect to input variable are simply

The second derivatives of the Kullback-Leibler divergence are given by the Fisher information matrix of the normal distribution

Thus, the KL-diff measure takes the form

Here, the first term is proportional to the slope of the mean prediction scaled by the predictive uncertainty, as with the linear regression model discussed in Section 2.1 of the main paper. The KL-diff measure evaluates to

Binary Classification

For binary classification problems, the predictive distribution is a Bernoulli distribution with only one parameter, the probability of positive classification. This is obtained by squashing the latent Gaussian process function through a link function and integrating over the posterior of the latent function values. Two commonly used link functions for Gaussian process classification are the logit and probit. The Probit link function has the benefit that the predictive distribution has an analytical formula when the posterior distribution of latent values is approximated with a Gaussian. Using a Probit link function, the predictive probability has thus an approximate analytical form

where is the cumulative distribution of the standard normal distribution.

Now, the derivatives of with respect to are