DeepAI
Log In Sign Up

Correcting Model Bias with Sparse Implicit Processes

Model selection in machine learning (ML) is a crucial part of the Bayesian learning procedure. Model choice may impose strong biases on the resulting predictions, which can hinder the performance of methods such as Bayesian neural networks and neural samplers. On the other hand, newly proposed approaches for Bayesian ML exploit features of approximate inference in function space with implicit stochastic processes (a generalization of Gaussian processes). The approach of Sparse Implicit Processes (SIP) is particularly successful in this regard, since it is fully trainable and achieves flexible predictions. Here, we expand on the original experiments to show that SIP is capable of correcting model bias when the data generating mechanism differs strongly from the one implied by the model. We use synthetic datasets to show that SIP is capable of providing predictive distributions that reflect the data better than the exact predictions of the initial, but wrongly assumed model.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/14/2022

Deep Variational Implicit Processes

Implicit processes (IPs) are a generalization of Gaussian processes (GPs...
10/14/2021

Sparse Implicit Processes for Approximate Inference

Implicit Processes (IPs) are flexible priors that can describe models su...
06/06/2018

Variational Implicit Processes

This paper introduces the variational implicit processes (VIPs), a Bayes...
10/12/2021

Uncertainty-based out-of-distribution detection requires suitable function space priors

The need to avoid confident predictions on unfamiliar data has sparked i...
06/27/2022

Distributional Gaussian Processes Layers for Out-of-Distribution Detection

Machine learning models deployed on medical imaging tasks must be equipp...

1 Introduction

Approximate Bayesian inference techniques extend the usual point-wise predictions of already successful methods,

e.g. neural networks (NNs) Blei et al. (2017), to full predictive distributions. These techniques are seen as fast, general and approachable ways for obtaining more complete information about the predictions obtained Blundell et al. (2015a); Gal (2016); Gal and Ghahramani (2016)

. The estimated uncertainty of the predictions can account for different sources of uncertainty present in the modelling process. Moreover, this extra information can play a crucial role in sensitive real-world scenarios, where decisions may be taken depending on the certainty of the predictions

McAllister et al. (2017). However, these estimates will always be subject to the selection of the proper model for the data and the task at hand.

In Bayesian modelling, one uses probability distributions to assign degrees of belief to the parameter values of these models

Graves (2011); Blundell et al. (2015b). Nonetheless, Bayesian learning presents several intrinsic difficulties in this context, e.g. the choice of meaningful priors when using complex systems such as Bayesian NNs (BNNs), as well as dealing with the intractability of the calculations required. Usually, these issues are avoided by resorting to approximations of the original posterior distribution. However, this also induces certain biases that can have potentially decisive effects on the resulting predictive distribution. Moreover, the popularity of bigger models further complicates the learning process due to the numerous symmetries and strong dependencies that arise in parameter-space Sun et al. (2019); Coker et al. (2021). These factors may compromise the performance of regular approximation techniques.

Function-space inference has become a prominent approach to solve some of the intrinsic issues of approximate inference in parameter-space. Performing inference in function-space simplifies the issues related to the complexity of the optimization space. In particular, approaches based on implicit stochastic processes (IPs) have gathered a lot of attention Ma et al. (2019); Sun et al. (2019). Until recently, the most advanced methods were unable to simultaneously yield flexible predictive distributions and a fully trainable framework capable of tuning all of its parameters. The first method to do both things while remaining based on IPs is Sparse Implicit Processes (SIP) Santana et al. (2022). Additionally, and maybe surprisingly, SIP’s predictive distribution seems capable of correcting the model bias imposed by the selection of a given prior. In general, if the selected prior does not suit the data at hand, the exact predictive distribution can become quite different from the data distribution. Here we show that SIP is capable of providing predictive distributions that closely resemble the data, even if the underlying prior model may not be the best choice. We test out this feature by comparing SIP with other function-based models. We show that, due to the changes in the approximate inference objective function, SIP’s predictive distribution is much more robust to the model choice than other approaches. This leads to better generalization properties, capturing complex patterns in the predictive distribution.

2 Background

Before we explore the model bias correction in SIP, let us briefly introduce the key ideas behind this framework.

Approximate inference in parameter space

Consider some data , a prior distribution over the model parameters and a likelihood model . Let distribution approximate the exact posterior resulting from the Bayes rule. In variational inference, is obtained maximizing the evidence lower bound (ELBO) Jordan et al. (1999):

(1)

where

represents the Kullback-Leibler divergence between

and . Usually, is assumed to be parametric for complex models such as Bayesian NNs (BNNs), imposing also independence among the components of Blundell et al. (2015b); Graves (2011).

An implicit model can be used for i.e., , with some random noise Mescheder et al. (2017); Santana and Hernández-Lobato (2020). If is high-dimensional and the model defining is expressive enough, can emulate almost any function. However, since lacks a closed-form density, the KL term becomes intractable. A solution here is rewriting it as the result of an auxiliary classification problem. The optimal value of this auxiliary problem is precisely Mescheder et al. (2017).

Finally, instead of minimizing the regular KL-divergence between and the posterior, other approaches employ the more general -divergences, which include the KL-divergence as a particular case Hernández-Lobato et al. (2016). This has shown to overall improve the final results Ma et al. (2019); Santana and Hernández-Lobato (2020); Wenzel et al. (2020)

Inference with Implicit Processes

Implicit processes (IPs) can be seen as a general framework over different stochastic processes, e.g.

Gaussian processes can be understood as a particular case of IPs. An IP is defined as a collection of random variables

such that the joint distribution of any finite set of evaluations

is determined by the generative process:

(2)

where is some random variable that summarizes the randomness, represents the parameters of the process and is the input space Ma et al. (2019). We define to indicate that is sampled from the corresponding IP with parameters , using samples from (denoted as ). This definition of IPs enables many models to be described with the same language, e.g. BNNs, warped GPs, neural samplers (NS), and deep GPs, among others Snelson et al. (2004); Ma et al. (2019); Damianou and Lawrence (2013).

Conducting approximate inference with IPs has become a promising research topic with increasing interest Ma et al. (2019); Sun et al. (2019). However, the complexity of dealing with IPs leads to important issues in the formulation of such methods. The approach in Ma et al. (2019) relies on approximating the marginal likelihood of the prior IP by the marginal likelihood of a GP. This leads to Gaussian predictions, which may lack flexibility in complex situations. The same issue arises in Ma and Hernández-Lobato (2021). On the other hand, the method in Sun et al. (2019) is not capable of updating the prior model parameters according to the data due to the usage of a spectral gradient estimator. This can result in important performance losses, since specifying a meaningful prior in complex models such as BNNs is a non-trivial task Knoblauch et al. (2019).

Sparse Implicit Processes (SIP)

In Santana et al. (2022), SIP is described as being the first general-purposed IP-based method that remains fully trainable and with flexible predictive distributions. This is achieved through an inducing point approximation in the likes of sparse GP approximations Titsias (2009). Scalability is therefore ensured by conducting inference with number of inducing points. Denoting as the set of inducing points, the IP values at these input locations is . The exact posterior is then approximated by

(3)

where are the parameters of the implicit distribution , and are the parameters of the IP prior. Crucially, is approximated by a Monte Carlo GP approximation as in Ma et al. (2019). The resulting functional-ELBO is:

(4)

Similar to (1), the first term can be estimated by MC sampling, but the KL term lacks closed-form solution. To estimate the exact log-ratio, an auxiliary classification problem is used, making , being

the classifier’s optimal value

Mescheder et al. (2017).

SIP is capable of tuning its prior parameters by using the symmetrized KL-divergence Santana et al. (2022). This leads to a better fit of the prior to data, even though the final objective is no longer a lower bound on the original objective, as in Ma et al. (2019). Moreover, SIP resorts to -divergence energy function, present in other methods Hernández-Lobato et al. (2016); Ma et al. (2019); Santana and Hernández-Lobato (2020), leaving as final objective

(5)

Finally, in (3) is approximated by the conditional of a GP with the same covariance and mean function as the prior IP, as in Ma et al. (2019), where the mean and covariances are empirically estimated. Predictions in new locations are given via Monte Carlo using samples. Using , the predictions are

(6)

The resulting distribution is a mixture of Gaussians. Using multiple samples, this enables SIP to reproduce flexible predictive distributions that need not be Gaussian, as can be seen in the original experiments Santana et al. (2022).

In a general sense, the first term in the r.h.s. of (5) forces predictions to explain the observed data, while the second term makes predictions similar to the prior in regions where no data points are available. Here, the approximations required in SIP can be understood in a similar fashion to the recent work in Knoblauch et al. (2019). It is the combination of the posterior approximation and the approximate objective function which allows SIP to produce flexible predictions. Moreover, and perhaps surprisingly, SIP seems capable of correcting the imposed model bias by the selection of a prior not suited to the data at hand. The original experiments in Santana et al. (2022)

hinted at this: the functions generated using the BNN priors are NN samples contaminated with Gaussian noise, which are normally distributed. This means that, through exact Bayesian inference, using such prior models one should not obtain bimodal or heteroscedastic predictive distributions. This is shown in

Santana et al. (2022) through the results of HMC on synthetic data. On the other hand, SIP seems capable of correcting this model bias, providing more flexible predictive distributions than the ones obtained through exact Bayesian inference. We explore this feature in the following experiments.

Bimodal data Heteroscedastic data
Exact GP SIP Exact GP SIP
RMSE 4.988 0.003 4.994 0.004 1.340 0.002 1.342 0.002
NLL 3.026 0.001 2.093 0.003 1.714 0.001 1.449 0.001
CRPS 2.860 0.089 2.429 0.003 0.727 0.033 0.678 0.001
Table 1: Results for training the system in different instances of the bimodal and heterocedastic datasets.
Figure 1: Results for the experiments comparing an exact GP model and SIP with the same GP prior with bimodal data (a) and heterocedastic data (b). For both experiments we show, clockwise and beginning from the top-left: the original data (blue), samples from the GP prior of SIP, samples from SIP’s predictive distribution, and the exact GP predictive distribution (orange). Best seen in color.

3 Experiments

In the following tests we compare SIP’s predictions with those of regular GPs. We study the behavior of SIP when forcing certain model biases in the implementation. By employing a GP prior, the associated posterior distribution will be that of a GP, which is normally distributed and can be obtained in an exact manner. Therefore, the GP predictions will serve as ground truth for the expected behavior of the exact posterior distribution. Using a GP prior in SIP, the exact GP posterior distribution is also available. However, if SIP maintains the behavior shown by Santana et al. (2022), we expect it to avoid this model-formulation bias and output non-Gaussian predictions if the data presented is also clearly non-Gaussian. For this comparison, the exact GP model makes use of a white and a RBF kernel using the GP regression module in sklearn

. In SIP we also use a GP prior implemented in the form of a wide 1-layer BNN with 500 units and cosine activation functions, as described in

Rahimi and Recht (2007). The weights are initialized using a standard Gaussian, while the biases are sampled from

(only for the hidden layer). The kernel length and amplitude parameters are initialized to 1. Finally, SIP’s posterior IP is parametrized using a 2-layer BNN with 50 units per layer, leaky-ReLU activations and 50 inducing points.

Following the original results in Santana et al. (2022), we expect SIP to compensate for this model bias in the formulation and output flexible predictions. To show this, we employ both an exact GP model and SIP in the same two synthetic cases present in the original article, one with bimodal data and another with heteroscedastic data. To measure the performance of each method in both datasets, we generate the data 20 different times following the same equations with different random seeds. We train both methods in each generated dataset and measure the performance metrics to average across the 20 different instances.

Bimodal dataset

To generate the bimodal data we first sample 1000 values for from . Then, we obtain one of the two possible

values with probability

. Fixing , these two values are:

In Fig.1

a we plot the the bimodal data in the top-left figure (in blue) and the resulting predictions for both SIP and the exact GP model. In the bottom-left we have the predictions of the exact GP model, which, as expected, follow a Gaussian behavior centered in the mean of the original data. The shaded orange regions represent two standard deviations from the GP predictive mean, marked here by the dark orange line in the middle. On the other hand, in the right-side figures we see SIP’s results. In the top figure we plot samples from the GP prior used for SIP (the same as for the exact GP). Note that these samples do not follow the data in the same fashion of the original article since they are simply samples from the trained prior GP, although the associated hyperparameters are fitted to the data. Finally, in the bottom-right we have samples from the predictive distribution of SIP. This distribution is clearly bimodal, which differs strongly from the predictive distribution of the exact GP, which would be the result from the process of exact inference. SIP’s predictions follow closely the original data, achieving a more representative predictive distribution that is not available with exact inference from the initial GP prior imposed. This model bias is, in practice, being corrected by the flexibility of the approximate inference method in SIP, which follows the original behavior observed in

Santana et al. (2022). Interestingly, each function sampled from SIP is located alongside one mode. It is through the multiple samples where this flexible behavior is obtained.

Heteroscedastic dataset

To obtain the heteroscedastic data, we again sample values for from and then obtain the respective value as

(7)

The results for the heterocestic case are included in Fig.1

b following the same layout as in the bimodal case. Again, the exact GP is unable to produce heteroscedastic predictive distributions. On the other hand, the variance in SIP’s predictive distribution clearly depends on the location in the

-axis for the predictions, following the same heteroscedastic pattern present in the original data. In this case, we can also see that the sampled functions from the prior GP in SIP are much smoother, pointing to a different fit for the hyperparameters of the GP prior employed.

Overall, we see that the predictive distribution of SIP is far more representative of the original data than the resulting distribution from the GP. These results are also supported by the error metrics in Table 1

. Here, the RMSE for both methods are almost equivalent, especially when taking into account their respective standard errors. This can be understood directly as both of their predictive means lay in the same vicinity. However, in terms of the negative log-likelihood and the

continous ranked probability score Gneiting and Raftery (2007) SIP is clearly superior. These results can be explained by the fact that the predictive mean of both methods almost coincide due to the data employed, while the predictive distribution of SIP is much closer to the original data than the one of the GP. This suggests that SIP successfully avoids the overly-simplistic Gaussian prediction bias imposed by the GP prior, which would also result in Gaussian predictions. Therefore, through its more general and flexible formulation, SIP is capable of bypassing formulation biases that may hinder the predictions and captures otherwise neglected key features present in the data.

4 Conclusions

We have shown that SIP avoids formulation bias whenever the data strongly deviates from the restrictions imposed by the selected model. We extend the results in the original contribution Santana et al. (2022) by using a GP prior, making the exact posterior available, which is also Gaussian. However, if the data differs from the expected exact GP predictions, SIP’s framework provides enough flexibility so that its predictive distribution can turn out not necessarily Gaussian, even when using the aforementioned GP prior, which is a wrong assumption. In fact, SIP’s predictive distribution can be much more representative of the original data, reflecting complex features that would be otherwise neglected, e.g. bimodality and heteroscedasticity. This complements the results in the original paper, where it is also shown that HMC is not capable of capturing these complex patterns in the data. Further work will include studying the robustness of function-space-based methods to adversarial examples and a more detailed analysis of the inference process conducted Knoblauch et al. (2019), as well as applications of SIP to complex real-world situations.

Acknowledgements

Authors gratefully acknowledge the use of the facilities of Centro de Computacion Cientifica (CCC) at Universidad Autónoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, PID2019-106827GB-I00. SRS acknowledges the BBVA Foundation project and the Trustonomy project, which have received funding from the European Community’s Horizon 2020 research and innovation programme under grant agreement No 812003. BZ has been supported by the Programa Atracción de Talento de la Comunidad de Madrid under grant n. 2017- T2/TIC-5455, from the Comunidad de Madrid/UAM “Proyecto de Jóvenes Investigadores” grant n. SI1/PJI/2019-00294, as well as from Spanish “Proyectos de I+D de Generación de Conocimiento” via grants PGC2018-096646-A-I00 and PGC2018-095161-B-I00. BZ finally acknowledge the support from Generalitat Valenciana through the plan GenT program (CIDEGENT/2020/055).

References

  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: A review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015a) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015b) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §1, §2.
  • B. Coker, W. Pan, and F. Doshi-Velez (2021) Wide mean-field variational Bayesian neural networks ignore the data. arXiv preprint arXiv:2106.07052. Cited by: §1.
  • A. Damianou and N. D. Lawrence (2013) Deep gaussian processes. In Artificial intelligence and statistics, pp. 207–215. Cited by: §2.
  • Y. Gal and Z. Ghahramani (2016)

    Dropout as a Bayesian approximation: representing model uncertainty in deep learning

    .
    In international conference on machine learning, pp. 1050–1059. Cited by: §1.
  • Y. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §1.
  • T. Gneiting and A. E. Raftery (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102 (477), pp. 359–378. Cited by: §3.
  • A. Graves (2011) Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: §1, §2.
  • J. M. Hernández-Lobato, Y. Li, M. Rowland, D. Hernández-Lobato, T. D. Bui, and R. E. Turner (2016) Black-box -divergence minimization. In International Conference on Machine Learning, pp. 1511–1520. Cited by: §2, §2.
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine learning 37, pp. 183–233. Cited by: §2.
  • J. Knoblauch, J. Jewson, and T. Damoulas (2019) Generalized variational inference: three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063. Cited by: §2, §2, §4.
  • C. Ma and J. M. Hernández-Lobato (2021) Functional variational inference based on stochastic process generators. In Advances in Neural Information Processing Systems, Cited by: §2.
  • C. Ma, Y. Li, and J. M. Hernández-Lobato (2019) Variational implicit processes. In International Conference on Machine Learning, pp. 4222–4233. Cited by: §1, §2, §2, §2, §2, §2, §2.
  • R. McAllister, Y. Gal, A. Kendall, M. Van Der Wilk, A. Shah, R. Cipolla, and A. Weller (2017) Concrete problems for autonomous vehicle safety: advantages of Bayesian deep learning. Cited by: §1.
  • L. Mescheder, S. Nowozin, and A. Geiger (2017)

    Adversarial Variational Bayes: unifying variational autoencoders and generative adversarial networks

    .
    In International Conference on Machine Learning, pp. 2391–2400. Cited by: §2, §2.
  • A. Rahimi and B. Recht (2007) Random features for large-scale kernel machines. Advances in neural information processing systems 20. Cited by: §3.
  • S. R. Santana and D. Hernández-Lobato (2020) Adversarial -divergence minimization for Bayesian approximate inference. Neurocomputing. Cited by: §2, §2, §2.
  • S. R. Santana, B. Zaldivar, and D. Hernández-Lobato (2022) Function-space inference with sparse implicit processes. Cited by: §1, §2, §2, §2, §2, §3, §3, §3, §4.
  • E. Snelson, C. E. Rasmussen, and Z. Ghahramani (2004) Warped Gaussian processes. Advances in neural information processing systems 16, pp. 337–344. Cited by: §2.
  • S. Sun, G. Zhang, J. Shi, and R. Grosse (2019) Functional variational Bayesian neural networks. In Internatinal Conference on Learning Representations, Cited by: §1, §1, §2.
  • M. Titsias (2009) Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pp. 567–574. Cited by: §2.
  • F. Wenzel, K. Roth, B. Veeling, J. Swiatkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin (2020) How good is the Bayes posterior in deep neural networks really?. In International Conference on Machine Learning, pp. 10248–10259. Cited by: §2.