1 Introduction
Approximate Bayesian inference techniques extend the usual pointwise predictions of already successful methods,
e.g. neural networks (NNs) Blei et al. (2017), to full predictive distributions. These techniques are seen as fast, general and approachable ways for obtaining more complete information about the predictions obtained Blundell et al. (2015a); Gal (2016); Gal and Ghahramani (2016). The estimated uncertainty of the predictions can account for different sources of uncertainty present in the modelling process. Moreover, this extra information can play a crucial role in sensitive realworld scenarios, where decisions may be taken depending on the certainty of the predictions
McAllister et al. (2017). However, these estimates will always be subject to the selection of the proper model for the data and the task at hand.In Bayesian modelling, one uses probability distributions to assign degrees of belief to the parameter values of these models
Graves (2011); Blundell et al. (2015b). Nonetheless, Bayesian learning presents several intrinsic difficulties in this context, e.g. the choice of meaningful priors when using complex systems such as Bayesian NNs (BNNs), as well as dealing with the intractability of the calculations required. Usually, these issues are avoided by resorting to approximations of the original posterior distribution. However, this also induces certain biases that can have potentially decisive effects on the resulting predictive distribution. Moreover, the popularity of bigger models further complicates the learning process due to the numerous symmetries and strong dependencies that arise in parameterspace Sun et al. (2019); Coker et al. (2021). These factors may compromise the performance of regular approximation techniques.Functionspace inference has become a prominent approach to solve some of the intrinsic issues of approximate inference in parameterspace. Performing inference in functionspace simplifies the issues related to the complexity of the optimization space. In particular, approaches based on implicit stochastic processes (IPs) have gathered a lot of attention Ma et al. (2019); Sun et al. (2019). Until recently, the most advanced methods were unable to simultaneously yield flexible predictive distributions and a fully trainable framework capable of tuning all of its parameters. The first method to do both things while remaining based on IPs is Sparse Implicit Processes (SIP) Santana et al. (2022). Additionally, and maybe surprisingly, SIP’s predictive distribution seems capable of correcting the model bias imposed by the selection of a given prior. In general, if the selected prior does not suit the data at hand, the exact predictive distribution can become quite different from the data distribution. Here we show that SIP is capable of providing predictive distributions that closely resemble the data, even if the underlying prior model may not be the best choice. We test out this feature by comparing SIP with other functionbased models. We show that, due to the changes in the approximate inference objective function, SIP’s predictive distribution is much more robust to the model choice than other approaches. This leads to better generalization properties, capturing complex patterns in the predictive distribution.
2 Background
Before we explore the model bias correction in SIP, let us briefly introduce the key ideas behind this framework.
Approximate inference in parameter space
Consider some data , a prior distribution over the model parameters and a likelihood model . Let distribution approximate the exact posterior resulting from the Bayes rule. In variational inference, is obtained maximizing the evidence lower bound (ELBO) Jordan et al. (1999):
(1) 
where
represents the KullbackLeibler divergence between
and . Usually, is assumed to be parametric for complex models such as Bayesian NNs (BNNs), imposing also independence among the components of Blundell et al. (2015b); Graves (2011).An implicit model can be used for i.e., , with some random noise Mescheder et al. (2017); Santana and HernándezLobato (2020). If is highdimensional and the model defining is expressive enough, can emulate almost any function. However, since lacks a closedform density, the KL term becomes intractable. A solution here is rewriting it as the result of an auxiliary classification problem. The optimal value of this auxiliary problem is precisely Mescheder et al. (2017).
Finally, instead of minimizing the regular KLdivergence between and the posterior, other approaches employ the more general divergences, which include the KLdivergence as a particular case HernándezLobato et al. (2016). This has shown to overall improve the final results Ma et al. (2019); Santana and HernándezLobato (2020); Wenzel et al. (2020)
Inference with Implicit Processes
Implicit processes (IPs) can be seen as a general framework over different stochastic processes, e.g.
Gaussian processes can be understood as a particular case of IPs. An IP is defined as a collection of random variables
such that the joint distribution of any finite set of evaluations
is determined by the generative process:(2) 
where is some random variable that summarizes the randomness, represents the parameters of the process and is the input space Ma et al. (2019). We define to indicate that is sampled from the corresponding IP with parameters , using samples from (denoted as ). This definition of IPs enables many models to be described with the same language, e.g. BNNs, warped GPs, neural samplers (NS), and deep GPs, among others Snelson et al. (2004); Ma et al. (2019); Damianou and Lawrence (2013).
Conducting approximate inference with IPs has become a promising research topic with increasing interest Ma et al. (2019); Sun et al. (2019). However, the complexity of dealing with IPs leads to important issues in the formulation of such methods. The approach in Ma et al. (2019) relies on approximating the marginal likelihood of the prior IP by the marginal likelihood of a GP. This leads to Gaussian predictions, which may lack flexibility in complex situations. The same issue arises in Ma and HernándezLobato (2021). On the other hand, the method in Sun et al. (2019) is not capable of updating the prior model parameters according to the data due to the usage of a spectral gradient estimator. This can result in important performance losses, since specifying a meaningful prior in complex models such as BNNs is a nontrivial task Knoblauch et al. (2019).
Sparse Implicit Processes (SIP)
In Santana et al. (2022), SIP is described as being the first generalpurposed IPbased method that remains fully trainable and with flexible predictive distributions. This is achieved through an inducing point approximation in the likes of sparse GP approximations Titsias (2009). Scalability is therefore ensured by conducting inference with number of inducing points. Denoting as the set of inducing points, the IP values at these input locations is . The exact posterior is then approximated by
(3) 
where are the parameters of the implicit distribution , and are the parameters of the IP prior. Crucially, is approximated by a Monte Carlo GP approximation as in Ma et al. (2019). The resulting functionalELBO is:
(4) 
Similar to (1), the first term can be estimated by MC sampling, but the KL term lacks closedform solution. To estimate the exact logratio, an auxiliary classification problem is used, making , being
the classifier’s optimal value
Mescheder et al. (2017).SIP is capable of tuning its prior parameters by using the symmetrized KLdivergence Santana et al. (2022). This leads to a better fit of the prior to data, even though the final objective is no longer a lower bound on the original objective, as in Ma et al. (2019). Moreover, SIP resorts to divergence energy function, present in other methods HernándezLobato et al. (2016); Ma et al. (2019); Santana and HernándezLobato (2020), leaving as final objective
(5)  
Finally, in (3) is approximated by the conditional of a GP with the same covariance and mean function as the prior IP, as in Ma et al. (2019), where the mean and covariances are empirically estimated. Predictions in new locations are given via Monte Carlo using samples. Using , the predictions are
(6) 
The resulting distribution is a mixture of Gaussians. Using multiple samples, this enables SIP to reproduce flexible predictive distributions that need not be Gaussian, as can be seen in the original experiments Santana et al. (2022).
In a general sense, the first term in the r.h.s. of (5) forces predictions to explain the observed data, while the second term makes predictions similar to the prior in regions where no data points are available. Here, the approximations required in SIP can be understood in a similar fashion to the recent work in Knoblauch et al. (2019). It is the combination of the posterior approximation and the approximate objective function which allows SIP to produce flexible predictions. Moreover, and perhaps surprisingly, SIP seems capable of correcting the imposed model bias by the selection of a prior not suited to the data at hand. The original experiments in Santana et al. (2022)
hinted at this: the functions generated using the BNN priors are NN samples contaminated with Gaussian noise, which are normally distributed. This means that, through exact Bayesian inference, using such prior models one should not obtain bimodal or heteroscedastic predictive distributions. This is shown in
Santana et al. (2022) through the results of HMC on synthetic data. On the other hand, SIP seems capable of correcting this model bias, providing more flexible predictive distributions than the ones obtained through exact Bayesian inference. We explore this feature in the following experiments.Bimodal data  Heteroscedastic data  

Exact GP  SIP  Exact GP  SIP  
RMSE  4.988 0.003  4.994 0.004  1.340 0.002  1.342 0.002 
NLL  3.026 0.001  2.093 0.003  1.714 0.001  1.449 0.001 
CRPS  2.860 0.089  2.429 0.003  0.727 0.033  0.678 0.001 
3 Experiments
In the following tests we compare SIP’s predictions with those of regular GPs. We study the behavior of SIP when forcing certain model biases in the implementation. By employing a GP prior, the associated posterior distribution will be that of a GP, which is normally distributed and can be obtained in an exact manner. Therefore, the GP predictions will serve as ground truth for the expected behavior of the exact posterior distribution. Using a GP prior in SIP, the exact GP posterior distribution is also available. However, if SIP maintains the behavior shown by Santana et al. (2022), we expect it to avoid this modelformulation bias and output nonGaussian predictions if the data presented is also clearly nonGaussian. For this comparison, the exact GP model makes use of a white and a RBF kernel using the GP regression module in sklearn
. In SIP we also use a GP prior implemented in the form of a wide 1layer BNN with 500 units and cosine activation functions, as described in
Rahimi and Recht (2007). The weights are initialized using a standard Gaussian, while the biases are sampled from(only for the hidden layer). The kernel length and amplitude parameters are initialized to 1. Finally, SIP’s posterior IP is parametrized using a 2layer BNN with 50 units per layer, leakyReLU activations and 50 inducing points.
Following the original results in Santana et al. (2022), we expect SIP to compensate for this model bias in the formulation and output flexible predictions. To show this, we employ both an exact GP model and SIP in the same two synthetic cases present in the original article, one with bimodal data and another with heteroscedastic data. To measure the performance of each method in both datasets, we generate the data 20 different times following the same equations with different random seeds. We train both methods in each generated dataset and measure the performance metrics to average across the 20 different instances.
Bimodal dataset
To generate the bimodal data we first sample 1000 values for from . Then, we obtain one of the two possible
values with probability
. Fixing , these two values are:In Fig.1
a we plot the the bimodal data in the topleft figure (in blue) and the resulting predictions for both SIP and the exact GP model. In the bottomleft we have the predictions of the exact GP model, which, as expected, follow a Gaussian behavior centered in the mean of the original data. The shaded orange regions represent two standard deviations from the GP predictive mean, marked here by the dark orange line in the middle. On the other hand, in the rightside figures we see SIP’s results. In the top figure we plot samples from the GP prior used for SIP (the same as for the exact GP). Note that these samples do not follow the data in the same fashion of the original article since they are simply samples from the trained prior GP, although the associated hyperparameters are fitted to the data. Finally, in the bottomright we have samples from the predictive distribution of SIP. This distribution is clearly bimodal, which differs strongly from the predictive distribution of the exact GP, which would be the result from the process of exact inference. SIP’s predictions follow closely the original data, achieving a more representative predictive distribution that is not available with exact inference from the initial GP prior imposed. This model bias is, in practice, being corrected by the flexibility of the approximate inference method in SIP, which follows the original behavior observed in
Santana et al. (2022). Interestingly, each function sampled from SIP is located alongside one mode. It is through the multiple samples where this flexible behavior is obtained.Heteroscedastic dataset
To obtain the heteroscedastic data, we again sample values for from and then obtain the respective value as
(7) 
The results for the heterocestic case are included in Fig.1
b following the same layout as in the bimodal case. Again, the exact GP is unable to produce heteroscedastic predictive distributions. On the other hand, the variance in SIP’s predictive distribution clearly depends on the location in the
axis for the predictions, following the same heteroscedastic pattern present in the original data. In this case, we can also see that the sampled functions from the prior GP in SIP are much smoother, pointing to a different fit for the hyperparameters of the GP prior employed.Overall, we see that the predictive distribution of SIP is far more representative of the original data than the resulting distribution from the GP. These results are also supported by the error metrics in Table 1
. Here, the RMSE for both methods are almost equivalent, especially when taking into account their respective standard errors. This can be understood directly as both of their predictive means lay in the same vicinity. However, in terms of the negative loglikelihood and the
continous ranked probability score Gneiting and Raftery (2007) SIP is clearly superior. These results can be explained by the fact that the predictive mean of both methods almost coincide due to the data employed, while the predictive distribution of SIP is much closer to the original data than the one of the GP. This suggests that SIP successfully avoids the overlysimplistic Gaussian prediction bias imposed by the GP prior, which would also result in Gaussian predictions. Therefore, through its more general and flexible formulation, SIP is capable of bypassing formulation biases that may hinder the predictions and captures otherwise neglected key features present in the data.4 Conclusions
We have shown that SIP avoids formulation bias whenever the data strongly deviates from the restrictions imposed by the selected model. We extend the results in the original contribution Santana et al. (2022) by using a GP prior, making the exact posterior available, which is also Gaussian. However, if the data differs from the expected exact GP predictions, SIP’s framework provides enough flexibility so that its predictive distribution can turn out not necessarily Gaussian, even when using the aforementioned GP prior, which is a wrong assumption. In fact, SIP’s predictive distribution can be much more representative of the original data, reflecting complex features that would be otherwise neglected, e.g. bimodality and heteroscedasticity. This complements the results in the original paper, where it is also shown that HMC is not capable of capturing these complex patterns in the data. Further work will include studying the robustness of functionspacebased methods to adversarial examples and a more detailed analysis of the inference process conducted Knoblauch et al. (2019), as well as applications of SIP to complex realworld situations.
Acknowledgements
Authors gratefully acknowledge the use of the facilities of Centro de Computacion Cientifica (CCC) at Universidad Autónoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, PID2019106827GBI00. SRS acknowledges the BBVA Foundation project and the Trustonomy project, which have received funding from the European Community’s Horizon 2020 research and innovation programme under grant agreement No 812003. BZ has been supported by the Programa Atracción de Talento de la Comunidad de Madrid under grant n. 2017 T2/TIC5455, from the Comunidad de Madrid/UAM “Proyecto de Jóvenes Investigadores” grant n. SI1/PJI/201900294, as well as from Spanish “Proyectos de I+D de Generación de Conocimiento” via grants PGC2018096646AI00 and PGC2018095161BI00. BZ finally acknowledge the support from Generalitat Valenciana through the plan GenT program (CIDEGENT/2020/055).
References
 Variational inference: A review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §1.
 Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §1.
 Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §1, §2.
 Wide meanfield variational Bayesian neural networks ignore the data. arXiv preprint arXiv:2106.07052. Cited by: §1.
 Deep gaussian processes. In Artificial intelligence and statistics, pp. 207–215. Cited by: §2.

Dropout as a Bayesian approximation: representing model uncertainty in deep learning
. In international conference on machine learning, pp. 1050–1059. Cited by: §1.  Uncertainty in deep learning. Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §1.
 Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102 (477), pp. 359–378. Cited by: §3.
 Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: §1, §2.
 Blackbox divergence minimization. In International Conference on Machine Learning, pp. 1511–1520. Cited by: §2, §2.
 An introduction to variational methods for graphical models. Machine learning 37, pp. 183–233. Cited by: §2.
 Generalized variational inference: three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063. Cited by: §2, §2, §4.
 Functional variational inference based on stochastic process generators. In Advances in Neural Information Processing Systems, Cited by: §2.
 Variational implicit processes. In International Conference on Machine Learning, pp. 4222–4233. Cited by: §1, §2, §2, §2, §2, §2, §2.
 Concrete problems for autonomous vehicle safety: advantages of Bayesian deep learning. Cited by: §1.

Adversarial Variational Bayes: unifying variational autoencoders and generative adversarial networks
. In International Conference on Machine Learning, pp. 2391–2400. Cited by: §2, §2.  Random features for largescale kernel machines. Advances in neural information processing systems 20. Cited by: §3.
 Adversarial divergence minimization for Bayesian approximate inference. Neurocomputing. Cited by: §2, §2, §2.
 Functionspace inference with sparse implicit processes. Cited by: §1, §2, §2, §2, §2, §3, §3, §3, §4.
 Warped Gaussian processes. Advances in neural information processing systems 16, pp. 337–344. Cited by: §2.
 Functional variational Bayesian neural networks. In Internatinal Conference on Learning Representations, Cited by: §1, §1, §2.
 Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pp. 567–574. Cited by: §2.
 How good is the Bayes posterior in deep neural networks really?. In International Conference on Machine Learning, pp. 10248–10259. Cited by: §2.