Generalized Bayesian Regression and Model Learning

11/26/2019
by   Tony Tohme, et al.
MIT
0

We propose a generalized Bayesian regression and model learning tool based on the “Bayesian Validation Metric" (BVM) proposed in [1], called BVM model learning. This method performs Bayesian regression based on a user's definition of model-data agreement and allows for model selection on any type of data distribution, unlike Bayesian and standard regression techniques, that fail in some cases. We show that the BVM model learning is capable of representing and combining Bayesian and standard regression techniques in a single framework and generalizing these methods. Thus, this tool offers new insights into the interpretation of the predictive envelopes in Bayesian and standard regression while giving the modeler more control over these envelopes.

READ FULL TEXT VIEW PDF

Authors

page 10

08/29/2019

A General Model Validation and Testing Tool

We construct and propose the "Bayesian Validation Metric" (BVM) as a gen...
09/02/2020

A Bayesian Approach with Type-2 Student-tMembership Function for T-S Model Identification

Clustering techniques have been proved highly suc-cessful for Takagi-Sug...
08/24/2019

Ontology alignment: A Content-Based Bayesian Approach

There are many legacy databases, and related stores of information that ...
03/24/2020

Model selection criteria of the standard censored regression model based on the bootstrap sample augmentation mechanism

The statistical regression technique is an extraordinarily essential dat...
12/23/2018

A Bayesian Zero-Inflated Negative Binomial Regression Model for the Integrative Analysis of Microbiome Data

Microbiome `omics approaches can reveal intriguing relationships between...
08/22/2014

A Bayesian Ensemble Regression Framework on the Angry Birds Game

An ensemble inference mechanism is proposed on the Angry Birds domain. I...
10/25/2019

Unified model selection approach based on minimum description length principle in Granger causality analysis

Granger causality analysis (GCA) provides a powerful tool for uncovering...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central problem in engineering, science, statistics, and machine learning involves describing, representing, and understanding data through the use of models. Nonprobabilistic methods, such as parametric model regression, nonparametric neural networks, and support vector machines (SVM)

[Bishop2006]

are able to tackle these types of problems efficiently. In Bayesian probability theory

[Mackaybook, Sivia, mhalgo], Bayesian model testing and maximum likelihood methods provide probabilistic features (

mean, covariance, distribution) for the parameters we aim to estimate, based on prior knowledge (

prior distribution) and the uncertainty of the data. Bayesian model testing, which uses Bayesian parameter regression, was shown to be successful for signal detection, light sensor characterization [knuth], exoplanet detection [Placek1], extra-solar planet detection [Placek], laser peening process [Park], time series [timeseries], astronomical data analyses [FH08], and cosmology and particle physics [FH09].

We believe that the efficacy of both parametric Bayesian and standard regression can be improved. Bayesian regression calculates the Bayesian evidence, which is the probability the model could have produced the observed, usually noisy or uncertain, data. If this probability is nonzero, one can proceed to calculate posterior model parameter probabilities using Bayes’ Theorem. In practice, there are models and parameters that may be of interest to the user, however, Bayesian regression fails to regress and produce posterior parameter distributions, as highlighted in Figure

1. For some of the instances that Bayesian regression fails to provide a solution, standard regression may actually succeed, but usually with some measure of expected error. How this error can translate into parameter and model uncertainty in the presence of certain or uncertain data is a problem that is largely omitted in the literature except for a few analytic cases.

(a) Bayesian regression works.
(b) Bayesian regression fails.
Figure 1: Illustrative example of success and failure cases of Bayesian regression.

Figure 0(a)

shows normally distributed data (infinite tails data distribution). In this case, parametric Bayesian regression finds a linear model that sits in low probability regions of the data. Figure

0(b)

shows uniformly distributed data (truncated data distribution). In this case, Bayesian regression cannot find a linear model solution because no linear model can pass through each data distribution simultaneously – the model given the data is regarded as impossible. Standard regression methods can provide linear model solutions here despite the model lying in a zero probability region of the data. Although this solution may be considered “wrong" because it is not supported by the data, it successfully provides useful information to the modeler (an increasing trend). Our generalized Bayesian regression method can find solutions in either case by regressing with respect to more general definitions of model-data agreement as will be discussed in section

3.1.

The “Bayesian Validation Metric" (BVM) is a general model validation and testing tool. This method was proposed and implemented in [Vanslette2019]

. The BVM is capable of representing all of the standard validation metrics (square error, statistical hypothesis testing, Bayesian model testing, etc.) as special cases. Using BVM model testing, the BVM selects models according to user defined definitions of agreement between the model output and the observed data, usually within a specific tolerance. It was found that the BVM is able to generalize the Bayesian model testing framework, which allowed the problem of model validation to be expressed in a single framework.


In this article we represent Bayesian and standard regression method techniques within the BVM framework. By learning model parameters with the BVM, we are able to estimate and construct model parameters distributions for any type of data distribution (Gaussian, Uniform, Completely Certain), which addresses the concerns raised in Figure 1. We show how BVM model learning allows us better control over the predictive envelopes of the model under question. This construction gives us additional insight into the meaning of the predictive envelopes of Bayesian Regression.

We have found that the BVM model learning technique we propose shares a few mathematical features with Approximate Bayesian Computation (ABC) methods - also known as likelihood-free techniques - which were widely studied in the past 15 years [Beaumont2025, ABCMethods]. However, the latter is used to approximate intractable likelihoods, while the former solves the problem of Bayesian regression for different types of data distributions using likelihoods that are modified by a user’s definition of agreement between the data and the model.

The remainder of the article is organized as follows. In Section 2, we will review the positives and negatives of Bayesian model testing and standard regression techniques. We then move to Section 3 where we derive our theoretical solutions for BVM model selection (or learning), for different types of data distributions and user modified definitions of model-data agreement. Section 4

presents a simulation application using BVM model learning on a nonlinear heuristic model, along with a compound predictive envelope example.

2 Background and Motivation

In this section, we will review Bayesian and standard regression and discuss their positives and negatives.

2.1 Standard Regression

We start by introducing some of the notations used in this paper. Suppose we are given the independent input (feature) vector , its corresponding dependent observed data (labels) , and a model output function , where represents the vector of model parameters we want to estimate.

The goal of regression is to find such that one can predict the value of for a new value of using the regressed parameters . To do this, one constructs an objective (loss, cost) function that’s purpose is to rank how well the model function represents the data. We define standard regression as the process of solving the optimization problem such that one obtains a regressed model solution over the entire input space.

In practice the model and/or the objective function can be poorly posed for the prediction of new data points given the observed data. If the model function is too complex (like a deep neural network) then standard regression methods may result in a solution that is deemed to have overfit the data and to have accidentally captured unintended features that may not be conducive to the data set as a whole. Overfitting can only be noticed once one tries to generalize the regressed function to new data and has observed it not to perform well. As overfitting can only result if the model function varies greatly, it is common practice to add a regulation term, , where

is a hyperparameter that mitigates the strength of the regularization, to the objective function that’s affect leads to regularized regressed model solutions having a reduced variance Var

Var at a cost of increasing its bias . The problem of underfitting is usually solved by using more complex models and then regularizing them appropriately.

Standard regression type methods have several positive and negative attributes. These methods are relatively easy to implement and can approximate model parameters even for reasonably high dimensional data and model parameter spaces (given one is not concerned with parameter or data uncertainty). When the data is uncertain, standard regression methods can generate parameter estimations along with their variances and covariances (analytically for simple cases and iteratively regressing randomly sampled uncertain data in others, using Bootstrap approach). However, regularizing the objective function introduces bias in which the parameter estimations change (i.e. become biased estimators), the parameter variances become reduced, and the model’s predictive envelope becomes more narrow and less representative of the data. Although this is not a problem for nonparametric models in which the parameters do not have physical interpretations, we find that regularization is problematic for parametric models because these parameters often represent physical quantities,

e.g. the predicted mass of an exoplanet, the predicted circuit resistance due to the addition of an electrical load, or the predicted stiffness of a beam. It seems more natural that larger acceptable training errors should be correlated with an increase in the variance of a parameter rather a decrease because one is admitting that the model is not perfect. Currently, regularization causes parameters to become biased as well as “more certain" because the variance of the regressed model is reduced.

Finally, other than using generalization error type estimates (via training and testing error statistics), these methods do not offer any other methods for model selection in which one could easily include their prior knowledge in a principled way.

2.2 Bayesian Regression and Model Testing

In this section, we present the Bayesian model testing (BMT) and regression while introducing some probability notations to be used throughout the paper. In Bayesian regression, rather than preforming regression to learn the model parameters, one performs regression to learn the posterior probability distribution of the model parameters. That is, one estimates the posterior probability of a set (vector) of parameters

in a model (or hypothesis) given the data (where

is the model input vector) and the prior probability of the parameters

. The defining equation of Bayesian regression is the learning of the posterior parameter distribution from the prior via Bayes Rule,

(1)

In Bayesian model testing and regression, these probabilities are named as follows:

is the posterior probability distribution of the parameters,
is the likelihood function,
is the prior probability,
is the marginal likelihood or Bayesian evidence.
Here, represents the probability that the model output values is equal to the uncertain data values represented by , given the model and the data.

After learning the posterior distribution of the model parameters, we can evaluate the predictive distribution defined by:

(2)

To perform Bayesian regression, one must calculate the evidence, which is the marginal likelihood over ,

(3)

In [Vanslette2019],111A derivation of equation (4) using the BVM is given in Appendix A. the likelihood of the parameters in BMT are expressed as,

(4)

We are explicit in stating that the model output values and the data values exist in independent subspaces, . This encodes that the model functions (and their outputs) do not perturb the observed data (or their uncertainties), and vice-versa, without an explicit connection. We connect the values together for the purpose of data modeling via in BMT (where is the dirac delta function), which allows for data driven modeling.

After solving for the model parameters values, rather than selecting the model with the lowest estimated generalization error as is done in standard regression, one instead uses BMT to select the model with the highest probability given the data. That is, for two Bayesian regressed models and , BMT uses the Bayes ratio, , and rank the data informed posterior model probabilities. It can be expressed several ways using Bayes Rule,


If there is no reason to suspect that one model is more probable than another prior to observing the data, we may set the ratio of the prior probabilities of the model , a priori. In this case one gets,


where

denotes Bayes factor and is the ratio of model evidences. Bayes factor is usually more accessible than

so it is usually used for model selection:
If , then the probability of given the observed data is higher than the probability of given . In this case, we select model .
If , then the probability of given the observed data is higher than the probability of given . In this case, we select model .
If , then the probability of given the observed data is equal to the probability of given . In this case, both models are equally good or bad.

Bayesian regression has several positive and negative attributes. As a byproduct, Bayesian regression can perform model selection in a principled way that allows one to incorporate their prior knowledge into the selection process using BMT. Because Bayesian regression requires regressing probability distributions rather than just single model predictions, it can become intractable to calculate in general if the number of dimensions are large (as would standard regression if uncertainty is taken into account). Regularization in Bayesian regression is interpreted as coming from the uncertainty of the data and the uncertainty presented in the prior parameters [Bishop2006], which we view as being a potential drawback. If one wants to change the regularization it would require changing either of these uncertainties, or both, “artificially" because one would be tuning their prior probabilities after regression, which is a bit anti-Bayesian. Similar to standard regression, regularization can again lead to an unnatural reduction of the posterior variances of the parameters for parametric models.

Further, we highlight some technical gaps found in Bayesian regression and model testing. Although almost all instances of Bayesian regression or model testing use data probability distributions that have infinite tails, truncated (or bounded) data probability density functions (pdfs) are realistic in practice too. We find that truncated data pdfs are potentially problematic for Bayesian regression. In the extreme case of completely certain data, Bayesian regression methods usually do not terminate because the Bayesian evidence is zero in (

1) since there are no possible combinations of parameter values that could exactly fit the data. This problem may also arise if the data uncertainties are bounded. In principle, standard regression methods can produce a solution regardless of the form of the data pdf. We give explicit examples of the likelihoods below:

Infinite Tail Data Distributions
Data distributions with infinite tails result in likelihoods with infinite tails in (4

). Some examples of infinite tail data distributions are Gaussian, Student-t, Laplace, canonical, and Poisson. For example, Gaussian distributed data

(see Figure 0(a)), where is the observed data values and is the covariance matrix, results in an infinite tailed likelihood function,

(5)

Since the likelihood has an infinite tail, the predicted model response has probabilistic flexibility around its corresponding data point because it is uncertain. Even far from , Bayesian regression is capable of estimating the posterior probability distributions of the model parameters in question as they are nonzero.

Truncated Tail Data Distributions
Data distributions with truncated tails lead to truncated likelihoods in (4). For example, if the uncertain data is bounded to a region and is uniformly distributed, (see Figure 0(b)) and where for , then the likelihood function is,

(6)

where is the indicator function. In other words, for the likelihood to be nonzero, the predicted model response at must lie within the interval for all simultaneously. The function space defined by model and uncertain parameters is constrained by the data. This can make the probability of estimating a regressed posterior probability distribution of the model parameters very small, and in some cases impossible, because the likelihood may evaluates to zero for almost all combinations of .

This point is exaggerated if the data is completely certain or deterministic, , because the likelihood function becomes

(7)

In this case, the model output and the observed data only agree if their values are exactly equal, for all points, which in most cases, is only possible if we overfit the data or the model is perfect. Thus, Bayesian regression will usually fail in this case, or if it succeeds, it only produces singular posterior distributions of the model parameters ( ). When Bayesian regression fails to regress, the Bayesian evidence is zero, which, although correct (the model does not support/fit the data), may not be the most useful type of answer for the modeler. It seems reasonable that a modeler would want both the benefits of Bayesian and standard regression simultaneously.

2.3 BVM Model Testing

We present the Bayesian Validation Metric (BVM) proposed in [Vanslette2019], and derive the BVM factor, which is analogous to the Bayes factor. The BVM represents model to data validation in a general way using a user definable probability of agreement,

(8)

where and are the model and data comparison quantities, respectively. The “agreement kernel" is the indicator function of a user defined boolean function, , that defines the context of what is meant by “model to data agreement" by being true when agree or false otherwise. For simplicity, we will assume and are the model outputs and data respectively.

The BVM model testing framework was shown to generalize BMT where the probability of agreement plays the role of the evidence,

(9)

where and are the BVM evidence and likelihood, respectively, that have been modified by a user’s definition of model-data agreement . Analogous to the Bayesian model testing framework, we can perform BVM model testing between two models and using the probability of agreement defined above as follows


where is the ratio of prior probabilities of and , which can often be set to unity, . In this case, we get


where denotes the BVM ratio and denotes the BVM factor, which is analogous to Bayes factor.

3 Generalized Regression

This section introduces what we call BVM regression, which generalizes Bayesian and standard regression. This method is able to produce posterior parameter distribution and predictive envelopes for any data distribution, include prior knowledge about model parameters (if there is any), and regularize parameter solutions in a way that parameter uncertainty increases rather than decreases (as was discussed in section 2.1).

3.1 BVM regression

BVM regression is defined as the learning of a posterior distribution of the parameters, given the agreement and the Boolean function , from the prior via Bayes Rule,

(10)

After learning the posterior distribution of the model parameters, we can evaluate the predictive distribution defined by:

(11)

Performing BVM regression requires evaluating the BVM probability of agreement. At the beginning of Appendix B, we give a derivation showing that (9) can be written as,

(12)

which is analogous to (4) in form and derivation, and where the comparison values are and .

BVM regression can reproduce both Bayesian and standard regression as special cases. When the data and model outputs must be exactly equal to agree with one another (, ), the BVM produces BMT as a special case [Vanslette2019] and the regression solutions are given in Appendix A. We find that the boolean function that reproduces standard regression is defined to be true iff . This only gives nonsingular posterior parameter distributions and predictive model envelopes if the data is uncertain.

If the objective function is convex, then we have a single minimum which results in one vector of parameters that makes true. However, when the cost function is non-convex, then multiple parameter vectors corresponding to different local minima, lead to a true and may be accepted. This results in multiple regressed solutions for the regression problem and approximates the posterior parameter distribution (analogous to the accepted parameter samples of the MCMC chain). Marginalizing leads to the predictive posterior model distribution as in (11). Finding the predictive model output average is analogous to the results obtained in the ensemble methods in machine learning [ensemble].

Because the BVM can reproduce these special cases and generate new ones by extending, combining, modulating boolean agreement functions, BVM regression may be seen as a generalized regression method.

Due to the flexibility of the BVM framework, there are many possible definitions of agreement that the user can define. Table 1 below contains some of these definitions.

[.5] Agreement Boolean Function
[.5]() Boolean True iff
[.5]() Boolean True iff and
[.5]() Boolean True iff
[.5]() Boolean True iff and
Table 1: Some examples of agreement Boolean functions.

To address the concerns we raised about Bayesian and standard regression depicted in Figure 1, consider using the Boolean with an agreement kernel,

=


for all , where increasing represents the modeler being more tolerant of training errors at instance . For simplicity, We assume that for all . Utilizing this BVM definition allows us to solve the truncated tail problem in Bayesian regression in a simple way – details in Appendix B.222A complete derivation for the infinite tail Gaussian data distribution is given in Appendix B.1.

Truncated Tail Solution Summary
Let the data be known to have the truncated pdf for . By using the -Boolean, we introduce leniency into the regression in that it no longer needs to exactly pass through all simultaneously to count as a “fit". This produces likelihood functions such as,

(13)

where and are defined by the boundaries of the intersection of the data uncertainty and the model’s tolerance ,


An illustration of how the BVM works with truncated data distributions is shown in Figure 2 below. For example, at instance , the interval is found by intersecting the intervals and . Note that this applies to the instances for all . In this case, the likelihood is non-zero, resulting in a non-zero evidence (eq. (12)). Thus, given this agreement definition, the probability of finding a model given the truncated data is non-zero.

Figure 2: Using BVM results in a non-zero probability of finding a model given the observed truncated data.

Now, if we consider the special case when the data is completely certain, deterministic, , then the likelihood function is

(14)

which can be seen as a relaxed general form of the delta function adopted in the Bayesian model testing (where ), which implies that the model output must be within from the observed measurements in order for them to agree. An analogous -Boolean solution exists for standard regression which leads to nonsingular parameter distributions if the regression is regularized or not.

Tolerant agreement as a new kind of regularization
The purpose of regularization is to better represent one’s expectations of unobserved data using the chosen model or model class. Using BVM regression and nonzero agreement tolerances (e.g. in the -Boolean), we can broaden the model’s prediction envelope to better represent our expectations of the data. Increasing agreement tolerances naturally increases the posterior variance of the parameters, which differs from standard regularization methods and can be used to avoid conceptual issues of interpreting regularized physical parameters. It should also be noted that this is done without changing the prior distributions of the parameters nor the given probability distributions of the data. This becomes a useful feature in our first example.

4 Implementation and Example

4.1 Computing the BVM Evidence

Like the Bayesian evidence, the BVM evidence is computationally expensive to calculate when one has many parameters to learn. Several approaches were adopted to solve this problem. Markov Chain Monte Carlo (MCMC) is a computational technique used for Bayesian methods that has been widely studied and improved

[Hasting, Metropolis, Marzouk, mcmcacc, bayesmcmc, Neal]

as it is considered an indispensable tool for Bayesian inference. Other techniques include the Nested Sampling method

[FH08, Skilling2004] and the MultiNest algorithm [FH09]. Throughout the discussion, we will use MCMC to compute the BVM evidence and run our simulations.

Changing the agreement tolerance affects the acceptance rate in the MCMC loop, a larger tolerance implies a higher acceptance rate, in other words, more “candidate" samples are accepted. A higher acceptance rate means more candidate samples being accepted, thus, a wider posterior parameter distribution, while a low acceptance rate implies less samples being accepted, hence, a narrower posterior distribution of the model parameters. Therefore, we can say that the agreement tolerance directly affects the posterior inferences of the model parameters, as we will show in section 4.2.1.

4.2 BVM Regression Examples

4.2.1 Exploratory Example 1

We consider the case study investigated in [Brown2002] using a bacterial growth model. The data is obtained by operating a continuous flow biological reactor at steady-state conditions. The observations are as follows.

(mg/L COD) 28 55 83 110 138 225 375
(1/h) 0.053 0.060 0.112 0.105 0.099 0.122 0.125
Table 2: The observations we aim to fit.

where is the growth rate at substrate concentration . We replicate the results found in [Brown2002] using the nonlinear Monod model to fit the data, ,

where is the maximum growth rate (h), and is the saturation constant (mg/L COD).

We run MCMC on the likelihoods derived in (13), (14) corresponding to the different types of data distributions discussed above, normal or Gaussian distribution with infinite tails, bounded or truncated uniform distribution, and completely certain observation points. We find that the BVM regression able to construct posterior inferences of the model parameters out of these data measurement distributions, unlike Bayesian model testing and standard regression techniques that fail at this task for truncated and completely certain data, as shown in Table 3:

Data Distribution BVM regression standard regression Bayesian regression
Infinite Tail
Truncated Tail
Completely Certain
Table 3: High probability of this model producing posterior parameter distributions and predictive envelopes for different types of data distributions using the three approaches. BVM model selection is capable of producing posterior distributions of the model parameters for any type of data distributions.

Using the BVM regressed parameter distributions, we can make predictions of for new values of , , as in (11). In addition, instead of just computing a point estimate of the fit, we should also study the predictive posterior distribution of the model, (also called the envelope). As an illustration of the predictive posterior distribution of our BVM regression model, we plot the predictive envelopes of the nonlinear Monod model described above, treating the data as completely certain, using a tolerance .

Figure 3: Predictive envelopes of the model in the absence of data uncertainty.

The black curve shows the predicted response, which is the model fit calculated using the mean of the values of the parameters and

in the chain. The gray shaded areas correspond to 50%, 90%, 95%, and 99% predictive posterior regions (by computing the model fit for a randomly selected subset of the chain). In other words, the gray regions span 0.675, 1.645, 2, 3 standard deviations either side of the mean response, respectively. We will leave the interpretation of the predictive envelopes for our compound predictive envelope example in section

4.2.2.

The value of the tolerance chosen affects the shape of the predictive envelope and the model parameter distributions. A smaller tolerance implies stricter agreement conditions between the model response and the observed data, which results in less uncertainty in the predictive posterior distribution and a narrower envelope. On the other hand, a larger tolerance implies a more flexible validation condition, and results in more uncertainty in the predictive distribution, a wider envelope and less predictive power. Thus, increasing can always result in finding a model given the data. To avoid getting very wide envelopes relative to the spread of the data, we start with a very small in the MCMC. We then keep increasing until the MCMC algorithm starts achieving a reasonably small acceptance rate for the new candidates in the chain.

Since this model has just two adaptive parameters, namely and , we can plot the prior and posterior distributions directly in parameter space. We explore the dependence between the parameters posterior distributions and the value of the tolerance . Figure 4 shows the results of BVM learning in this model as the value of is decreased. For comparison, the optimal parameter values and computed using standard regression are shown by a yellow cross in the plots in the first row.

Figure 4: Illustration of BVM Learning for the Monod model for decreasing values of . In the first row is the prior/posterior parameter distribution in space. The data points are shown by a blue circle in the second row. The first column corresponds to the situation before any data points are observed and shows a plot of the prior distribution in space together with six samples of the model response (red lines) in which the values of and are randomly drawn from the prior. In the second, third and fourth columns, we see the situation after running our BVM learning using MCMC, with a tolerance , and , respectively. The posterior has now been influenced by the agreement tolerance , this gives a relatively compact posterior distribution. Samples from this posterior distribution lead to the functions shown in red in the second row.

As Figure 4 shows, the smaller is the tolerance, the narrower and sharper is the posterior distribution of the parameters, and the closer the red lines get to each others, and the lower is the uncertainty. This explains the shape of the predictive envelopes previously discussed. Thus, by varying , one can tune the model posterior distributions to be more or less representative of the data. Note that there is no solution when goes below about .

4.2.2 Exploratory Example 2

After showing how the BVM can perform regression on any type of data distribution to generate posterior parameter distributions and predictive model envelopes, we will focus on how the user can choose the Boolean function to define the model-data agreement.

In this example, we will use the compound Boolean as presented in [Vanslette2019]. In that case, the definition of agreement requires the model to pass an average square error threshold of as well as a check for probabilistic model configuration. The latter states that of the uncertain observations (data) should lie inside the model’s confidence interval. Note that we impose the tolerance to prevent the scenario where all of the data lie within an overly wide confidence interval, being marked as “agreeing". We denote this compound Boolean by and it is equal to,

where is the number of data points in the vector , and is the model’s confidence interval at instance (see the last row of Table 1). Note that, although this compound Boolean seems to be complex, it is relatively easy to code and implement.

The BVM probability of agreement in this case can be expressed as,

Note that the likelihood can be expressed as an expectation value over , ,

(15)

where denotes the data vector drawn randomly from the probability distribution of . This allows us to approximate the integral using a statistical method like Monte-Carlo (MC). In this example, we use MC with .

We implement the compound Boolean, , and show its ability to combine and quantify the average error as well as the probabilistic model representation of the uncertain data observations.

We consider the data generated using,

where for and for represents the aleatoric stochastic uncertainty due to the system’s randomness. We also represent the presence of epistemic measurement uncertainty in the data with an additional normal distribution about each data point.

In order to solve this example, we consider the following non-linear model,

where are the model parameters having some prior distribution. We run the MCMC algorithm for 5000 iterations with burn-in using the Bayesian regression first, then using the derived likelihood in (15) with a threshold of . In both cases, we assume the model parameters to be normally distributed around with standard deviations . The results are shown in Figure 5.

Figure 5: A) Bayesian regression under infinite tail data distribution. Note that the confidence interval is very narrow and standard regression method produces a nearly identical result. B) The BVM regression using the compound Boolean. In this case, the confidence region is much wider and represents the data more accurately. Note that this probabilistic model passes both agreement conditions imposed by the compound Boolean . More specifically, starting with a very small in the MCMC, we keep increasing until the second element of the compound Boolean is naturally satisfied.

The BVM regression framework offers new insights into the interpretation of the predictive envelopes of the Bayesian and standard regression. It is clear in Figure 5A) that the Bayesian and standard regression methods generate predictive envelopes that would not accurately predict new target points. Rather, these envelopes quantify the uncertainties in the least square error solution due to the presence of data uncertainty. By being careful in how we define the model-data agreement (as in Figure 5B)), we were able to construct predictive envelopes that satisfy our desire in representing new target points probabilistically. In other words, using the BVM regression framework gives the user more control over the predictive envelopes and what they mean.

5 Conclusions

Using the BVM, we can perform regression and model learning for any type of data distribution, generate posterior parameter distributions, and predictive model envelopes. This is accomplished through a BVM framework with user defined model-data agreement kernel that were presented in [Vanslette2019]

. By adding tolerance within the agreement kernel, we were able to better represenet expectations of the unmeasured data, similar to a regularization procedure. The BVM regression framework proved its potential in offering new insights into the interpretation of the predictive envelopes of the Bayesian and standard regression, and hence providing the user with more freedom and control over the predictive envelopes and their meaning. We found that Bayesian and standard regression are special cases of BVM regression as the BVM can combine and generalize their features. For this reason, we find BVM regression to be a generalized regression and model learning tool. This allows us to address several potential shortcomings in Bayesian and standard regression methods. In the future work, we will consider BVM learning for non-parametric models.

Acknowledgments

This work was supported by the Center for Complex Engineering Systems (CCES) at King Abdulaziz City for Science and Technology (KACST) and the Massachusetts Institute of Technology (MIT). We would like to thank all the researchers in the CCES.

References

Appendix A Bayesian Model Testing

We derive equations (4) – (7) mentioned in section 2.2. In the Bayesian model testing framework, the model output and the observed data are defined to agree only if their values are exactly equal. Thus, Bayesian model testing is a special case of the BVM where the agreement kernel is equal to the kronecker delta function (exact agreement) with continuous indices, . Since Bayesian model testing deals with probability densities, we have the following expression for the probability density of agreement (8):

The kronecker delta and the dirac delta functions are related as follows,


Thus, the probability density of agreement becomes,

which is equation (4) derived in section 2.2.

a.1 Normally Distributed Data

If we assume the data to be normally distributed, , we get,


where is the dimension of the training data set, is the covariance matrix, and is the observed data values.
Therefore, using (4), we have,

Therefore, the likelihood function to be used in the MCMC algorithm is

(16)

which is equation (5) presented in section 2.2.

a.2 Uniformly Distributed Data

We first note that

(A.1)

Now, we assume the data to be uniformly distributed,


Then, the probability density becomes:

(A.2)

Notice that we can generalize to any bounded probability density function (PDF). Therefore, using (4), we have,

Therefore, the likelihood function to be used in the MCMC algorithm is

(17)

which is equation (6) presented in section 2.2.

a.3 Completely Certain Data

If we consider the data to be completely certain, deterministic, , then, the probability density becomes , and thus, using (4), we have,

Therefore, the likelihood function to be used in the MCMC algorithm is

(18)

which is equation (7) presented in section 2.2.

Appendix B BVM Model Selection

We derive equations (12) – (14) presented in section 3.1. We show how we can apply Bayesian model selection on any data distribution using the BVM probability of agreement. Starting from the original definition of probability of agreement (8), we have:

(19)

which is equation (12) derived in section 3.1. From (A.1), we know that


The probability density can be expressed as:


We also note that the compound boolean under question can be expressed as:


Thus, we rewrite the BVM probability of agreement as follows:


We will use the Boolean indicator defined as:

=


where .
Then, the indicator function can be rewritten as: