# Output-weighted optimal sampling for Bayesian regression and rare event statistics using few samples

For many important problems the quantity of interest (or output) is an unknown function of the parameter space (or input), which is a random vector with known statistics. Since the dependence of the output on this random vector is unknown, the challenge is to identify its statistics, using the minimum number of function evaluations. This is a problem that can been seen in the context of active learning or optimal experimental design. We employ Bayesian regression to represent the derived model uncertainty due to finite and small number of input-output pairs. In this context we evaluate existing methods for optimal sample selection, such as model error minimization and mutual information maximization. We show that the commonly employed criteria in the literature do not take into account the output values of the existing input-output pairs. To overcome this deficiency we introduce a new criterion that explicitly takes into account the values of the output for the existing samples and adaptively selects inputs from regions or dimensions of the parameter space which have important contribution to the output. The new method allows for application to a large number of input variables, paving the way for optimal experimental design in very high-dimensions.

## Authors

• 8 publications
• ### Output-Weighted Importance Sampling for Bayesian Experimental Design and Uncertainty Quantification

We introduce a class of acquisition functions for sample selection that ...
06/22/2020 ∙ by Antoine Blanchard, et al. ∙ 0

• ### Active Learning for Regression Using Greedy Sampling

Regression problems are pervasive in real-world applications. Generally ...
08/08/2018 ∙ by Dongrui Wu, et al. ∙ 0

• ### An Upgrading Algorithm with Optimal Power Law

Consider a channel W along with a given input distribution P_X. In certa...
04/02/2020 ∙ by Or Ordentlich, et al. ∙ 0

• ### A sequential sampling strategy for extreme event statistics in nonlinear dynamical systems

We develop a method for the evaluation of extreme event statistics assoc...
04/19/2018 ∙ by Mustafa A. Mohamad, et al. ∙ 2

• ### Sequential Computer Experimental Design for Estimating an Extreme Probability or Quantile

A computer code can simulate a system's propagation of variation from ra...
08/14/2019 ∙ by Hao Chen, et al. ∙ 0

• ### Fast multi-output relevance vector regression

This paper aims to decrease the time complexity of multi-output relevanc...
04/17/2017 ∙ by Youngmin Ha, et al. ∙ 0

• ### CLAIMED: A CLAssification-Incorporated Minimum Energy Design to explore a multivariate response surface with feasibility constraints

Motivated by the problem of optimization of force-field systems in physi...
06/09/2020 ∙ by Yao Song, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For a wide range of problems in engineering and science it is essential to quantify the statistics of specific quantities of interest (or output) that depend on uncertain parameters (or input) with known statistical characteristics. The main obstacle towards this goal is that this dependence is not known a priori and numerical or physical experiments need to be performed in order to specify it. If the problem at hand allows for the generation of many input-output pairs then one can employ standard regression methods to machine learn the input-output map over the support of the input parameters and subsequently compute the statistics of the output.

However, for several problems of interest it is not possible to simulate even a moderate size of input parameters. In this case it is critical to choose the input samples carefully so that they provide the best possible information for the output of interest [3, 6, 1]. A class of problems that belong in this family is the probabilistic quantification of extreme or rare events rising from high dimensional complex systems such as turbulence [13, 5, 16, 2], networks [17], waves [4, 10, 8], and materials [18]. Of course the considered setup is not limited to extreme or rare events but it is also relevant for any problem where the aim is to quantify the input-output relationship with very few but carefully selected data points.

The described setup is a typical example of an optimal experimental design or active learning problem [3]

. Specifically, we will assume that we have already a sequence of input-output data and our goal will be to sequentially identify the next most informative input or experiment that one should perform in order to achieve fastest possible convergence for the output statistics. The problem has been studied extensively using criteria relying on mutual information theory or the Kullback–Leibler divergence (see e.g.

[12]). More recently another criterion was introduced focusing on the rapid convergence of the output statistics [11]. A common characteristics of these methods is the large computational cost associated with the resulted optimization problem that constrains applicability to low-dimensional input or parameter spaces.

The first objective of this work is to understand some fundamental limitations of popular selection criteria widely used for optimal experimental design (beyond the large computational cost). Specifically, we will examine how well these criteria distinguish and promote the input parameters that have the most important influence to the quantities of interest. The second objective is the formulation of a new, output-weighted selection approach that explicitly takes into account, beyond the uncertainty of each input parameter, each effect on the output variables, i.e. the quantities of interest. This is an important characteristic as it is often the case that a small number of input parameters controls a specific quantity of interest. The philosophy of the developed criterion is to exploit the existing samples in order to estimate which input parameters are the most influential for the input and then bias the sampling process using this information.

Beyond its intuitive and controllable character on selecting input samples according to their effect to the output statistics, the new criterion has a numerically tractable form which allows for easy computation of each value and gradient. The latter property allows for the employment of gradient optimization methods and therefore the applicability of the approach even in high-dimensional input spaces. We demonstrate ideas through several examples ranging from linear to nonlinear maps with low and high dimensional input spaces. In particular, we show that the important dependencies of given quantities of interest can be identified and quantified using a very small number of input-output pairs, allowing also for quantification of rare event statistics with minimal computational cost.

## 2 Setup

Let the input vector denote the set of parameters or system variables and

be the output vector describing the quantities of interest. The input vector can be thought as high-dimensional with known statistics described by the probability density function (pdf)

that corresponds to mean value and covariance (or correlation ). In what follows we will use

to denote pdf and an index will be used only if the random variable is not automatically implied by the argument.

A map from the input to the output variables, , exists and our aim is to approximate the statistics of the output, , using the smallest possible number of evaluations of the map . We will assume that we have already obtained some input-output pairs which we employ in order to optimize the selection of the next input that one should evaluate. This problem can be seen as an optimal experimental design problem where the experimental parameters that one is optimizing coincide with the random input parameters. All the results/methods presented in this work can be formulated in the standard setup of optimal experimental design in a straightforward way.

The first step of the approach is to employ a Bayesian regression model to represent the map

. Out choice of the Bayesian framework is dictated by our need to have a priori estimates for the model error, as those will be employed for the sample selection criteria. For simplicity we will present our results for linear regression models, although the extension for regression schemes with nonlinear basis functions or Gaussian process regression schemes is straightforward. We formulate a linear regression model with an input vector

that multiplies a coefficient vector to produce an output vector , with Gaussian noise added:

 y=Ax+ee∼ N(0,V)p(y|x,A,V)=N(Ax,V) (1)

The basic setup involves a given data set of pairs . We set and .

For the matrix we assume a Gaussian prior with mean and covariance for the columns, having by dimension, and for the rows, having by dimension. This has the form:

 p(A) ∼N(M,V,K) =|K|d/2|2πV|m/2exp(−12tr((A−M)TV−1(A−M)K)). (2)

Then one can obtain the posterior for the matrix [14, 9]

 p(A|D,V) ∼N(SyxS−1xx,V,Sxx) (3)

where,

 Sxx=XXT+KSyx=YXT+MK (4)

Essentially, is the data correlation of the sample input points . We choose and , where is an empirical parameter that will be optimized. Therefore, the above relations take the form

 Sxx =XXT+αI Syx =YXT

Based on the above we obtain the probability distribution function for new inputs

:

 p(y|x,D,V)=N(SyxS−1xxx,V(1+c)),c=xTS−1xxx. (5)

Then we can obtain an estimate for the probability density function of the output as

 (6)

It is important to emphasize that the output is random due to two sources: i) the uncertainty of the input vector , and ii) the uncertainty due to the model error expressed by the term . The latter is directly related to the choice of data-points and the goal is to choose these points in such a way so that the statistics of converge most rapidly.

The most notable property for this model is the fact that the model error is independent of the expected output value of the system. This fact holds true also for Gaussian Process regression (GPR) schemes or regression models that use nonlinear basis functions. This property will have very important consequences when it comes to the optimal input sample selection for the modeling of the input-output relation.

### 2.1 Properties of the data correlation Sxx

One can compute the eigenvectors and eigenvalues of the data correlation matrix,

,

 ^R=[^r1|...|^rm]∈Rm×m \ and \ σ2i, i=1,...,m.

By applying a linear transformation to the

eigendirections, we have

 xTS−1xxx=m∑i=1χ2iσ2i

Thus, the eigendirections of indicate the principal directions of maximum confidence for the linear model. The eigenvalues quantify this confidence: the larger the eigenvalue the slower the uncertainty increases (quadratically) as increases. For a new, arbitrary point, , added to the family of points we will have . By direct computation we obtain

 S′xx=N∑i=1xixTi=Sxx+hhT. (7)

If the new point belongs to the eigendirection, , where then the new data correlation will be,

 S′xx=Sxx+κ2rjrTj

It can be easily checked that under this assumption the new matrix will have the same eigenvectors. Moreover the eigenvalue will be , while all other eigenvalues will remain invariant. Therefore, by adding one more data-point along a principal direction will increase the confidence along this direction by the magnitude of this new point.

The larger the magnitude of any point we add, the larger its impact on the covariance. One can trivially increase the magnitude of the new points but this does not offer any real benefit. Moreover, in a typical realistic scenario there will be magnitude constraints. To avoid this ambiguity, typical of linear regression problems, we will fix the magnitude of the input points, i.e. , so that we can assess the direction of the new points, without being influenced by the magnitude. For nonlinear problems the input points should be chosen from a compact set, typically defined by the mechanics of the specific problem.

## 3 Fundamental limitations of standard optimal experimental design criteria

Here we consider two popular criteria that can be employed for the selection of the next most informative input sample . The first one is based on the minimization of the model error expressed by the parameter (eq. (5)), while the second one is the Kullback–Leibler divergence or equivalently the maximization of the mutual information between input and output variables, which is the standard approach in the optimal experimental design literature [3].

We hypothesize a new input point . As the corresponding output is not a priori known we will assume that it is given by the mean regression model, . The new pairs of data points will be . Under this setup the new model error will be given by , where the new data correlation matrix is given by (7). In addition, the mean estimate of the new model will remain invariant since,

 S′yxS′−1xx=[Syx+SyxS−1xxhhT][Sxx+hhT]−1=SyxS−1xx[Sxx+hhT][Sxx+hhT]−1=SyxS−1xx. (8)

### 3.1 Minimization of the mean model error

The first approach we will employ is to select by minimizing the mean value of the uncertainty parameter (eq. (5)). Using standard expressions for quadratic forms of a random variable [15] we obtain a closed expression, valid for any input distribution. More specifically, we will have:

 μc =E[xTS′−1xxx]=tr[S′−1xxCxx]+μTxS′−1xxμx=tr[S′−1xxRxx]. (9)

Moreover for the case of Gaussian input we also obtain [15]

 σ2c =Var[xTS′−1xxx]=2tr[S′−1xxCxxS′−1xxCxx]+4μTxS′−1xxCxxS′−1xxμx (10)

We note that the model uncertainty depends only on the statistics of the input (expressed through the covariance ) and the samples (expressed through the constant (i.e. non-dependent on ) matrix ). In other words, the matrix and the output distribution play no role on the mean model uncertainty.

To understand the mechanics of selecting input samples using the mean model error we assume that is diagonal with eigenvalues arranged with increasing order. We also assume that samples are collected only along the principal directions of the input covariance. In this case the quantity that is minimized takes the form

 μc(h)=tr[S′−1xxRxx]=d∑i=1σ2i+μ2xini+δik,   hi=δik∈Sm−1,

where denotes the number of samples in the direction. One should choose or equivalently according to value of the derivative of . In particular,

 hi=δij,  j=argmink(−σ2k+μ2xkn2k).

If all directions have been sampled with equal number (e.g. each of the directions have ), sampling will continue with the most uncertain direction. After sufficient sampling in this direction, the addition of a new sample will cause smaller effect than sampling the next most important direction and this is when the scheme will change sampling direction. This behavior guarantees that the scheme will never get ‘trapped’ in one direction. It will continuously evolve, as more samples in one direction lead to very small eigenvalue of along this direction, and therefore sampling along another input direction will cause bigger contribution to the trace.

It is clear that sampling based on the uncertainty parameter searches only in directions with important uncertainty, while the impact of each input variable is completely neglected. Therefore, even directions that have zero effect on the output variable will still be sampled as long as they are uncertain.

### 3.2 Maximization of the mutual information

An alternative approach for the selection of a new sample, , is maximizing the entropy transfer or mutual information between the input and output variables, when a new sample is added [3]:

 I(x,y|D′,V)=Ex+Ey|D′−Ex,y|D′. (11)

This is also equivalent with maximizing the mean value of the Kullback–Leibler divergence [6]

 Ey[DKL[p(x|y,D′)||p(x)]] =∫y∫xp(x|y,D′,V)logp(x|y,D′,V)p(x)dxp(y|D′,V)dy =∫y∫xp(x,y|D′,V)logp(x,y|D′,V)p(x)p(y|D′,V)dxdy =I(x,y|D′,V).

We first compute the entropy of :

 Ex,y(h) =∬p(y,x|D′,V)logp(y,x|D′,V)dxdy =∬p(y|x,D′,V)p(x)logp(y|x,D′,V)dxdy+∬p(y|x,D′,V)p(x)logp(x)dxdy =∫Ey|x(x;h) p(x)dx+∫p(x)logp(x)dx =Ex[Ey|x(x;h)]+Ex.

We focus on computing the first term on the right hand side. For the linear regression model, the conditional error follows a Gaussian distribution. From standard expressions about the entropy of a multivariate Gaussian we have

 Ey|x(x;h)=12log(1+c)d|2πeV|=d2log(1+c(x;h))+12log|2πeV|.

Therefore,

 Ex[Ey|x(x;h)]=d2Ex[log(1+c(x;h))]+12log|2πeV|.

In the general case, we cannot compute the entropy of the output, conditional on . To this end, the mutual information of the input and output, conditioned on takes the form

 I(x,y|D′,V)=Ey(h)−d2Ex[log(1+c(x;h))]−12log|2πeV|. (12)

This expression is valid for any input distribution and relies only on the assumption of Bayesian linear regression. To compute the involved terms, one has to perform a Monte-Carlo or importance sampling approach, even for linear regression models and Gaussian inputs. This, of course, limits the applicability of the approach to very low-dimensional input spaces.

#### 3.2.1 Gaussian approximation of the output

To overcome this computational obstacle one can consider an analytical approximation of the mutual entropy, assuming Gaussian statistics for the output. This assumption is not true in general, even for Gaussian input, because of the multiplication of the (Gaussian) uncertain model parameters (matrix ) with the the Gaussian input (vector ).

We focus on the computation of the entropy of the output , so that we can derive an expression for the mutual information. We will approximate the pdf for through its second order statistics. Given that the input variable is Gaussian and the exact model is linear the Gaussian approximation for the output is asymptotically accurate. Still, it will help us to obtain an understanding of how the criterion works to select new samples.

We express the covariance of the output variable using the law of total variance

 Cyy(D′,V) =Ex[Cyy|x(D′,V)]+% Cov[Ey(y|x,D′,V)]. (13)

The first term is the average of the updated conditional covariance of the output variables and it is capturing the regression error. The second term expresses the covariance due to the uncertainty of the input variable , as measured by the estimated regression model using the input data in .

As we pointed out earlier the mean model using either or remains invariant. Therefore, we have

 Cyy(D′,V)=V(1+μc(h))+SyxS−1xxCxxS−1xxSTyx.

In this way we have the approximated entropy of the output variable using a Gaussian approximation, which is also an upper bound for any other non-Gaussian distribution with the same second order statistics

 Ey(h) =12log|V(1+μc(h))+SyxS−1xxCxxS−1xxSTyx|+d2log(2πe).

Therefore, we have the second-order statistics approximation of the mutual information in terms of the new sample , denoted as :

 IG(x,y|D′,V)=12log|V(1+μc(h))+SyxS−1xxCxxS−1xxSTyx|−12log|V|−d2Ex[log(1+c(x;h))]. (14)

We observe that the second-order approximation of the mutual information criterion has minimial dependence on the output samples . Specifically, (14) depends on the uncertainty parameter

and its statistical moments, as well as the term

. However, the latter, is not coupled with the new hypothetical point and to this end the minimization of this criterion does not guarantee that the output values will be taken into account in a meaningful way. Instead, the selection of the new sample, depends primarily on the minimizing , always under the constraint , a process that depends exclusively on the current samples and the statistics of the input .

Therefore, regions of associated with large or important values of the output are not emphasized by this sampling approach and the emphasis is given in regions that minimize the mean model error . We note that these conclusions are valid for the second-order approximation of the mutual information criterion. However, the non-approximated (full) form of the criterion, eq. (12), is hard to compute and most importantly, it has no analytical expressions for its gradient with respect to and thus is not practical for high-dimensional inputs.

### 3.3 Nonlinear basis regression

Similar conclusions can be made for the case where one utilizes nonlinear basis functions. In this case we assume that the input points ‘live’ within a compact set Specifically, let the input be expressed as a function of another input where the input value has distribution and be a compact set. One can choose a set of basis functions

 x=ϕ(z). (15)

In this case the distribution of the output values will be given by :

 p(y|z,D,V)=N(SyϕS−1ϕϕϕ(z),V(1+c)),c=ϕ(z)TS−1ϕϕϕ(z). (16)

The mean of the model uncertainty parameter will become

 μc=tr[S−1ϕϕCϕϕ]+μTϕS−1ϕϕμϕ=tr[S−1ϕϕRϕϕ], (17)

where

 Sϕϕ=N∑i=1ϕ(zi)ϕ(zi)T  and   S′ϕϕ=Sϕϕ+ϕ(h)ϕ(h)T,

and

 μϕ=∫ϕ(z)fz(z)dz    and    Cϕϕ=∫(ϕ(z)−μϕ)(ϕ(z)−μϕ)Tfz(z)dz.

Following the same steps as we did for the linear model, we will have, first for the conditional entropy (assuming that the model noise in the nonlinear case is Gaussian)

 Ey|z(z;h)=12log(1+c)d|2πeV|=d2log(1+c(ϕ(z);h))+12log|2πeV|.

Therefore,

 Ez[Ey|z(z;h)]=d2Ez[log(1+c(ϕ(z);h)]+12log|2πeV|.

The exact expression for the mutual information for the nonlinear case will be:

 I(x,y|D′,V)=Ey−d2Ez[log(1+c(ϕ(z);h)]−12log|2πeV|. (18)

To perform, the second-order statistical approximation for the entropy , we follow the same steps as for the linear model case to obtain

 IG(x,y|D′,V)=12log|V(1+μc(h))+SyϕS−1ϕϕCϕϕS−1ϕϕSTyϕ|−12log|V|−d2Ez[log(1+c(ϕ(z);h)]. (19)

The sampling strategy is more complicated in this case due to the nonlinearity of the basis elements. However, even in the present setup the sampling depends exclusively on the statistics of the input variable and the form of the basis elements . The measured output values of the modeled process do not enter explicitly into the optimization procedure for the next sample, in the same fashion with the linear model.

## 4 Optimal sample selection considering the output values

We saw that selecting input samples based on either the mean model error or the mutual information does not take into account the output values of the existing samples. Here we formulate a new criterion that explicitly accounts for the outcome samples. Our goal is to compute samples that accelerate the convergence of the output statistics, expressed by the probability density function . To measure how well this convergence has occurred we are going to rely on the distance between the probability density function of the mean model

 y0=SyxS−1xxx (20)

and the perturbed model along the most important direction of the model uncertainty (dominant eigenvector of ), denoted as :

 y+=SyxS−1xxx+βrV(1+xTS′−1xxx), (21)

where is a small scaling factor. The corresponding probability density functions, and will differ only due to errors of the Bayesian regression, which vary as changes. It is therefore meaningful to select the next sample that will minimize their distance. Moreover, as we are interested on capturing the probability density function equally well in regions of low and large probability we will consider the difference between the logarithms. Specifically, we define

 DLog1(y+∥y0;h)=∫Sy|logpy+(y)−logpy0(y)|dy (22)

where

is a finite domain over which we are interested to define the criterion. Note that the latter has to be finite in order to have a bounded value for the distance. It can be chosen so that it contains several standard deviations of the output process. The defined criterion focuses exactly on our goal which is the convergence of the output statistics, while the logarithm guarantees that even low probability regions have converged as well. This criterion for selecting samples was first defined in

[11] and was shown that it results in a very effective strategy for sampling processes related to extreme events. However, it is also associated with a very expensive optimization problem that has to be solved in order to minimize this distance. Apart of the cost, its complicated form does not allow for the application of gradient methods for optimization and therefore it is practical only for low-dimensional input spaces where non-gradient methods can be applied.

We note that for bounded probability density functions

 DKL(y0∥y+;Sy)=∫Sy(logpy+(y)−logpy0(y))py0(y)dy⩽κDLog1(y+∥y0), (23)

where is a constant. To this end, the criterion based on the difference of the logarithms is more conservative (i.e. harder to minimize) compared with the Kullback–Leibler divergence (defined over the same domain).

In [11] it was also shown that for small the above distance takes the asymptotic form

 DLog1(y+∥y0;h)≃β∫∣∣ddsE[σ2y(x;h)⋅1y0(x)=s]∣∣py0(s)ds, (24)

where,

 σ2y(x;h)=tr(Var[y|x,D′])=tr(V)(1+c(x;h)),

is the conditional variance (on ) if the output is scalar or the trace of the output conditional covariance matrix in the general case, while is the mean model from the input-output data collected so far. Using standard inequalities for the derivatives of differential functions one can bound the derivative in (24). Specifically, if the function has uniformly bounded second derivative (with respect to a hypothetical new point), and has not zeros or singular points, there exists a constant such that ([7], Theorem 3.13, p. 109)

 ∫∣∣ddsE[σ2y(x)⋅1y0(x)=s]∣∣py0(s)ds⩽κ0∫E[σ2y(x)⋅1y0(x)=s]py0(s)ds.

Moreover,

 ∫E[σ2y(x)⋅1y0(x)=s]py0(s)ds =E[σ2y(x)py0(y0(x))]=∫px(x)py0(y0(x))σ2y(x)dx.

Based on this we define the output-weighted model error criterion

 Q[σ2y]=∫px(x)py0(y0(x))σ2y(x;h)dx. (25)

Because of the inequality (4) we can conclude that converge of also implies convergence of the metric . However, the criterion is much easier to compute compared with and it can be employed even in high-dimensional input spaces. With the modified criterion the output data and their pdf is taken into account explicitly. In particular, the conditional variance (or uncertainty) of the model at each input point is weighted by the probability of the input at this point, , as well as the inverse of the estimated probability of the output at the same input point, .

The term in the denominator comes as a result of considering the distance between the logarithms in (22). If we had started with

, we would have cancellation of this important term. Note that a relevant approach, based on the heuristic superposition of the outcome and the mutual information criterion was presented in

[19]

. However, there is no clear way how the two terms should be weighted or in what sense the outcome can be superimposed to the information content. We emphasize that the presented framework is not restricted to linear regression problems and it can also be applied to Bayesian deep learning problems (a task that will not be considered in this work). In addition, we have not made any assumption for the distribution of the input

.

#### A simple demonstration

To illustrate the properties of the new criterion we consider the map

 T(x)=0.1x1−0.5x2, \ where \ x∼N(0,I).

Note that the variable is more important than in determining the value of the output, given that the two input variables have the same variance. It is therefore intuitive to require more accuracy for the second direction. However, the information distance or entropy based criteria take into account only the input variable statistics to select the next sample, in which case, both directions will have equal importance. This is illustrated in Figure 1, where we present contours of the exact map, , as well as, of the input pdf, . We also present the contours of the output pdf conditional on the input, (bottom left), and the weight that is used in the criterion . Clearly, relying on the sampling criterion that uses only information about the input will not be able to approximate the map in the most important directions. On the other hand, we observe that the weight used in the criterion takes into account explicitly the importance of both the input variable statistics but also the information that one has estimated so far from the input-output samples. Here we used the exact map to demonstrate the weight function but in a realistic scenario the estimated mean model will be used to approximate the output pdf.

### 4.1 Approximation of the criterion for symmetric output pdf

Our efforts will now focus on the efficient approximation of the criterion . The first step of the approximation focuses on the denominator. This term introduces the dependence to the output data and acts as a weight to put more emphasis on regions associated with large deviations of from its mean. We will approximate the weight, , by a quadratic function that optimally represents it over the region of interest . Therefore, for the scalar case we will have

 p−1y0(y)≃p1+p2(y−μy)2, (26)

where are constants chosen so that the above expression approximates the inverse of the output pdf optimally over the region of interest. We use this expression into the definition of (eq. (25)) and obtain the approximation

 Q[σ2y] ≃p1∫p(x)σ2y(x)dx+p2∫(y0(x)−μy0)2p(x)σ2y(x)dx. (27)

Note that the first term does not depend on the output values but only on the input process. It is essentially the same term that appears in the entropy based criteria. The second term however depends explicitly on the deviation of the output process from its mean and therefore on the output data. Specifically, it has large values in regions of where the output process has important deviations from its mean, essentially promoting the sampling of these regions. The two constants provide the relative weight between the two contributions. They are computed for a Gaussian approximation of the output pdf in Appendix B. For the case where the pdf

is expected to have important skewness, i.e. asymmetry around its mean, a linear term can be included in the expansion of

, so that this asymmetry is reflected in the sampling process.

### 4.2 Linear regression with Gaussian input

For the case of liner regression the first term in the criterion (27) will take the form

 ∫p(x)σ2y(x;h)dx=σ2V(1+μc(h))=σ2V(1+tr[S′−1xxRxx]), (28)

where we have considered the case of a scalar output with . The second term of the criterion (27) will take the form

 1σ2V∫(y0(x)−μy0)2p(x)σ2y(x;h)dx =∫[(SyxS−1xx(x−μx)]2(1+xTS′−1xxx)p(x)dx =c0+∫(x−μx)TS−1xxSTyxSyxS−1xx(x−μx)xTS′−1xxxp(x)dx.

where is a constant that does not depend on

 c0=∫[(SyxS−1xx(x−μx)]2p(x)dx=tr[S−1xxSTyxSyxS−1xxCxx].

We observe that the second term depends on fourth order moments of the input process but also on the output values of the samples . This term can be computed in a closed form for the case of Gaussian input. Specifically,

 ∫(x−μx)TS−1xxSTyxSyxS−1xx(x−μx)xTS′−1xxxp(x)dx =∫x′TS−1xxSTyxSyxS−1xxx′x′TS′−1xxx′p(x′)dx′ +∫x′TS−1xxSTyxSyxS−1xxx′x′TS′−1xxμxp(x′)dx′ +∫x′TS−1xxSTyxSyxS−1xxx′μTxS′−1xxx′p(x′)dx′ +∫x′TS−1xxSTyxSyxS−1xxx′μTxS′−1xxμxp(x′)dx′,

where and is the zero-mean translation pdf of the original one. The second and third term on the right hand side vanish as they consist of third order central moments of a Gaussian random variable. For the first term we employ a theorem for the covariance of quadratic forms which gives for two symmetric matrices, and [15]:

 cov(xTAx,xTBx)=2tr(ACxxBCxx)+4μxACxxBμx

Therefore,

 E[xTAxxTBx]=2tr(ACxxBCxx)+4μxACxxBμx−tr(ACxx)tr(BCxx).

From this equation, it follows,

 ∫x′TS−1xxSTyxSyxS−1xxx′x′TS′−1xxx′p(x′)dx′ =2tr[S−1xxSTyxSyxS−1xxCxxS′−1xxCxx] −c0tr[S′−1xxCxx]

In addition, the last term becomes

 ∫x′TS−1xxSTyxSyxS−1xxx′p(x′)dx′=μTxS′−1xxμxtr[S−1xxSTyxSyxS−1xxCxx]=c0μTxS′−1xxμx.

We collect all the computed terms and obtain

 Q(h)1σ2V=p1(1+tr[S′−1xxCxx]+μTxS′−1xxμx)+p2c0(1+μTxS′−1xxμx−tr[S′−1xxCxx])+2p2tr[S−1xxSTyxSyxS−1xxCxxS′−1xxCxx]. (29)

This is the form of the criterion under the assumption of Gaussian input for the case of linear regression. For the case of zero mean input it becomes

 Q(h)1σ2V =(p1−p2c0)tr[S′−1xxCxx]+2p2tr[S−1xxSTyxSyxS−1xxCxxS′−1xxCxx]+const. (30)

The coefficients are determined using the output pdf of the estimated model through the samples (eq. (26)), i.e. the pdf of . Note that the exact form of the output pdf, used in the criterion, is not important at this stage as it only defines the weights of the criterion . For a Gaussian approximation of the output process the coefficients are given in Appendix B.

### 4.3 Nonlinear regression with Gaussian input

For the case of regression with non-linear basis the first term in the criterion (27) will take the form

 ∫p(z)σ2y(z;h)dz=σ2V(1+μc(h))=σ2V(1+tr[S′−1ϕϕCϕϕ]+μTϕS′−1ϕϕμϕ), (31)

where we have considered the case of a scalar output with . The second term of the criterion (27) will take the form