# A Survey of Constrained Gaussian Process Regression: Approaches and Implementation Challenges

Gaussian process regression is a popular Bayesian framework for surrogate modeling of expensive data sources. As part of a broader effort in scientific machine learning, many recent works have incorporated physical constraints or other a priori information within Gaussian process regression to supplement limited data and regularize the behavior of the model. We provide an overview and survey of several classes of Gaussian process constraints, including positivity or bound constraints, monotonicity and convexity constraints, differential equation constraints provided by linear PDEs, and boundary condition constraints. We compare the strategies behind each approach as well as the differences in implementation, concluding with a discussion of the computational challenges introduced by constraints.

## Authors

• 3 publications
• 8 publications
• 3 publications
• 5 publications
• 4 publications
12/22/2020

### Gaussian Process Regression constrained by Boundary Value Problems

We develop a framework for Gaussian processes regression constrained by ...
06/15/2020

### Sparse Gaussian Process Based On Hat Basis Functions

Gaussian process is one of the most popular non-parametric Bayesian meth...
04/13/2020

### Explicit Estimation of Derivatives from Data and Differential Equations by Gaussian Process Regression

In this work, we employ the Bayesian inference framework to solve the pr...
11/23/2021

### Stochastic Processes Under Linear Differential Constraints : Application to Gaussian Process Regression for the 3 Dimensional Free Space Wave Equation

Let P be a linear differential operator over 𝒟⊂ℝ^d and U = (U_x)_x ∈𝒟 a ...
12/23/2019

### Tensor Basis Gaussian Process Models of Hyperelastic Materials

In this work, we develop Gaussian process regression (GPR) models of hyp...
08/24/2020

### Infrastructure Recovery Curve Estimation Using Gaussian Process Regression on Expert Elicited Data

Infrastructure recovery time estimation is critical to disaster manageme...
05/29/2012

### A Framework for Evaluating Approximation Methods for Gaussian Process Regression

Gaussian process (GP) predictors are an important component of many Baye...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

There has been a tremendous surge in the development and application of machine learning models in recent years due to their flexibility and capability to represent trends in complex systems [30]. The parameters of a machine learning model can often be calibrated, with sufficient data, to give high fidelity representations of the underlying process [61, 35, 21, 22]

. It is now feasible to construct deep learning models over datasets of tens of thousands to millions of data points with modern computational resources

[16]. In many scientific applications, however, there may not be the large amount of data available for training. Unlike data from internet or text searches, computational and physical experiments are typically extremely expensive. Moreover, even if ample data exists, the machine learning model may yield behaviors that are inconsistent with what is expected physically when queried in an extrapolatory regime.

To aid and improve the process of building machine learning models for scientific applications, it is desirable to have a framework that allows the incorporation of physical principles and other a priori information to supplement the limited data and regularize the behavior of the model. Such a framework is often referred to as “physics-constrained” machine learning within the scientific computing community [59, 55, 45, 9, 40, 41, 35]. Karpatne et al. [37]

provide a taxonomy for theory-guided data science, with the goal of incorporating scientific consistency in the learning of generalizable models. The information used to constrain models can be simple, such as known range or positivity constraints, shape constraints, or monotonicity constraints that the machine learning model must satisfy. The constraints can also be more complex; for example, they can encode knowledge of the underlying data-generating process in the form of a partial differential equation.

To highlight the interest in “physics-informed” machine learning, there is a curated bibliography maintained by Constantine et al. [13] that has over 200 references to papers involving scientific machine learning. Additionally, several recent conferences highlight this interest [5, 83, 44, 81, 52].

Much of the existing research in physics-informed machine learning has focused on incorporating constraints in neural networks

[41, 35]

, often through the use of objective/loss functions which penalize constraint violation

[63, 60, 50, 14, 51]

. Other works have focused on incorporating prior knowledge using Bayesian inference that expresses the data-generating process as dependent on a set of parameters, the initial distribution of which is determined by the available information, e.g., functional constraints

[87, 33]

. Unlike deterministic learning approaches, the predictions made using approximations trained with Bayesian inference are accompanied with probabilistic estimates of uncertainty/error.

Within the Bayesian regression framework, Gaussian processes (GPs) are popular for constructing “surrogates” or “emulators” of data sources that are very expensive to query. The use of GPs in a regression framework to predict a set of function values is called Gaussian Process Regression (GPR). An accurate GPR can often be used constructed using only a relatively small number of training data (e.g. tens to hundreds), which consists of pairs of input parameters and corresponding response values. Once constructed, the GPR can be thought of as a machine-learned metamodel and used to provide fast, cheap function evaluations for the purposes of prediction, sensitivity analysis, uncertainty quantification, calibration, and optimization. GP regression models are constructed with data obtained from computational simulation [27] or field data; in geostatistics, the process of applying Gaussian processes to field data has been used for decades and is frequently referred to as kriging [12].

In this survey we focus on the use of constrained GPRs that honor or incorporate a wide variety of physical constraints [67, 15, 32, 75, 61, 90, 42, 2]. Specifically, we focus on the following topics, after a short review of Gaussian process regression in Section 2. Section 3 presents an overview and a classification of constraints according to how the constraint is enforced during the construction of a GP. Section 4 discusses bound constraints, in which the GP prediction may be required to be positive, for example, or the prediction may be required to fall between upper and lower bounds. Section 5 discusses monotonicity and related convexity constraints. Constraints may also be more tightly integrated with the underlying physics: the GP can be constrained to satisfy linear operator constraints which represent physical laws expressed as partial different equations (PDE). This is discussed in Section 6. Section 7 discusses intrinsic boundary condition constraints. We review several different approaches for enforcing each of these constraint types. Finally, Section 8 is a compendium of computational details for implementing the constraints of Sections 47, together with a summary of computational strategies for improving GPR and brief commentary about the challenges of applying these strategies for the constrained GPs considered here.

The taxonomy we present is formulated to enable practitioners to easily query this overview for information on the specific constraint(s) they may be interested in. For approaches that enforce different constraints but have significant overlap in methodology, references are made between sections to the prerequisite subsection where the technical basis of an approach is first discussed in detail. This is done, for example, when discussing spline-based approaches which are used for both bound constraints in Section 4 and monotonicity constraints in Section 5.

Not all physical constraints can be neatly divided into the categories that we focus on in Sections 47

. For example, with a view toward computer vision,

Salzmann and Urtasun [70]

considered GPR for pose estimation under rigid (constant angle and length) and non-rigid (constant length) constraints between points. They proved that linear equality constraints of the form

, if satisfied by all the data vectors

, are satisfied by the posterior mean predictor of a GP. Then, at the cost of squaring the input dimension, they translated quadratic length constraints into such linear constraints for pairwise products of the input variables. In another example, Frankel et al. [20]

applied GPR to predict the behavior of hyperelastic materials, in which the stress-stretch constitutive relation naturally exhibits rotational invariance. The rotational invariance was enforced by deriving a finite expansion of the Cauchy stress tensor in powers of the Finger tensor that satisfies the rotational invariance by virtue of its structure, and GPR was performed for the coefficients of the expansion. We mention these examples to illustrate that physical constraints are varied, and in some cases the method to enforce them can depend highly on the specific nature of the constraint.

Even within the selected categories represented by Sections 47, the literature on constrained Gaussian processes is extensive and expanding rapidly. Consequently, we cannot provide a complete survey of every instance of constrained GPR. Rather, we strive to discuss main areas of research within the field. The goal is to aid readers in selecting methods appropriate for their applications and enable further exploration of the literature. We present selected implementation details and numerical examples, giving references to the original works for further details. Many of the authors of these works have developed codebases and released them publicly. Finally, we remark that we have adopted consistent notation (established in Section 2) for GPR that does not always follow the notation of the original works exactly.

## 2 Gaussian Process Regression

This section provides an overview of unconstrained Gaussian process regression. As mentioned previously, Gaussian process models, or simply Gaussian processes, are popular because they can be used in a regression framework to approximate complicated nonlinear functions with probabilistic estimates of the uncertainty. Seminal work discussing the use of GPs as surrogate models for computational science and engineering applications include the papers of Sacks et al. [69] and Santner et al. [71] and the book by Rasmussen and Williams [64].

A Gaussian process can be viewed as a distribution over a set of functions. A random draw or sample

from a GP is a realization from the set of admissible functions. Specifically, a Gaussian process is a collection of random variables

for which, given any finite set of inputs , , the collection

has a joint multivariate Gaussian distribution. A GP is completely defined by its mean and covariance functions which generate the mean vectors and covariances matrices of these finite-dimensional multivariate normals. Assumptions such as smoothness of samples

, stationarity, and sparsity are used to construct the mean and covariance of the GP prior and then Bayes’ rule is used to constrain the prior on observational/simulation data.

The prediction of a Gaussian process with mean function and a covariance function is a random variable such that

 p(f|X)=N(f;m(X),k(X,X)), (1)

where denotes the vector and denotes the matrix with entries

. The multivariate normal probability distribution

with mean vector and covariance matrix has the form

 N(f;m,K)=1(2π)N/2|K1/2|exp(−12(f−m)⊤K−1(f−m)). (2)

The covariance kernel function of a Gaussian process must be symmetric and positive semidefinite. Denoting the individual components of the vector as , where , the squared exponential kernel

 k(xi,xj)=η2exp⎡⎣−12d∑ℓ=1(xiℓ−xjℓρℓ)2⎤⎦ (3)

is popular, but many other covariance kernels are available. The choice of a covariance kernel can have profound impact on the GP predictions [18, 64], and several approaches to constraining GPs that we survey rely on construction of a covariance kernel specific to the constraint.

The distribution (1), determined by covariance kernel and the mean , is referred to as a prior for the GP. If the error or noise relating the actual observations collected at the set of inputs to the GP prediction is assumed to be Gaussian, then the probability of observing data given the GP prior is given by

 p(y|X,f)=N(f,σ2IN). (4)

Here, denotes the identity matrix. The distribution is referred to the likelihood of the GP, and the Gaussian likelihood (4) is by far the most common. As discussed in Section 4.1.2, specific non-Gaussian likelihood functions can be used to enforce certain types of constraints.

The parameters in the covariance kernel function of a GP are referred to as hyperparameters of the GP. We denote them by . For the squared exponential kernel (3), the aggregate vector of hyperparameters is , where we have included the likelihood/noise parameter from (4) as a hyperparameter. In general, finding the best hyperparameters to fit the data is an important step of GPR known as training. From now on, we explicitly denote the dependence on of the likelihood in (4) and the prior in (1), writing these as and , respectively. The marginal likelihood is given by

 p(y|X,θ)=∫p(y|X,f,θ)p(f|X,θ)df (5)

and the log-marginal-likelihood for a GP with a zero-mean prior () can be written [53, 64] as

 logp(y|X,θ)=−12y⊤(K(X,X)+σ2IN)−1y−12log∣∣K(X,X)+σ2IN∣∣−N2log2π (6)

Formula (6), derived from (5), (1) and (2), is a function of the hyperparameters present in the kernel , which can be optimized to give the most likely values of the hyperparameters given data. This is known as maximum likelihood estimation (MLE) of the hyperparameters.

Once the hyperparameters of the GPR have been chosen, the posterior of the GP is given by Bayes’ rule,

 p(f|X,y,θ)=p(f|X,θ)p(y|X,f,θ)p(y|X,θ). (7)

Given the prior (1) and the Gaussian likelihood (4), the prediction of a GPR at a new point can be calculated [64] as

 p(f∗|y,X,x∗,θ)=N(k(x∗,X)(K(X,X)+σ2IN)−1y,k(x∗,x∗)−k(x∗,X)(K(X,X)+σ2IN)−1[k(x∗,X)]⊤) (8)

Note that the mean of this Gaussian posterior is the mean estimate of the predicted function value at

and the variance is the estimated prediction variance of the same quantity.

We now preface some of the computational issues of inference in GPR that will be important for the constrained case. Firstly, when the GP likelihood is Gaussian, the posterior (8) is also Gaussian, thus it can be computed exactly and sampling is from the posterior is simple. This is generally not the case when the likelihood is not Gaussian. The same issue arises if the distribution (8) is directly replaced by a non-Gaussian distribution in the course of enforcing constraints – by truncation, for example. Next, inversion of , which scales as , is an omnipresent issue for inference. This poor scaling to large data is compounded by the fact that increased data tends to rapidly increase the condition number of (see Section 8.3.1). Finally, optimizing the hyperparameters of the GP involves the nonconvex objective function (6); both this function and its derivatives are potentially costly and unstable to compute for the reasons just mentioned. These issues arise for conventional GPR, but throughout sections 47 we shall see that constraining the GP can make them more severe. Therefore, we review potential strategies for dealing with them in Section 8.

## 3 Strategies for Constraints

There are many ways to constrain a Gaussian process model. The difficulty with applying constraints to a GP is that a constraint typically calls for a condition to hold globally – that is, for all points in an interval – for all realizations or predictions of the process. A priori, this amounts to an infinite set of point constraints for an infinite dimensional sample space of functions. This raises a numerical feasibility issue, which each method circumvents in some way. Some methods relax the global constraints to constraints at a finite set of “virtual” points; others transform the output of the GP to guarantee the predictions satisfy constraints, or construct a sample space of predictions in which every realization satisfies the constraints. This distinction between should be kept in mind when surveying constrained GPs. For example, the methods in Sections 4.1, 4.3, and 6.2 enforce constraints globally. The methods in Sections 4.2 and 6.2 enforce the constraint at scattered auxiliary data points, be this a result of introducing virtual data points for constraints, incomplete knowledge, or spatial variability.

Strategies for enforcing constraints are apparent from the review of GPR in Section 2

, which covers posterior prediction for

, the likelihood function for observations , the kernel prior , and the data involved in GPR. Some methods, such as the warping method of Section 4.1, simply apply a transformation to the output of GPR, so the transformed output satisfies the constraint. This transformation is essentially independent of the other components of GPR. One can instead introduce the constraints at the prediction of , replacing the distribution (8), by augmenting the data with a discrete set of virtual points in the domain and predicting from the GP given the data and knowledge that the constraint holds at the virtual points. An example of this is in Section 4.2. Next, the likelihood provides another opportunity to enforce constraints. One can replace the Gaussian likelihood (4) by a likelihood function such that constraints are satisfied by regardless of the output .

A different strategy is to design a covariance kernel for the prior (1) of the Gaussian process which enforces the constraint. Several of the methods discussed in this survey involve regression with an appropriate joint GP, defined by the constraint, which uses a “four block” covariance kernel incorporating the constraint in some of the blocks. This is the strategy used for the linear PDE constraints in Section 6.1

. Such methods are based on derivations of linear transformations of GPs. These types of kernels can combined with other strategies for constraints, such as for the monotonicity constraints of Section

5.1 which use a four block covariance kernel (for and ) within a likelihood approach.

Considering Gaussian processes as distributions over functions, another strategy is to consider a function space defined by a certain representation such that a global constraint can be translated into a finite set of constraints, e.g. on the coefficients of a spline expansion in Sections 4.3 and 5.3. Or a representation can be sought such that every element of the sample space satisfies the constraint before the Gaussian process (the distribution) is even introduced. The latter approach is taken in Sections 6.2 and 7; in these cases, this strategy amounts to deriving a specific kernel function related to the representation.

Finally, data provides an opportunity to constrain Gaussian processes implicitly. Some approaches involve proving that, if the data fed into a GP through the posterior formula (7) satisfies the constraint, then the GP predictions satisfy the constraint – either exactly, as for linear and quadratic equality constraints of Salzmann and Urtasun [70], or within a certain error, as in linear PDE constraints discussed in Section 6.3. These results consider properties of GPs that form the basis of such algorithms.

## 4 Bound Constraints

Bound constraints of the form over some region of interest arise naturally in many applications. For example, regression over chemical concentration data should enforce that predicted values lie between and [66]. Bound constraints also include, as a special case, nonnegativity constraints (). In this section we present three approaches for enforcing bound constraints.

### 4.1 Transformed Output and Likelihood

The most direct way to impose bound constraints on a Gaussian process involves modifying the output of the regression. One way to do this is to transform the output of the GP using a “warping” function which satisfies the bounds. The second way is to replace the Gaussian likelihood (4) by a non-Gaussian likelihood that satisfies the bounds, which is then used to obtain a posterior formula for predicting observations from . The paper by Jensen et al. [32] provides an overview and comparison of these two methods; we review this below. For the subsequent discussion, we assume that we have a set of observations that satisfy the bound constraint: .

#### 4.1.1 Warping Functions

Warping functions are used to transform bounded observations to unbounded observations . The field together with the observations are then treated with a traditional GP model using the steps outlined in Section 2

. The probit function, which is the inverse cumulative distribution function of a standard normal random variable:

, is commonly used as a warping function [32]. The probit function transforms bounded values to unbounded values via

 u=Φ−1(zi) (9)

The probit function is popular when

because the transformed values will be draws from a standard normal Gaussian with zero mean and unit variance. For a discussion of alternative warping functions we refer the reader to Snelson et al. [74].

#### 4.1.2 Likelihood formulations

In addition to using warping functions, bound constraints can also be enforced using non-Gaussian likelihood functions that are constructed to produce GP observations which satisfy the constraints. Given a general non-Gaussian likelihood , the posterior distribution of GPR predictions is given by (7). Unlike the posterior in (8), the posterior in this case is no longer guaranteed to be Gaussian. There are a number of parametric distribution functions with finite support that can be used for the likelihood function to constrain the GP model. Jensen et al. [32] suggest either a truncated Gaussian (see Section 8.1

) or the beta distribution scaled appropriately to the interval

. Their results show that the beta distribution generally performs better.

Unlike the warping method of Section 4.1.1, with either a truncated Gaussian likelihood or a beta likelihood, the posterior (7) is not analytically tractable. Jensen et al. [32] compare two schemes for approximate inference and prediction using bounded likelihood functions: the Laplace approximation and expectation propagation. These approaches both use a multivariate Gaussian approximation of the posterior, but solve for the governing posterior distribution in different ways.

### 4.2 Discrete Constraints using Truncated Gaussian Distributions

By noting that a Gaussian process (1) is always trained and evaluated at a finite set of points , global constraints over an interval can be approximated by constraints at a finite set of auxiliary or “virtual” points . This approach, introduced by Da Veiga and Marrel [15], requires constructing an unconstrained GP and then, over the virtual points, transforming this GP to a truncated multivariate Gaussian distribution

 TN(z;μ,Σ,a,b)={N(z;μ,Σ)P(a≤z≤b),for a≤z≤b0,otherwise (10)

as a postprocessing step.

More specifically, Da Veiga and Marrel [15] construct an approximation which is conditioned on a truncated multivariate Gaussian distribution at the auxiliary points. We point out how this approach effects the mean posterior predictions of the GP. The unconstrained mean predictor is conditioned on the data :

 E[f(x∗) ∣∣ f(X)=y]. (11)

This setup is augmented by a fixed, finite set of discrete points , and the predictor (11) is replaced by the predictor

 E[f(x∗) ∣∣ f(X)=y and a≤f(xi)≤b for all i=1,2,...Nc]. (12)

As

is normally distributed in the unconstrained case (

11), in the constrained case (12) it is distributed according to the truncated multivariate normal (10).

In a few special cases, the mean and covariance of the truncated distribution can be derived analytically. In one dimension, the mean at a single prediction point, , is the unconstrained mean plus a factor which incorporates the change in the probability mass of the Gaussian distribution to reflect the truncation:

 E(zi∣a≤zi≤b)=μ+σϕ(α)−ϕ(β)Φ(β)−Φ(α) (13)

where , , and and

are the probability density function and cumulative density function of a univariate standard normal distribution, respectively. In general, sampling and computing the moments of (

10) is computationally demanding. Da Veiga and Marrel [15] estimate moments empirically using an expensive rejection sampling procedure, based on a modified Gibbs sampler, to generate samples that honor the truncation bounds. We discuss the computational challenge of estimating the moments further in Section 8.1.

In contrast to the warping approach (Section 4.1) or the spline approach (Section 4.3) which maintain a global enforcement of the constraints, the bounds in (12) can depend on the location: , representing different bounds in different regions of (see Section 4 of [15] for an example). A downside of using the approach described here is that it is unclear how many virtual points are needed to approximately constrain the GP globally with a prespecified level of confidence; some studies with increasing are presented by Da Veiga and Marrel [15]. However, if the number of points can be chosen adequately, this approach can be used to enforce not only bound constraints but also monotonicity and convexity constraints [15]; see Section 5 for more details. These types of constraints can also include linear transformations of a Gaussian process [1].

### 4.3 Splines

Maatouk and Bay [47] present a constrained Gaussian process formulation involving splines, where they place a multivariate Gaussian prior on a class of spline functions. The constraints are incorporated through constraints on the coefficients of the spline functions. To avoid the difficulty of enforcing a bound constraint globally on an interval for all predictions, the approach in Section 4.2 enforced constraints only at a finite set points. In contrast, the approach taken by Maatouk and Bay [47]

is to instead consider a spline interpolant whose finite set of knot values are governed by a GP. This reduces the infinite-dimensional GP to a finite-dimensional one, for which the distributions of the knot values (i.e., the coefficients of the spline expansion) must be inferred. By using a set of piecewise linear splines that form a partition of unity, this approach guarantees that the set of all values between neighboring knots are bounded between the values of the knots. Thus if the knot values satisfy prescribed bound or monotonicity constraints, then so must all values in between them; that is, the global constraints are satisfied if the finite-dimensional constraints are. The problem then reduces to sampling the knot values from a truncated multivariate normal.

#### 4.3.1 GPR for spline coefficients

We first discuss the spline formulation in one input dimension, and without loss of generality assume that the process being modeled is restricted to the domain [0,1]. Let be the standard tent function, i.e., the piecewise linear spline function defined by

 h(x)=max(1−|x|,0) (14)

and define the locations of the knots to be for , with total spline functions. Then for any set of spline basis coefficients , the function representation is given by

 f(x)=M∑i=0ξih(M(x−xi))=M∑i=0ξihi(x). (15)

This function representation gives a piecewise linear interpolant of the point values for all .

The crux of the spline approach to GPR lies in the following argument. Suppose we are given a set of data points at unique locations . Define the matrix such that

 Aij=hi(xj). (16)

Then any set of spline coefficients that satisfy the equation

 Aξ=y (17)

will interpolate the data exactly. Clearly solutions to this system of equations will exist only if the rank of is greater than , which requires that any given spline basis spans no more than two data points. Intuitively, this is because a linear function is only guaranteed to interpolate two points locally. Supposing that we make large enough to satisfy this condition, we can find multiple solutions to the system (17).

We now assume the knot values to be governed by a Gaussian process with covariance function . Because a linear function of a GP is also a GP, the values of and are governed jointly [47, 42] by a GP prior in the form

 [yξ]∼N([00],[AKA⊤KA⊤AKK]) (18)

where each entry of the covariance matrix is understood to be a matrix. Upon observation of the data , the conditional distribution of the knot values subject to is given by

 p(ξ ∣∣ y=Aξ)=N(ξ;KA⊤(AKA⊤)−1y,K−KA⊤(AKA⊤)−1AK) (19)

This formula is similar to that proposed by Wilson and Nickisch [88], in which a GP is interpolated to a regular grid design to take advantage of fast linear algebra. In this case, we are now interested in evaluating the distribution further conditioned on the inequality constraints given by

 p(ξ ∣∣ y=Aξ,ξ∈C)=TN(ξ;KA⊤(AKA⊤)−1y,K−KA⊤(AKA⊤)−1AK,C) (20)

where , and the truncated normal distribution is defined and discussed in Section 8.1. We illustrate bound constrained GPs using this approach in Figure 1. We discuss monotonicity constraints using this approach in Section 5.3 and constrained MLE estimation of the hyperparameters in Section 8.2. Several constraint types can be combined in this approach, in which case in (20) is a convex set defined by a set of linear inequalities in [42].

#### 4.3.2 Sampling

Just as for the discrete constraint method discussed in Section 4.2, sampling from the truncated normal distribution for the spline coefficients introduces a new computational challenge into the GPR framework. While we discuss this in more detail and for several dimensions in Section 8.1, we give a cursory discussion of this following Maatouk and Bay [47]. We consider one dimension and the one-sided constraint on .

The original method of Maatouk and Bay [47] was to use a rejection sampling approach by sampling from the untruncated distribution with a mean shifted to the mode (or maximum a posteriori point) of the true posterior. That is, one first solves the problem

 ξ∗=argminξ(ξ−μ)⊤Σ−1(ξ−μ) (21)

subject to the bound constraints , where and . This is a convex quadratic program (assuming the covariance matrix is not too ill-conditioned) and may be solved efficiently. One then draws samples from and accepts or rejects the samples based on an inequality condition, described in more detail in Maatouk and Bay [46]. This is a simple approach, but it does not perform well at larger scale. The probability of rejecting any sample increases exponentially with the number of splines

. Furthermore, imprecision in the mode evaluation from the optimization process can lead to a deterioration of acceptance (for example, if the computed mode only satisfies monotonicity constraints to within some solver tolerance). Other approaches to sampling from the multivariate normal rely on Markov chain Monte Carlo methods, and are discussed in Section

8.1.

#### 4.3.3 Multidimensional Setting

The extension of the spline approach to higher dimensions is straightforward. The spline knots must be arranged in a regular grid with points in each dimension , and the process is restricted to a hypercube domain of size for number of dimensions . Under this restriction, the underlying function may be approximated with the spline expansion

 f(x)=d∏j=1Mj∑i=0ξijh(Mj(x−xij))=d∏j=1Mj∑i=0ξijhij(x). (22)

for knot locations and coefficients . The inference process from this representation proceeds as before, as any set of observed data values can be expressed as for the appropriately defined matrix , and the coefficient values may be inferred with a multivariate truncated normal distribution.

The primary issue with the multidimensional extension is the increase in cost. The spline approach suffers from the curse of dimensionality since the number of spline coefficients that must be inferred scales as

with knots per dimension, leading to scaling of the inference cost. This cost is further complicated by the fact that the spline formulation requires enough spline coefficients to guarantee interpolation through the data points in all dimensions, which means that . Some potential methods for addressing computational complexity are discussed later in this work. The need for efficient sampling schemes is also increased in the multidimensional setting as the acceptance ratio of a simple rejection sampler as discussed in Section 4.3.2 decreases as the dimensionality (i.e. number of coefficients to infer) increases. This is partially addressed by the Gibbs sampling schemes referred to above, but those schemes also begin to lose efficiency as the size of the problem increases; for other approaches, see Section 8.1.

## 5 Monotonicity Constraints

Monotonicity constraints are an important class of “shape constraints” which are frequently required in a variety of applications. For example, Maatouk [48] applied monotonicity-constrained GPR for the output of the Los Alamos National Laboratory “Lady Godiva” nuclear reactor, which is known to be monotonic with respect to the density and radius of the spherical uranium core. Kelly and Rice [39] considered monotonic Bayesian modeling of medical dose-response curves, as did Brezger and Steiner [8] for predicting sales from various prices of consumer goods.

Roughly speaking, given a method to enforce bound constraints, monotonicity constraints can be enforced by utilizing this method to enforce on the derivative of the Gaussian process in a “co-kriging” setup for the joint GP . Indeed, many of the works reviewed in Section 4 considered both bound, monotonicity, and convexity constraints under the general heading of “linear inequality constraints” [47, 15]. As a result, some of the methods below are based on techniques reviewed in Section 4, and we frequently refer to that section.

### 5.1 Constrained Likelihood with Derivative Information

The work of Riihimäki and Vehtari [67] enforces monotonicity of a Gaussian process using a probit model for the likelihood of the derivative observations. Probit models are often used in classification problems or binary regression when one wants to predict a probability that a particular sample belongs to a certain class (0 or 1) [64]. Here it is used generate a probability that the derivative is positive (1) or not (0). Monotonicity is obtained if the derivatives at all the selected points are 1 or 0.

Using the probit model, the likelihood111This particular likelihood is the inverse of the probit function used for warping output in (9): it maps a value from to , representing the probability that the value is in class 1 (which translates to monotonicity for this application). for a particular derivative observation is given by

 Φ(z)=∫z−∞pN(t|0,1)dt (23)

where is the probability density function of the standard normal distribution. This likelihood is used within an expanded GPR framework that incorporates derivatives and constraints. As part of this formulation, the original GP covariance matrix, representing the covariance between data points, is extended to a “four block” covariance matrix. The full covariance matrix is composed of matrices involving the covariance between function values, the covariance between derivative values, and the covariance between function values and derivative values.

Following Riihimäki and Vehtari [67] our goal is to enforce the -th partial derivative of at to be nonnegative, i.e.

 ∂f∂xdi(xi)≥0, (24)

at a set of finite “operating” or virtual points . Using the shorthand notation

 f′i=∂f∂xdi(xi),andf′=[∂f∂xd1(x1)…∂f∂xdm(xm)]⊤=[f′1…f′m]⊤ (25)

and denoting222Riihimäki and Vehtari [67] use the notation of rather than for observations of an observation of by , we can write

 p(y′i∣∣f′i)=Φ(f′i1ν). (26)

Here is the cumulative distribution function of the standard normal distribution and (26) approaches a step function as . Note that the likelihood function in  (26) forces the likelihood to be zero (for non-monotonicity) or one (for monotonicity) in most cases. Riihimäki and Vehtari [67] point out that (26) is more tolerable of error than a step function, and use throughout their article.

The joint prior is now given by:

 p(f,f′|X,Xm)=N(fjoint|0,Kjoint) (27)

where

 fjoint=[ff′]andKjoint=[Kf,fKf,f′Kf′,fKf′,f′]. (28)

Here, the matrix denotes the standard covariance matrix for the GP assembled over the data locations : , as in Section 2 where denotes the covariance function of . The matrix in (28) denotes the covariance matrix between the values of the specified partial derivatives of at the operational points :

 [Kf′,f′]ij=[cov(f′i,f′j)]=[cov(∂f∂xdi(xi),∂f∂xdj(xj))],1≤i,j≤m. (29)

Riihimäki and Vehtari [67] show that is a GP with covariance matrix

 ∂∂xdi∂∂x′djk(x,x′), (30)

so that

 [Kf′,f′]ij=∂2k∂xdi∂x′dj(xi,x′j),1≤i,j≤m. (31)

This result is a special case of a linear transformation of a GP; see Section 6.1 for more details. By the same general derivation in that section, the matrix matrix represents the covariance between and , and is given by

 [Kf,f′]ij=∂k∂x′dj(xi,x′j),1≤i≤n,1≤j≤m, (32)

and the matrix , representing the covariance between and .

Putting this all together, we have the posterior probability of the joint distribution incorporating the derivative information.

 p(f,f′|y,y′)=1Zp(f,f′|X,Xm)p(y|f)p(y′|f′) (33)

where is a normalizing constant. This distribution is analytically intractable because of the non-Gaussian likelihood for the derivative components. Riihimäki and Vehtari [67] sample this (33) using expectation propagation. We used an MCMC approach to calculate the posterior distribution. This approach is illustrated for an example in Figure 2; this example is particularly challenging because the data is non-monotonic, but there is a requirement that the GP be monotonic.

We describe the MCMC approach that we used for (33). As before, and denote the estimates of these quantities at a new prediction point .

 p(y∗|x∗,x,y)=∫p(y∗|f∗)p(f∗|x∗,x,y)df∗, (34)
 p(f∗|x∗,x,y)=∫∫p(f∗|x∗,f,f′)p(f,f′|x,y)dfdf′. (35)

Since was computed as samples from MCMC, we can approximate the posterior of as

 p(f∗|x∗,x,y)=Ef,f′∼p(f,f′|x,y)(p(f∗|x∗,f,f′))≈1NN∑i=1p(f∗|fi,f′i). (36)

The MCMC samples outlined in Equation 36 are generated over the vector . This could be a large vector, indicating a large latent function space, which may pose a challenge to MCMC. Our experience indicates that one must start the MCMC sampling at a good initial point, one that is obtained either by finding the maximum a posterior point (MAP) or by using a surrogate or interpolation to find a feasible initial point for .

Note that this approach leads to the question of how to select the number and placement of the operating points . Riihimäki and Vehtari [67] point out that grid-based methods suffer from the curse of dimensionality, and that a more efficient strategy may be to successively add new operating points to by computing where derivatives of the GP are most likely to be negative for the current choice of . We did not find much discussion of the placement of virtual points for this method or for the discrete constraint method in Section 4.2. The issue of optimal point placement for the virtual points could be addressed with some of the low-rank methods discussed in Section 8.3.

### 5.2 Monotonicity using Truncated Gaussian Distributions

The approach, discussed in Section 4.2, for bounding Gaussian processes at a finite set of virtual points can naturally be extended to enforce monotonicity constraints. Specifically, by treating the partial derivatives as GPs with covariance kernel functions given by (30), monotonicity constraints of the same form as (24) can be enforced at a discrete set of virtual points, i.e.

 ∂f∂xdi(xi)≥0,i=1,…,Nc. (37)

This is done by treating the partial derivatives as GPs with covariance kernel functions given by (30), and using the joint Gaussian process with covariance matrix given by (28). Then, given data for , Da Veiga and Marrel [15] replace the unconstrained predictor (11) by the predictor

 E[f(x∗) ∣∣∣ f(X)=y and 0≤∂f∂xdi(xi) for all i=1,2,...Nc]. (38)

This is analogous to the predictor (12) used for bound constraints. As a postprocessing step, per (38) the distribution for over the virtual points is replaced by the distribution ; this distribution is discussed more in Section 8.1.

### 5.3 Monotonic splines

The spline formulation, presented in Section 4.3 to globally enforce a bound constraint of the form may be extended easily to enforce monotonicity constraints or other linear inequalities. For example, if is a first-order (backward or forward) finite difference matrix relating neighboring spline values, then monotonicity is enforced globally by sampling values of the knots subject to the constraint

 Cξ≥0; (39)

see López-Lopera et al. [42] or Maatouk and Bay [47]. This inequality is also used in the rejection sampler of Section 4.3.2 as a constraint to identify the MAP estimate to increase the sampling efficiency. Bound and monotonicity constraints can be enforced simultaneously by requiring both and in the sampling, though the acceptance ratio drops substantially with combined constraints.

### 5.4 Convexity

The sections above illustrated how a method for bound constraints can be used with first derivatives of a Gaussian process to enforce and thereby monotonicity for the GP , either globally as in Section 5.3 or at a finite set of virtual points as in Sections 5.1 and 5.2. Similar nonnegativity constraints can be applied to higher derivatives of as well. In one dimension, this can be used to enforce convexity via the constraint

 ∂2f∂x2≥0, (40)

treating the left-hand side as a GP with covariance kernel

 ∂4k∂x2∂x′2(x,x′). (41)

Although monotonicity can be enforced in arbitrary dimensions, convexity presents a challenge in dimensions greater than one, since it cannot be expressed as a simple linear inequality involving the derivatives of as in (40). As Da Veiga and Marrel [15] point out, enforcing convexity in higher dimensions requires that (40) be replaced by the condition that the Hessian of be positive semidefinite. Sylvester’s criterion yields the equivalent condition that each leading principal minor determinant of the Hessian be positive. Such inequality constraints involve polynomials in partial derivatives of . As polynomial functions of GPs are no longer GPs, the bound constraint methods in Section 4 no longer apply.

While higher dimensional convexity constraints are outside the scope of this survey, several references we have mentioned discuss the implementation of convexity constrained Gaussian processes in greater detail. Da Veiga and Marrel [15] discuss how convexity in one dimension of the form (40) can be enforced at virtual points using the (partially) truncated multinormal, in a way analogous to Section 5.2, while convexity in two dimensions can be enforced using the elliptically truncated multinormal distribution. Maatouk and Bay [47] and López-Lopera et al. [42] point out that for the spline basis considered in Section 4.3, convexity in one dimension amounts to requiring that the successive differences of the values at the spline knots are increasing in magnitude, i.e.

 ξk+1−ξk≥ξk−ξk−1 for all k. (42)

This is equivalent to requiring that the second-order finite differences be positive. This can also easily be applied in higher dimensions to guarantee that the second partial derivatives are positive globally, although this does not imply convexity.

## 6 Differential Equation Constraints

Gaussian processes may be constrained to satisfy linear operator constraints of the form

 Lu=f (43)

given data on and . When is a linear partial differential operator of the form

 L=∑αCα(x)∂α∂xα,α=(α1,...,αd),∂α∂xα=∂α1∂xα11∂α2∂xα22...∂αd∂xαd3, (44)

the equation (43) can be used to constrain GP predictions to satisfy known physical laws expressed as linear partial differential equations. In this section we survey methods to constraint GPs with PDE constraints of the form (43).

### 6.1 Block Covariance Kernel

The principle behind the approach of Raissi et al. [61] is that if is a GP with mean function and covariance kernel ,

 u∼GP(m(x),k(x,x′)) (45)

and if and belong to the domain of , then defines a valid covariance kernel for a GP with mean function . This Gaussian process is denoted :

 Lu∼GP(Lxm(x),LxLx′k(x,x′)). (46)

Note from (44) that the operator takes as input a function of a single variable . When applying to a function of two variables such as , we use a subscript as in (46) to denote the application of in the indicated variable, i.e., considering the input to as a function of the indicated variable only. Note that a special case of this, for , appeared in Section 5.1.

The notation “” for the GP (46) is suggested by noting that if one could apply to the samples of the GP , then the mean of the resulting stochastic process would indeed be given by

 mean(L[u](x)) =E[L[u](x)]=LE[u(x)]=Lm(x) (47)

and the covariance by

 (48)

This justification is formal, as in general the samples of the process defined by (46) cannot be identified as applied to the samples of [36, 17]; a rigorous interpretation involves the posterior predictions and reproducing kernel Hilbert spaces of the processes and [72, 6].

If scattered measurements on the source term in (43) are available at domain points , then this can be used to train and obtain predictions for from the GP (46) in the standard way. If, in addition, measurements of are available at domain points a GP co-kriging procedure can be used. In this setting physics knowledge of the form (43) enters via the data and can be used to improve prediction accuracy and reduce variance of the GPR of . The co-kriging procedure requires forming the joint Gaussian process . Similarly to the derivative case considered in Section 5.1, the covariance matrix of the resulting GP is a four block matrix assembled from the covariance matrix of the GP (45) for the solution , the covariance of the GP (46) for the forcing function, and the cross terms. Given the covariance kernel for , the covariance kernel of this joint GP is

 k([x1x2],[x′1x′2])=[Lxk(x1,x′1)LxLx′k(x1,x′2)Lxk(x2,x′1)LxLx′k(x2,x′2)]=[K11K12K21K22]. (49)

The covariance between and is given by in the upper right block of the kernel and can be justified by a calculation similar to (48); see Raissi et al. [61]. Similarly the covariance between and is represented by the bottom left block of the kernel. In this notation, the joint Gaussian process for is then

 [u(X1)f(X2)]∼GP([Lm(X1)Lm(X2)],[K11(X1,X1)K12(X1,X2)K21(X2,X1)K22(X2,X2)]), (50)

where .

Given data and , the GP kernel hyperparameters may be trained by assembling the four-block covariance matrix in (50) with , ,

 Kdata=[K11(Xu,Xu)K12(Xu,Xf)K21(Xf,Xu)K22(Xf,Xf)] (51)

and minimizing the negative log-marginal-likelihood

 −logp(yu,yf|Xu,Xf,θ)=12(y−m)⊤K−1data(y−m)+12log|Kdata|+N2log(2π),with y=[yuyf] and m=[Lm(Xu)Lm(Xf)]. (52)

In the presence of noise on measurements of and , a standard approach analogous to the Gaussian likelihood (4) is to introduce two noise hyperparameters and and replace the four-block covariance matrix (51) by

 [[l]K11(Xu,Xu)+σ2uINuK12(Xu,Xf)K21(Xf,Xu)K22(Xf,Xf)+σ2fINf] (53)

The inclusion of the additional terms depending on and

correspond to an assumption of uncorrelated white noise on the measurements

and , i.e.,

 Yu=u(Xu)+ϵu,Yf=f(Xf)+ϵf, (54)

with and independently , given data points for and data points for .

The implementation of the constrained Gaussian process kernel (49) for constraints of the form (43) raises several computational problems. The first is the computation of and . The most ideal scenario is that in which has an analytical formula and is a linear differential operator so that these expressions be computed in closed form by hand or with a symbolic computational software such as Mathematica. This was the approach used for the examples in Raissi et al. [61], Raissi et al. [62] and Raissi and Karniadakis [59], including for the heat equation, Burgers’ equation, Korteweg-de Vries Equation, and Navier-Stokes equations. The nonlinear PDEs listed here were treated using an appropriate linearization. An example of

being parametrized by a neural network (which allows derivatives to be computed using backpropagation) was also considered in

Raissi et al. [62] for the Burgers’ equation.

Closed form expressions for the covariance kernel (49) greatly simplify the implementation compared to numerical approximation of and using finite-differences or series expansions. As the size of the dataset and therefore size of the covariance matrix (49) increases, our numerical experiments suggest that any numerical errors in the approximation of the action of rapidly lead to ill-conditioning of the covariance matrix. This in turn can lead to artifacts in the predictions or failure of maximum likelihood estimation with the constrained GP. Ill-conditioning can be reduced by adding an ad-hoc regularization on the diagonal of (49) at the cost of reducing the accuracy of the regression, potentially negating the benefit of the added constraint. For more general constraints of the form (43), depending on the form of or , numerical methods may be unavoidable. For example, in Raissi et al. [61] and Gulian et al. [29], fractional-order PDE constraints (amounting to being a nonlocal integral operator with singular kernel) were considered. For these constraints, the kernel blocks and had no closed formula. To approximate these terms, a series expansion was used in Raissi et al. [61], and in Gulian et al. [29] a numerical method was developed involving Fourier space representations of and

with Gaussian quadrature for Fourier transform inversion.

A second problem is that the formulation (50) requires enforcing the constraint (43) at discrete points of . Therefore, even if we have complete knowledge of the constraining equation (43