Gaussian processes (GPs) are used in a great variety of real-world problems as stochastic emulators in fields such as biology, finance and robotics (Rasmussen and Williams, 2005; Murphy, 2012). In the latter, they can be used for emulating the dynamics of robots when experiments become expensive (e.g. time consuming or highly costly) (Rasmussen and Williams, 2005).
Imposing inequality constraints (e.g. boundedness, monotonicity, convexity) into GP emulators can lead to more realistic profiles guided by the physics of data (Golchi et al., 2015; Maatouk and Bay, 2017; López-Lopera et al., 2018). Some applications where constrained GP emulators have been successfully used are computer networking (monotonicity) (Golchi et al., 2015), econometrics (positivity or monotonicity) (Cousin et al., 2016), and nuclear safety criticality assessment (positivity and monotonicity) (López-Lopera et al., 2018).
, an approximation of GP emulators based on (first-order) regression splines is introduced in order to satisfy general sets of linear inequality constraints. Because of the piecewise linearity of the finite-dimensional approximation used there, the inequalities are satisfied everywhere in the input space. Furthermore, they prove that the resulting posterior distribution conditioned on both observations and inequality constraints is truncated Gaussian-distributed. Finally, it was shown in(Bay et al., 2016) that the resulting posterior mode converges uniformly to the thin plate spline solution when the number of knots of the spline goes to infinity.
Since the posterior is a truncated GP, its distribution cannot be computed in closed-form, but it can be approximated via Monte Carlo (MC) or Markov chain Monte Carlo (MCMC) (Maatouk and Bay, 2017; López-Lopera et al., 2018). Several MC and MCMC samplers have been tested in (López-Lopera et al., 2018), leading to emulators that perform well for one or two dimensional input spaces. Starting from the claim that allowing noisy observations could yield less constrained sample spaces for samplers, here we develop the corresponding approximation of constrained GP emulators when adding noise. Moreover, (constrained) GP emulators for observations that are truly noisy are also of interest for practical implementations. We test the efficiency of various MC and MCMC samplers under 1D toy examples where models without observation noise yield impractical sampling routines. We also show that, in a 5D monotonic example, our framework can be used in higher dimensions for thousands of observations, providing high-quality effective sample sizes within reasonable running times.
This paper is organised as follows. In Section 2, we introduce the finite-dimensional approximation of GP emulators with linear inequality constraints and noisy observations. In Section 3, we apply our framework to 1D synthetic examples where the consideration of noise-free observations is unworkable. We also test our framework on a 5D monotonic example with thousands of noisy observations. Finally, in Section 4, we highlight the conclusions, as well as potential future works.
2 Gaussian Process Emulators with Linear Inequality Constraints and Noisy Observations
In this paper, we aim at satisfying linear inequality constraints on Gaussian process (GP) emulators when observations are considered noisy. As an example, Figure 1 shows three types of GP emulators with training points at , , , and different inequality conditions. We used a squared exponential (SE) covariance function,
. We fixed the variance parameterand length-scale parameter . We set a noise variance to be equal to 0.5% of the variance parameter . One can observe that different types of (constrained) Gaussian priors (top) yield different GP emulators (bottom) for the same training data. One can also note that the interpolation constraints are relaxed due to the noise effect, and that the inequality constraints are still satisfied everywhere.
Next, we formally introduce the corresponding model to obtain constrained GP emulators with linear inequality constrains and noisy observations as in Figure 1.
2.1 Finite-Dimensional Approximation of Gaussian Process Emulators with Noisy Observations
Let be a centred GP on with arbitrary covariance function . Consider , with space . For simplicity, assume a spline with an equispaced set of knots such that for , with . This assumption can be relaxed for non-equispaced designs of knots as in (Larson and Bengzon, 2013), leading to similar developments as obtained in this paper but with slight differences when imposing the inequality constraints (e.g. convexity condition). In contrast to (López-Lopera et al., 2018), here we consider noisy observations for . Define as a stochastic emulator consisting of the piecewise-linear interpolation of at knots ,
where , for , with noise variance , and are hat basis functions given by
We assume that are independent, and independent of . One must note that the benefit of considering noisy observations in (2.1) is that, due to the “relaxation” of the interpolation conditions, the number of knots does not have to be larger than the number of interpolation points (assumption required in (López-Lopera et al., 2018) for the interpolation of noise-free observations). Then, for , the finite representation in (2.1) would lead to less expensive procedures since the cost of the MC and MCMC procedures below grow with the value of rather than (see Sections 2.2 and 2.3).
2.2 Imposing Linear Inequality Constraints
Now, assume that also satisfies inequality constraints everywhere in the input space (e.g. boundedness, monotonicity, convexity), i.e.
where is a convex set of functions defined by some inequality conditions. Then, the benefit of using (2.1) is that, for many constraint sets , satisfying is equivalent to satisfying only a finite number of inequality constraints at the knots (Maatouk and Bay, 2017), i.e.
with for , and a convex set on . As an example, we evaluate a GP with bounded trajectories , the convex set can be defined by . In this paper, we consider the case where is composed by a set of linear inequalities of the form
where the ’s encode the linear operations, the ’s and ’s represent the lower and upper bounds, respectively. One can note that the convex set is a particular case of where if and zero otherwise, and with bounds , , for .
). One can observe that the vectoris a centred Gaussian vector with covariance matrix . Denote , , , the matrix defined by , and the vector of noisy observations at points . Then, the distribution of conditioned on , with , is given by (Rasmussen and Williams, 2005)
Since the inequality constraints are on , one can first show that the posterior distribution of conditioned on and is truncated Gaussian-distributed (see, e.g., López-Lopera et al., 2018, for further discussion when noise-free observations are considered), i.e.
Notice that the inequality constraints are encoded in the posterior mean , the posterior covariance , and the bounds . Moreover, one must also highlight that by considering noisy observations, due to the “relaxation” of the interpolation conditions, inequality constraints can be imposed also when the observations do not fulfil the inequalities.
Finally, the truncated Gaussian distribution in (2.8) has not a closed-form expression but it can be approximated via MC or MCMC. Hence, samples of can be recovered from samples of , by solving a linear system (López-Lopera et al., 2018). As discussed in (López-Lopera et al., 2018), the number of inequalities are usually larger than the number of knots for many convex sets . If we further assume that , and that , then the solution of the linear system exists and is unique (see (López-Lopera et al., 2018) for a further discussion). Therefore, samples of can be obtained from samples of , with the formula for . The implementation of the GP emulator is summarised in Algorithm 1.
2.3 Maximum a Posteriori Estimate via Quadratic Programming
In practice, the posterior mode (maximum a posteriori estimate, MAP) of (2.8) can be used as a point estimate of unobserved quantities (Rasmussen and Williams, 2005), and as a starting state of MCMC samplers (Murphy, 2012). Let
be the posterior mode that maximises the probability density function (pdf) ofconditioned on and . Then, maximising the pdf in (2.8) is equivalent to maximise the quadratic problem
with conditional mean and conditional covariance as in (2.7). By maximising (2.9), we are looking for the most likely vector satisfying both the interpolation and inequality constraints (Bishop, 2007). One must highlight that the posterior mode of (2.8) converges uniformly to the spline solution when the number of knots (Maatouk and Bay, 2017; Bay et al., 2016). Finally, the optimisation problem in (2.9) is equivalent to
which can be solved via quadratic programming (Goldfarb and Idnani, 1982).
2.4 Extension to Higher Dimensions
The GP emulator of Section 2 can be extended to dimensional input spaces by tensorisation (see, e.g., Maatouk and Bay, 2017; López-Lopera et al., 2018, for a further discussion on imposing inequality constraints for ). Consider with input space , and a set of knots per dimension . Then, the GP emulator is given by
subject to and where , , and for ; , and are hat basis functions as defined in (2.2). We assume that are independent, independent of . Then, following a similar procedure as in Section 2, Algorithm 1 can be used with a centred Gaussian vector with an arbitrary covariance matrix .
Notice that having less knots than observations can have a great impact since the MC and MCMC samplers will then be performed in low dimensional spaces when . Moreover, for the case , the inversion of the matrix can be computed more efficiently through the matrix inversion lemma (Press et al., 1992), reducing the computational complexity to the inversion of an full-rank matrix. Therefore, the computation of the conditional distribution in (2.7) and the estimation of the covariance parameter can be achieved faster.
3 Numerical Experiments
The codes were implemented in the R programming language, based on the open source package lineqGPR (López-Lopera, 2018). This package is based on previous R software developments produced by the Dice (Deep Inside Computer Experiments) and ReDice Consortiums (e.g. DiceKriging, Roustant et al. (2012), DiceDesign, Dupuy et al. (2015), kergp, Deville et al. (2015)), but incorporating some structures of classic libraries for GP regression modelling from other platforms (e.g. the GPmat toolbox from MATLAB, and the GPy library from Python).
lineqGPR also contains implementations of different samplers for the approximation of truncated (multivariate) Gaussian distribution. Samplers are based on recent contributions on efficient MC and MCMC inference methods. Table 1 summarise some properties of the different MC and MCMC samplers used in this paper (see, e.g., Maatouk and Bay, 2016; Botev, 2017; Taylor and Benjamini, 2017; Pakman and Paninski, 2014, for a further discussion). Codes were executed on a single core of an Intel® CoreTM i7-6700HQ CPU.
|Previous R Implementations||constrKriging||TruncatedNormal||tmvtnorm||tmg|
3.1 1D Toy Example under Boundedness Constraint
Here, we use the GP framework introduced in Section 2 for emulating bounded trajectories with constant . We aim at analysing the resulting constrained GP emulator when noise-free or noisy observations are considered. The dataset is : , , , , and . We use a Matérn 5/2 covariance function,
with . We fix the variance parameter leading to highly variable trajectories. The lengthscale parameter and the noise variance are estimated via maximum likelihood (ML).
The effect of different bounds on the constrained GP emulators can be seen in Figure 2. There, we set for having emulations with high-quality of resolution, and we generated constrained emulations via RSM (Maatouk and Bay, 2016). One can observe that, since interpolation conditions were relaxed due to the influence of the noise variance , the prediction intervals are wider when bounds become closer to the observations. For the case , the noise-free GP emulator yielded costly procedures due to a small acceptance rate equal to . In contrast, when noisy observations were assumed, emulations were more likely to be accepted leading to an acceptance rate equal to .
Now, we assess all the MC and MCMC methods from Table 1 for the approximation of the truncated Gaussian posterior distribution in (2.8). We considered the examples from Figure 2. For the MCMC samplers, we used the posterior mode solution from (2.10) as the starting state of the Markov chains. This initialises the chains in a high probability region. Therefore, only few emulations have been “burned” in order to have samples that appeared to be independent of the starting state. Here, we only burned the first 100 emulations. We evaluated the performance of both MC and MCMC samplers in terms of the effective sample size (ESS):
where is the size of the sample path, and is the sample autocorrelation with lag . The ESS indicator gives an intuition on how many emulations of the sample path can be considered independent (Gong and Flegal, 2016). In order to obtain non-negative sample autocorrelations , we used the convex sequence estimator proposed in (Geyer, 1992). We then computed the ESS of each coordinate of , i.e. for
, and we evaluated the quantilesover the resulting values. The sample size has been chosen to be larger than the minimum ESS required to obtain a proper estimation of the vector (Gong and Flegal, 2016). Finally, we tested the efficiency of each sampler by computing the time normalised ESS (TN-ESS) (Lan and Shahbaba, 2016) at (worst case): .
Table 2 displays the performance indicators obtained for each samplers from Table 1. Firstly, one can observe that RSM yielded the most expensive procedures due to its high rejection rate when sampling the constrained trajectories from the posterior mode. In particular, for , and assuming noise-free observations, the prohibitively small acceptance rate of RSM leaded to costly procedures (about 7 hours) making it impractical. Secondly, although the Gibbs sampler needs to discard intermediate samples (thinning effect), it provided accurate ESS values within a moderate running time (with effective sampling rates of ). Thirdly, due to the high acceptance rates obtained by ExpT, and good exploratory behaviour of the exact HMC, both samplers provided much more efficient TN-ESS values compared to their competitors, generating thousands of effective emulations each second. Finally, as we expected, the performance of some samplers were improved when adding a noise. For RSM, due to the relaxation of the interpolation conditions, we noted that emulations were more likely to be accepted leading quicker routines: more than 150 times faster with noise (see Table 2, ).
|Bounds||Method||Without noise variance||With noise variance|
|CPU Time||ESS||TN-ESS||CPU Time||ESS||TN-ESS|
|[-1.0, 1.0]||RSM||61.30||(0.97, 1.00, 1.00)||0.02||57.64||(0.91, 1.00, 1.00)||0.02|
|ExpT||2.30||(0.98, 1.00, 1.00)||0.43||2.83||(0.96, 1.00, 1.00)||0.34|
|Gibbs||19.70||(0.84, 0.86, 0.91)||0.04||21.18||(0.75, 0.84, 0.91)||0.04|
|HMC||1.89||(0.95, 0.99, 1.00)||0.50||1.92||(0.94, 0.99, 1.00)||0.49|
|[-0.75, 0.75]||RSM||63.59||(1.00, 1.00, 1.00)||0.02||48.66||(0.95, 0.99, 1.00)||0.02|
|ExpT||3.22||(0.96, 0.99, 1.00)||0.30||3.24||(0.98, 1.00, 1.00)||0.30|
|Gibbs||20.20||(0.83, 0.86, 0.91)||0.04||18.23||(0.74, 0.84, 0.93)||0.04|
|HMC||1.46||(0.94, 1.00, 1.00)||0.64||1.28||(0.94, 0.97, 1.00)||0.73|
|[-0.6, 0.6]||RSM||242.34||(0.94, 0.97, 1.00)||0||101.20||(0.96, 1.00, 1.00)||0.01|
|ExpT||2.94||(0.94, 1.00, 1.00)||0.32||2.80||(0.98, 1.00, 1.00)||0.35|
|Gibbs||18.89||(0.80, 0.83, 0.94)||0.04||18.90||(0.77, 0.84, 0.92)||0.04|
|HMC||1.72||(0.92, 0.99, 1.00)||0.53||1.68||(0.93, 0.96, 1.00)||0.55|
|[-0.5, 0.5]||RSM||25512.77||(0.98, 1.00, 1.00)||0||157.06||(0.96, 0.99, 1.00)||0.01|
|ExpT||2.50||(0.99, 1.00, 1.00)||0.40||2.69||(0.97, 1.00, 1.00)||0.36|
|HMC||6.20||(0.86, 0.90, 0.98)||0.14||2.14||(0.52, 0.85, 0.97)||0.24|
3.2 1D Toy Example under Multiple Constraints
In (López-Lopera et al., 2018), numerical implementations were limited to noise-free observations that fulfilled the inequality constraints. In this example, we test the case when noisy observations do not necessarily satisfy the inequalities.
Consider the sigmoid function given by
We evaluated (3.2) at random values of3.2) exhibits both boundedness and non-decreasing conditions, we added those constraints into the GP emulator using the convex set
Hence, the MC and MCMC samplers will be performed on (number of inequality conditions). As a covariance function, we used a SE kernel, and we estimated the parameters via ML.
Unlike (López-Lopera et al., 2018), there is no need here to satisfy the condition , due to the noise. Therefore, the finite approximation of Section 2 can be seen as a surrogate model of standard GP emulators for . Figure 3 shows the performance of the constrained emulators via HMC for . For smaller values of , the GP emulator runs fast but with a low quality of resolution of the approximation. For example, for and , because of the linearity assumption between knots, the predictive mean presents breakpoints at the knots. On the other hand, the GP emulator yields smoother (constrained) emulations as increases (). In particular, one can observe that for , the emulator leads to a good trade-off between quality of resolution and running time (13.5 times faster than for ).
3.3 Toy Example in Higher Dimensions
In (López-Lopera et al., 2018), since the finite approximation of GPs was introduced aiming to interpolate a given number of (noise-free) observations , it was strictly necessary to have more knots than observations (). Moreover, the interpolation conditions yielded more restricted domains where the MC and MCMC samplers could not be performed efficiently. Therefore, the extension to high dimensions was challenging there (limited to ). Here, we show that, by considering noisy observations, constrained GP emulators can be performed in higher dimensions and for a high number of observations within a reasonable running time.
Consider the 5D target function given by
with . One can observe that is non-decreasing with respect to all its input variables for . Although some computations could be potentially simplified given that is a (first-order) additive function, we do not take advantage of that aiming to test our framework on a fully 5D example. We evaluated on a maximin Latin-hypercube design over at 2000 locations. A maximin Latin hypercube design is a space-filling design consisting in the iterative maximisation of the distance between two closest design points from a random Latin hypercube design. We used the simulated annealing routine maximinSA_LHS from the R package DiceDesign (Dupuy et al., 2015).
As in Section 3.2, we contaminated the resulting observations with an additive Gaussian white noise with an standard deviation equal to of the range of
. As covariance function, we used the tensor product of 1D SE kernels,
with . We estimated the parameters via ML. We set the number of knots as a trade-off between high-quality of resolution in the approximation and CPU running time. Thus, we fixed the number of knots per dimension to five, except for the third and fifth dimensions (i.e. for ). For the third dimension, we let due to the linearity of the target function with respect to . For the fifth dimension, we fixed since varies more through this dimension compared to the other ones. Hence, this leads to a total number of knots.
Figure 4 shows the predictions obtained by the constrained GP emulator under monotonicity conditions. The proposed emulator was obtained via HMC, with a CPU running time of 44.8 minutes for generating monotonic trajectories. The ESS values are equal to at quantiles , providing an effective sample rate of at quantile . Besides the efficient and fast performance of HMC, one can observe from Figure 4 that the GP emulator was able to capture the non-decreasing dynamics of the target function everywhere.
Finally, we evaluated the quality of the predictions using the criterion. This criterion is given by , where SMSE is the standardised mean squared error (Rasmussen and Williams, 2005), and is equal to one if the predictive mean is equal to the test data and lower than one otherwise. We tested both the predictive mean and mode, obtaining values equal to and (respectively). Figure 5 plots the predictive mean and the predictive mode vs the true (noise-free) observations. Since the predictive mode of (2.10) can be obtained much faster than the predictive mean (requiring only a couple of seconds), one may suggest it as an accurate point prediction of the conditional process.
We have introduced a constrained GP emulator with linear inequality conditions and noisy observations. By relaxing the interpolation of observations through a noise effect, MC and MCMC samplers are performed in less restrictive sample spaces. This leads to faster emulators while preserving high effective sampling rates. As seen in the experiments, the Hamiltonian Monte Carlo sampler from (Pakman and Paninski, 2014) usually outperformed its competitors, providing much more efficient effective sample rates in high dimensional sample spaces.
Since there is no need of having more knots than observations (), the computational complexity of MC and MCMC samplers is independent of . Therefore, since the samplers are performed on , they can be used for large values of by letting . As shown in the 5D monotonic example, effective monotone emulations can be obtained within reasonable running times (about tens of minutes).
Despite the improvements obtained here for scaling the monotonic GP emulator in higher dimensions, its tensor structure makes it impractical for tens of input variables. We believe that this limitation could be mitigated by using other types of designs of the knots (e.g. sparse designs). In addition, supplementary assumptions on the nature of the target function can also be made to reduce the dimensionality of the sample spaces where MC and MCMC samplers are performed (e.g. additivity).
This research was conducted within the frame of the Chair in Applied Mathematics OQUAIDO, gathering partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (CNRS, Ecole Centrale de Lyon, Mines Saint-Etienne, University of Grenoble, University of Nice, University of Toulouse) around advanced methods for Computer Experiments.
- Bay et al. (2016) Bay, X., Grammont, L., and Maatouk, H. (2016). Generalization of the Kimeldorf-Wahba correspondence for constrained interpolation. Electronic journal of statistics , 10(1):1580–1595.
- Bishop (2007) Bishop, C. M. (2007). Pattern Recognition And Machine Learning (Information Science And Statistics). Springer.
- Botev (2017) Botev, Z. I. (2017). The normal law under linear restrictions: simulation and estimation via minimax tilting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):125–148.
- Cousin et al. (2016) Cousin, A., Maatouk, H., and Rullière, D. (2016). Kriging of financial term-structures. European Journal of Operational Research, 255(2):631–648.
- Deville et al. (2015) Deville, Y., Ginsbourger, D., Roustant, O., and Durrande, N. (2015). kergp: Gaussian Process Models with Customised Covariance Kernels. https://cran.r-project.org/web/packages/kergp/index.html.
- Dupuy et al. (2015) Dupuy, D., Helbert, C., and Franco, J. (2015). DiceDesign and DiceEval: Two R packages for design and analysis of computer experiments. Journal of Statistical Software, 65(i11).
- Geyer (1992) Geyer, C. J. (1992). Practical Markov Chain Monte Carlo. Statistical Science, 7(4):473–483.
- Golchi et al. (2015) Golchi, S., Bingham, D. R., Chipman, H., and Campbell, D. A. (2015). Monotone emulation of computer experiments. SIAM/ASA Journal on Uncertainty Quantification, 3(1):370–392.
- Goldfarb and Idnani (1982) Goldfarb, D. and Idnani, A. (1982). Dual and primal-dual methods for solving strictly convex quadratic programs. In Hennart, J. P., editor, Numerical Analysis, pages 226–239, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Gong and Flegal (2016) Gong, L. and Flegal, J. M. (2016). A practical sequential stopping rule for high-dimensional Markov Chain Monte Carlo. Journal of Computational and Graphical Statistics, 25(3):684–700.
Lan and Shahbaba (2016)
Lan, S. and Shahbaba, B. (2016).
Sampling Constrained Probability Distributions Using Spherical Augmentation, pages 25–71. Springer International Publishing, Cham.
- Larson and Bengzon (2013) Larson, M. G. and Bengzon, F. (2013). The Finite Element Method: Theory, Implementation, and Applications. Springer Publishing Company, Incorporated.
- López-Lopera (2018) López-Lopera, A. F. (2018). lineqGPR: Gaussian process regression models with linear inequality constraints. https://cran.r-project.org/web/packages/lineqGPR/index.html.
- López-Lopera et al. (2018) López-Lopera, A. F., Bachoc, F., Durrande, N., and Roustant, O. (2018). Finite-dimensional Gaussian approximation with linear inequality constraints. SIAM/ASA Journal on Uncertainty Quantification, 6(3):1224–1255.
Maatouk and Bay (2016)
Maatouk, H. and Bay, X. (2016).
A New Rejection Sampling Method for Truncated Multivariate Gaussian Random Variables Restricted to Convex Sets, pages 521–530. Springer International Publishing, Cham.
- Maatouk and Bay (2017) Maatouk, H. and Bay, X. (2017). Gaussian process emulators for computer experiments with inequality constraints. Mathematical Geosciences, 49(5):557–582.
- Murphy (2012) Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective (Adaptive Computation And Machine Learning Series). The MIT Press.
- Pakman and Paninski (2014) Pakman, A. and Paninski, L. (2014). Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542.
- Press et al. (1992) Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C (2Nd Ed.): The Art of Scientific Computing. Cambridge University Press, New York, NY, USA.
- Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Roustant et al. (2012) Roustant, O., Ginsbourger, D., Deville, Y., et al. (2012). DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. Journal of Statistical Software, 51(i01).
- Taylor and Benjamini (2017) Taylor, J. and Benjamini, Y. (2017). RestrictedMVN: multivariate normal restricted by affine constraints. https://cran.r-project.org/web/packages/restrictedMVN/index.html. [Online; 02-Feb-2017].