Bayesian learning of orthogonal embeddings for multi-fidelity Gaussian Processes

08/05/2020 ∙ by Panagiotis Tsilifis, et al. ∙ General Electric 7

We present a Bayesian approach to identify optimal transformations that map model input points to low dimensional latent variables. The "projection" mapping consists of an orthonormal matrix that is considered a priori unknown and needs to be inferred jointly with the GP parameters, conditioned on the available training data. The proposed Bayesian inference scheme relies on a two-step iterative algorithm that samples from the marginal posteriors of the GP parameters and the projection matrix respectively, both using Markov Chain Monte Carlo (MCMC) sampling. In order to take into account the orthogonality constraints imposed on the orthonormal projection matrix, a Geodesic Monte Carlo sampling algorithm is employed, that is suitable for exploiting probability measures on manifolds. We extend the proposed framework to multi-fidelity models using GPs including the scenarios of training multiple outputs together. We validate our framework on three synthetic problems with a known lower-dimensional subspace. The benefits of our proposed framework, are illustrated on the computationally challenging three-dimensional aerodynamic optimization of a last-stage blade for an industrial gas turbine, where we study the effect of an 85-dimensional airfoil shape parameterization on two output quantities of interest, specifically on the aerodynamic efficiency and the degree of reaction.



There are no comments yet.


page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In addressing reliability and design optimization challenges in modern engineering and manufacturing processes, the need for an increasing accuracy in the predictive capabilities of the analysis codes has been of paramount importance. Taking into account the presence of aleatoric uncertainties resulting from the severe complexity of the physical system under investigation and the limited knowledge available to the experimenter, simulation-based uncertainty quantification (UQ) tasks can result in computational costs that are unacceptable in practice smith_uq ; lemaitre ; koch_size .

Metamodeling techniques wang_review have gained increasing popularity in recent years, as they provide a way to alleviate the computational burden associated with quantifying uncertainties in computer model outputs and inverse modeling morokoff ; tarantola . The key idea is to replace the expensive computer solver with a surrogate that is cheap to evaluate and at the same time it maintains high predictive accuracy. These metamodels can consist of functional representations of the forward solver such as Polynomial Chaos ghanem_spanos ; lemaitre_knio ; xiu_karniadakis , kernel-based methods cristianini ; cortes

, relevance vector machines

bilionis_rvm ; tsilifis_rvm

and neural networks

tripathy_nn ; zhu or they can be nonparametric models such as Gaussian Processes bilionis_mogp ; chen_jcp ; perdikaris_scicomp .

Gaussian Process (GP) regression techniques allow us to quantify the epistemic uncertainty associated with the limited number or training data points by providing explicit Bayesian posterior updates on the error bars ohagan ; ohagan_quad

. Recent works have demonstrated their wide applicability for UQ analysis and probabilistic machine learning and have been exhaustively used for predicting solutions to differential equations, including local adaptations for detecting discontinuities

bilionis_local , discovering governing equations raissi_gp and constructing latent variable models for solving inverse problems atkinson

. Challenges associated with efficiently training an accurate GP model, include acquiring the training dataset necessary to guarantee the desired predictive performance and tuning the model’s hyperparameters

rasmussen . Collecting the data often requires a significant number of model simulations that increases exponentially as a function of the dimensionality of the input parameters and can easily become prohibitive due to computational budget limitations. On the other hand, training GPs using massive datasets, when available, poses numerical challenges that can result in poor performance. These are mainly associated with repeatedly performing covariance matrix inversions that can be extremely inaccurate and computer memory-demanding. Applying sparse techniques herbrich and batch optimization hensman have only partially managed to address these issues.

An alternative approach involves leveraging a cheaper computer solver that provides with low-cost, yet less accurate observations to be used for training the GP, leading to formulations where the observable depends on more that one covariates. Such approaches, although already known to the geostatistics community as co-kriging cressie , were formalized within the context of multi-fidelity simulations in the pioneering work of Kennedy & O’Hagan kennedyohagan . The auto-regressive scheme presented therein, decomposes the expensive simulation code as a sum of its cheap approximation and a discrepancy term, both modeled as independent Gaussian Processes. Techniques for training the model have been proposed forrester and involve learning the degree of correlation between the codes, while a model calibration variation of the scheme is also available kennedyohagan_calib . Similarly to standard GP regression in single-fidelity settings, training the auto-regressive scheme includes inverting large, ill-conditioned covariance matrices, leading to numerical inaccuracies. Le Gratiet & Garnier le_gratiet_doe offered an elegant solution to the problem that consists of decoupling the information stemming from different levels of fidelity and thus simplifying the covariance matrix structure. These ideas found wide applicability in discovering high dimensional response surfaces arising from dynamical systems paris_rspa , model inversion paris_rsif , multi-fidelity Bayesian optimization pang and uncertainty propagation in the big-data regime paris_siam .

From a different perspective, an efficient strategy for constructing a GP metamodel would involve addressing input dimensionality by means of reducing the number of variables in the input design space. Standard ways of doing so include sensitivity analysis saltelli

and unsupervised learning methods that explore correlations of the input variables, such as PCA

pearson , kernel PCA ma or even truncating Karhunen-Loève expansions ghanem_spanos . More sophisticated approaches to dimensionality reduction in the context of GPs include applying embeddings on the input design space in order to obtain a low dimensional latent variable space. Several criteria for discovering linear embeddings have been proposed in the literature. Constantine in his seminal work constantine_AS ; constantine_book , used a gradient-based approach to exploit what is termed Active Subscape (AS). The key step in this approach is to map the input space using a matrix derived by an orthogonal decomposition of the covariance of the gradient vector of the observable quantity. Intuitively the approach is sound as it explores the directions along which the output quantity exhibits most of its variability. The major limitation however is that it is impractical in black-box simulations where gradient information is hardly available. Gradient-free approaches that learn orthogonal tripathy_bilionis or arbitrary garnett linear mappings on-the-fly, have also been developed. In these works, the matrices were inferred using MLE or the Laplace approximation of the posterior of the linear mapping, respectively. At last, attempts to reduce the dimensionality using nonlinear mappings have been successfully applied for Bayesian optimization moriconi , where the forward mapping to the latent variables is modeled using a feed-forward neural net while the original input reconstruction mapping is carried out using multi-output GPS.

The ultimate goal in this work is to combine dimensionality reduction principles with multi-fidelity simulations so as to manifest their benefits in a single context. We seek to simultaneously address numerical issues and computational limitations arising in the presence of high input dimensionality and expensive data acquisition procedures. We thus present a unified framework where the classic auto-regressive Gaussian Process (ARGP) scheme can be built on a linearly embedded subspace that can be learned while training the model. To the authors best knowledge, such an approach is presented for the first time. Specifically, we build on the Kennedy-O’Hagan autoregessive scheme for modeling expensive computer outputs using a multi-fidelity source of information. Regardless of the quality of their output as an approximation of the high fidelity solver, we assume that the different fidelity codes are highly correlated and thus they exhibit similar dependence on the input variables, thus intuitively it is natural to assume that a common linear embedding can be applied to all GPs at different levels of fidelity. A training procedure for tuning the hyperparameters of the model is presented where the ARGP parameters are estimated using either MLE or Bayesian methods and the orthogonal embedding is inferred using a fully Bayesian approach that relies on the Geodesic Monte Carlo algorithm

byrne that has been particularly tailored for sampling orthogonal matrices tsilifis_rspa . Such an approach to learn linear embeddings is novel in the context of dimensionality reduction for GPs and clearly prevails previously presented techniques tripathy_bilionis ; garnett

in that the full posterior of the orthogonal matrix can be exploited as opposed to a Laplace approximations while it relieves us of the computational burden caused by high-dimensional gradient-based optimization of the likelihood.

We structure this paper as follows: Section 2.1 presents the basic elements of the classic autoregressive Gaussian Process scheme, Sec. 2.2 presents the prior setting and predictive distributions and Sec. 2.3 discusses the model training approach. Section 3.1 then formulates the ARGP model defined on a linearly embedded subspace using an orthogonal projection, that is trained using the proposed algorithm presented in Sec. 3.2. Our numerical examples include two toy problems with known embeddings that are learned using observations from three levels of fidelity (Sec. 4.1 & 4.2) and a challenging three-dimensional airfoil optimization problem with a 85-dimensional input space where observations are available from a high- and a low-fidelity simulator.

2 Recursive Multi-fidelity Gaussian Processes

2.1 Autoregressive Gaussian Process model

We consider the following scenario where a hierarchy of computer codes is available, say , indexed by input vector where is the design space consisting of all feasible inputs. The codes are in order of increasing fidelity from the cheapest one , to the most accurate one, . The key assumption as was first stated by Kennedy & O’Hagan kennedyohagan for any two consecutive levels of code and , is that given observation of the low fidelity code at , nothing more can be learnt about by observing at any . This translates to the Markov property


for any

. Based on this principle, we write the relation between any two consecutive codes using the autoregressive model


where “” denotes statistical independence and therefore the discrepancy between successive levels is characterized by an independent correction term and by the scaling coefficient which captures the correlation between the models, as it satisfies


As suggested in kennedyohagan , in what follows we assume for simplicity that is constant, although dependence on has been modeled using regression functions and was shown to be worthwhile le_gratiet_juq ; le_gratiet_doe . At last accounts for measurement noise in the obserbations at level that is assumed to be independent of both and .

2.2 Prior and predictive distributions

Prior to observing any outputs of the multi-fidelity codes, we assume that and are Gaussian Processes


where , are scaling factors and are covariance kernel functions. For the measurement noise we take


The kernel functions are modeled using the squared exponential kernel


where are the lengthscales along the dimensions. The choice of the covariance kernel depends primarily on the prior belief about the smoothness of the response surface (Ch.4, rasmussen ).

Let now be the experimental design set at level that consists of input points in , and be the observations of the -th code . We denote with the parameters introduced at level , so that and , and . For a fixed set of parameter values by stacking all observations to form a vector , we write the predictive distribution of the highest level of code at any new set of test points , for , as


where the predictive mean is given by


and the predictive variance is


In the above expressions we have


where the diagonal block matrices are given by


where , is the correlation matrix with entries , . The off-diagonal blocks are written


with . Analogously, is the correlation matrix with entries , . Further, the vector is defined as , where


and and . At last, the variance is defined as


for .

2.3 Estimating the model parameters

In order for the above predictive distribution to be of practical use, it is crucial to train the model by means of finding the optimal set of parameters . Pursuing a fully Bayesian approach to model training, although robust, it can be computationally challenging as the number of levels, and consequently the dimension of , increases. It can be therefore preferable in such cases to resort to more efficient strategies, such as maximum likelihood estimation (MLE). For the sake of completeness, we present below both approaches, that we use interchangeably in our numerical examples.

2.3.1 Maximum-likelihood estimation

Here we explore the possibility of training the model using maximum likelihood estimation (MLE). More specifically we seek to minimize the negative log-likelihood, that is to identify such that




writing to emphasize the dependence of the covariance matrix on the parameters. Minimization of can be performed using standard gradient based algorithms. The gradient of with respect to any of its arguments is given by


Note that a convenient simplification applies in the special case considered in kennedyohagan ; le_gratiet_juq ; le_gratiet_doe , where the design points corresponding to the observations are nested, that is . By expanding the likelihood using conditional probabilities and making use of the Markov property in eq. (1) we write


thus the autoregressive model can be trained by solving distinct optimization problems with respect to the parameters , corresponding to the different levels of fidelity . At an arbitrary level , the log-likelihood is written


After differentiating with respect to the components of and setting equal to zero, one can derive the maximum-likelihood estimates




where , , is the set of observations from code corresponding only to the design points in . Further, is a vector of length , filled with ones and is the indicator function that is one in , and zero otherwise. At last, the two estimates given above are dependent on the lengthscales and the noise variance . Those can be estimated by maximizing the concentrated restricted log-likelihood function


2.3.2 Markov Chain Monte Carlo sampling

A fully Bayesian updating strategy of the model parameters can be carried out using Markov Chain Monte Carlo (MCMC) sampling in order to generate samples of the posterior distribution of . The latter is written as


where the likelihood distribution is with given in eq. (16). The prior distributions of parameters

corresponding to lengthscales at various levels of fidelity are modeled using independent Beta distributions and those corresponding to the variances are modeled using independent inverse gamma distributions. In our implementations we use a Metropolis-Hastings algorithm

gelman that allows jumps in order to fully explore possible multimodal behavior.

3 Multi-fidelity Gaussian Processes on low-dimensional embeddings

3.1 Dimensionality reduction using projection matrices

As discussed in the introduction, the main goal in this paper is to learn a response surface on a high dimensional input space within a multi-fidelity context, i.e. by leveraging observations from low accuracy simulators that are cheaper to evaluate. Several shortcomings can make the learning process problematic in the presence of high dimensions and large datasets. For instance, the large number of parameters in the case of anisotropic kernels can result in poor performance of the MCMC algorithm or convergence of the MLE procedure to suboptimal solutions. Below, we develop an dimensionality reduction framework where the ARGP model is trained on a low dimensional input space that is the result of a linear embedding of the original space. We assume throughout this work that the target function can be described or be well-approximated by a function , defined in a d-dimensional space , where , such that


Here, is assumed to be a orthonormal matrix that maps the original design space to . The choice of orthonormality is made so that the columns of form a basis on and therefore the matrix itself is a projection from to . That further implies that once is identified, any other set of basis vectors forms a projection that can describe the same approximation of .

Next, it is important to assume that all the lower fidelity codes can be approximated by similar “link” functions defined on the same low dimensional space , that is using the same projection function . We therefore assume that for each


for functions , . Thus, we can define the autoregressive model (2) on as


where and .

The idea behind this formulation is that, by identifying such that the above approximations are accurate, the multi-fidelity output quantify of interest is described as a function of a low dimensional input, thus, it becomes simpler to characterize its predictive distribution. Applying the same projection at all levels of fidelity, practically means that all codes exhibit most of their variability within the same “active” subspace, as it was termed by Constantine constantine_AS ; constantine_book . Although this might seem as a strong assumption, in fact, considering that different fidelity codes are typically highly correlated as they simulate the same physical process at different levels of accuracy, the assumption is fairly plausible. Furthermore, it is worth pointing out that the ARGP model (26) where the low dimensional spaces are defined using different , at different levels, does no longer honor the Markov property (1). We do not pursue further such a scenario in this work.

Assigning the same prior distributions as in eq. (4) for the reduced dimensionality ARGP introduced above, results in the same posterior expressions given in eqs. (8)-(9), where the covariance matrices describe the correlations of the training points in . By denoting the projections of all design points where the model outputs are observed, as , , we can rewrite the covariance kernels and as functions of the high dimensional inputs in the original space , with entries


respectively. By incorporating these expressions in the likelihood and posterior distributions, it becomes clear that can be considered as an additional set of model parameters that needs to be inferred from observations, in a similar manner as in the single-fidelity setting presented in tripathy_bilionis .

3.2 Simultaneous autoregressive GP training and embedding learning

As highlighted above, training the autoregressive GP model and identifying the low-dimensional design space requires learning the model parameters and the rotation matrix simultaneously. To do so, we propose a two-step iterative procedure that iterates between tuning the parameters for a fixed and updating while keeping the parameters fixed. Such algorithms have been used in the past for dimensionality reduction purposes within the context of GP regression tripathy_bilionis ; garnett and Polynomial Chaos adaptations tsilifis_rspa ; tsilifis_cs and have demonstrated great potential. To further justify the choice of updating scheme, we can write the Bayesian posterior of the joint parameters given observations , as


Assuming that sampling from the marginal posteriors of and conditional on each other, a Gibbs sampler would consist of generating a Markov chain that eventually converges to the joint posterior above chaspari , therefore generating a chain of and samples in such a fashion will ultimately explore .

In order to carry out such a sampling scheme, we employ the following strategy: When is given, the updating step is performed using the methods presented in Section 2.3. For updating while is kept fixed, we use the Geodesic Monte Carlo method that samples from target distributions defined on embedded manifolds, as it is described in Byrne & Girolami byrne . In our case, is a matrix that takes values on the Stiefel manifold of orthonormal -frames on muirhead , and the target density is the marginal posterior . The iterative procedure is summarized in Algorithm 1. The details of Geodesic Monte Carlo sampling are presented in the next section.

Require : Design input sets , observations , initial guess , assign priors on or initialize to .
       Run MCMC or MLE optimization (Sec. 2.3.2-2.3.1)
       Run Geodesic MC Algorithm 2 with target density
until relative change in Hamiltonian function (30) is less than tolerance .
Algorithm 1 Two-step iterative update of and

3.2.1 Geodesic Monte Carlo

The Geodesic Monte Carlo algorithm developed in byrne is at its core a Hamiltonian Monte Carlo (HMC) sampling technique girolami_hmc defined on a Riemannian manifold embedded in . A Hamiltonian function is defined that describes the dynamics of a spatial variable and is characterized by its target density. Next, the Hamiltonian flows are simulated through numerical integration in order to propose new samples that are to be accepted or rejected, based on a Metropolis-Hastings step hitchcock . In our case, we are working on the Stiefel manifold that is defined as


for , where the special cases and correspond to the set of all orthonormal square matrices and the ()-dimensional hypersphere on , respectively. The Hamiltonian corresponding to our target posterior density is defined as


where is an auxiliary velocity variable and the dynamics of are described by


The key difference of the sampling strategy proposed in byrne from traditional HMC algorithms is that instead of relying on symplectic integrators neal to simulate the Hamiltonian flow, we take advantage of the fact that the dynamics of the kinetic term describe a flow over a geodesic curve that are explicitly known for certain manifolds, therefore numerical integration applies only on the potential term, thus improving accuracy and performance.

In summary, at the -th step of the algorithm, a random velocity vector is proposed that is tangent on the manifold at the previously accepted step , and it specifies the direction along which the Hamiltonian is going to move. Then the two terms and are integrated over time by first updating for a time step , followed by updating for using the known geodesic curve formula, and then is integrated again for . The procedure is repeated until a user-defined final time is reached. At last, the resulting “spatial” coordinate will be accepted in the chain with probability


The full mathematical details of the integrator scheme for the Hamiltonian flow and the geodesic formulas on are given in A. Algorithm 2 describes one complete accept-reject step of the above procedure. As a prior on , we use a Matrix-Langevin (mL) distribution chikuse whose density function is given by


where is the normalizing constant that is parametrized by the matrix . Details on the geometric interpretation of the mL density and how to tune the prior parameters are given in B. Note that updating by integrating the potential term requires computing the gradient of the marginal log-posterior distribution . The gradient of the marginal log-posterior becomes


where the likelihood gradient is given by (17) and the gradient of consists of gradients of block matrices , . Those involve differentiating that have -th entries


At last, the log-prior gradient is .

Initialize : Choose integration period , time step and sample .
At the -th step assume :
for  to  do
       Update by following the geodesic flows (48)-(49) for a time interval
end for
if   then
end if
Algorithm 2 Geodesic Monte Carlo algorithm byrne

4 Numerical examples

For all numerical examples presented in this section we have used the following settings:

  1. The number of time steps and the step-size used in the integration of the Hamiltonian in Algorithm 2 are set to and respectively. These values have been carefully selected after numerical experimentation with the algorithm and they provide a moderate acceptance rate while keeping running times to reasonably low levels. Intuitively, a highly accurate integration resulting in high acceptance rate would require a large value for and very small but it would slow down the algorithm significantly. On the other hand, coarse integration using large and small would speed up the algorithm but would reduce the acceptance rate. Another characteristic of the algorithm that manifests in high dimensions for both and is that for an arbitrary initial point the acceptance rate during the first few iterations is significantly low, as a result of the standard Gaussian proposal from which we sample , leading to completely random directions along with the Hamiltonian flow evolves. To improve performance of the algorithm, we change the mL prior at each iteration by setting its parameter . Each iteration stops when one single sample is accepted and we reduce the step size to after successive rejections in order to refine the Hamiltonian integration and reset on the next iteration. We stop sampling as soon as the first sample is accepted.

  2. To train the ARGP using MCMC sampling, we assign beta priors with parameters on the lengthscales and inverse Gamma priors with parameters and on the variance parameters and respectively, for . To proceed with sampling , the hyperparameters are then fixed to the median values of the full MCMC chain after 200 samples are accepted. The MCMC algorithm is run using General Electric Global Research Center’s in-house Bayesian modeling toolbox GEBHM ghosh2020advances . To train the ARGP using MLE, the log-likelihood or the restricted log-likelihoods are maximized using the BFGS algorithm byrd .

4.1 Academic example 1: Three-fidelity model with known 1-dimensional embedding

We consider the following three levels of code


where so that is a scalar variable and the link functions are given by


For this example we take and we generate randomly, by fixing the random seed in order to ensure reproducibility. For our numerical experiments, the projection matrix is fixed to

Figure 1: Academic example 1. Training data along with the 1-dimensional representation of , and and the GP posterior mean mean using only data.

We generate synthetic data that consists of observations obtained on nested design points and from the low, intermediate and high fidelity codes respectively, where , and

. The observations are corrupted by Gaussian noise at all levels of fidelity with their standard deviations being equal to

, and , corresponding to , and respectively. Fig. 1 shows the full training data set along with the true 1-dimensional link functions , and . To motivate our study, a single-level Gaussian process regression is performed on the high-fidelity data points that are available. The predictive mean, depicted with solid blue line, matches the training points but clearly is unable to capture the true model’s fluctuations due to the absence of a sufficient amount of data, let alone the fact that the true and the resulting 1-dimensional representation is hypothetically not available to the experimenter. That essentially means that a dimensionality reduction approach performed using only the high fidelity dataset, as proposed in tripathy_bilionis would provide such a predictive mean only in the best case scenario, where would be recovered exactly. In addition. even learning this rotation matrix would be challenging due to the limited data availability and convergence of the two step algorithm would be expected to be extremely slow. Lastly, it is needless to say that gradient-based methods such as constantine_AS would simply fail dramatically due to the poor gradient Monte Carlo estimate using only data points. On the other hand, in a multi-fidelity GP regression setting in the original 10-dimensional design space, the available data might fail to provide meaningful inference results, while training the model becomes again challenging due to the large number of parameters to be inferred, including different lengthscales that are present in the anisotropic kernels, as well as the repeated use of Cholesky decomposition for inverting large covariance matrices.

We run Algorithm 1 for this particular setting and after only eleven iterations we obtain the converged rotation. Fig. 2 (left) shows the drawn samples during iterations 5-11. It can be seen that the values are in full agreement with - which is a valid rotation since the representation of through its link function is invariant under reflections about the origin. Fig. 2 (right) shows the plots of the predictive mean and the true link function from eq. . The excellent agreement between the two is apparent. At last, Table 1 shows the estimated model parameter values . For comparison we display the value of the single-level GP regression shown in Fig. 1. As one can conclude from the plot, the single-level GP overestimates the lengthscale and interprets the data discrepancies as observation noise (). On the contrary, the obtained ARGP gives high fidelity lengthscale while the noise variances for all fidelities are almost negligible. By capturing accurately the correlations between the different levels and (recall the true values and ), the autoregressive model leverages the low fidelity data effectively and captures the model’s fluctuations even in areas such as the interval where high fidelity training points are fully absent.

Figure 2: Academic example 1. Left: Entry values of the posterior samples and the true values of and . Right: Comparison of the multi-fidelity predictive mean and the true link function .
GP 5.223 11.601 4.017
ARGP -0.508 7.022 -8.297 1.955 3.296 11.518 -17.868 1.242 1.700 13.107 -20.123
Table 1: Academic example 1. Final maximum likelihood estimates of the model parameters.

4.2 Academic example 2: Three-fidelity model with known 2-dimensional embedding

A three-level autoregressive Gaussian Process is presented in this example where the low dimensional embedding is now 2D and the additional challenge of identifying the dimensionality of is also explored. Let be a fixed projection matrix with columns and . The three levels of code are given by the functions


We consider again to be the dimensionality of and the projection matrix is fixed to


For this example we generate data again from nested design points , and where this time , and . Observations are contaminated with Gaussian noise whose standard deviation is equal to , and at the corresponding levels.

Figure 3: Academic example 2. Left: Comparison of samples versus values of columns of true . Right: Predictive mean and 3-standard deviation confidence bands of the 1-dimensional adapted MFGP along with 250 test data projected in the embedded space.
d=1 2.005 0.651 -3.082 2.419 0.327 4.901 -6.903 2.505 2.510 8.807 -21.089
d=2 (2.67, 4.40) 4.932 -15.004 0.123 (2.96, 0.26) 4.756 -8.888 -0.529 (0.49, -0.31) 4.083 -18.466
Table 2: Academic example 2. Final maximum likelihood estimates of the model parameters.

We run again Algorithm 1, first for and then for and we report our results below. For the case, the algorithm converged after roughly 12 iterations and, as expected, the trained model fails to capture a suitable embedded space that could honor all training data points and further represent the high fidelity code as a 1-dimensional function. Fig. 3 (left), compares the posterior values of accepted while running the Geodesic MC algorithm, with the values of the two columns of the true used to generate the training data, Fig. 3 (right) shows the 1-dimensional predictive mean of the posterior ARGP with 3-standard-deviation-wide confidence bands along with test data points. One can observe that the entries of the inferred tend mostly towards the values of the 2nd column of the true . This can be intuitively explained from the fact that the terms including appear to be more dominant in the overall expression of and particularly the low fidelity model has small impact on the high fidelity code. At the same time, the predictive capabilities of the posterior model clearly fail to span the regions where the additional test data might be observed, indicating that a higher dimensional embedding should be learned. In the case, the situation improves significantly. As can be seen in Fig. 4, top row, the sampled posterior values of and are in agreement with the true values of in both columns. The samples have been obtained during iterations 10-20. Furthermore, the bottom row graphs show the 2-dimensional predictive mean of the train ARGP model along with test data points and a 45-degree line plot comparing observations versus predictions on the same data points. The overall predictive performance of the model is in agreement with the true model output.

Figure 4: Academic example 2. Top: Comparison of posterior samples of and columns of versus the nominal values used to generate the training data. Bottom left: Predictive posterior mean of the trained MFGP model as a function of its two arguments , along with test observations generated from the true model . Bottom right: Prediction at the same test points versus observations. Perfect predictions would fall on the line depicted in black solid color.

5 Airfoil shape optimization problem

The aerodynamic optimization problem analyzed in this section is the full 3D design of a typical large last stage blade (LSB) of an industrial gas turbine (IGT). As the largest rotating component in the turbine, last stage blades are one of the most mechanically challenged components in an IGT and often determine the total power output of the machine. Increasing push for larger and hotter turbines to drive down cost of electricity has resulted in increasingly challenging LSB designs. With deference typically skewed towards durability, this has generally resulted in greater aerodynamic compromises that negatively impact blade efficiency. This also drives longer design cycles as designers incrementally search for acceptable aero-mechanical solutions. In this type of design, aerodynamic assessments of power, efficiency, and flow capacity are based on expensive high-fidelity computational fluid dynamics (CFD) simulations. Each design iteration requires one or more CFD simulations, depending on the number of inner aerodynamic iterations required to satisfy the cycle’s flow capacity requirements. A full 3D optimization would enable the enhancement of turbine performance. However, the process normally implies a large computational cost because of the high dimensionality of the design space.

Figure 5: 9HA Industrial Gas Turbine ( Last stage blades are visible on the right end of the engine.
Figure 6: Last stage blade parametrization of 2D sections.

In the current work, turbine performance is evaluated using steady-state RANS CFD at two different fidelity levels; a fast running coarse mesh for broader design space exploration, and a slower running fine mesh for accuracy refinement. The 3D airfoil is parametrized using the approach illustrated in Figure 6, which shows the airfoil parametrization at one section on the left, and a view of the full 3D rotor on the right. The airfoil surface is constructed with 7 2D profiles at different spanwise locations from hub to tip. Each airfoil section is characterized by 12 independent parameters. The ranges for these parameters are selected to provide a wide design space while respecting geometrical constraints. The 12 parameters, as sketched in Figure 6, include the stagger angle, leading and trailing edge metal angles, leading edge diameter, suction and pressure side wedge angles, leading edge and trailing edge metal angles, and additional parameters to control the airfoil curvature between the leading and trailing edges.

The 2D sections are aligned relative to each other in circumferential and axial space by aligning the section centers of gravity (CG) along a radial line through the hub section CG (referred to as the stacking line). After the section CGs are aligned, one additional parameter, referred to as the airfoil lean angle, is applied to reorient the stacking line relative to the radial direction. Surfaces are fit through the seven stacked sections to create the full, continuous, 3D airfoil definition. Eighty five total parameters are therefore required to define the complete 3D airfoil shape.

The parameters are expressed as offsets from a baseline, requiring that each section starts from an appropriate reference design. An in-house software package tailored specifically for turbomachinery design uses the parameters described above to create the airfoil coordinates, and these coordinates are then transformed into a full 3D CAD model of the rotor blade for the CFD grid generation. A 3D structured mesh is built using a commercially available software package. Grid templates are built from the baseline geometry and used consistently for all the cases throughout the optimization. With this approach, all the grids have similar refinement and quality metrics. To simulate the full stage, the upstream stator is included in the CFD calculation for each case.The design of the vane and the vane mesh are not altered through the optimization.

The 3D CFD analysis is performed using GE’s in-house CFD solver TACOMA, a 2nd-order accurate (in time and space), finite-volume, block-structured, compressible flow solver seeley ; ren . The steady Reynolds-Averaged Navier-Stokes (RANS) calculations are solved with a mixing plane between rotating and stationary components. Source terms are included at various locations along the endwalls to simulate the injected cooling, leakage, and purge flows. The two objectives are to maximize the aerodynamic efficiency while matching the baseline degree of reaction. For this problem, with the full stage modeled, the degree of reaction can be calculated using standard turbine definitions.

  1. Aerodynamic efficiency: The efficiency is calculated as the ratio of mechanical power and isentropic power. All inputs required to produce the efficiency value are available from the output of the CFD simulation. Design preference is to maximize efficiency.

  2. Pseudo-reaction: In a stage calculation, the degree of reaction indicates the split of flow acceleration between stator and rotor. The degree of reaction can be calculated based on the full stage CFD, so the flow quantities between stator and rotor are extracted from the numerical results for each case. This objective is monitored to ensure that the changes in airfoil shape do not significantly affect the turbine operating condition. The design preference of this objective is to be as close as possible to the baseline value.

Figure 7: Airfoil shape optimization. Observed outputs from the two-fidelity solvers to be used to construct the ARGP metamodel.

Multiple quantities are extracted and processed from the CFD, like ideal Mach number at various spanwise locations. Radial profiles of flow quantities, like pressure and flow angle, are obtained to characterize the quality of flow field propagating from the turbine to the downstream exhaust diffuser, which has not been modeled here. Additional objectives are formulated based on the desirability of the turbine exit flow profile. Diffusion rate, diffusion ratio, and shock intensity are assessed for each case to provide additional selection criteria. The resulting simulated data consists of 160 shape configuration points are generated to be run on the coarse grid and another 120 for the fine grid that have successfully converged to CFD solutions. The two main objectives, efficiency and degree of reaction, are plotted in Figure 7, as variations from each respective baseline. The cases assessed with a low fidelity grid are represented with gray circles, while the cases assessed with high fidelity are represented with blue diamonds. Given the small number of points compared to the number of parameters, a conclusive Pareto front has not yet been identified.

For this problem, we fix the correlation coefficient between the two models to since the different level of accuracy of the two solvers varies as a result of different mesh discretization, unlike the previous examples where higher fidelity codes resulted from scaling the low fidelity ones. We split the available high fidelity observations to 75 training points and 45 test points while all low fidelity observations are used in training the ARGP using MCMC. As expected, Algorithm 1 converges slower than in the previous synthetic examples due to the higher dimensionality. Figs. 8 & 9 show the comparison of the ARGP mean predictions ( 2-standard deviations) for the two quantities of interest vs the test observations for reduced dimensionalities . Clearly the fit improves significantly as increases and reported residual mean square error (RMSE) value simply confirms this fact by reducing to as low as & respectively. The quality of fit on the test data for can serve as a stopping criterion when testing for different reduced dimensionalities and no need for further exploration on is required. To further support this claim we report values of the loglikelihodd function and the Bayesian Information criterion (BIC) bishop . The latter is a standard criterion used for model selection and is given by the loglikelihood expression penalized by a term that depends on the number of training data points and model parameters. Specifically,


where is the number of model parameters to be inferred and is the maximum likelihood estimate. The largest values of BIC typically indicate the most favorable model. We plot the maximum log-likelihood and the BIC values as a function of in Fig. 10. We observe that the log-likelihood becomes almost constant, indicating that no further improvement in terms of fitting the training set can be achieved. The BIC on the other hand drops at , indicating that the number of parameters (penalty term) becomes significantly large, thus the trade-off between data-fit improvement and number of parameters to be inferred has already a negative trend.

Figure 8: Airfoil shape optimization. Test observations of aerodynamic efficiency vs ARGP predictions for .
Figure 9: Airfoil shape optimization. Test observations of pseudo-reaction vs ARGP predictions for .
Figure 10: Airfoil shape optimization. Maximum log-likelihood and Bayesian Information criterion values for reduced dimension .

At last, the display the sample values of the 1st column of , denoted as for both QoIs in Fig. 11. The values corresponding to the same input dimension in general do not seem to be similar, indicating that the importance of each input parameter on the different QoIs varies. Most importantly, very few of the entries are near zero, thus only a small number of the parameters in the original input space are negligible with no effect in the output QoIs. This allows us to conclude that the reduction from to is not the result of simply discarding unimportant parameters but that our algorithm was able to reveal the low-dimensional linear embedding that captures the the models observed behavior accurately.

Figure 11: Airfoil shape optimization. Entries of the 1st column of the projection matrix corresponding to the two QoIs.

6 Conclusions

We have thus far presented a methodology to exploit low dimensional subspaces for input dimensionality reduction and have demonstrated it’s applicability to uncertainty propagation problems using multi-fidelity simulations. This was achieved by employing an autoregressive Gaussian Process scheme with a linear embedding that is modeled using an orthogonal matrix that maps the original input variable to its latent counterpart. We treat the matrix as an additional set of hyperparameters and we learn the mapping jointly with training the ARGP by employing a two-step algorithm that train the ARGP using standard MLE and MCMC techniques while it updates the orthogonal matrix using the proposed Geodesic Monte Carlo sampling. We validated our method on synthetic examples with known linear embedding and we further utilised it to study the problem of 3D airfoil optimization by building a low dimensional responce surface to understand the impact of airfoil shape parameters on the particular quantities of interest.

Our work continues on the findings of constantine_AS ; tripathy_bilionis towards the development of a surrogate that enables a fully Bayesian exploration of the active subspace. Quoting Tripathy & Bilionis tripathy_bilionis “the big challenge is the construction of proposals that force to remain on the Stiefel manifold such approaches would open the way for more robust AS dimensionality selection”. We have achieved a Bayesian treatment that relieves us from such headaches by tailoring our HMC sampling method byrne particularly on the Stiefel manifold. In addition, the capability to leverage information from computer codes of a varying degree of accuracy and cost is another novel step that opens new directions in multi-fidelity simulations for uncertainty quantification, including multi-output Gaussian Processes and multi-objective Bayesian optimization. Such topics are to be explored in future research.


The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0001204. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Appendix A Mathematical background on Geodesic Monte Carlo

a.1 Hamiltonian dynamics on an embedded manifold

Consider the problem of sampling from a probability density defined on a manifold that is isometrically embedded in

and equipped with a metric tensor

, serving as the isometry between the two spaces, thus, preserving the distance. In order to employ a Hamiltonian Monte Carlo algorithm neal , the Hamiltonian function is formed


and its dynamics are described by


Integrating the Hamiltonian on a Euclidean space typically involves a Stormer-Verlet leapfrog scheme bishop ; hairer which in our case requires inverting the metric tensor . To avoid numerical drawbacks associated with such operations, Byrne & Girolami byrne suggest splitting to two distinct Hamiltonians, namely with dynamics