1 Motivation
Multifidelity modeling proves extremely useful while solving inverse problems for instance. Inverse problems are ubiquitous in science. In general, the response of a system is modeled as a function . The goal of model inversion is to find a parameter setting that matches a target response . In other words, we are solving the following optimization problem:
for some suitable norm. In practice,
is often a highdimensional vector and
is a complex, nonlinear, and expensive to compute map. These factors render the solution of the optimization problem very challenging and motivate the use of surrogate models as a remedy for obtaining inexpensive samples of at unobserved locations. To this end, a surrogate model acts as an intermediate agent that is trained on available realizations of , and then is able to perform accurate predictions for the response at a new set of inputs. Multifidelity framework can be employed to build efficient surrogate models of . Our Deep Multifidelity GP algorithm is most useful when the function is very complicated, involves discontinuities, and when the correlation structures between different levels of fidelity have discontinuous nonfunctional forms.2 Introduction
Using deep neural networks, we build a multifidelity model that is immune to discontinuities. We employ Gaussian Processes (GPs) (see [5]) which is a nonparametric Bayesian regression technique. Gaussian Processes is a very popular and useful tool to approximate an objective function given some of its observations. It corresponds to a particular class of surrogate models which makes the assumption that the response of the complex system is a realization of a Gaussian process. In particular, we are interested in Manifold Gaussian Processes [1] that are capable of capturing discontinuities. Manifold GP is equivalent to jointly learning a data transformation into a feature space followed by a regular GP regression. The model profits from standard GP properties. We show that the wellknown classical multifidelity Gaussian Processes (AR(1) Cokriging) [4] is a special case of our method. Multifidelity modeling is most useful when lowfidelity versions of a complex system are available. They may be less accurate but are computationally cheaper.
For the sake of clarity of presentation, we focus only on two levels of fidelity. However, our method can be readily generalized to multiple levels of fidelity. In the following, we assume that we have access to data with two levels of fidelity , where has a higher level of fidelity. We use to denote the number of observations in and to denote the sample size of . The main assumption is that . This is to reflect the fact that highfidelity data are scarce since they are generated by an accurate but costly process. The low fidelity data, on the other hand, are less accurate, cheap to generate and hence are abundant.
As for the notation, we employ the following convention. A boldface letter such as is used to denote data. A nonboldface letter such as is used to denote both a vector or a scalar. This will be clear from the context.
3 Deep Multifidelity Gaussian Processes
A simple way to explain the main idea of this work is to consider the following structure:
(1) 
where
The high fidelity system is modeled by and the low fidelity one by . We use to denote a Gaussian Process. This approach can use any deterministic parametric data transformation . However, we focus on multilayer neural networks
where each layer of the network performs the transformation
with being the transfer function, the weights, and the bias of the layer. We use to denote the parameters of the neural network. Moreover, and denote the hyperparameters of the covariance functions and , respectively. The parameters of the model are therefore given by
It should be noted that the AR(1) Cokriging model of [4] is a special case of our model in the sense that for AR(1) Cokriging .
3.1 AR(1) Cokriging
In [4]
, the authors consider the following autoregressive model
where and are two independent Gaussian Processes with
and
Therefore,
and
(2) 
which is a special case of (1) with . The importance of is evident from (2). If , the high fidelity and low fidelity models are fully decoupled and by combining there will be no improvements of the prediction.
4 Prediction
The Deep Multifidelity Gaussian Process structure (1) can be equivalently written in the following compact form of a multivariate Gaussian Process
(3) 
with , and . This can be used to obtain the predictive distribution
of the surrogate model for the high fidelity system at a new test point (see equation (4)). Note that the terms and model the correlation between the highfidelity and the lowfidelity data and therefore are of paramount importance. The key role played by is already wellknown in the literature [4]. Along the same lines one can easily observe the effectiveness of learning the transformation function jointly from the low fidelity and high fidelity data.
We obtain the following joint density:
where , and . From this, we conclude that
(4) 
where
(5)  
(6)  
(7) 
5 Training
The Negative Marginal Log Likelihood is given by
(8) 
where
The Negative Marginal Log Likelihood along with its Gradient can be used to estimate the parameters
. Finding the gradient is discussed in the following. First observe thatTherefore,
(9)  
and
(10) 
where
. We use backpropagation to find
. Backpropagation is a popular method of training artificial neural networks. With this method one can calculate the gradients of with respect to all the parameters in the network.6 Summary of the Algorithm
The following summarizes our Deep Multifidelity GP algorithm.

Then, use eq. 4 to predict the the output of the highfidelity function at a new test point .
7 Numerical Experiments
To demonstrate the effectiveness of our proposed method, we apply our Deep Multifidelity Gaussian Processes algorithm to the following challenging benchmark problems.
7.1 Step Function
The high fidelity data is generated by the following step function
where
and the low fidelity data are generated by
where
In order to generate the training data, we pick uniformly distributed random points from the interval . Out of these points, are chosen at random to constitute and are picked at random to create . We therefore obtain the dataset . This dataset is depicted in figure 1.
We use a multilayer neural network of neurons. This means that is given by
Moreover, is given by with
being the Sigmoid function. Furthermore,
is given by .As for the kernels and , we use the squared exponential covariance functions with Automatic Relevance Determination (ARD) (see [5]) of the form
The predictive mean and two standard deviation bounds for our Deep Multifidelity Gaussian Processes method is depicted in figure
2.The 2D feature space discovered by the nonlinear mapping is depicted in figure 3. Recall that, for this example, we have .
The discontinuity of the model is captured by the nonlinear mapping . Therefore, the mapping from the feature space to outputs is smooth and can be easily handled by a regular AR(1) Cokriging model. In order to see the importance of the mapping , let us compare our method with AR(1) Cokriging. This is depicted in figure 4.
7.2 Forrester Function [3] with Jump
The low fidelity data are generated by
and the high fidelity data are generated by
In order to generate the training data, we pick uniformly distributed random points from the interval . Out of these points, are chosen at random to constitute and are picked at random to create . We therefore obtain the dataset . This dataset is depicted in figure 5.
Figure 6 depicts the relation between the low fidelity and the high fidelity data generating processes. One should notice the discontinuous and nonfunctional form of this relation.
Our choice of the neural network and covariance functions is as before. The predictive mean and two standard deviation bounds for our Deep Multifidelity Gaussian Processes method is depicted in figure 7.
The 2D feature space discovered by the nonlinear mapping is depicted in figure 8.
Once again, the discontinuity of the model is captured by the nonlinear mapping . In order to see the importance of the mapping , let us compare our method with AR(1) Cokriging. This is depicted in figure 9.
7.3 A Sample Function
The main objective of this section is to demonstrate the types of crosscorrelation structures that our framework is capable of handling. In the following, let the true mapping be given by
This is plotted in figure 10.
Given , we generate a sample of the joint prior distribution 1. This gives us two sample functions and , where is the highfidelity one. In order to generate the training data, we pick uniformly distributed random points from the interval . Out of these points, are chosen at random to constitute and are picked at random to create . We therefore obtain the dataset . This dataset is depicted in figure 11.
Figure 12 depicts the relation between the low fidelity and the high fidelity data generating processes. One should notice the discontinuous and nonfunctional form of this relation.
Our choice of the neural network and covariance functions is as before. The predictive mean and two standard deviation bounds for our Deep Multifidelity Gaussian Processes method is depicted in figure 13.
The 2D feature space discovered by the nonlinear mapping is depicted in figure 14. One should notice the discrepancy between the true mapping and the one learned by our algorithm. This discrepancy reflects the fact that the mapping from to the feature space is not necessarily unique.
Once again, the discontinuity of the model is captured by the nonlinear mapping . In order to see the importance of the mapping , let us compare our method with AR(1) Cokriging. This is depicted in figure 15.
8 Conclusion
We devised a surrogate model that is capable of capturing general discontinuous correlation structures between the low and highfidelity data generating processes. The model’s efficiency in handling discontinuities was demonstrated using benchmark problems. Essentially, the discontinuity is captured by the neural network. The abundance of lowfidelity data allows us to train the network accurately. We therefore need very few observations of the highfidelity data generating process.
A major drawback of our method could be its overconfidence which stems from the fact that, unlike Gaussian Processes, neural networks are not capable of modeling uncertainty. Modeling the data transformation function as a Gaussian Process, instead of a neural network, might be a more proper way of modeling uncertainty. However, this becomes analytically intractable and more challenging. This could be a promising subject of future research. A good reference in this direction is [2].
Acknowledgments
This work was supported by the DARPA project on Scalable Framework for Hierarchical Design and Planning under Uncertainty with Application to Marine Vehicles (N660011524055).
References
 [1] Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold gaussian processes for regression. arXiv preprint arXiv:1402.5876, 2014.
 [2] Andreas Damianou. Deep Gaussian processes and variational propagation of uncertainty. PhD thesis, University of Sheffield, 2015.
 [3] Alexander IJ Forrester, András Sóbester, and Andy J Keane. Multifidelity optimization via surrogate modelling. In Proceedings of the royal society of london a: mathematical, physical and engineering sciences, volume 463, pages 3251–3269. The Royal Society, 2007.
 [4] Kennedy, Marc C and O’Hagan, Anthony. Predicting the output from a complex computer code when fast approximations are available. Biometrika, 87(1):1–13, 2000.

[5]
Christopher KI Williams and Carl Edward Rasmussen.
Gaussian processes for machine learning.
the MIT Press, 2(3):4, 2006.
Comments
There are no comments yet.