1 Introduction
The evergrowing access to high performance computing in scientific communities has enabled development of complex computer models in fields such as nuclear physics, climatology, and engineering that produce massive amounts of data. These models need realtime calibration with quantified uncertainties. Bayesian methodology combined with Gaussian process modeling has been heavily utilized for calibration of computer models due to its natural way to account for various sources of uncertainty; see HigdonJPG15, and King19 for examples in nuclear physics, Sexton2012 and Pollard2016 for examples in climatology, and Lawrence_2010, Plumlee16 and Zhang2019 for applications in engineering, astrophysics, and medicine.
The original framework for Bayesian calibration of computer models was developed by KoH with extensions provided by Higdon04; Higdon08; Vlid; Plumee; Plumlee2019, and MenGu18
, to name a few. Despite its popularity, however, Bayesian calibration becomes infeasible in bigdata scenarios with complex and manyparameter models because it relies on Markov chain Monte Carlo (MCMC) algorithms to approximate posterior densities.
This text presents a scalable and statistically principled approach to Bayesian calibration of computer models. We offer an alternative approximation to posterior densities using variational Bayesian inference (VBI), which originated as a machine learning algorithm that approximates a target density through optimization. Statisticians and computer scientists (starting with
Peterson; Jordan1999) have been widely using variational techniques because they tend to be faster and easier to scale to massive datasets. Moreover, the recently published frequentist consistency of variational Bayes by VIth established VBI as a theoretically valid procedure. The scalability of VBI in modern applications hinges on efficiency of stochastic optimization in scenarios with independent data points. This efficiency, however, diminishes in the case of Bayesian calibration of computer models due to dependence structure in data (robbins1951; Hoffman13a). To maintain the speed and scalability of VBI, we adopt a pairwise decomposition of data likelihood using vine copulas that separate the information on dependence structure in data from their marginal distributions (Kurowicka2006). Our specific contributions are as follows:
We propose a novel version of the blackbox variational inference (Ranganath14) for Bayesian calibration of computer models that preserves the efficiency of stochastic optimization in scenario with dependent data. Python code with our algorithm is available at https://github.com/kejzlarv/VBI_Calibration.

We provide both theoretical and empirical evidence for scalability of our methodology and establish its superiority over the MetropolisHastings algorithm and the NoUTurn sampler both in terms of time efficiency and memory requirements.

Finally, we demonstrate the opportunities in uncertainty quantification given by the proposed algorithm on a realword example in the field of nuclear physics.
1.1 Outline of this paper
In Section 2, we describe the framework for Bayesian calibration of computer models and give a general overview of VBI. In Section 3, we derive our proposed VBI approach to perform inexpensive and scalable calibration. We establish statistical validity of the method and provide theoretical justification for its scalability. Subsequently, in Section 4, we discuss the implementation details with focus on strategies to reduce the variance of the gradient estimators that are at the center of stochastic optimization for VBI. Section 5 presents a simulation study comparing our approach with the stateoftheart methods to approximate posterior distribution and illustrates our method on a realdata application.
2 Background and Theoretical Framework
Formally, let us consider a computer model
relying on a parameter vector
and a vector of inputs . Let be a set of experimental measurements and inputs of a physical process . Model calibration corresponds to determining the unknown and hypothetical true value of the parameter , at which the physical process would satisfy ; is the systematic discrepancy of the model whose form is generally unknown. Overall, we can write the complete statistical model as(1) 
The term is a scale function to be parametrized and inferred, and
are independent random variables representing measurement errors, which we assume to be standard Gaussian. For computationally expensive models, the evaluations of
cannot be reasonably performed at calibration runtime: they need to be done beforehand, typically on a grid or otherwise spacefilling design. Common practice is to emulate the computer model by a Gaussian process with mean function and covariance function . In this setup, the data also include a set of model runs at predetermined points . The discrepancy function , while intrinsically deterministic, is also modeled by a Gaussian process. Under this framework, the complete dataset follows(2) 
where and
are the mean and covariance of a multivariate normal distribution. The latent vector
consists of the calibration parametersand hyperparameters corresponding to the
’s parametrization. The term “calibration” in the Bayesian paradigm includes both parameter estimation and a full evaluation of uncertainty for every parameter under a prior uncertainty expressed by . We are therefore interested in deriving the posterior distribution . This becomes quickly infeasible with increasing size of datasets, number of parameters, and model complexity. Traditional MCMC methods that approximate —such as the MetropolisHastings (MH) algorithm (Metropolis) or more advanced ones including Hamiltonian Monte Carlo or the NoUTurn Sampler (NUTS) (NUTS)—typically fail because of the computational costs associated with the evaluation of . The conventional approaches to scalable Bayesian inference are in general not applicable here because of the highly correlated structure of or the nature of calibration itself. Indeed, parallelization of MCMC (Neiswanger2014) works in the case of independent , and Gaussian process approximation methods are developed in the context of regression problems (QuinoneroCandela; pmlrv5titsias09a). We emphasize that our context is much more complex and that our approach is not developing parallel computing, but rather exploiting probabilistic theory of approximation to reduce the computational cost.2.1 Variational Bayes Inference (VBI)
VBI is an optimization based method that approximates by a family of distributions over latent variables with its own variational parameter . Many commonly used families exist with the simplest meanfield family assuming independence of all the components in ; see pmlrv38hoffman15; Tran2015; Ranganath:2016 for more examples. The approximate distribution is chosen to satisfy
(3) 
Here,
denotes the KullbackLeibler divergence of
from . Finding is done in practice by maximizing the evidence lower bound (ELBO)(4) 
which is a sum of the expected data loglikelihood and the divergence between the combined prior distribution of calibration parameters and hyperparameters and the variational distribution . Minimizing the ELBO is equivalent to minimizing the original objective function. Note that we set for the ease of notation. The ELBO is typically optimized via coordinate or gradientascent methods. These techniques are inefficient for large datasets, because we must optimize the variational parameters globally for the whole dataset. Instead, it has become common practice to use a stochastic gradient ascent (SGA) algorithm, which Hoffman13a named “stochastic variational inference” (SVI). Similarly to traditional gradient ascent, SGA updates at the iteration with . Here, is a realization of the random variable , so that , and Ranganath14 shows that the gradient of ELBO with respect to the variational parameter can be written as
(5) 
where is the gradient of the variational loglikelihood with respect to . SGA converges to a local maximum of (global for concave (bottou97)) when the learning rate follows the RobbinsMonro conditions (robbins1951)
(6) 
The bottleneck in the computation of is the evaluation of the loglikelihood , which makes traditional gradient methods as hard to scale as MCMC methods. SGA algorithms address this challenge. Indeed, if we consider independent observations, then we can define a noisy estimate of the gradient as
(7) 
where with . Each update of computes the likelihood only for one observation at a time and makes the SVI scalable for large datasets. One can easily see that, under the framework for Bayesian calibration, and that the corresponding SVI does not scale.
3 Variational Calibration of Computer Models
In this section, we derive the algorithm for scalable variational inference approach to Bayesian computer model calibration. The first step is finding a convenient decomposition of the likelihood that allows for an unbiased stochastic estimate of the gradient that depends only on a small subset of data. Multivariate copulas, and specifically their pairwise construction which we shall introduce below, provide such a decomposition. We are not the first ones to use copulas in the context of VBI. For instance, Tran2015 and Smith2020 proposed a multivariate copula as a possible variational family. However, we are the first ones using copulas in the context of computer model calibration implementing via VBI.
3.1 Multivariate Copulas and Likelihood Decomposition
Fundamentally, a copula separates the information on the dependence structure of random variables from their marginal distributions. Let us assume, for simplicity, that the marginal CDFs are continuous and possess inverse functions . It follows from the probability integral transform that and conversely that . With this in mind, we have
The function is a distribution with support on , uniform marginals, and is called a copula. A onetoone correspondence exists between copula and the distribution of , as stated in the following theorem due to Skla59. To keep the notation consistency and readability, we restate the theorem here.
Theorem 1 (Skla59)
Given random variables with continuous marginals
and joint distribution functions
, there exists a unique copula C such that for all : . Conversely, given and copula , defined through is an nvariate distribution functions with marginals .Consequently, one can write the joint pdf of as
(8) 
where represents the copula density and is the marginal of .
The key reason for considering copulas is that one can decompose the dimensional copula density into a product of bivariate copulas. The starting point for this construction is the recursive decomposition of the density into a product of conditional densities
(9) 
For , Sklar’s theorem implies that and
(10) 
where is a density of . Using (10) for the decomposition of given , we obtain
(11) 
where and .
Using (9) and (11) with the specific index choices , we have that
(12) 
Note that are twodimensional copulas evaluated at CDFs and . The decomposition above is called a Dvine distribution. Similar class of decomposition is possible when one applies (10) on given and sets to get a canonical vine (Cvine) (Kurowicka2006):
One can imagine that many such decompositions exist. bedford2002 observed that these can be represented graphically as a sequence of nested trees with undirected edges, which are referred to as vine trees and their decompositions as regular vines.
Here, we focus exclusively on Dvine and Cvine decompositions because they represent the moststudied instances of regular vines and provide especially efficient notation. We note, however, that the following results can be extended to any regular vines.
Properties of vine copulas (Kurowicka2006):
The vine copula construction is attractive for two reasons. First, each pair of variables occurs only once as a conditioning set. Second, the bivariate copulas have convenient form in the case of Gaussian likelihood . Let follows a multivariate normal distribution with , where is the standard normal CDF. The bivariate copula density is
(13) 
Here, , , , , and is the partial correlation of variables given . The Dvine and Cvine decompositions also involve conditional CDFs, for which we need further expressions. Let and so that contains more than one element, is typically computed recursively as and the function is for the Gaussian case given by
(14) 
3.2 Scalable Algorithm with Truncated Vine Copulas
We now consider the data likelihood according to (1) and make use of vines to construct a noisy estimate of the gradient . The loglikelihood can be rewritten according to the Dvine decomposition as
(15) 
where . This can be conveniently used in the expression of the ELBO gradient. For Dvine, we have that
(16) 
Additionally, if we consider the bijection
(17) 
and the random variable , we define an estimate of the gradient as
(18) 
This is unbiased (i.e., ) as desired. Similarly, we can derive a noisy estimate of the gradient using Cvine. We leave the details to the Appendix A. As in the case of SVI for independent data, these noisy estimates allow to update the variational parameter without the need to evaluate the whole likelihood . We need to consider only the data consisting of a copula’s conditioning and conditioned sets. Unfortunately, both and can be relatively costly to compute for large datasets because of the recursive nature of calculations involved in the copula densities’ evaluation. According to VinesTruncated01; BRECHMANN201519, and DIMANN201352, the most important and strongest dependencies among variables can typically be captured best by the pair copulas of the first trees. This notion motivates the use of truncated vine copulas, where the copulas associated with higherorder trees are set to independence copulas. From the definition of a regular vine, one can show that a joint density can be decomposed as
where is an edge in the tree of the vine specification. We define the truncated regular vine copula as follows.
Definition 1 (VinesTruncated01)
Let be a random vector with uniform marginals, and let be the truncation level. Let denote the bivariate independence copula. Then is said to be distributed according to an ndimensional ltruncated Rvine copula if is an ndimensional Rvine copula with
For the case of an ltruncated Dvine, we have
(19) 
If the copula of is distributed according to an ltruncated Dvine, we can rewrite
(20) 
where , and
(21)  
(22) 
The main idea for the scalable variational calibration (VC) of computer models is replacing the full loglikelihood in the definition of ELBO with the likelihood based on a truncated vine copula. This yields the ltruncated ELBO for the ltruncated Dvine:
(23) 
Given the bijection
(24) 
and , we can get an estimate of the gradient from
which is unbiased since
. We can analogously derive an unbiased estimate
of the gradient using Cvine (see Appendix A). Considering the ltruncated ELBO (23), our proposed algorithm for the VC of computer models with truncated vine copulas is stated in the Algorithm 1. Note that does not have closed form expression in general due to expectations involved in the computation. Therefore, we resort to a Monte Carlo (MC) approximation of using samples from the variational distribution.Scalability Discussion:
The complexity of bivariate copula evaluation depends on the size of conditioning dataset due to the recursive nature of the calculations (Kurowicka2006). From the vine tree construction, the cardinality of the conditioning set for Dvine and Cvine is in the worst case . Nevertheless, on average, we can do better. Indeed, let be the cardinality of the conditioning set in (or ), then
(25) 
and The cardinality of conditioning set is on average roughly . On the other hand, the cardinality of conditioning set is for the case of Algorithm 1 at most . Now, let be the cardinality of the conditioning set in the updating step of the variational parameter in the Algorithm 1, then
(26) 
and . for and truncation level , which is a significant improvement to the average case and ( for
). This provides a heuristic yet convincing argument for the scalability.
4 Implementation Details
4.1 Selection of Truncation Level
Selection of the truncation level is an important element of effective approximation of the posterior distribution under the Algorithm 1. DIMANN201352 propose a sequential approach for selection of in the case of vine estimation. One sequentially fits models with an increasing truncation level until the quality of fit stays stable or computational resources are depleted. We adopt similar idea for the case of VC of computer models with vine copulas. Let represents the value of variational parameter estimated with the Algorithm 1 for a fixed truncation level . One can then sequentially increase until for some distance metric and a desired tolerance .
4.2 Variance Reduction of Monte Carlo Approximations
The computational convenience of simple MC approximations of the gradient estimators based on ltruncated Dvine and Cvine copulas, and (see Section 3.2) is typically accompanied by their large variance. The consequence in practice is a need for small step size in the SGA portion of the Algorithm 1 which results in slower convergence. In order to reduce the variance of MC approximations, we adopt the same approach as OBBVI and use RaoBlackwellization (RB) in combination with control variates (CV) (Ross:2006) and importance sampling. The reminder of this section focuses on the case of Dvine decomposition, see Appendix A for derivations for Cvines.
4.2.1 RaoBlackwellization
The idea here is to replace the noisy estimate of gradient with its conditional expectation with respect to a subset of . For simplicity, let us consider a situation with and variational family that factorizes into . Additionally, let be the MC approximation of the gradient . Now, the conditional expectation is also an unbiased estimate of since and
shows that . The factorization of the variational family also makes the conditional expectation straightforward to compute as
i.e., we just need to integrate out some variables. Let us consider the MC approximation of the gradient estimator . The entry of the RaoBlackwellized estimator is
where are the components of that include .
4.2.2 Control Variates
To further reduce the variance of the MC approximations we will replace the RaoBlackwellized estimate above with a function that has equivalent expectation but again smaller variance. For illustration, let us first consider a target function whose variance we want to reduce, and a function with finite expectation. Define
(27) 
where is a scalar and . The variance of is
(28) 
This shows that a good choice for function is one that has high covariance with . Moreover, the value of that minimizes (28) is
(29) 
Let us place CV back into the context of calibration. Meeting the above described criteria, Ranganath14 propose to be , because it depends only on the variational distribution and has expectation zero. We can now set the target function to be , which gives the following entry of the MC approximation of the gradient estimator with CV
where is the estimate of based on additional independent draws from the variational approximation (otherwise the estimator would be biased).
4.2.3 Importance sampling
Here, we outline the last variance reduction technique that makes use of importance sampling. We refer to OBBVI for full description of the method and illustration of its efficiency in VI framework. Fundamentally, instead of taking samples from the variational family to carry out the MC approximation of the ELBO gradient, we will take samples from an overdispersed distribution in the same family that depends on an additional dispersion parameter . Namely, we can write the estimate as
where is the importance weight which guarantees the estimator to be unbiased. Combining the ideas of RaoBlackwellization, CV, and importance sampling, we have the following entry of the MC approximation of the gradient estimator
where and
The extension of the Algorithm 1 with the variance reductions of the MC approximations due to RaoBlackwellization, CV, and importance sampling is summarized in the Algorithm 2.
4.3 Choice of the learning rate
Even though the SGA is straightforward in its general definition, the choice of learning rate can be challenging in practice. Ideally, one would want the rate to be small in situations where the noisy estimates of the gradient have large variance and viceversa. The elements of variational parameter can also differ in scale, and one needs to set the learning rate so that the SGA can accommodate even the smallest scales. The rapidly increasing usage of machine learning techniques in recent years produced various algorithms for elementwise adaptivescale learning rates. We use the AdaGrad algorithm (AdaGrad) which has been considered in similar problems before, e.g., Ranganath14, however, there are other popular algorithms such as ADADELTA (Zeiler2012)
or RMSProp
(Tieleman2012). Let be the gradient used in the step of the SGA algorithm, and be a matrix consisting of the sum of the outer products of these gradients across the first iterations, namely
Comments
There are no comments yet.