Variational Inference with Vine Copulas: An efficient Approach for Bayesian Computer Model Calibration

03/28/2020
by   Vojtech Kejzlar, et al.
0

With the advancements of computer architectures, the use of computational models proliferates to solve complex problems in many scientific applications such as nuclear physics and climate research. However, the potential of such models is often hindered because they tend to be computationally expensive and consequently ill-fitting for uncertainty quantification. Furthermore, they are usually not calibrated with real-time observations. We develop a computationally efficient algorithm based on variational Bayes inference (VBI) for calibration of computer models with Gaussian processes. Unfortunately, the speed and scalability of VBI diminishes when applied to the calibration framework with dependent data. To preserve the efficiency of VBI, we adopt a pairwise decomposition of the data likelihood using vine copulas that separate the information on dependence structure in data from their marginal distributions. We provide both theoretical and empirical evidence for the computational scalability of our methodology and describe all the necessary details for an efficient implementation of the proposed algorithm. We also demonstrated the opportunities given by our method for practitioners on a real data example through calibration of the Liquid Drop Model of nuclear binding energies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/29/2018

Variational Calibration of Computer Models

Bayesian calibration of black-box computer models offers an established ...
12/24/2020

Calibration and Uncertainty Quantification of Convective Parameters in an Idealized GCM

Parameters in climate models are usually calibrated manually, exploiting...
08/11/2020

A Fast, Scalable, and Calibrated Computer Model Emulator: An Empirical Bayes Approach

Mathematical models implemented on a computer have become the driving fo...
02/04/2022

Population Calibration using Likelihood-Free Bayesian Inference

In this paper we develop a likelihood-free approach for population calib...
02/11/2020

Statistical aspects of nuclear mass models

We study the information content of nuclear masses from the perspective ...
01/24/2018

Uncertainty quantification for spatio-temporal computer models with calibration-optimal bases

The calibration of complex computer codes using uncertainty quantificati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ever-growing access to high performance computing in scientific communities has enabled development of complex computer models in fields such as nuclear physics, climatology, and engineering that produce massive amounts of data. These models need real-time calibration with quantified uncertainties. Bayesian methodology combined with Gaussian process modeling has been heavily utilized for calibration of computer models due to its natural way to account for various sources of uncertainty; see HigdonJPG15, and King19 for examples in nuclear physics, Sexton2012 and Pollard2016 for examples in climatology, and Lawrence_2010, Plumlee16 and Zhang2019 for applications in engineering, astrophysics, and medicine.

The original framework for Bayesian calibration of computer models was developed by KoH with extensions provided by Higdon04; Higdon08; Vlid; Plumee; Plumlee2019, and MenGu18

, to name a few. Despite its popularity, however, Bayesian calibration becomes infeasible in big-data scenarios with complex and many-parameter models because it relies on Markov chain Monte Carlo (MCMC) algorithms to approximate posterior densities.

This text presents a scalable and statistically principled approach to Bayesian calibration of computer models. We offer an alternative approximation to posterior densities using variational Bayesian inference (VBI), which originated as a machine learning algorithm that approximates a target density through optimization. Statisticians and computer scientists (starting with

Peterson; Jordan1999) have been widely using variational techniques because they tend to be faster and easier to scale to massive datasets. Moreover, the recently published frequentist consistency of variational Bayes by VIth established VBI as a theoretically valid procedure. The scalability of VBI in modern applications hinges on efficiency of stochastic optimization in scenarios with independent data points. This efficiency, however, diminishes in the case of Bayesian calibration of computer models due to dependence structure in data (robbins1951; Hoffman13a). To maintain the speed and scalability of VBI, we adopt a pairwise decomposition of data likelihood using vine copulas that separate the information on dependence structure in data from their marginal distributions (Kurowicka2006). Our specific contributions are as follows:

  1. We propose a novel version of the black-box variational inference (Ranganath14) for Bayesian calibration of computer models that preserves the efficiency of stochastic optimization in scenario with dependent data. Python code with our algorithm is available at https://github.com/kejzlarv/VBI_Calibration.

  2. We implement Rao-Blackwellization, control variates, and importance sampling to reduce the variance of noisy gradient estimates involved in our algorithm.

  3. We provide both theoretical and empirical evidence for scalability of our methodology and establish its superiority over the Metropolis-Hastings algorithm and the No-U-Turn sampler both in terms of time efficiency and memory requirements.

  4. Finally, we demonstrate the opportunities in uncertainty quantification given by the proposed algorithm on a real-word example in the field of nuclear physics.

1.1 Outline of this paper

In Section 2, we describe the framework for Bayesian calibration of computer models and give a general overview of VBI. In Section 3, we derive our proposed VBI approach to perform inexpensive and scalable calibration. We establish statistical validity of the method and provide theoretical justification for its scalability. Subsequently, in Section 4, we discuss the implementation details with focus on strategies to reduce the variance of the gradient estimators that are at the center of stochastic optimization for VBI. Section 5 presents a simulation study comparing our approach with the state-of-the-art methods to approximate posterior distribution and illustrates our method on a real-data application.

2 Background and Theoretical Framework

Formally, let us consider a computer model

relying on a parameter vector

and a vector of inputs . Let be a set of experimental measurements and inputs of a physical process . Model calibration corresponds to determining the unknown and hypothetical true value of the parameter , at which the physical process would satisfy ; is the systematic discrepancy of the model whose form is generally unknown. Overall, we can write the complete statistical model as

(1)

The term is a scale function to be parametrized and inferred, and

are independent random variables representing measurement errors, which we assume to be standard Gaussian. For computationally expensive models, the evaluations of

cannot be reasonably performed at calibration runtime: they need to be done beforehand, typically on a grid or otherwise space-filling design. Common practice is to emulate the computer model by a Gaussian process with mean function and covariance function . In this setup, the data also include a set of model runs at predetermined points . The discrepancy function , while intrinsically deterministic, is also modeled by a Gaussian process. Under this framework, the complete dataset follows

(2)

where and

are the mean and covariance of a multivariate normal distribution. The latent vector

consists of the calibration parameters

and hyperparameters corresponding to the

’s parametrization. The term “calibration” in the Bayesian paradigm includes both parameter estimation and a full evaluation of uncertainty for every parameter under a prior uncertainty expressed by . We are therefore interested in deriving the posterior distribution . This becomes quickly infeasible with increasing size of datasets, number of parameters, and model complexity. Traditional MCMC methods that approximate —such as the Metropolis-Hastings (MH) algorithm (Metropolis) or more advanced ones including Hamiltonian Monte Carlo or the No-U-Turn Sampler (NUTS) (NUTS)—typically fail because of the computational costs associated with the evaluation of . The conventional approaches to scalable Bayesian inference are in general not applicable here because of the highly correlated structure of or the nature of calibration itself. Indeed, parallelization of MCMC (Neiswanger2014) works in the case of independent , and Gaussian process approximation methods are developed in the context of regression problems (QuinoneroCandela; pmlrv5titsias09a). We emphasize that our context is much more complex and that our approach is not developing parallel computing, but rather exploiting probabilistic theory of approximation to reduce the computational cost.

2.1 Variational Bayes Inference (VBI)

VBI is an optimization based method that approximates by a family of distributions over latent variables with its own variational parameter . Many commonly used families exist with the simplest mean-field family assuming independence of all the components in ; see pmlr-v38-hoffman15; Tran2015; Ranganath:2016 for more examples. The approximate distribution is chosen to satisfy

(3)

Here,

denotes the Kullback-Leibler divergence of

from . Finding is done in practice by maximizing the evidence lower bound (ELBO)

(4)

which is a sum of the expected data log-likelihood and the divergence between the combined prior distribution of calibration parameters and hyperparameters and the variational distribution . Minimizing the ELBO is equivalent to minimizing the original objective function. Note that we set for the ease of notation. The ELBO is typically optimized via coordinate- or gradient-ascent methods. These techniques are inefficient for large datasets, because we must optimize the variational parameters globally for the whole dataset. Instead, it has become common practice to use a stochastic gradient ascent (SGA) algorithm, which Hoffman13a named “stochastic variational inference” (SVI). Similarly to traditional gradient ascent, SGA updates at the iteration with . Here, is a realization of the random variable , so that , and Ranganath14 shows that the gradient of ELBO with respect to the variational parameter can be written as

(5)

where is the gradient of the variational log-likelihood with respect to . SGA converges to a local maximum of (global for concave (bottou-97)) when the learning rate follows the Robbins-Monro conditions (robbins1951)

(6)

The bottleneck in the computation of is the evaluation of the log-likelihood , which makes traditional gradient methods as hard to scale as MCMC methods. SGA algorithms address this challenge. Indeed, if we consider independent observations, then we can define a noisy estimate of the gradient as

(7)

where with . Each update of computes the likelihood only for one observation at a time and makes the SVI scalable for large datasets. One can easily see that, under the framework for Bayesian calibration, and that the corresponding SVI does not scale.

3 Variational Calibration of Computer Models

In this section, we derive the algorithm for scalable variational inference approach to Bayesian computer model calibration. The first step is finding a convenient decomposition of the likelihood that allows for an unbiased stochastic estimate of the gradient that depends only on a small subset of data. Multivariate copulas, and specifically their pairwise construction which we shall introduce below, provide such a decomposition. We are not the first ones to use copulas in the context of VBI. For instance, Tran2015 and Smith2020 proposed a multivariate copula as a possible variational family. However, we are the first ones using copulas in the context of computer model calibration implementing via VBI.

3.1 Multivariate Copulas and Likelihood Decomposition

Fundamentally, a copula separates the information on the dependence structure of random variables from their marginal distributions. Let us assume, for simplicity, that the marginal CDFs are continuous and possess inverse functions . It follows from the probability integral transform that and conversely that . With this in mind, we have

The function is a distribution with support on , uniform marginals, and is called a copula. A one-to-one correspondence exists between copula and the distribution of , as stated in the following theorem due to Skla59. To keep the notation consistency and readability, we re-state the theorem here.

Theorem 1 (Skla59)

Given random variables with continuous marginals

and joint distribution functions

, there exists a unique copula C such that for all : . Conversely, given and copula , defined through is an n-variate distribution functions with marginals .

Consequently, one can write the joint pdf of as

(8)

where represents the copula density and is the marginal of .

The key reason for considering copulas is that one can decompose the -dimensional copula density into a product of bivariate copulas. The starting point for this construction is the recursive decomposition of the density into a product of conditional densities

(9)

For , Sklar’s theorem implies that and

(10)

where is a density of . Using (10) for the decomposition of given , we obtain

(11)

where and .

Using (9) and (11) with the specific index choices , we have that

(12)

Note that are two-dimensional copulas evaluated at CDFs and . The decomposition above is called a D-vine distribution. Similar class of decomposition is possible when one applies (10) on given and sets to get a canonical vine (C-vine) (Kurowicka2006):

One can imagine that many such decompositions exist. bedford2002 observed that these can be represented graphically as a sequence of nested trees with undirected edges, which are referred to as vine trees and their decompositions as regular vines.

Here, we focus exclusively on D-vine and C-vine decompositions because they represent the most-studied instances of regular vines and provide especially efficient notation. We note, however, that the following results can be extended to any regular vines.

Properties of vine copulas (Kurowicka2006):

The vine copula construction is attractive for two reasons. First, each pair of variables occurs only once as a conditioning set. Second, the bivariate copulas have convenient form in the case of Gaussian likelihood . Let follows a multivariate normal distribution with , where is the standard normal CDF. The bivariate copula density is

(13)

Here, , , , , and is the partial correlation of variables given . The D-vine and C-vine decompositions also involve conditional CDFs, for which we need further expressions. Let and so that contains more than one element, is typically computed recursively as and the function is for the Gaussian case given by

(14)

3.2 Scalable Algorithm with Truncated Vine Copulas

We now consider the data likelihood according to (1) and make use of vines to construct a noisy estimate of the gradient . The log-likelihood can be rewritten according to the D-vine decomposition as

(15)

where . This can be conveniently used in the expression of the ELBO gradient. For D-vine, we have that

(16)

Additionally, if we consider the bijection

(17)

and the random variable , we define an estimate of the gradient as

(18)

This is unbiased (i.e., ) as desired. Similarly, we can derive a noisy estimate of the gradient using C-vine. We leave the details to the Appendix A. As in the case of SVI for independent data, these noisy estimates allow to update the variational parameter without the need to evaluate the whole likelihood . We need to consider only the data consisting of a copula’s conditioning and conditioned sets. Unfortunately, both and can be relatively costly to compute for large datasets because of the recursive nature of calculations involved in the copula densities’ evaluation. According to VinesTruncated01; BRECHMANN201519, and DIMANN201352, the most important and strongest dependencies among variables can typically be captured best by the pair copulas of the first trees. This notion motivates the use of truncated vine copulas, where the copulas associated with higher-order trees are set to independence copulas. From the definition of a regular vine, one can show that a joint density can be decomposed as

where is an edge in the tree of the vine specification. We define the truncated regular vine copula as follows.

Definition 1 (VinesTruncated01)

Let be a random vector with uniform marginals, and let be the truncation level. Let denote the bivariate independence copula. Then is said to be distributed according to an n-dimensional l-truncated R-vine copula if is an n-dimensional R-vine copula with

For the case of an l-truncated D-vine, we have

(19)

If the copula of is distributed according to an l-truncated D-vine, we can rewrite

(20)

where , and

(21)
(22)

The main idea for the scalable variational calibration (VC) of computer models is replacing the full log-likelihood in the definition of ELBO with the likelihood based on a truncated vine copula. This yields the l-truncated ELBO for the l-truncated D-vine:

(23)

Given the bijection

(24)

and , we can get an estimate of the gradient from

which is unbiased since

. We can analogously derive an unbiased estimate

of the gradient using C-vine (see Appendix A). Considering the l-truncated ELBO (23), our proposed algorithm for the VC of computer models with truncated vine copulas is stated in the Algorithm 1. Note that does not have closed form expression in general due to expectations involved in the computation. Therefore, we resort to a Monte Carlo (MC) approximation of using samples from the variational distribution.

1:Data , mean and covariance functions for models in Kennedy-O’Hagan framework, variational family , truncation level l
2: random initial value
3:
4:repeat
5:     for  to  do Random sample
6:         
7:     end for
8:     
9:      value of a Robbins-Monro sequence
10:     
11:     
12:until change of is less than
Algorithm 1 Variational Calibration with Truncated Vine Copulas (D-vine version)
Scalability Discussion:

The complexity of bivariate copula evaluation depends on the size of conditioning dataset due to the recursive nature of the calculations (Kurowicka2006). From the vine tree construction, the cardinality of the conditioning set for D-vine and C-vine is in the worst case . Nevertheless, on average, we can do better. Indeed, let be the cardinality of the conditioning set in (or ), then

(25)

and The cardinality of conditioning set is on average roughly . On the other hand, the cardinality of conditioning set is for the case of Algorithm 1 at most . Now, let be the cardinality of the conditioning set in the updating step of the variational parameter in the Algorithm 1, then

(26)

and . for and truncation level , which is a significant improvement to the average case and ( for

). This provides a heuristic yet convincing argument for the scalability.

4 Implementation Details

4.1 Selection of Truncation Level

Selection of the truncation level is an important element of effective approximation of the posterior distribution under the Algorithm 1. DIMANN201352 propose a sequential approach for selection of in the case of vine estimation. One sequentially fits models with an increasing truncation level until the quality of fit stays stable or computational resources are depleted. We adopt similar idea for the case of VC of computer models with vine copulas. Let represents the value of variational parameter estimated with the Algorithm 1 for a fixed truncation level . One can then sequentially increase until for some distance metric and a desired tolerance .

4.2 Variance Reduction of Monte Carlo Approximations

The computational convenience of simple MC approximations of the gradient estimators based on l-truncated D-vine and C-vine copulas, and (see Section 3.2) is typically accompanied by their large variance. The consequence in practice is a need for small step size in the SGA portion of the Algorithm 1 which results in slower convergence. In order to reduce the variance of MC approximations, we adopt the same approach as OBBVI and use Rao-Blackwellization (RB) in combination with control variates (CV) (Ross:2006) and importance sampling. The reminder of this section focuses on the case of D-vine decomposition, see Appendix A for derivations for C-vines.

4.2.1 Rao-Blackwellization

The idea here is to replace the noisy estimate of gradient with its conditional expectation with respect to a subset of . For simplicity, let us consider a situation with and variational family that factorizes into . Additionally, let be the MC approximation of the gradient . Now, the conditional expectation is also an unbiased estimate of since and

shows that . The factorization of the variational family also makes the conditional expectation straightforward to compute as

i.e., we just need to integrate out some variables. Let us consider the MC approximation of the gradient estimator . The entry of the Rao-Blackwellized estimator is

where are the components of that include .

4.2.2 Control Variates

To further reduce the variance of the MC approximations we will replace the Rao-Blackwellized estimate above with a function that has equivalent expectation but again smaller variance. For illustration, let us first consider a target function whose variance we want to reduce, and a function with finite expectation. Define

(27)

where is a scalar and . The variance of is

(28)

This shows that a good choice for function is one that has high covariance with . Moreover, the value of that minimizes (28) is

(29)

Let us place CV back into the context of calibration. Meeting the above described criteria, Ranganath14 propose to be , because it depends only on the variational distribution and has expectation zero. We can now set the target function to be , which gives the following entry of the MC approximation of the gradient estimator with CV

where is the estimate of based on additional independent draws from the variational approximation (otherwise the estimator would be biased).

4.2.3 Importance sampling

Here, we outline the last variance reduction technique that makes use of importance sampling. We refer to OBBVI for full description of the method and illustration of its efficiency in VI framework. Fundamentally, instead of taking samples from the variational family to carry out the MC approximation of the ELBO gradient, we will take samples from an overdispersed distribution in the same family that depends on an additional dispersion parameter . Namely, we can write the estimate as

where is the importance weight which guarantees the estimator to be unbiased. Combining the ideas of Rao-Blackwellization, CV, and importance sampling, we have the following entry of the MC approximation of the gradient estimator

where and

The extension of the Algorithm 1 with the variance reductions of the MC approximations due to Rao-Blackwellization, CV, and importance sampling is summarized in the Algorithm 2.

1:Data , mean and covariance functions for models in Kennedy-O’Hagan framework, variational family , dispersion parameter , truncation level l
2: random initial value
3:
4:repeat
5:     for  to  do Random sample
6:         
7:     end for
8:     
9:      value of a Robbins-Monro sequence
10:     
11:     
12:until change of is less than
Algorithm 2 Variational Calibration with Truncated Vine Copulas II (D-vine)

4.3 Choice of the learning rate

Even though the SGA is straightforward in its general definition, the choice of learning rate can be challenging in practice. Ideally, one would want the rate to be small in situations where the noisy estimates of the gradient have large variance and vice-versa. The elements of variational parameter can also differ in scale, and one needs to set the learning rate so that the SGA can accommodate even the smallest scales. The rapidly increasing usage of machine learning techniques in recent years produced various algorithms for element-wise adaptive-scale learning rates. We use the AdaGrad algorithm (AdaGrad) which has been considered in similar problems before, e.g., Ranganath14, however, there are other popular algorithms such as ADADELTA (Zeiler2012)

or RMSProp

(Tieleman2012). Let be the gradient used in the step of the SGA algorithm, and be a matrix consisting of the sum of the outer products of these gradients across the first iterations, namely