A Function Emulation Approach for Intractable Distributions

06/20/2018
by   Jaewoo Park, et al.
0

Doubly intractable distributions arise in many settings, for example in Markov models for point processes and exponential random graph models for networks. Bayesian inference for these models is challenging because they involve intractable normalising "constants" that are actually functions of the parameters of interest. Although several clever computational methods have been developed for these models, each method suffers from computational issues that makes it computationally burdensome or even infeasible for many problems. We propose a novel algorithm that provides computational gains over existing methods by replacing Monte Carlo approximations to the normalising function with a Gaussian process-based approximation. We provide theoretical justification for this method. We also develop a closely related algorithm that is applicable more broadly to any likelihood function that is expensive to evaluate. We illustrate the application of our methods to a variety of challenging simulated and real data examples, including an exponential random graph model, a Markov point process, and a model for infectious disease dynamics. The algorithm shows significant gains in computational efficiency over existing methods, and has the potential for greater gains for more challenging problems. For a random graph model example, we show how this gain in efficiency allows us to carry out accurate Bayesian inference when other algorithms are computationally impractical.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/10/2021

Diagnostics for Monte Carlo Algorithms for Models with Intractable Normalizing Functions

Models with intractable normalizing functions have numerous applications...
08/06/2020

Bayesian Indirect Inference for Models with Intractable Normalizing Functions

Inference for doubly intractable distributions is challenging because th...
11/17/2019

Bayesian Model Selection for Ultrahigh-Dimensional Doubly-Intractable Distributions with an Application to Network Psychometrics

Doubly intractable distributions commonly arise in many complex statisti...
06/18/2015

Hamiltonian Monte Carlo Acceleration Using Surrogate Functions with Random Bases

For big data analysis, high computational cost for Bayesian methods ofte...
10/22/2019

Reduced-dimensional Monte Carlo Maximum Likelihood for Latent Gaussian Random Field Models

Monte Carlo maximum likelihood (MCML) provides an elegant approach to fi...
09/19/2018

Efficient sampling of conditioned Markov jump processes

We consider the task of generating draws from a Markov jump process (MJP...
03/05/2021

On the Occasional Exactness of the Distributional Transform Approximation for Direct Gaussian Copula Models with Discrete Margins

The direct Gaussian copula model with discrete marginal distributions is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Models with intractable normalising functions arise frequently, for example in exponential random graph models (cf. Robins et al., 2007, Hunter and Handcock, 2012) for social networks, autologistic models (cf. Besag, 1974, Hughes et al., 2011, for a review) for lattice data, and interaction spatial point process models (cf. Strauss, 1975, Goldstein et al., 2015). Consider

, an unnormalized probability model for a data set

given a parameter vector

. Suppose it has a normalising function . Let be the prior density for . The likelihood function, is and the posterior density of is

(1)

In Bayesian analysis this results in so-called doubly intractable posterior distributions. The major computational issue for these models is that cannot be easily evaluated. Several algorithms substitute with a Monte Carlo approximation. However, such approximations are often computationally expensive, making the resulting Markov chain Monte Carlo (MCMC) algorithm impractical. In this manuscript we provide an approach for replacing Monte Carlo approximations with fast Gaussian process approximations. We demonstrate how this algorithm is fast while producing accurate posterior approximations. Later we discuss the case where is very expensive to compute and propose a method to solve the general problem where the likelihood function is expensive to evaluate.

There is a large literature on computational methods for doubly intractable distributions. Besag (1974) proposed the pseudolikelihood approximation, a simple approximation to that does not contain

. However in the presence of strong dependence among data points, the maximum pseudo-likelihood estimator (MPLE) can be a poor approximation to the MLE.

Geyer and Thompson (1992) proposes MCMC-MLE which is based on maximizing a Monte Carlo approximation to the likelihood. This approach is elegant and practical, but the algorithm requires analytical gradients for the unnormalized likelihood, which is not available in many cases (cf. Goldstein et al., 2015). Bayesian alternatives may be useful in such situations, and also in cases where we want a convenient approach for carrying out inference for hierarchical models involving normalising function models, and for incorporating prior information. There is a growing literature on computational methods for Bayesian inference for such models (see Park and Haran, 2018, for a review). Asymptotically exact algorithms are those where the Markov chain’s stationary distribution is equal to the desired posterior distribution. Examples of elegant asymptotically exact methods for this challenging problem include Møller et al. (2006), Murray et al. (2006), Atchade et al. (2008), Lyne et al. (2015), Liang et al. (2016). While some require the ability to draw independent samples from the probability model, others are complicated to construct and require users to tune the algorithm carefully. All of them are computationally infeasible for many interesting models (Park and Haran, 2018). Asymptotically inexact approaches may be much faster (Liang, 2010, Alquier et al., 2016), but they can still be prohibitively expensive, for instance in the case of an exponential random graph model example we describe later in this manuscript. This motivates the development of computationally efficient algorithms that allow scientists to fit models to more complex models and larger data sets than previously possible.

In this manuscript we describe an algorithm that uses a fast two-stage approximation to construct an efficient MCMC algorithm for doubly intractable distributions. The two steps are as follows: (1) approximate the normalising function at several parameter values using importance sampling, and (2) interpolate the normalising function at other parameter values using a Gaussian process fit to the importance sampling approximations. These two steps allow the normalising function to be approximated when evaluating Metropolis-Hastings acceptance probabilities in a Markov chain Monte Carlo algorithm.

Gaussian processes have been widely used for interpolation in spatial statistics (Krige, 1951, Cressie, 2015), as well as in ”computer model emulation”, to approximate the relationship between input parameters and the output of a complex computer model (cf. Sacks et al., 1989, Kennedy and O’Hagan, 2001). We show how Gaussian processes are very effective in our two-stage approximation, and how our method may be useful in addressing inferential challenges for doubly intractable distributions. We also describe a second algorithm that is applicable in principle to a much wider class of problems – Bayesian inference when the likelihood function (not just its normalising function) is difficult to evaluate.

The outline of the remainder of this paper is as follows. In Section 2 we describe existing Bayesian algorithms for intractable normalising functions and discuss their computational challenges. We also introduce several function emulation approaches used in a number of works in the areas of computational statistics. In Section 3 we propose our fast Gaussian process-based function emulation approach to inference, and provide implementation details. In addition, we provide theoretical justification for our algorithm. In Section 4 we describe the application of our approach in the context of three different case studies, including a more general problem where the entire likelihood function is assumed to be intractable. We study the computational and statistical efficiency of our algorithm, showing how our algorithm is able to perform inference for a problem where others are computationally impractical. We conclude with a summary and discussion in Section 5.

2 Computational Methods

2.1 Bayesian Methods for Doubly Intractable Distributions

Several MCMC algorithms have been developed for Bayesian inference for doubly intractable distributions. Park and Haran (2018)classifies these algorithms into two broad if somewhat overlapping categories: (1) likelihood approximation approaches which directly approximate the normalising functions via importance sampling, and substitute the approximations into the Metropolis-Hastings acceptance probability (Atchade et al., 2008, Lyne et al., 2015, Alquier et al., 2016), and (2) auxiliary variable approaches which introduce an auxiliary variable that cancels out the normalising functions in the Metropolis-Hasings acceptance probability.

For instance, Lyne et al. (2015) constructs an unbiased likelihood estimate for doubly-intractable distributions. This method has the advantage that it can be shown to be a pseudo-marginal algorithm which is known to be asymptotically exact (Beaumont, 2003, Andrieu and Roberts, 2009). However, for the problems we consider in this paper, specifically doubly intractable distributions, obtaining a single likelihood estimate requires multiple Monte Carlo approximations of . This is computationally infeasible even for small data problems involving the interaction point process and exponential random graph model we consider in this paper (Park and Haran, 2018). Møller et al. (2006), Murray et al. (2006) relies on perfect sampling (Propp and Wilson, 1996), an algorithm that uses bounding Markov chains to generate an auxiliary variable that is exactly from the target distribution. These algorithms are asympotically exact. However, perfect samplers are available only for a small set of probability models, and even for these cases, they tend to be very computationally expensive; this greatly limits the applicability of algorithms that require perfect sampling. To address this, Liang (2010) proposes a double Metropolis-Hastings (DMH) algorithm by replacing perfect sampling with a standard Metropolis-Hastings algorithm. Although DMH is asymptotically inexact, among current approaches it is the most practical method for computationally expensive problems (see Park and Haran (2018) for details). But even the DMH algorithm is computationally infeasible in some situations, as we will show via examples in Section 4.

Approximate Bayesian computation (ABC) methods (Beaumont et al., 2002, Marin et al., 2012) are popular likelihood-free algorithms. In their simplest form, for a given parameter drawn from the prior distribution, ABC methods simulate a representative summary statistic from the model. If this simulated summary statistic is close to the observed summary statistic, the parameter is accepted. Accepted samples are approximately distributed according to the posterior distribution. ABC and its MCMC variant (Marjoram et al., 2003) are broadly applicable. However, for models without representative summary statistics, ABC can be very inefficient.

Current auxiliary variable and likelihood approximation algorithms are computationally expensive when the data are high-dimensional. The main expense is due to the high-dimensional auxiliary data simulations. All the algorithms require sampling a data set () from the probability model () at each iteration of the algorithm. Multiple samples are generated to construct an importance sampling estimate (likelihood approximation approach) or a single sample is simulated to cancel out (auxiliary variable approach). The sampling becomes more demanding as the dimension of the data () increases. Furthermore, adaptive algorithms (Atchade et al., 2008, Liang et al., 2016)

are computationally infeasible for high-dimensional data sets in some cases because in order to guarantee asymptotically exact inference, the adaptive algorithms require storing simulated auxiliary data with each iteration. For models without low-dimensional summary statistics, the memory costs can become prohibitively expensive.

2.2 Function Emulation

In many disciplines including climate science, mechanical engineering, computer models are used to simulate complex processes. Because these numerical simulations are expensive, it becomes difficult to study how the processes vary as functions of parameters and it is challenging to perform statistical inference. Several global approximations for such models have been developed using polynomial functions (Marzouk et al., 2007, Marzouk and Xiu, 2009)

, radial basis functions

(Joseph, 2012, Bliznyuk et al., 2012), Gaussian processes (Sacks et al., 1989, Kennedy and O’Hagan, 2001, Rasmussen, 2004, Wang and Li, 2017) and many others. However, analyzing convergence and the error of global approximations is often challenging. To overcome such difficulties of uniform modeling, local and nonstationary approximations are also studied in the Gaussian process context (Gramacy and Lee, 2008, Gramacy and Apley, 2015) and polynomial functions (Conrad et al., 2016). For instance, Conrad et al. (2016) replace likelihood functions with local polynomial approximations in the Metropolis-Hastings kernel. With increasing iterations, these local approximations are refined via sequential experimental design procedure, so that the approximations are asymptotically exact. This is an elegant approach but does not apply to our problem. They assume that it is possible to evaluate the likelihood exactly, even if each evaluation is expensive. In the doubly intractable distributions context, exact likelihood evaluations are not possible. Our problem requires another layer of approximation to the intractable normalizing functions; this can lead to further computational difficulties. We note, however, that for our disease dynamics example in the supplementary material, a version of the algorithm in Conrad et al. (2016) may be a possible alternative to our algorithm.

Several methods based on Gaussian process approximation have been proposed to accelerate inference when the likelihood function is intractable or expensive. Gaussian process approximations have also been used in the ABC context (Wilkinson, 2014, Meeds and Welling, 2014, Gutmann and Corander, 2016, Järvenpää et al., 2016). For instance, Wilkinson (2014) uses a Gaussian process to reduce the number of simulations in ABC. This method can iteratively rule out implausible regions of the parameter space. We note that there are delayed-acceptance approaches to speed up MCMC algorithms using cheap surrogates (Christen and Fox, 2005, Golightly et al., 2015, Sherlock et al., 2017). By using cheap approximations, poor proposals are rejected quickly, and expensive likelihood functions are evaluated at the next stage only for promising proposals. Although delayed-acceptance MCMC approaches are asymptotically exact, the efficiency gains are limited because the methods require evaluating expensive likelihood functions at the second stage for promising proposals.

Drovandi et al. (2018) proposes an approach to speed up pseudo-marginal methods by replacing the log of an unbiased likelihood estimate with a Gaussian process approximation. Our approach has similarities to Drovandi et al. (2018) in that we replace the log of the function estimate with a Gaussian process approximation, and also use a short run of an MCMC algorithm to obtain good design points for constructing the Gaussian process approximation. Function emulation approaches such as Drovandi et al. (2018) and the method we describe in this manuscript, are useful for problems where it is expensive to evaluate a function (or evaluate a function approximation) many times. Both our algorithm and the algorithms in Drovandi et al. (2018) require a pre-computation step which itself involves repeated approximations of the likelihood function at a set of parameter values. Drovandi et al. (2018) requires unbiased likelihood estimates in the pre-computation step, which can be prohibitively expensive for the problems we consider – interaction point processes and exponential random graph models with large data sets. In contrast, our approach, as we demonstrate later in the paper, is computationally efficient even for such problems. To our knowledge, no existing approach provides general computer model emulation framework for doubly intractable distributions. We note that Reich and Gardner (2014) develops an approximate MCMC method for the Strauss process (Strauss, 1975) using polynomial interpolation. However our approach applies more broadly due to the flexibility and nonparametric nature of the approach (the covariance mimics the role of a non-linear relationship), and we have studied its application to several challenging examples. Furthermore, we are able to provide a theoretical justification for our methodology.

3 Markov chain Monte Carlo Using Gaussian process-based Function Emulation

Here we describe two algorithms: NormEm constructs a Markov chain Monte Carlo algorithm for approximating a doubly intractable distribution, and LikEm applies more broadly to posterior distributions where the entire likelihood function is hard to evaluate.

3.1 Outline

Gaussian processes are commonly used for nonparametric regression (cf. Rasmussen, 2004), in spatial interpolation (cf. Krige, 1951, Cressie, 2015), and in approximating computationally expensive computer models (Sacks et al., 1989). The main idea of our approach is to replace expensive importance sampling estimates with fast Gaussian process approximations. We can approximate either (normalising function emulation) or (full likelihood function emulation). We begin with an outline of the normalising function emulation algorithm.

Step 1. Log of importance sampling estimates for are computed at a set of values.
Step 2. A Gaussian process model is fit to the above estimates, which allows for approximation of at other values.
Step 3. A Markov chain Monte Carlo algorithm is constructed for sampling from the posterior distribution of where for each Metropolis-Hasting accept-reject ratio, the approximation from Step 2 is used.

Our second algorithm is similar but it directly approximates instead of approximating just . We provide details in the following section.

3.2 Function Emulation Algorithms

For a -dimensional parameter vector , consider particles in , . Let be an approximation to the maximum likelihood estimate, for example, the maximum pseudolikelihood estimate (Besag, 1974) or sample mean of . For we can construct unbiased Monte Carlo estimates for via importance sampling. For each , log of the importance sampling estimate is

(2)

where each is the last draw of the th Markov chain with stationary distribution . For a robust approximation, an importance sampling estimate can also be extended through umbrella sampling (Torrie and Valleau, 1977, Atchade et al., 2008, Geyer, 2011). The log importance sampling approximations are obtained, respectively, at the particles . We can construct a Gaussian process model relating the importance approximation to the particle,

(3)

where is the mean and is a second order stationary Gaussian process. For , a symmetric and positive definite covariance function can be defined as

(4)

with partial sill , range , and nugget . Since we assume that the log-normalizing function surface is smooth, we take a Matérn class covariance function, where the smoothness parameter is set to . We assume a simple linear mean trend , where is the regression parameter. The flexible covariance structure allows, indirectly, for a ”nonparametric” non-linear mean function; this is the basis for kriging and computer model emulation.

To obtain at some new , we use basic definitions of the Gaussian process to obtain

(5)

where and . The conditional distribution of given observed is

(6)

Given true covariance parameters , a generalized least squares (GLS) estimator of regression parameter is . By minimizing the mean square error, the best linear unbiased predictor (BLUP) for can be derived as

(7)

where the mean squared error is

(8)

Since covariance parameters

are unknown in practice, we can plug in estimates of these parameters (e.g. maximum likelihood or ordinary least squares) into the covariance

, and GLS estimate . Using these plug-in estimates, (7) is called the empirical BLUP (EBLUP). With each iteration of the MCMC algorithm, this EBLUP is plugged into the log acceptance probability. The normalising function emulation algorithm is described in Algorithm 1.

Part 1: Construct two-stage approximation to for any
Step 1. Construct MCMC algorithms, each with stationary distribution . The last state of each of these Markov chains will be used in the step 2.
Step 2. Calculate importance sampling approximation (2) using the Markov chain samples for , to obtain, .
Step 3. Obtain parameters by fitting a Gaussian process via MLE to .
Part 2: MCMC algorithm with Gaussian process approximation.
Given at th iteration, construct the next step of the algorithm as follows
Step 4. Propose
Step 5. Evaluate from Gaussian process approximation as in (7) and accept with probability where
else reject (set ).
Algorithm 1 normalising function emulation algorithm

This algorithm can dramatically reduce computing time because the two-stage approximations (Step 1 - 3 in Algorithm 1) are precalculated and outside the MCMC algorithm. We can take advantage of parallel computation because constructing importance sampling estimates is embarrassingly parallel. Furthermore, the Gaussian process interpolation (Step 5 in Algorithm 1) is extremely fast with each iteration of the MCMC algorithm.

When the unnormalized likelihood is expensive to evaluate, it is computationally efficient to emulate the entire likelihood function instead of just the normalising function. We can construct log importance sampling estimates of likelihood functions for each particle as follows.

(9)

Then for a new , the log-likelihood value may be approximated in a similar fashion to (7), resulting in . This approach can also be applied to the problem where likelihood evaluations are available but still expensive. In this case we do not need to construct importance sampling estimates. The full likelihood emulation algorithm is described in Algorithm 2.

Part 1: Construct two-stage approximation to for
Step 1. Construct independent MCMC algorithms, each with stationary distribution . The last state of each of these Markov chains will be used in the step 2.
Step 2. Calculate importance sampling approximation (9) using the Markov chain samples for , to obtain, .
Step 3. Fit a Gaussian process to to obtain the parameters .
Part 2: MCMC algorithm with Gaussian process approximation.
Given at th iteration, construct the next step of the algorithm as follows
Step 4. Propose
Step 5. Evaluate from Gaussian process approximation and accept with probability where
else reject (set ).
Algorithm 2 Full likelihood function emulation algorithm

3.3 Theoretical Justifications

For these function emulation approaches, we examine the approximation error in terms of total variation distance (cf. Mitrophanov, 2005, Alquier et al., 2016). Consider a target distribution whose Markov chain transition kernel is . By plugging in into the log acceptance probability, the first-stage approximated transition kernel can be constructed, the stationary distribution of which is . The second-stage approximated kernel is constructed by replacing the with and is the corresponding stationary distribution. We make the following assumptions.

Assumption 1

constant s.t. .

Assumption 2

constant s.t. .

Assumption 3

constant , s.t. .

Assumption 4

is compact.

In many applications, and this is the case for the examples discussed in Section 4, the sample space may be reasonably assumed to be finite and the parameter space may be assumed to be a compact set (assumption 4). Hence, the assumptions 1-3 may also be easily checked. Theorem 1 quantifies the total variation distance between the target posterior distribution and the two-stage approximated Markov transition kernel.

Theorem 1

Consider Markov transition kernel constructed by plugging in into the log acceptance probability. Suppose Assumptions 1 to 4 hold. Then almost surely for bounded constant and .

Proof of Theorem 1 is provided in the supplementary material. Given the result in this theorem, the Markov chain samples from the normalising function emulation algorithm will be close to the target distribution , as the sample size for importance sampling estimates () and the number of particles () are increased ( and goes to 0 as and increases respectively). We note that learning how scales with and is also potentially of interest, but this is quite problem-specific and poses challenges. For fixed and , the stationary distribution of the proposed algorithm is , which is different from the desired target . Hence, in practice, this algorithm is asymptotically inexact. However, with an appropriate choice of and this algorithm appears to provide reasonable approximations more quickly than other asymptotically inexact algorithms, as is evident from numerous applications in Section 4.

In similar fashion, we examine the approximation error for Algorithm 2, the likelihood function emulation approach, in Corollary 1. With increasing number of and , the posterior recovered from a likelihood function emulation approach becomes close to the true target distribution . For finite and , this algorithm is also asymptotically inexact. Proof of Corollary 1 is in the supplementary material.

Corollary 1

Consider Markov transition kernel constructed by plugging in into the log acceptance probability. Suppose that Assumptions 1 to 4 hold. Then almost surely for bounded constant and .

3.4 Pre-MCMC Details for the Function Emulation Algorithms

The preliminary non-MCMC part of the function emulation algorithms involve constructing the two-stage approximation. For this, the particles should be chosen so that they cover well the important regions of the parameter space . The choice of particles is important for both algorithms. In general, we found that a short run of the double Metropolis-Hastings (DMH) (Liang, 2010) was useful in providing particles. This was an approach also used in Liang et al. (2016). The DMH Algorithm may be described as follows:

Given at th iteration.
Step 1. Propose .
Step 2. Generate the auxiliary variable approximately from probability model at : using Metropolis-Hastings updates.
Step 3. Accept with probability
else reject (set ).
Repeat Step 1 - Step 3 until we have accepted samples .
Algorithm 3 Double Metropolis-Hastings algorithm for choosing particles

There are other approaches to choosing particles. When the summary statistics are low-dimensional (e.g. the exponential random graph models), we recommend the approximate Bayesian computation (ABC) algorithm (Beaumont et al., 2002) as in Jin et al. (2013). Starting from a wide domain Algorithm 4 can search interesting region of parameter space . Then number of particles are generated over . In Step 3 of the Algorithm 4, auxiliary variables can be generated via parallel computation. In Step 4 is a tolerance, which controls the trade-off between computational efficiency and accuracy. With decreasing , we can have ABC samples from the distribution which is close to the posterior distribution, but acceptances will be rare. We set tolerance

equal to about the 0.03 quantile of the Euclidean distance between simulated and observed summary statistics

in our study.

Step 1. Select a wide rectangular domain over the parameter space using the MPLE (Besag, 1974)

and its standard error:

.
Step 2. Generate Latin hypercube design points over .
Step 3. Simulate the auxiliary variable for each : using Metropolis-Hastings updates for .
Step 4. Without loss of generality, choose from which satisfy , where denotes the Euclidean distance.
Step 5. Select a smaller rectangular domain which cover the important region of the parameter space: .
Step 6. Generate number of particles over using Latin hypercube design.
Algorithm 4 Approximate Bayesian Computation for choosing particles

If we use Algorithm 3 or Algorithm 4 to generate particles, particles seems to work well in practice for problems with up to 4-dimensional parameter spaces. Then the number of samples for constructing importance sampling estimate should be specified. Considering that our approach is asymptotically inexact, a conservative approach involves using a large value of . In this manuscript we set to .

4 Applications

We apply our approach to two general classes of models with intractable normalising functions: (1) an exponential random graph model, and (2) an attraction-repulsion point process model. Double Metropolis Hastings (DMH) was found to be the most efficient among current algorithms in terms of effective sample size per time. Hence, to illustrate the computational and statistical efficiency of our approach, we compare the normalising function emulation (NormEm) and likelihood function emulation (LikEm) algorithms with DMH. For a large social network example, DMH is too expensive to be practical, but both NormEm and LikEm take under 2 hours, a dramatic computational gain.

Our function emulation approach (LikEm) is more broadly applicable than to just doubly intractable distributions. To illustrate this, we apply this method to a susceptible-infected-recovered infectious disease model where likelihood evaluations are available but computationally expensive. The different examples we study illustrate different computational challenges. The code for our algorithms is implemented in R (Ihaka and Gentleman, 1996) and C++, using the Rcpp and RcppArmadillo packages (Eddelbuettel et al., 2011). We fit Gaussian process models to estimate hyper-parameters using the DiceKriging package (Roustant et al., 2012). The point estimates in each example are simply means of the entire sample; there was no thinning or burn-in. The highest posterior density (HPD) is calculated by using coda package in R. The calculation of Effective Sample Size (ESS) follows Kass et al. (1998), Robert and Casella (2013). For a point process model and an infectious disease model, we calculate the total variational (TV) distance of the marginal posterior distributions between each of the algorithms and a gold standard using density function in R. We cannot construct a gold standard for a social network model because of the computational expense. Figures for bivariate and univariate posterior densities are in the supplementary material. All the code was run on dual 10 core Xeon E5-2680 processors on the Penn State high performance computing cluster. The source code may be downloaded from the following repository (https://github.com/jwpark88/FuncEmul).

4.1 Social Network Models

Exponential random graph models (ERGM) (Robins et al., 2007, Hunter et al., 2008) describe relationships among actors in networks. Consider the undirected ERGM with nodes. For all , if the th node and th node are connected, otherwise and is defined as 0. Calculation of the normalising function requires summation over all network configurations, which is intractable. Consider the ERGM, where the probability model is

(10)

where is the number of edges and is the geometrically weighted edge-wise shared partnership (GWESP) statistic (Hunter and Handcock, 2006, Hunter, 2007). denotes the number of connected pairs , where and have common neighbors. models high-order transitivites, because is a function of triangles. Therefore, GWESP models edge-wise shared partnership by placing geometric weights on the edges with higher transitivites. We used uniform prior distribution for and with ranges of and

respectively. These priors were centered around the MPLE with a width of 10 standard deviations. For this model we can generate auxiliary variables (DMH) or Monte Carlo samples for importance sampling estimates (NormEm and LikEm) via Gibbs updates. For each iteration,

pairs are chosen randomly and is set to 0 or 1 according to the full conditional probabilities. See Hunter et al. (2008) for details. Here we study this model for both real and simulated data examples.

We study the Faux Magnolia high school data set (Resnick et al., 1997), which describes an in-school friendship network among 1461 students. For all the algorithms, samples are generated from the probability model through 1 cycle of Gibbs updates. In both function emulation approaches, particles are selected via approximate Bayesian computation (ABC) as in Algorithm 4. We initially generate Latin hypercube design points over . We set a tolerance equal to 0.03 quantile to obtain , the important region of the parameter space. Then we generate particles over by using a Latin hypercube design. Then numbers of samples are used to construct importance sampling estimates. We used parallel computing to obtain importance sampling estimates. The parallel computing was implemented through OpenMP (Dagum and Menon, 1998) with the samples generated in parallel across 20 processors. For all the algorithms we use a multivariate normal proposal. The covariance used in the multivariate normal is obtained as follows. The initial covariance matrix for the proposal is the inverse of the negative hessian matrix from the MPLE. We re-estimate the sample covariance matrix for the first 10,000 iterations and use this in the proposal. After 10,000 iterations the covariance of the proposal is fixed for the rest of the algorithm. Algorithms were run until the Monte Carlo standard errors calculated by batch means (Jones et al., 2006, Flegal et al., 2008) are below 0.001.

DMH
Mean -7.47 2.31
95%HPD (-7.56, -7.38) ( 2.21 ,2.41)
ESS 1568.90 1470.91
Time(hour) 41.41
minESS/Time 35.52
NormEm
Mean -7.47 2.31
95%HPD (-7.55, -7.38) (2.21,2.41)
ESS 2509.00 2332.76
Time(hour) 1.07
minESS/Time 2163.92
LikEm
Mean -7.47 2.31
95%HPD (-7.55, -7.38) (2.21,2.41)
ESS 2460.23 2552.40
Time(hour) 0.90
minESS/Time 2738.60
Table 1: Results for parameter estimates in ERGM for a Faux Magnolia high school data set. 25,000 MCMC samples are generated from each algorithm.

The function emulation approaches dramatically reduce computational time even when compared to the fastest algorithm, DMH. DMH takes about 40 hours but both function emulation approaches only take about 1 hour to run, including pre-computing time. Table 1 indicates that the estimates from the different algorithms are similar. Bivariate and univariate posterior densities are illustrated in the supplementary material. We can also account for mixing of the algorithms through effective sample size (ESS), (Kass et al., 1998) which approximates the number of independent samples that correspond to the number of dependent samples from the chain (a chain with very low dependence would return an ESS very similar to the actual Markov chain length). When accounting for mixing time, as shown in Table 1, the proposed algorithms show larger ESS than DMH for the same length. Naturally, the differences in minimum effective sample size per time (minESS/T) are even more dramatic. In summary, the function emulation approaches are much faster than current algorithms, result in better mixing chains, and provide reasonable results.

To validate our methods, we simulated a by network via 100 cycles of Gibbs updates, where the true parameter is . We study our methods for different combinations of and . The rest of the settings for all algorithms are identical to the real data example. Here we only provide the results for because similar results are observed for the other parameter. We provide results for in the supplementary material. In Table 2 we observe that both function emulation approaches can recover the true parameter value used in the simulation, across different choices of and . Implementing DMH is infeasible because auxiliary variable simulations are computationally expensive for this example. Based on our preliminary run, we estimate that it will take at least 5 days to run. Considering that DMH is the fastest approach among existing algorithms (Park and Haran, 2018), this highlights the fact that our approach can provide reliable results for large networks, and do so much faster than current approaches.

Mean 95%HPD ESS Time(hour) ESS/Time
DMH NA NA NA NA NA NA NA
NormEm 1000 400 2.01 (1.96, 2.06) 2248.88 1.54 1459.39
1000 200 2.01 (1.96, 2.06) 2447.13 1.54 1588.80
1000 100 2.01 (1.96, 2.06) 2459.70 1.54 1597.34
500 400 2.01 (1.96, 2.06) 2372.00 1.35 1755.24
500 200 2.01 (1.96, 2.06) 2371.55 1.35 1756.02
500 100 2.01 (1.96, 2.07) 2489.17 1.35 1821.87
LikEm 1000 400 2.01 (1.96, 2.06) 2570.34 1.67 1535.80
1000 200 2.01 (1.96, 2.06) 2411.25 1.67 1441.29
1000 100 2.01 (1.96, 2.06) 2552.63 1.67 1526.16
500 400 2.01 (1.96, 2.06) 2604.7 1.45 1793.38
500 200 2.01 (1.96, 2.06) 2374.33 1.45 1635.83
500 100 2.01 (1.96, 2.06) 2546.33 1.45 1759.24
Simulated Truth 2.00
Table 2: Results for in ERGM for a large simulated network. 25,000 MCMC samples are generated from each algorithm. The true value is 2. DMH was too expensive to be practical.

4.2 An Attraction-Repulsion Point Process Model

A spatial point process in two dimensions is a random set of points in a bounded plane . Consider a realisation of points , and is the distance between the coordinates of and . Then a probability model can describe spatial patterns among the points by introducing an interaction function . Goldstein et al. (2015) extends the Strauss process (Strauss, 1975) to develop a model describing both attraction and repulsion patterns of the cells infected with human respiratory syncytial virus (RSV). The interaction function is

(11)

and the probability model is

(12)

The intensity of the point process is controlled by , and control the interaction function. is the peak value of , is value of at the peak of and represents the descent rate after the peak. We used uniform priors with range for , which is a plausible range obtained from Goldstein et al. (2015). The model is able to explain both attraction and repulsion spatial associations among infected cells. Calculation of the normalising function requires integration over the continuous domain , which is infeasible. For this model we can generate auxiliary variables (DMH) or Monte Carlo samples for importance sampling estimates (NormEm and LikEm) via birth-death MCMC (Geyer and Møller, 1994). At each iteration of the chain, an existing point is removed (death) or a new point is added (birth) with equal probability. See Goldstein et al. (2015) for details. We study inference for this model in the context of real and simulated data examples.

We study the RSV A point pattern () from experiment (RSV A primary virus, RSV A secondary virus), with a 16 hour time lag (Goldstein et al., 2015). For all the algorithms, samples are generated from the probability model through 10 cycles of birth-death MCMC. For both function emulation approaches, samples are used to construct importance sampling estimates and particles are used for Gaussian process approximations. Importance sampling estimates are obtained in parallel as in the previous example. We run the DMH algorithm (Algorithm 3) until we obtain unique particles; this took 2,412 iterations with the first 1,000 iterations are discarded for burn-in. For all the algorithms we use a multivariate normal proposal with covariance matrix obtained as in the social network example. Although the DMH algorithm is asymptotically inexact, by increasing the number of iterations for the birth-death MCMC, the algorithm can provide more accurate results (Liang et al., 2016). Since other asymptotically exact algorithms are infeasible for this example as pointed out in Park and Haran (2018), we treated a run from DMH as our gold standard; it was run for 100,000 iterations with 20 cycles of birth-death MCMC. All algorithms were run until the Monte Carlo standard error is at or below 0.001.

DMH
Mean 2.97 1.34 11.52 0.22
95%HPD (2.64, 3.30) (1.30, 1.39) (10.68, 12.28) (0.17, 0.28)
TV 0.23 0.37 0.28 0.23
ESS 977.43 1037.47 1120.68 899.01
Time(hour) 18.99
minESS/Time 47.35
NormEm
Mean 2.96 1.34 11.51 0.22
95%HPD (2.62, 3.27) (1.29, 1.39) (10.65, 12.33) (0.17, 0.27)
TV 0.26 0.28 0.33 0.24
ESS 1996.07 2151.35 2167.10 1564.01
Time(hour) 3.80
minESS/Time 411.10
LikEm
Mean 2.98 1.34 11.50 0.22
95%HPD (2.61, 3.35) (1.30, 1.39) (10.65, 12.31) (0.17, 0.29)
TV 0.47 0.37 0.30 0.40
ESS 1883.62 1988.59 1815.97 1315.89
Time(hour) 2.52
minESS/Time 521.46
Gold
Mean 2.97 1.34 11.49 0.22
95%HPD (2.64, 3.28) (1.30, 1.39) (10.68, 12.27) (0.17, 0.28)
Table 3: Inference results for the RSV point pattern from experiment. 40,000 MCMC samples are generated from each algorithm.

While DMH takes about 19 hours, the function emulation approaches take between 2.5 and 4 hours. LikEm is about an hour faster than NormEm. This is because the unnormalized likelihood is expensive to evaluate in a point process example. For the ERGM, we can evaluate simply by taking the product of and once we evaluate . However for the attraction repulsion point process model, needs to be reevaluated at the distance matrix of with different parameters to calculate . Here, even though this is a normalising function problem, emulating the entire likelihood function is helpful. For a 2-dimensional ERGM example, our approach is about 40 times faster than DMH, while in the 4-dimensional attraction-repulsion point process problem, our approach is about 6-7 times faster than DMH. This difference comes from the precomputation step. Obtaining particles takes much more time for the point process example. Because a point process model neither has low-dimensional summary statistics nor analytical gradients, a short run of DMH needs to be used to obtain particles, which takes 1.5 hours. Table 3 indicates that the estimates from the algorithms are well matched to the gold standard. Bivariate and univariate posterior densities are illustrated in the supplementary material. To measure the accuracy of our algorithms we calculate the total variational (TV) distance of the marginal posterior distributions between each of the algorithms and a gold standard. TVs for function emulation approaches are comparable with those of DMH. Compared to DMH, function emulation approaches can achieve the same accuracy within a much shorter time. In Table 3, the function emulation approaches show larger ESS than DMH. When accounting for mixing, the difference increases, as is apparent from a comparison of minimum effective sample size per second (minESS/T). In summary, our algorithm is much faster and provides reasonable inference results. LikEm, in particular, has significant computational advantages over DHM.

To validate our methods, a point pattern () is simulated under RSV B settings in Goldstein et al. (2015). A point process is simulated by 100 cycles of birth-death MCMC, where the true parameter is . We study our approaches for different combinations of and . The rest of the settings for all algorithms are implemented using the same tuning parameters as in the real data example. Here we only provide the inference results regarding because similar results are observed for the other parameters. We provide results for other parameters in the supplementary material. Table 4 indicates that the estimates from the function emulation approaches are similar to the those of the gold standard, when we have large enough and ( and or and ). We observe that TVs for function emulation approaches are comparable with those of DMH for these and values. Otherwise, recovered posteriors from the function emulation approaches do not match the gold standard (TVs ). This fact demonstrates that with increasing parameter dimensions, function emulation approaches become more sensitive to the choice of and . Considering that emulation approaches are much cheaper than DMH, for parameter dimensions of 4 and under, we recommend using and .

N d Mean 95%HPD TV ESS Time(hour) ESS/Time
DMH NA NA 0.34 (0.21, 0.49) 0.18 784.58 22.10 35.50
NormEm 2,000 400 0.34 (0.20, 0.49) 0.25 931.96 4.10 227.30
2,000 200 0.34 (0.20, 0.48) 0.27 1250.63 3.18 393.28
2,000 100 0.21 (0.20, 0.21) 2.42 469.78 2.18 215.50
1,000 400 0.33 (0.20, 0.48) 0.37 973.40 3.49 278.91
1,000 200 0.26 (0.20, 0.44) 2.89 645.31 2.25 286.81
1,000 100 0.20 (0.20, 0.20) 2.50 531.26 1.63 325.93
LikEm 2,000 400 0.34 (0.21, 0.48) 0.30 1358.18 2.87 473.90
2,000 200 0.33 (0.21, 0.49) 0.71 1274.48 2.04 624.75
2,000 100 0.45 (0.21, 0.85) 3.49 1494.04 1.62 922.25
1,000 400 0.34 (0.21, 0.49) 0.35 1116.76 2.26 494.14
1,000 200 0.51 (0.21, 0.91) 5.79 1748.24 1.43 1222.54
1,000 100 0.59 (0.20, 0.96) 5.79 1677.07 1.02 1644.18
Gold NA NA 0.34 (0.21, 0.50)
Simulated Truth 0.30
Table 4: Inference results for outputs for in simulated attraction repulsion point process model. 40,000 MCMC samples are generated from each algorithm.

4.3 Susceptible-Infected-Recovered Models

Our function emulation approach may be more broadly applicable to any likelihood function that is expensive to evaluate. Susceptible-infected-recovered (SIR) compartmental models (Dietz, 1967) are widely used to quantify the dynamics of infectious diseases. Park et al. (2017) examine the rotavirus disease for children under 5 years of age in Niger with several variants of SIR compartmental models. For these models, the evaluation of the likelihood is available but computationally expensive. This is because for each set of parameter values, the periodic solution to the SIR dynamic equation is required for the likelihood calculation. Our study shows that LikEm provides comparable results, and is 10 times faster than the regular MCMC algorithm. We provide details in the the supplementary material.

4.4 Computational Complexity

We examine the computational complexity of the function emulation approaches (NormEm and LikEm) and double Metropolis-Hastings (DMH) in ERGM and interaction point process models, summarizing how our algorithms scale with an increase in the size of the data, . We denote by the number of nodes for ERGM and the number of points for the attraction-repulsion point process.

Simulated Truth Network Density
200 nodes 4.12
282 nodes 3.71
400 nodes 4.45
565 nodes 4.23
 800 nodes 4.54
Table 5: Simulation settings with different scales of networks.

We begin with a few caveats. The computational complexity of the ERGM used in the manuscript not only is dependent on the number of nodes but also is dependent on the density of the network. The density of a network is defined as where is the number of edges. Therefore, true parameter values are selected to maintain similar density of a network with different scales of simulated networks as in Table 5. To simplify calculations we assume the dimensions of the data and the simulated data are the same. We note that while this always holds for ERGMs, in the interaction point process case this is not always true as the simulated data is generated through birth-death MCMC with varying dimensions.

The main difference in calculating complexity of both examples comes from the different structure of . For ERGM, we can take the product of and summary statistics for evaluation of the unnormalized likelihood function in different . In the point process example, however, requires recalculation of the interaction function with different parameters. Here we provide our main observations (see supplement for details), where costs are per iteration of the main MCMC algorithm: (1) In the ERGM, complexity of DMH is . Ignoring pre-MCMC (particle finding and importance sampling) costs, complexity for both function emulation approaches is , where is the number of particles. (2) Complexity for the point process model is for DMH and NormEm. However the amount of calculations per iteration of NormEm is about 1/5 that of DMH. Ignoring pre-MCMC costs, the complexity of LikEm is , because we can avoid expensive evaluation in the MCMC steps. (3) Pre-MCMC costs are heavily parallelizable in both problems, and become marginal with increasing number of available cores.

Figure 1 is the observed computing time for algorithms with different scales in both models. In the ERGM, it is observed that complexity for DMH (for times larger nodes, computing time takes about times longer). Complexities of both emulation approaches are similar to each other because the unnormalized likelihood is not expensive to calculate. The results in Figure 1 are consistent with our calculations.

Figure 1: Illustration of the observed computing time for algorithms. For ERGM, time is measured for simulation settings in Table 5 with 25,000 iterations. For point process model, time is measured for RSV-B simulation settings in Goldstein et al. (2015) with 40,000 iterations.

5 Discussion

In this manuscript, we have proposed fast Gaussian process-based function emulation approaches for Bayesian inference. We describe two algorithms – one specifically targeted at doubly intractable distributions, while the other applies broadly to problems where the likelihood functions are expensive. Our study shows that our function emulation approaches provide comparable results at far lower computational cost then existing algorithms. We have also studied bounds on the total variation between a Markov chain with the exact target distribution, and the Markov chain of our approximate algorithm. Our study of applications to real and simulated data applications shows that our function emulation approaches provide similar results to the best current algorithms, but at a fraction of the computational cost.

There have been a number of recent proposals for efficient precomputation approaches for intractable normalising function problems. These include the precomputation for Monte Carlo approximations in Boland et al. (2017), and preprocessing for approximate Bayesian computation (Moores et al., 2015). The precomputation step in our method is embarrassingly parallel in that the importance sampling estimates can be constructed entirely in parallel. Therefore, with relatively little effort, the computational costs can be reduced by a factor corresponding to the number of available threads. This can be helpful given the increasing availability of parallel computing resources.

We note that if there are irregularities, for example multimodalities, in the likelihood function, it becomes more challenging to emulate accurately. In such cases, we can extend our methods via a local Gaussian process approximation. For instance, Gramacy and Apley (2015) provide a nonstationary modeling framework by constructing a Gaussian process emulator based on a local subset of the data. Their approach can provide accurate estimates in the presence of irregularities and can take advantage of parallel computation for local design. We note that it would be ideal to use a more efficient approach than importance sampling for approximating normalizing functions. However, in the interest of computational efficiency, particularly our ability to easily parallelize portions of the algorithm, as well as the limited alternative methods for approximating normalizing functions efficiently, importance sampling appears to be a good choice. Although our emulation approaches are scalable for high-dimensional data sets, we note that their performance relies on the choice of particles. Choosing particles well becomes a bigger challenge with increasing parameter dimensions. By a clever choice of particles, Drovandi et al. (2018) applies a Gaussian process approximation to an 8-dimensional parameter in the pseudo-marginal context. However, for doubly-intractable distributions there are practical implementation issues for approximating for higher dimensional , as pointed out in Park and Haran (2018). For example, Atchade et al. (2008), Liang et al. (2016) provide multiple importance sampling approaches for robust estimates of , but this method suffers from slow mixing of the stochastic approximation. To our knowledge DMH is the only algorithm applicable for higher parameter dimension doubly-intractable problems. However DMH does not provide likelihood estimates as a by-product, because the auxiliary variables introduced by the algorithm result in being canceled out in the Metropolis-Hastings acceptance ratio (see Step 3 in the Algorithm 3). Furthermore the DMH algorithm requires simulation of a high-dimensional auxiliary variable with each iteration, which is computationally demanding for the problems considered here. The methods we describe in this paper are effective for low-dimensional parameter spaces, involving high-dimensional data sets. Inference for doubly-intractable distributions arising from high-dimensional parameter space models with large data sets (e.g. thousands of nodes with 10 dimensional parameter space) remains an open challenge. Hence, our methods are ideally suited to parameter dimensions similar to those we considered in our examples, that is, between 1 and 4. Given our current computing resources we find that our methods are well suited to problems where simulating auxiliary variables, that is, producing a single draw from the probability model , takes under 20 seconds. As our examples in Section 4 illustrate, this allows for our method to be applicable to several classes of practical models for which current methods are infeasible. However, an open question is how to extend the algorithms to work beyond these parameter dimensions. Our methods can be extended to moderate dimensional parameter space models, but the number of particles would then increase exponentially with dimension, which slows computing. Therefore in this manuscript we choose particles carefully by using ABC or a short run of DMH (see also Atchade et al., 2008, Liang et al., 2016). There are function interpolation approaches that add design points sequentially to improve the accuracy of approximation (cf. Joseph, 2012, Joseph et al., 2015, Conrad et al., 2016, Wang and Li, 2017). However, direct application of these methods is challenging because they require sequential optimization, which is computationally expensive as there are no robust estimates for high-dimensional in the doubly-intractable distribution context. Developing extensions of our approach to high-dimensional parameter models may provide an interesting avenue for future research.

Acknowledgement

MH and JP were partially supported by the National Science Foundation through NSF-DMS-1418090. The authors are grateful to Galin Jones, Omiros Papaspiliopoulos and Alexander Mitrophanov for helpful discussions.

Supplementary Material

Supplementary material available online contains proofs, complexity calculations, and details for the model for infectious disease dynamics briefly described in Section 4.3 . It also provides tables and figures not included in the manuscript.

Appendix


Appendix A Proof of Theorem 1

Consider a target distribution whose Markov chain transition kernel is . The acceptance probability of which is

(13)

and denote an importance sampling estimate and a Gaussian process arppoximation as in (2) and (7) respectively. By plugging in the into the acceptance probability, first-stage approximated transition kernel can be constructed; the acceptance probability of which is . Second-stage approximated kernel is constructed by replacing with and is the corresponding acceptance probability.

a.1 Approximation Error of Importance Sampling Estimates

Bound of difference between the acceptance probabilities of and can be derived as follows.

(14)

The first inequality is from the ergodic theorem and continuous mapping theorem. With increasing , importance sampling estimates converge to almost surely. The second inequality is from Assumption 1-3 in the Theorem 1.

We now show that is uniformly ergodic for measurable subset of . From Assumption 1-3 in the Theorem 1, Markov transition kernel may be bounded as follows,

(15)

According to Theorem 16.0.2 and Theorem 16.2.4 in Meyn and Tweedie (1993), this proves that , where , and . Hence, the following conditions hold: (1) the difference between acceptance probabilities is bounded, and (2) the Markov chain with transition kernel is uniformly ergodic. The assumptions of Corollary 2.3 in Alquier et al. (2016) are therefore satisfied which implies that the approximation error of importance sampling estimates is

(16)

a.2 Approximation Error of Gaussian Process Emulation

Now we consider the second-stage approximation of the Metropolis-Hastings acceptance probability. Similar to the previous section, we can derive bound of difference between acceptance probabilities of and . The difference between acceptance probabilities is

(17)

Since the parameter space is assumed to be compact, there exists a finite -number of open balls with radius that can cover . Let be center of the -balls respectively. Here, is a continuous function with respect to . This is because is a linear function of , which is a continuous function of . This satisfies

(18)

for every (i.e. uniformly convergent). is also continuous and has value , when for . Therefore with continuous mapping theorem and Assumption 1-3 in the Theorem 1, the difference in acceptance probability approximations may be bounded as

(19)

We now show that is uniformly ergodic for measurable subset of . From Assumption 1-3 in Theorem 1, we obtain,

(20)

Therefore, t