Likelihood Inflating Sampling Algorithm

05/06/2016 ∙ by Reihaneh Entezari, et al. ∙ UNIVERSITY OF TORONTO 0

Markov Chain Monte Carlo (MCMC) sampling from a posterior distribution corresponding to a massive data set can be computationally prohibitive since producing one sample requires a number of operations that is linear in the data size. In this paper, we introduce a new communication-free parallel method, the Likelihood Inflating Sampling Algorithm (LISA), that significantly reduces computational costs by randomly splitting the dataset into smaller subsets and running MCMC methods independently in parallel on each subset using different processors. Each processor will be used to run an MCMC chain that samples sub-posterior distributions which are defined using an "inflated" likelihood function. We develop a strategy for combining the draws from different sub-posteriors to study the full posterior of the Bayesian Additive Regression Trees (BART) model. The performance of the method is tested using both simulated and real data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov Chain Monte Carlo (MCMC) methods are essential for sampling highly complex distributions. They are of paramount importance in Bayesian inference as posterior distributions are generally difficult to characterize analytically (e.g., Brooks et al., 2011; Craiu and Rosenthal, 2014). When the posterior distribution is based on a massive sample of size , posterior sampling can be computationally prohibitive since for some widely-used samplers at least operations are needed to draw one MCMC sample. Additional issues include memory and storage bottlenecks where datasets are too large to be stored on one computer.

A common solution relies on parallelizing the computation task, i.e. dividing the load among a number of parallel workers, where a worker can be a processing unit, a computer, etc. Given the abundant availability of processing units, such strategies can be extremely efficient as long as there is no need for frequent communication between workers. Some have discussed parallel MCMC methods (Wilkinson, 2006; Rosenthal, 2000; Laskey and Myers, 2003) such that each worker runs on the full dataset. However, these methods do not resolve memory overload, and also face difficulties in assessing the number of burn-in iterations for each processor.

A truly parallel approach is to divide the dataset into smaller groups and run parallel MCMC methods on each subset using different workers. Such techniques benefit from not demanding space on each computer to store the full dataset. Generally, one needs to avoid frequent communication between workers, as it is time consuming. In a typical divide and conquer strategy the data is partitioned into non-overlapping sub-sets, called shards, and each shard is analyzed by a different worker. For such strategies some essential MCMC-related questions are: 1) how to define the sub-posterior distributions for each shard, and 2) how to combine the MCMC samples obtained from each sub-posterior so that we can recover the same information that would have been obtained by sampling the full posterior distribution. Existing communication-free parallel methods proposed by Scott et al. (2013), Neiswanger et al. (2013) and Wang and Dunson (2013) have in common the fact that the product of the unnormalized sub-posteriors is equal to the unnormalized full posterior distribution, but differ in the strategies used to combine the samples. Specifically, Neiswanger et al. (2013)

approximate each sub-posterior using kernel density estimators, while

Wang and Dunson (2013) use the Weierstrass transformation. The popular Consensus Monte Carlo (CMC) method (Scott et al., 2013) relies on a weighted averaging approach to combine sub-posterior samples. The CMC relies on theoretical derivations that guarantee its validity when the full-data posterior and all sub-posteriors are Gaussian or mixtures of Gaussian.

We introduce a new communication-free parallel method, the Likelihood Inflating Sampling Algorithm (LISA), that also relies on independent and parallel processing of the shards by different workers to sample the sub-posterior distributions. The latter are defined differently than in the competing approaches described above. In this paper, we develop techniques to combine the sub-posterior draws obtained for LISA in the case of Bayesian Additive Regression Trees (BART) (Chipman et al., 1998, 2010; Kapelner and Bleich, 2013) and compare the performance of our method with CMC.

Sections 2 and 3 contain a brief review of the CMC algorithm and the detailed description of LISA, respectively. Section 4

illustrates the potential difference brought by LISA over CMC in a simple Bernoulli example, and includes a simple application of LISA to linear regression models. Section

5 contains the justification for a modified and improved version of LISA for BART. Numerical experiments and the analysis of socio-economic data presented in Section 6 examine the computational performance of the algorithms proposed here and compare it with CMC. We end the paper with some ideas for future work. The Appendix contains theoretical derivations and descriptions of the steps used when running BART.

2 Review of Consensus Monte Carlo

In this paper we assume that of interest is to generate samples from , the posterior distribution given the iid sample of size . The assumption is that is large enough to prohibit running a standard MCMC algorithm in which draws from are obtained on a single computer. We use the notation , where is the likelihood function corresponding to the observed data and is the prior. Major issues with MCMC posterior sampling for big data can be triggered because a) the data sample is too large to be stored on a single computer, or b) each chain update is too costly, e.g. if is sampled via a Metropolis-Hastings type of algorithm each update requires likelihood calculations.

In order to reduce the computational costs, the CMC method of Scott et al. (2013) partitions the sample into batches (i.e. ) and uses the workers independently and in parallel to sample each sub-posterior. More precisely, the -th worker () will generate samples from the -th sub-posterior distribution defined as:

Note that the prior for each batch is considered to be such that and thus the overall full-data unnormalized posterior distribution which we denote as is equal to the product of unnormalized sub-posterior distributions, i.e.

When the full posterior is Gaussian, the weighted averages of the sub-samples from all batches can be used as full-data posterior draws. That is, assuming are sub-samples from the th worker then the -th approximate full posterior draw will be:

where the weights are optimal for Gaussian models with .

In the next section we introduce an alternative method to define the sub-posteriors in each batch.

3 Likelihood Inflating Sampling Algorithm (LISA)

LISA is an alternative to CMC that also benefits from the random partition of the dataset followed by independently processing each batch on a different worker. Assuming that the data have been divided into batches of approximately equal size , we define the sub-posterior distributions for each machine by adjusting the likelihood function without making changes to the prior. Thus the -th sub-posterior distribution will be:

Since the data are assumed to be iid, inflating the likelihood function -times is intuitive because the sub-posterior from each batch of data will be a closer representation of the whole data posterior. We expect that sub-posteriors sampled by each worker will be closer to the full posterior thus improving the computational efficiency.

We indeed prove in a theorem below that under mild conditions, LISA’s sub-posterior distributions are asymptotically closer to the full posterior than those produced by the CMC-type approach.

The Taylor’s series expansion for a log-posterior density around its posterior mode yields the approximation

where . Exponentiating both sides will result in

which shows asymptotic normality, i.e. as where . Let and denote the - sub-posterior modes in LISA and CMC, respectively. Similarly, and denote the negative second derivative of the - log sub-posterior for LISA and CMC, respectively, when calculated at the mode. Then consider the assumptions,

  1. [label=)]

  2. There exist such that if we define and , then and w.p. 1 as .

  3. and w.p. 1 as .

  4. , , and are unimodal distributions that have continuous derivatives of order 2.

Theorem 1.

Assume that assumptions A1 through A3 hold and if we also assume as . If and then as

Proof.

See Appendix. ∎

Theorem 1 shows the difference between sub-posterior distributions for CMC and LISA, with LISA’s sub-posterior distributions being asymptotically similar to the full posterior distribution. This suggests that draws from LISA sub-posteriors can be combined using uniform weights.

Remarks:

  1. When data are iid we expect the shards to become more and more similar as (and thus ) increases and assumption A1 is expected to hold for general models.

  2. Assumption A2 in Theorem 1 holds due to the structural form of sub-posteriors in LISA and CMC.

  3. The validity of using uniform weights with LISA’s sub-posterior draws is justified asymptotically, but we will see that this approximation can be exact in some examples, e.g. for a Bernoulli model with balanced batch samples, while in others modified weights can improve the performance of the sampler. In this respect LISA is similar to other embarrassing parallel strategies where one must carefully consider the model of interest in order to find the best way to combine the sub-posterior samples.

In the next section we will illustrate LISA in some simple examples and compare its performance to the full-data posterior sampling as well as CMC.

4 Motivating Examples

In this section we examine some simple examples where theoretical derivations can be carried out in detail. We emphasize the difference between LISA and CMC.

4.1 Bernoulli Random Variables

Consider

to be N i.i.d. Bernoulli random variables with parameter

. Hence, we consider a prior . Assuming that we know little about the size of we set which corresponds to a prior. The resulting full-data posterior is Beta where is the total number of ones. Suppose we divide the data into batches with number of ones in batch , such that , i.e. the number of 1’s are divided equally between batches. Then the sub-posterior based on batch-data of size for each method will be:

  • CMC:

  • LISA:

    which implies

In this simple case any one of LISA ’s sub-posterior distributions is equal to the full posterior distribution if the batches are balanced, i.e. the number of 1’s are equally split across all batches. Thus, LISA’s sub-samples from any batch will represent correctly the full posterior. On the other hand, the draws from the CMC sub-posterior distributions will need to be recombined to obtain a representative sample from the true full posterior .

However, when the number of ones is unequally distributed among the batches it is not easy to pick the winner between CMC and LISA as both require a careful weighting of each batch sub-posterior samples.

In the remaining part of this paper, we will mainly focus on the performance of LISA when it is applied to the Bayesian Additive Regression Trees (BART) model. Interestingly, we discover that using a minor modification inspired by running LISA on the simpler Bayesian Linear Regression model we can approximate the full posterior. The idea behind the modification is described in the next section.

4.2 Bayesian Linear Regression

Consider a standard linear regression model

(1)

where , and with . To simplify the presentation we consider the improper prior

(2)

Straightforward calculations show that the conditional posterior distributions for the full data are

(3)
(4)

where and .

An MCMC sampler designed to sample from will iteratively sample using (3) and then via (4). If we denote the r.v. with density

then, using the iterative formulas for conditional mean and variance we obtain

and

(5)

We examine below the statistical properties of the samples produced by LISA. If the data are divided into K equal batches of size , let us denote and

the response vector and model matrix from the

th batch, respectively.

With the prior given in (2), the sub-posteriors produced by LISA have the following conditional densities

(6)
(7)

where and for all .

A simple Gibbs sampler designed to sample from will iteratively sample from (6) and then from (7).

It can be shown using the iterative formulas for conditional means and variances that

and

In order to combine the sub-posterior samples we propose using the weighted average

(8)

where and . Since we get

(9)

and

(10)

where the last approximation in (10) is based on the assumption that

as both are unbiased estimators for

based on and, respectively, observations. It is apparent that the variance computed in (10) is roughly times smaller than the target given in (5). In order to avoid underestimating the variance of the posterior distribution we propose a modified LISA sampling algorithm which consists of the following steps:

The intermediate step simply adjusts the variance samples so that

In turn, if we define

(11)

then and

(12)

Note that when the regression has only an intercept, i.e. consists of a column of 1’s, the weights .

While both (11) and (8) produce samples that have the correct mean, from equations (5), (10) and (12) we can see that the weighted average of the modified LISA samples have the variance closer to the desired target.

In the next section, we will examine LISA’s performance on a more complex model, the Bayesian Additive Regression Trees (BART). The discussion above will guide our construction of a modified version of LISA for BART.

5 Bayesian Additive Regression Trees (BART)

Consider the nonparametric regression model:

where is a -dimensional vector of inputs and is approximated by a sum of regression trees:

where denotes a binary tree consisting of a set of interior node decision rules and a set of terminal nodes. is the set of parameter values associated with the terminal nodes of . In addition, is the function that maps each to a . Thus the regression model is approximated by a sum-of-trees model

Let denote the vector of model parameters. Below, we briefly describe the prior specifications stated in Chipman et al. (2010) and Chipman et al. (1998).

Prior Specifications:

  • Prior Independence and Symmetry:

    where .

  • Recommended number of trees: m=200 (Chipman et al., 2010) and m=50 (Kapelner and Bleich, 2013)

  • Tree prior , is characterised by three aspects:

    1. The probability that a node at depth

      is non-terminal, which is assumed to have the form , where   and . (recommended values are and )

    2. The distribution on the splitting variable assignments at each interior node which is recommended to have a uniform distribution.

    3. The distribution on the splitting rule assignment in each interior node, conditional on the splitting variable which is also recommended to have a uniform distribution.

  • The conditional prior for is such that:

    with recommended.

  • The prior for is where is recommended and is chosen such that with recommended and sample variance .

Hence the posterior distribution will have the form:

(13)

Gibbs Sampling is used to sample from this posterior distribution. The algorithm iterates between the following steps:


  • where and .

  • which is the same as drawing from the conditional where denotes all trees except the -th tree, and residual is defined as:

    The sampling of is performed in two steps:

    1. and

    2. .

    Step 2 involves sampling from each component of using

    where denotes the average residual (computed without tree ) at terminal node with total number of observations . The conditional density of in step 1 can be expressed as:

    (14)

The Metropolis-Hastings (MH) algorithm is then applied to draw from (14) with four different proposal moves on trees:

  • GROW: growing a terminal node (with probability 0.25);

  • PRUNE: pruning a pair of terminal nodes (with probability 0.25);

  • CHANGE: changing a non-terminal rule (with probability 0.4) (Kapelner and Bleich, 2013, change rules only for parent nodes with terminal children);

  • SWAP: swapping a rule between parent and child (with probability 0.1) (This proposal move was removed by Kapelner and Bleich, 2013).

Detailed derivations involving the Metropolis-Hastings acceptance ratios are described in the Appendix.

Two existing packages in R, ”BayesTree” and ”bartMachine”, can be used to run BART on any dataset, but as the sample size increases, these packages tend to run slower. In these situations we expect methods such as LISA or CMC to become useful, and for a fair illustration of the advantages gained we have used our own R implementation of BART and applied the same structure to implement LISA and CMC algorithm for BART. The Metropolis-Hastings acceptance ratios for LISA and CMC are also reported in the Appendix.

As discussed by Scott et al. (2013), the approximation to the posterior produced by the CMC algorithm can be poor. Thus, for comparison reasons, we applied both LISA and CMC to BART using a simulated dataset (described further) with batches. Given Theorem 1, since LISA’s sub-posterior distributions are asymptotically equivalent to the full posterior distribution, we examined its performance by uniformly taking sub-samples from all its batches as an approximation to full posterior samples. We will see further that LISA with uniform weights produces higher prediction accuracy compared to CMC. However, they both perform poorly in approximating the posterior samples as they generate larger trees and under-estimate , which results in over-dispersed posterior distributions.

The following sub-section discusses a modified version of LISA for BART which will have significant improvement in performance.

5.1 Modified LISA for BART

The under estimation of when applying LISA to BART is similar to the problem encountered when using LISA for the linear regression model discussed in Section 4.2. This is not a coincidence since BART is also a linear regression model, albeit one where the set of independent variables is determined through a highly sophisticated process. We will show below that when applying a similar variance adjustment to the one discussed in Section 4.2, the Modified LISA (modLISA) for BART will exhibit superior computational and statistical efficiency compared to either LISA or CMC.

Just like in the regression model we “correct” the sampling algorithm by adjusting the residual variance. We start with the conditional distribution of tree from expression (14) which takes the form

Note that only the conditional distribution of the residuals, is affected by the modifications brought by LISA. The Metropolis-Hastings acceptance ratios for tree proposals contain three parts: the transition ratio, the likelihood ratio and the tree structure ratio. The modifications brought by LISA will influence only the likelihood ratio which is constructed from the conditional distributions of residuals. Consider the likelihood ratio for GROW proposal in LISA (full details are presented in the Appendix)

(15)

where is the total number of observations from batch-data that end up in terminal node . The newly grown tree, , splits terminal node into two terminal nodes (children) and , which will also divide to and which are the corresponding number of observations in each new terminal node. By factoring out in (15), we can rewrite it as

(16)

Expression (16) shows a similar residual variance that is times smaller in each batch, and hence following the discussion in Section 4.2, to achieve similar variance, we need to modify LISA for BART by adding the intermediate step when updating trees in each batch, and then taking a weighted average combination of sub-samples (similar to Bayesian linear regression). As in Section 4.2, we don’t apply any changes when updating . All our numerical experiments show that modLISA also generates accurate predictions in BART, since the modification corrects the bias in the posterior draws of and properly calibrates the size of the trees.

The BART algorithm will split the covariate space into disjoint subsets and on each subset a regression with only an intercept is fitted. Therefore, as suggested by the discussion in 4.2 the weight assigned to each batch will be proportional to the estimate of in that batch. In the following sections we examine the improvement brought by modLISA when compared to LISA and CMC.

6 Numerical Experiments

6.1 The Friedman’s function

We have simulated data of size from Friedman’s test function (Friedman, 1991)

where the covariates are simulated independently from a and with

. Note that five of the ten covariates are unrelated to the response variable. We have also generated test data containing 5000 cases. We apply BART to this simulated dataset using the default hyperparameters stated in Section

5 with to generate posterior draws of that, in turn, yield posterior draws for using the approximation for each . Since in this case the true is known, one can compute the root mean squared error (RMSE) using average posterior draws of for each (i.e. ), as an estimate to measure its performance, i.e. RMSE . It is known that SingleMachine BART may mix poorly when it is run on an extremely large dataset with small residual variance. However since the data simulated is of reasonable size and is not very small the SingleMachine BART is expected to be a good benchmark for comparison (see discussion in Pratola et al., 2016).

6.1.1 Comparison of modLISA with Competing Methods

We have implemented modLISA, LISA, and CMC for BART with batches on the simulated data for 5000 iterations with a total of 1000 posterior draws. Table 1 shows results from all methods including the SingleMachine which runs BART on the full dataset using only one machine. Results are averaged over three different realizations of train and test data, and include the Train and Test RMSE for each method, along with tree sizes,

estimates and their 95% Credible Intervals (CI). The summaries presented in Table

1 show that although LISA has better prediction performance than CMC, it does a terrible job at estimating , its estimate being orders of magnitude smaller than the one produced by CMC. CMC and LISA both generate larger trees compared to SingleMachine, with CMC generating trees that are ten times larger than LISA’s. One can see that modLISA with weighted averages dominates both CMC and LISA across all performance indicators since it yields the smallest RMSE, the smallest tree size, and less biased estimates. Generally, modLISA generates results that are by far the closest to the ones produced by SingleMachine.

Method TrainRMSE TestRMSE Tree Nodes Avg 95% CI for
2.73 2.94 602 1.91 [1.45 , 2.88]
1.18 1.19 55 0.001 [0.0009 , 0.0011]
0.57 0.59 7 7.97 [7.87 , 8.08]
0.55 0.56 7 9.04 [8.85 , 9.21]
Table 1: Comparing Train & Test RMSE, tree sizes, and average post burn-in with 95% CI in each method for to SingleMachine BART (all results are averaged over three different realizations of data).
Method GROW PRUNE CHANGE
21% 0.03% 34%
1.8% 0.5% 1.6%
20% 26% 19%
9% 10% 6%
Table 2: Average acceptance rates of tree proposal moves.

The size of trees produced by each method is in sync with the average acceptance rates of each tree proposal move shown in Table 2. It is noticeable the difference between CMC and LISA ’s average acceptance rates between growing a tree and pruning one. On the other hand, modLISA has overall larger acceptance rates with the smallest relative absolute difference between growing and pruning probabilities compared to LISA and CMC (% for modLISA, % for CMC, and % for LISA) and is closest to SingleMachine (%). Overall, modLISA induced a significant reduction in tree sizes by preserving a balance between growing and pruning trees which also improves exploring the posterior distribution.

Method TrainPredCov TestPredCov TrainCredCov TestCredCov
45.71 % 47.83 % 81.95 % 99.99 %
1.54 % 1.54 % 100 % 100 %
92.93 % 92.91 % 60.88 % 58.45 %
94.67 % 94.65 % 71.58 % 71.54 %
Table 3: Average coverage for 95% credible intervals constructed for training (TrainCredCov) and test (TestCredCov) data and 95% prediction intervals constructed for training (TrainPredCov) and test (TestPredCov) data. The prediction interval coverage is estimated based on 1000 iid samples, and . All results are averaged over three different realizations of data.

For a more clear comparison of the methods, Table 3 shows the average coverage of 95% credible intervals (CI) for predictors and 95% prediction intervals (PI) for future responses . The calculations are made for the values of and in the training and test data sets.

The coverage for CI is given by the averaging for all training or test data of

where is the CI for estimated based on the MCMC draws from .

The coverage of the PI corresponding to a pair is given by the proportion of 1000 iid samples generated from the true generative model that fall between its limits, i.e. the average over training or test data of

where is the PI for . The PI coverage in modLISA and SingleMachine are very close to nominal and vastly outperform the PI’s produced using LISA or CMC.

One can see that coverages of the CI built via CMC and LISA are high, which is not surprising since both algorithms produce over-dispersed approximations to the conditional distributions of . Our observation is that the CI for LISA and CMC are too wide to be practically useful. Also, modLISA and SingleMachine have much lower CI coverage than nominal which, as pointed out by one of the referees, is also expected due to the systematic bias induced by the discrepancy between the functional forms of the true predictor (continuous) and of the one fitted by BART (piecewise constant). Thus, the CI for will exhibit poor coverage as they are centered around a biased estimate of .

In order to verify that this is indeed the case we have generated a dataset of size 20,000 from the piecewise constant function:

where if and 0 otherwise, is a ten-dimensional input vector, with , and . Additional 5000 data have also been simulated as test cases. Table 4 summarizes the analysis with and confirms a sharp decrease in RMSEs even though the noise has the same variance . We note that the coverages of CI build under modLISA and SingleMachine are much higher.

Method TestRMSE TestCredCov
1.35 100 %
0.94 100 %
0.24 90.16 %
0.15 98.76 %
Table 4: Comparing test data RMSE and coverage of 95% credible intervals for piecewise with and .
(a) Test ,
(b) Test ,
(c) Training ,
(d) Training ,
Figure 1: Empirical distribution functions of obtained from MCMC samples produced by modLISA (red line), LISA (green line), CMC (blue line), and SingleMachine BART (black line) for two different pairs of training and test data. In this example .

6.1.2 Comparison with SingleMachine BART

In order to investigate the closeness of posterior samples in each method to the SingleMachine BART, we have plotted in Figure 1 the empirical distribution functions of generated from each algorithm for two pairs of observations in the training and test dataset. One can see that the empirical distribution functions in LISA and CMC don’t match the ones from SingleMachine, and look over-dispersed. However, the empirical distribution functions in modLISA weighted average look much closer to SingleMachine with a slight shift in location.

(a)
(b)
(c)
Figure 2: Blue lines: Fitted polynomial trends (for both train and test data) of average squared difference between empirical distribution functions of SingleMachine and the following: (a) CMC for training (solid line) and test (dot dashed line) data, (b) LISA with uniform weights for training (solid line) and test (dot dashed line) data and (c) modLISA with weighted average for training (solid line) and test (dot dashed line) data. The difference is plotted against the mean prediction produced by SingleMachine. Grey areas represent the credible intervals constructed from 100 independent replicates.

In order to assess the performance of the sampling procedures considered, we use the Cramér-von Mises distance to assess the difference between empirical distribution functions. This distance is defined to be where in our case we assume to be the empirical distribution function generated from posterior samples in SingleMachine BART and is similarly computed for the alternative method that is considered for comparison.

Using a set of equispaced points, we compute the average squared difference between the single machine and all other alternative methods for each observation in the dataset. To illustrate, for LISA we estimate using .

Figure 2 is comparing the fitted polynomial trends of (in each method) versus mean predicted in SingleMachine with their corresponding credible regions (for both train and test data). Clearly in LISA and modLISA, there are small variations around the trends with no significant changes in values of among different mean predicted , which specifies consistency within different train or test observations. In addition, the gap between trends from train and test data indicate that the average distance between LISA/modLISA and SingleMachine’s distributions are smaller for test data compared to train data. Furthermore, there are still small variations seen around CMC’s trends, but with slight changes in values of among different mean predicted , especially for the test dataset which indicates inconsistency within different observations.

(a) Train data
(b) Test data
Figure 3: Comparing fitted polynomial trends of average squared difference in empirical distribution functions of each method and SingleMachine, as functions of mean predicted in SingleMachine (for both train and test data).

To emphasize the difference in performance between modLISA and its competitors, Figure 3 shows all the fitted polynomial trends without their credible regions for the train and test data. One can see that there is a large gap between