We consider the problem of computing expectations for some complicated target distribution with density over a set and a target function
. Such expectations often arise in Bayesian inference and maximum likelihood estimation(Andrieu et al., 2003; Brooks et al., 2011). Oftentimes, has a closed form except for an unknown normalization constant, making the computation of the integral especially challenging (Andrieu et al., 2003).111In most of real world applications finding this normalization constant is as hard as computing the original expectation under . For example, when is obtained as a posterior distribution, can usually be written as from Bayes rule, where is available in closed form for any , but the value of the integral is unknown and is hard to compute. Markov chain Monte Carlo (MCMC) methods are a family of numerical estimation methods, which are successfully applied to estimate the aforementioned expectations, especially in high-dimensional problems. MCMC algorithms take random samples from an ergodic Markov chain with stationary distribution , and approximate the expectation via averaging over the produced sample.
The challenging problem in designing MCMC methods is to ensure that the distribution of the samples converge to fast, and, in practice, usually some domain-specific knowledge is used in the design of their proposal distributions to achieve fast convergence (Andrieu et al., 2003). This need for specialized design led to the development of dozens of methods for each problem, each of which has their own tunable parameters (Neufeld et al., 2014). Consequently, choosing the right method with corresponding parameters to achieve fast convergence is quite difficult and requires considerable time and effort.
A large body of work has been devoted in the literature to address this difficulty and to find ways to set the algorithms’ parameters optimally; for instance, optimal tuning of the Metropolis-Hasting algorithm (Roberts & Rosenthal, 2001; Bédard, 2008; Roberts et al., 1997; Atchadé et al., 2011; Brooks et al., 2011, Chapter 4). The problem with this line of research is that the solutions rely on some Markov chain parameters that are typically unknown (Łatuszyński et al., 2013).
A more promising line of research to address the parameter setting issue is based on adaptive MCMC methods. In this framework, the MCMC samples are used to learn about the target distribution, and the algorithms adjust their parameters as they progress (Łatuszyński et al., 2013). To do so, they rely on optimizing some objective function such as expected squared jumping distance (Pasarica & Gelman, 2010; Wang et al., 2013), the area under the autocorrelation function up to some specific lag (Mahendran et al., 2012), the difference between the proposal covariance matrix and its empirical estimation (Haario et al., 2001, 2006; Sejdinovic et al., 2014; Mbalawata et al., 2015), or the difference between the optimal acceptance rate (in practice, a recommendation thereof) and its empirical estimate (Yang & Rosenthal, 2017). Perhaps the most successful adaptive method for finding the optimal parameters (i.e., the number of leapfrog steps and the step size) in Hamiltonian Monte Carlo (HMC), called no-U-turn sampler (NUTS), is based on monitoring when the sample trajectories “turn back”, and the resulting algorithm provides state-of-the-art performance in a number of problems (Hoffman & Gelman, 2014). For more details about adaptive MCMC methods, the reader is referred to the tutorial paper by Andrieu & Thoms (2008).
In practice, making sure that a sampler can move between distant modes of the target distribution cannot be guaranteed by the aforementioned adaptive methods.222To be more precise, instead of modes the “high-probability” regions of the target distribution are of interest, but, for simplicity and following the standard language of MCMC literature, we will often refer to these as modes throughout the paper. There are two main approaches to deal with distant modes: (i) running parallel chains and combining their final samples; and (ii) sampling from powers of the target function (the inverse power is referred to as temperature), known as annealing. Indeed, in the literature, there are several successful methods based on combining these two ideas, such as parallel tempering (Earl & Deem, 2005) and (annealed) sequential Monte Carlo samplers, also known as particle filters (Doucet et al., 2001; Moral et al., 2006). The core idea of these methods and their variants is to run several chains with different temperatures and periodically exchange information between them. By increasing the temperature of a target distribution, it flattens (becomes more uniform), and the chains at high temperatures act as global search algorithms. They pass the information regarding the location of the modes to the chains with lower temperatures, which can effectively search each mode. An alternative to parallelization is the idea of regeneration (Mykland et al., 1995; Ahn et al., 2013). Regeneration partitions a long Markov chain into smaller independent segments such that the samples are unbiased in each segment, hence can be combined together without any further consideration. This process makes it possible to combine samples from different samplers and tune the parameters of the Markov chain after each regeneration. Although theoretically elegant, the application of regeneration methods is limited in practice since they require a properly tuned distribution for detecting regenerations.
In this paper we combine the strengths of adaptive and parallel MCMC methods. Instead of trying to find a single sampler that approximates the target distribution well on its whole domain, we run several samplers and select which sampler to use at any given time in a sequential manner, based on all the samples obtained before. Our main contributions are (i) adapting the bandit-based adaptive Monte Carlo (MC)333Throughout the paper, we refer to samplers taking unbiased, independent samples as MC methods method of Neufeld et al. (2014) to MCMC; and (ii) a novel method for combining samples from multiple chains. The resulting algorithm is suitable for sampling from challenging multimodal distributions and is fairly insensitive to the choice of its parameters. The next subsection gives a more detailed overview of our approach and the corresponding challenges.
1.1 Approach and challenges
In the simple case when all MCMC samplers mix well over the whole domain, our goal is to use the best sampler (which mixes the fastest) most of the time. This is very similar to the case of choosing from several unbiased MC samplers. For the latter, Neufeld et al. (2014) showed that scheduling which samplers to use at any point in time is equivalent to a stochastic multi-armed bandit problem (Bubeck & Cesa-Bianchi, 2012)
. This makes it possible to apply bandit algorithms to select which sampler to use to get the next sample, and the decision depends on the overall performance of the samplers so far, measured by the variance for each sampler. Extending the same idea to the MCMC case is not trivial, since measuring the quality of MCMC samplers is a much harder task. In fact, until recently, there has not been any empirical measure that can monitor the sample convergence. Common MCMC diagnostics such as effective sample size, trace and mean plots, or asymptotic variance assume the chain asymptotically converges to the target distribution, so they cannot detect asymptotic bias(Gorham & Mackey, 2015). To address this issue, Gorham & Mackey (2015) developed an empirical sample-quality measure that can detect non-convergence (or bias) based on Stein’s method. A kernelized version of this measure, called kernel Stein discrepancy (KSD) was subsequently developed by Liu et al. (2016); Chwialkowski et al. (2016); Gorham & Mackey (2017), which can be used to compare the quality of different samplers. As our first contribution, we extend the bandit-based racing method of Neufeld et al. (2014)
to MCMC samplers by using the KSD measure as the loss function in the bandit algorithms.This is described in detail in Section 3.1 while the background on KSD is given in Section 2.
On the negative side, KSD is not able to detect underfitting if the target distribution has well-separated modes, so it cannot distinguish between two samplers such that one samples only one mode while the other samples both modes, and the samples are equally good locally (Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017). This brings us to the next problem: namely, MCMC methods using a single chain usually fail to explore the whole domain if the support has reasonably high-probability regions separated by low-probability regions (of course, such notion of separation depends on the actual sampler used). Setting the parameters of the samplers to deal with this issue, or simply detecting its presence, is hard, and to our knowledge no practical solutions thereof are available. To alleviate this problem, following the parallel MCMC framework, we run several chains in parallel only expecting that they provide good samples locally. The hope is that the multiple instances will explore the space sufficiently, finding all the important regions of the support. Then, in the end, we combine all the samples (from all the samplers) to approximate the target distribution. This is challenging for two reasons: (i) it is not obvious how the samples from different samplers should be weighted, and (ii) we do not want to waste resources to run several samplers exploring the same region of the domain. For (ii), we apply our bandit-based racing method locally (Section 3.2), while to address (i), we develop a method to estimate the probability of the region a set of samples cover based on Rényi-entropy estimates (Pál et al., 2010), and use these to weight the samples, which is our second main contribution. This is described in Section 3.3.
Our final sampling algorithm is put together in Section 4. Lastly, in Section 5 we demonstrate through a number of experiments that our method is competitive with state-of-the-art adaptive MCMC methods, such as the no-U-turn sampler NUTS (Hoffman & Gelman, 2014) or the recent sample reweighting method of Liu & Lee (2016) on simpler cases when the distribution is concentrated on a single “connected” region, while significantly outperforming the competitors, including parallel tampering and sequential MC, on harder problems where high-probability regions are separated by areas of low-probability, as well as on a challenging sensor-localization problem.
2 Measuring sample quality
As mentioned in Section 1.1, measuring the quality of samples produced by an MCMC algorithm is crucial in our approach. To this end we are going to use the recently introduced kernel Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017).
To measure the quality of a set of samples, we want to quantify how well a probability distributionover can approximate the target distribution ; in this section we assume that the density of is positive and differentiable on the whole . Oftentimes, is given by a weighted sample , where denotes the set , and with . One way to do this is to measure the maximum expectation error over a class of real-valued test functions :
where, in case of a weighted sample, (we use to distinguish the sample from ). If the class of test functions is large enough, for any sequence of probability measures , the convergence of to zero implies that converges weakly to (). One advantage of using this formulation is that we can recover different metrics in the form (1) by changing the class of test functions (Gorham & Mackey, 2015); for example, using , the measure in (1) becomes the total variation distance, while using the class of Lipschitz-continuous functions recovers the Wasserstein distance. Note that typically finding the exact solution in (1) is as hard as the original problem since one needs to calculate ; however, if we can find a class of functions such that for any function and is sufficiently rich so that (1) is still a meaningful measure of similarity, we can avoid this problem.
To this end, let be a positive definite kernel and its associated reproducing kernel Hilbert space (RKHS), be the induced norm from the inner product in . Then, for , and , we have . Given this kernel and a norm on with dual norm ,444The dual norm is defined as for any . Gorham & Mackey (2017) defined the kernel Stein set as
and the Langevin Stein operator as
where denotes the
-dimensional all-one vector (with each component being). Gorham & Mackey (2017) showed that that if is a continuous and bounded function with a continuous and bounded second derivative and , then for all . Thus, with , cancels from the definition (1) reducing it to
where we suppressed the dependence on , , and in the notation . This makes it possible to compute the optimum in the definition under the above conditions on . Let with denoting its th coordinate for any , and define
Then, if (where and are independently drawn from ) the resulting maximum becomes
When is a weighted sample , this simplifies to
where and . We call the kernel Stein discrepancy. Note that can be computed with our information about , since it only depends on through , which cancels the effect of the unknown normalization constant. Also, for any norm , if and only if it converges to zero for ; hence, in the rest of the paper we assume that is the Euclidean norm.
Under some technical conditions the KSD measure goes to if and only if converges weakly to (Gorham & Mackey, 2017): In particular, if is the inverse multiquadratic (IMQ) kernel for some and , if . Furthermore, if is Lipschitz with , whenever the Wasserstein distance of and converges to zero (which, in turn, implies ). Also note that popular choices of kernels, such as the Gaussian kernel or the Matérn kernel do not detect the non-convergence of the distribution to for (there are cases when but ), and the situation is the same for the IMQ kernel defined above with for . Note that these kernel do work for .
However, despite of the above nice theoretical guarantees, one might not be able to detect convergence based on KSD for practical sample sizes, especially when the target distribution is multimodal and the modes are well-separated: Gorham et al. (2016) demonstrated that for a one-dimensional Gaussian mixture target distribution (with two components), for practical sample sizes, the KSD measure fails to distinguish between two sets of samples, one drawn independently from one mode and the other drawn independently from the whole target distribution. Furthermore, KSD requires even more samples to distinguish between the two cases as the modes’ distance increases (see Section 6.1 of Gorham et al. (2016) for more details). Another issue is that the complexity of computing the KSD for an empirical distribution is quadratic in the sample size, which quickly becomes infeasible as the sample size grows.
An interesting property of the KSD measure is that it is convex: if are distributions and for nonnegative weights satisfying , we have
3 Sequential selection of samplers
In this section we present several strategies to select from a pool of MCMC samplers in a sequential manner. In all of our algorithms the selection of the sampler to be used next depends on the quality of the samples generated by the different samplers, where the quality will be measured by the KSD measure (or its approximations). Formally, assume we have access to MCMC samplers (e.g., multiple sampling methods and/or multiple instances of the same sampling algorithm with different parameters, such as starting point or step size), and denote the set of samplers by . At every step of the algorithm, we select one of the samplers and use it to produce the next batch of samples.
3.1 Mixing samplers
First we consider the case when each sampler is asymptotically unbiased (or consistent), that is, generates samples with an empirical distribution converging weakly to the target distribution almost surely (this is usually satisfied for any standard MCMC sampler when is unimodal). Our task is to sequentially allocate calls among the samplers to minimize the KSD measure of the set of samples we collect. In order to do so, we design an algorithm which gives preference to samplers where the convergence is faster. This setup is similar to the one considered by Neufeld et al. (2014), who designed sequential sampling strategies for MC samplers generating independent and identically distributed (i.i.d.) samples, by selecting samplers with smaller variances based on multi-armed bandit algorithms (Bubeck & Cesa-Bianchi, 2012). In this section we generalize their method to MCMC samplers, aiming to minimize the total KSD measure (cf. Eq. 3) of the sample instead of the variance (recall that for practical sample sizes–before convergence–the variance of the sample is not a good measure of quality for MCMC samplers). However, computing the KSD measure is quadratic in the sample size, and so it becomes computationally infeasible even for relatively small sample sizes—note that any computation we spend on selecting samplers could also be used for sampling. Therefore, we are going to approximate the KSD measure as the average KSD over smaller blocks of samples.
For a sampler with total sampling budget , we break the sampling process into rounds: At each round the sampler takes a batch of samples of size . Let be the KSD measure of samples from the th round; we approximate , the KSD of the full sample of size with the average . We call this the block-diagonal approximation, as it corresponds to a block-diagonal approximation of the kernel matrix in computing (3). To quantify the accuracy of the approximation, we assume that there exists a function such that and for all . Using the block-diagonal approximation, we relax our goal to competing with a sampler with the smallest average approximate block-KSD measure , where is the KSD measure of the samples generated by sampler when it is called the th time. We also assume that this is close to the KSD measure of a sample of size obtained from using sampler for blocks, that is, for all . Experimental results presented in Section 5.1 indicate that the block-diagonal approximation mostly preserves the ranking of the samplers (as defined by the true KSD measure), hence we pay very little price for the computational advantage we get.
Furthermore, solving this problem is well-suited for any bandit algorithm; here we adapt the UCB1 method of Auer et al. (2002). The resulting algorithm, which we call KSD-UCB1 and is given in Algorithm 1555The algorithm assumes that , which can be achieved by rescaling the KSD measures. In practice, a reasonable estimate for the range can be obtained from for each sampler .
, keeps track of an optimistic estimate of the average approximate KSD value for each sampler, and every time selects a sampler whose performance is estimated to be the best possible (based on high-probability confidence intervals).
If the KSD values were i.i.d., the standard bandit regret bound (Bubeck & Cesa-Bianchi, 2012) would yield . In this case the convexity of the KSD measure (recall Eq. 4) would imply that after rounds of sampling,
This shows that increasing and , the performance of KSD-UCB1 would be close to that of the best sampler. However, in our case, the are not i.i.d. Assuming that the samplers mix (which is reasonable for a single mode distribution), the are getting closer and closer to be sampled i.i.d. as increases. Also, as mentioned above, the effect of the block-diagonal approximation (and hence that of ) is small in practice (see also Section 5.1). The exact parametrization of KSD-UCB1 corresponds to the assumption that the block-KSD values are in . While this is not true in general, in practice we estimate the maximum KSD value from samples and normalize the values accordingly, before feeding them to the KSD-UCB1 algorithm.
3.2 Locally mixing samplers
In practice, if the target distribution is multimodal and the modes are far from each other, MCMC methods often get stuck in some of the modes and fail to explore all the regions where is supported; while eventually all asymptotically consistent methods reach each mode, this may not happen after any practically reasonable time. To model this situation, we assume that the support of is partitioned into sets with for all (the are pairwise disjoint and their union is the support of ) such that the empirical distribution of the samples generated by sampler converges weakly to for some , where is the conditional distribution of over . We refer to the sets as regions, and a sampler satisfying the above condition almost surely a locally mixing sampler.
For simplicity we first consider the case where there is one sampler in each region (consequently ). This setup is similar to stratified sampling: The idea is to partition the domain into non-overlapping regions (a.k.a., strata), draw samples from each region, and combine the final samples to estimate (Owen, 2013). The problem in stratified sampling is to find the optimal number of samples that need to be taken from each stratum in order to minimize the Monte Carlo integration error. Given the total number of sample , the optimal strategy for minimizing the mean squared error (MSE) is to sample each stratum times (relaxing the integrality constraints), where
is the conditional standard deviation ofgiven that falls into the th region (Carpentier et al., 2015).
One can immediately see that the problem we consider in this subsection is very similar to stratified sampling, with the important differences that our samplers are not i.i.d., and we do not minimize the squared error but the KSD measure. Denoting the distribution of samples from region by after taking samples in total, let denote the weight of sampler generating these samples (recall that here, by assumption, we have one sampler in each region). Then our total weighted sample distribution becomes
Since according to our assumptions, converges weakly to almost surely, for every , we need to have for all to ensure that converges to weakly (we will refer to this as the being asymptotically consistent). A procedure for estimating
this way will be given in the next section. Assuming for a moment that, using the convexity of the KSD measure (see Eq. 4), we have
where, as before, and denote the KSD measures of the whole sample and, resp., that of sampler (with number of samples obtained by sampler ). Based on this inequality, we could aim for minimizing and use the ideas from adaptive stratified sampling (Carpentier et al., 2015); however, multiple challenges preclude us from doing this: (i) is not known in advance; (ii) the stratified sampling algorithm is based on the known convergence rate of the sample average, but we do not know how fast approaches zero (as a function of ); and (iii) the computational complexity of calculating is . To handle (i), we will address the problem of estimating in Section 3.3. For (ii), a conservative approach is to uniformly minimize , hence selecting sampler for which this quantity is the largest. For (iii), we again use a block-diagonal approximation to , which causes problems with (ii), since the estimate does not converge to for a fixed block size. In Section 5.3 we present experiments with different strategies under different setups, but the strategies considered seem to perform rather similarly. The simplest strategy considered is to select regions uniformly at random. Another approach is to minimize the maximum , that is, choosing , which does not require the knowledge of the weights , while we also test strategies selecting regions based on their estimated weights or, similarly to stratified sampling, based on their estimated variance. Although quite different, the strategies considered seem to perform rather similarly in various settings.
Unfortunately, we cannot guarantee that we start samplers in such a way that we have a single sampler for each region. If we know which sampler belongs to which region (recall that we assume that the samplers are locally mixing and hence belong to a region ), we can combine any of the region selection strategies described above with the bandit method described in Section 3.1 in a straightforward way: in each region , run an instance of KSD-UCB1 over the samplers exploring this region, and use this KSD-UCB-1 instance as a single locally mixing sampler in any of the above region selection methods. Finally, when the sampling budget is exhausted ( samples are generated), estimate the weights , and reweight the samples corresponding to region according to (5). We refer to this procedure as KSD-UCB1-M (M stands for Multiple regions); the pseudocode of the algorithm is given in Algorithm 2. Clearly, since the bandit algorithms sample each sampler infinitely often, we have the following consistency result:
Under the local mixing assumption made at the beginning of this section, the final weighted sample (5) obtained by KSD-UCB1-M is asymptotically unbiased as long as the weight estimates are asymptotically unbiased.
Thus, we need to find some asymptotically unbiased estimates of the probabilities of the regions.
3.3 Weight estimation
In this section we consider the problem of finding the weights in (5). As discussed after the equation, this amounts to finding the probability of the region the samples cover, which is again challenging since we have access only to an unnormalized density. This problem is faced by every algorithm which tries to speed up MCMC methods by running parallel chains. As an example, in big-data scenarios it is common to split the data into subsets, run an MCMC sampler on each subset, and combine the resulting samples in the end. To our knowledge, most work in the literature solves this problem by estimating the density of each batch of samples separately (Angelino et al., 2016; Nemeth & Sherlock, 2018), using typically either a Gaussian approximation (Scott et al., 2016)
or some kernel-density estimation method(Neiswanger et al., 2014). According to Nemeth & Sherlock (2018), the first approach works well in practice, in spite of not being supported by any theory, while the second, kernel-based estimation scales poorly with the dimension .
Here we take a different approach and rather than estimating the density of the sample batches, we directly estimate the probabilities via Rényi entropy.
Formally, suppose the domain of is partitioned into non-overlapping regions , and from each region we have a set of samples . The Rényi entropy of order for a density is defined as . The conditional density of restricted to a set is denoted as , and its Rényi entropy is . From this definition it trivially follows that
In our case, instead of we only have access to for some . Replacing with in the integral, we obtain
Thus, we can estimate (or, more precisely, ) by estimating the two terms on the right-hand side of the above equation.
Given a sample taken i.i.d. from , the second term can be estimated by the empirical average , while for the first term we can use a graph-based estimate (Hero & Michel, 1999). In particular, we are going to use the estimator of Pál et al. (2010), which is based on generalized -nearest neighbor graphs of the sample , and it converges to almost surely for any as the sample size grows to infinity. Thus, we obtain that is an asymptotically unbiased estimate of . Therefore, given a fixed partition ,
More precisely, if the minimum number of samples in the regions is , then .
We use the following estimator based on -nearest neighbors (Pál et al., 2010): Let denote that set of all pairs in defining the edges of its -nearest neighbor graph, and define for some . Then for , the estimator is defined as
where and is a constant such that almost surely where is a set of i.i.d. samples drawn uniformly from the unit cube . Theorem 1 of Pál et al. (2010) ensures that is well-defined, while their Theorem 2 shows that if is bounded, then for any and ,
with probability at least where is the number of samples in . Omitting the terms from the notation for simplicity, we see that the additive error of the estimator is of order which dominates the error of for (which can be obtained by standard concentration inequalities if is bounded from below by some number, see, e.g., Boucheron et al., 2013). This implies that . Therefore, can be estimated with a multiplicative error of where now denotes the minimum number of samples over the partition cells . In other words, , and using as , we also have as . Note that although the error of our estimator scales quite unfavorably in the dimension , in practice it seems to work well even for moderately large dimensions (around 20). Furthermore, to maximize the convergence rate, in the experiments we choose to be close to one (this also controls the variance of , which was hidden in the above crude analysis); see Section 5.4 for more details.
4 The final algorithm
So far we have discussed how to solve our problem if we have either local mixing or all the samplers mix globally. While the latter is the case asymptotically for all the MCMC samplers used in practice, the mixing may be too slow to be observed for any practical number of samples. On the other hand, the problem does not simplify to the local mixing scenario, since–even if well-separated regions are actually present–the chains often jump from one region to another even if they do not cover the whole domain.
To be able to adapt our KSD-UCB1-M algorithm, we need to group together the samplers covering the same region (even though what a region is is not clearly defined in this scenario). The problem is especially hard since the grouping of the samplers is non-stationary, and we should also be able to track when a sampler leaves or joins a region (equivalently, group). Furthermore, if the groups are too large, we do not explore the whole domain, while if they are too small, we waste resources by running multiple samplers for the same region.
To solve this issue, we propose a simple heuristic to identify samplers that are close together: In each round of the sampling, we take all the samples from the last batch of each sampler, and for each sample point we look at itsnearest neighbors. Then we find the grouping where any two samplers are grouped together that have points which are nearest neighbors of each other (this can easily be done recursively). By this simple heuristic, in each round we can group together the samplers that are close to each other. Note that here we do not make any assumption regarding the number of the regions (e.g., well-separated modes) of the distribution. By having all the samples, the algorithm can easily identify multiple regions by running a clustering algorithm. The final step of the algorithm is to determine the correct weight for each sample point. Recall that in the locally mixing case we weighted the empirical distribution of each region by their estimated probability (cf. Eq. 5). Here, since we do not have these regions, in the end we assign the samples into clusters using -means clustering, and weight the empirical measure within each cluster by the estimated probability of the cluster (using Eq. 8). Algorithm 3 shows the whole procedure, called KSD-MCMC with reweighting (KSD-MCMC-WR).
In this section we empirically evaluate our choices in the algorithm design process as well as compare the performance of our final algorithm with state-of-the-art methods from the literature on several synthetic problems.
We use three different base sampling methods: Metropolis-Hastings (MH), the Metropolis-adjusted Langevin algorithm (MALA) of Roberts & Rosenthal (2002) and the no-U-turn sampler (NUTS) of Hoffman & Gelman (2014). For the latter, which is a state-of-the-art sampling method (Wang et al., 2013; Liu & Lee, 2016), we used the implementation provided in the pymc3 package (Salvatier et al., 2016) which, on top of the original NUTS, also includes several modifications recommended in the literature for boosting its performance (see the pymc3 package website for more details). Beside the initial point, MH and MALA have a single step size parameter (for MALA the preconditioning matrix is always the identity), while for NUTS we used the default setting (in our experiments the performance of NUTS was insensitive to the choice of its parameters). In our experiments, when a MH or MALA sampler is chosen randomly, the step size is selected uniformly at random from (the initial points are selected uniformly at random “not too far” from the modes of the underlying distribution).
We present one set of experiments for each component of our algorithm. Throughout this section, to compute the KSD measure, we use the inverse multiquadratic kernel suggested by Gorham & Mackey (2017) with and unless otherwise stated (experiments in Section 5.1 and 5.6 indicate that the performance of our method is quite insensitive to the choice of in the range ).
Section 5.1 analyzes the effect of using the block-diagonal approximation of the KSD measure. The performance of our bandit-based samplers for unimodal target distributions is considered in Section 5.2. Multimodal densities with separated modes and one sampler in each mode are considered in Section 5.3, including an empirical analysis of our weight estimation method in Section 5.4. Our final algorithm KSD-MCMC-WR is tested in the general setting of a multimodal target distribution with an unknown number of modes in Section 5.5, and specifically against parallel MCMC methods in Section 5.6. Finally, we consider a realistic task of node localization in a sensor network in Section 5.8.
5.1 Block-diagonal approximation of the KSD measure
The intuition behind our bandit MCMC method is that we can identify the best sampler among a group of samplers from the average KSD for small batch sizes instead of making the decision based on the KSD computed over a large set of samples. To verify this hypothesis, we checked if the average approximate KSD computed on small blocks preserves the order of two samplers as defined by the true KSD measure. In particular, we compared the average block-KSD measure of two samplers for different block sizes. We drew
samples from each sampler for a non-isotropic two-dimensional independent Gaussian distribution and computed the average block-KSD measure for these samples with block sizes(the parameters of the samplers and the distribution were selected randomly666Throughout this section the variance of the components of a Gaussian distribution are selected uniformly from ). Formally, for each sampler with block size and total sample size , we computed the average block-KSD as , where is the average block-KSD computed on the th block of data, and for a given block size , is declared to be the better sampler according to the measurements.
Table 1 shows the fraction of times the average block-KSD for different block sizes gave the same ordering of the samplers as the ordering obtained for block size (which is treated as the ground truth in this experiment), both for MH and MALA as base samplers. We can observe that the ordering obtained from the average block-KSD measures is most of the time close to the “ground truth”, justifying its use for measuring sample quality. The agreement in most cases is around at least , and rearly gets below . Interestingly, no trend regarding the kernel width is observable. Scatter plots of average block-KSD differences () for and different block sizes are given in Figure 1 for the MH algorithm. These figures show that in cases of incorrect ordering, the average block-KSD values of the samplers are really close for both block sizes, which means the two samplers considered are of approximately the same quality according to both measures (thus, practically it does not matter which of them is used for sampling).
|Kernel width||Batch Size||10||25||50||100||250||500||2000|
5.2 Unimodal target distributions with multiple samplers
In this section we consider sampling from a standard normal Gaussian distribution, where the samplers have access to the unnormalized density . The goal of the samplers is to estimate the mean of the distribution with respect to the mean squared error. We consider different MH and, respectively, MALA samplers (with step size parameters , resp.) and run our proposed algorithm KSD-UCB1 (Algorithm 1) on top of them. The method is compared with the following baselines and variations:
Uniform: This algorithm distributes the computational budget equally among the samplers, and in the end takes all the samples generated from all samplers with equal weights.
KSD-opt: This method again takes the same number of samples from each sampler (i.e., uses the aforementioned uniform algorithm), but reweights them using the method of Liu & Lee (2016). Indeed, KSD-opt is a reweighting method that can be used with any sets of samples. However, it has been originally proposed to be used on top of the uniform algorithm. This reweighting can be very effective for unimodal distributions or multimodal distributions with really close modes, but it is computationally quite demanding.
-greedy: This method is a variant of KSD-UCB1, but is based on the -greedy bandit algorithm (Bubeck & Cesa-Bianchi, 2012) instead of UCB1. That is, in every round, with probability , the sampler with the smallest approximate block-KSD measure is selected, while with probability , a sampler is selected uniformly at random. In the experiments in round .
We also compared the performance with that of NUTS. Figures 4 and 4 show the results for MH base samplers with and as a function of , respectively. One can observe that both bandit-based algorithms (i.e., KSD-UCB1 and KSD--greedy) perform similarly, almost achieving the performance of the best base sampler. Interestingly, there is no significant difference between the two methods, and -greedy, which is an inferior bandit algorithm, seems to perform slightly better. This might indicate that in our case less exploration could be preferable. Also, it is interesting to note that the bandit method with the smaller batch size of outperformed the one with bigger batch size of . Since we have already observed that the ordering of the samplers is not really affected by the batch size, this improved performance is most likely due to the increased number of decision points, which allows better adaptation. Note that given that the computational complexity of calculating the KSD measure is quadratic in the number of samples, smaller batch sizes have a huge advantage compared to bigger ones. In particular, in our experiment, the computational cost for batch size is about two orders of magnitude smaller than for batch size . The running time of the different KSD computations are shown in Figure 2.
On the other hand, although our methods are competitive with the MH samplers, one can observe that NUTS significantly outperforms all of them, hence also their combinations. Furthermore, the reweighting mechanism of KSD-opt can provide a significant boost in performance for lower dimensions.
The performance of the aggregating sampling algorithm can be significantly improved by changing the base samplers. This is shown in Figures 6 and 6, where MH is replaced with the much better MALA samplers, making the best sampler and our bandit-based sampling method competitive with NUTS. Interestingly, due to the better quality samples, in this case KSD-opt outperforms MALA for the whole range of dimensions (up to ) we consider, although the performance difference vanishes as increases, and indicates (as expected) that NUTS will be better for large values of .
5.3 Separated modes with one sampler for each mode
In this section we consider the case of a multimodal target density (Gauss mixture) with separated modes, where we have one sampler for each mode. Here we run several versions of KSD-UCB1-M, which differ in how the region (mode) is selected in the algorithm. In particular, we consider the following methods for selecting :
Uniformly at random (this is called "Equal probability" in the figures).
is selected as the sampler (region) with the largest average block-KSD measure (KSD).
is selected randomly with probability proportional to , as suggested by (6) (KSD.).
based on the stratified sampling idea of (Carpentier et al., 2015), with probability proportional to , where is the estimated weight and is the estimated standard deviation of the samples ()
For completeness, for the last reweighting step we conducted experiments with both the true weights and also with estimated weights.
We ran experiments for a target distribution with three separated modes, where the goal again was to estimate the mean of the distribution with respect to the mean squared error. We selected one sampler for each mode (with random parameters). The experiment were repeated times, both for MH and MALA base samplers, with different random initializations for the MCMC chains. Finally, the whole process was repeated times, drawing different parameters for the samplers. Figure 7 shows the results for one of the settings with MH samplers (the results for all the settings followed the same pattern); both with and without the a priori knowledge of the weights. The results with MALA instead of MH are presented in Figure 8. One can observe that, for known weights, the best method for selecting in both experiments is the random choice with probability proportional to . On the other hand, when the weights are needed to be estimated, the performance of our more informed methods (i.e., all except for the uniform random selection) is approximately the same, which is due to the unavoidable errors in estimating the weights (this effect is more visible for small sample sizes).
5.4 Weight estimation
A crucial step in the aggregation of the samples coming from different samplers is to estimate the weight of each mode. To this end, we repeated the same experiment with a uniformly random choice of , but considering different methods for the final weight estimation. In particular, we compare the Rényi-entropy-based estimates with different parameters and the natural weight estimates for Gaussian mixtures based on estimating the covariance matrix for each mode (Scott et al., 2016). The results presented in Figure 9 show that the weight estimation is not really sensitive to the choice of (the recommendation of Szabó, 2014 is to use values close to ). Interestingly, in case of MALA base samplers, the Rényi-entropy-based methods outperform the ones designed specifically for the Gaussian distribution, which is indeed the underlying distribution in our case. In a similar setup, Figure 10 compares several aggregation schemes, including the above tested Rényi- and, respectively, Gaussian-mixture-based combination, as well as uniform averaging and the computationally very expensive KSD-based reweighting scheme of Liu & Lee (2016). The results demonstrate that our method is able to outperform the others. On the other hand, extending the experiment to higher dimensions (and using NUTS as the base sampler, see Figure 11 ), the Gaussian and the Rényi-entropy-based estimators perform very similarly (recall that the Gaussian estimator is specifically fitted to the Gauss mixtures considered in the experiments).
5.5 The general case: unknown number of modes with multiple chains
In this section, we consider a realistic situation where we have access to an unnormalized density but the number of its separated regions/modes is unknown. Again, the goal is to estimate the mean of the distribution. Here we cannot guarantee that any MCMC chain used will only sample one mode; on the contrary, chains will usually move among different regions. We consider a Gaussian mixture model with modes of random parameters, and run base samplers whose parameters are also chosen randomly. We compare several combination methods discussed before: (i) each sampler is used to generate the same number of samples (Uniform); (ii) the same as the previous, but the final clustering and reweighting step of KSD-MCMC-WR (Algorithm 3) is used in the end (Uniform+clustering); finally, four versions of KSD-MCMC-WR are considered where the bandit method is -greedy or UCB1, and the selection of is uniform at random (Equal probability) or random with probabilities proportional to . For clustering, we used the "Kmean" method of the scikit-learn package (Pedregosa et al., 2011).
After randomly choosing the distribution and the samplers, we considered them fixed and ran experiments; in each of them the sampling methods are started from random initial points. The whole process was then repeated for