 # Scalable Discrete Sampling as a Multi-Armed Bandit Problem

Drawing a sample from a discrete distribution is one of the building components for Monte Carlo methods. Like other sampling algorithms, discrete sampling suffers from the high computational burden in large-scale inference problems. We study the problem of sampling a discrete random variable with a high degree of dependency that is typical in large-scale Bayesian inference and graphical models, and propose an efficient approximate solution with a subsampling approach. We make a novel connection between the discrete sampling and Multi-Armed Bandits problems with a finite reward population and provide three algorithms with theoretical guarantees. Empirical evaluations show the robustness and efficiency of the approximate algorithms in both synthetic and real-world large-scale problems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sampling a random variable from a discrete (conditional) distribution is one of the core operations in Monte Carlo methods. It is an ubiquitous and often necessary component for inference algorithms such as Gibbs sampling and particle filtering. Applying discrete sampling for large-scale problems has been a challenging task like other Monte Carlo algorithms due to the high computational burden. Various approaches have been proposed to address different dimensions of “large scales”. For example, distributed algorithms have been used to sample a model with a large number of discrete variables (Newman et al., 2009; Bratières et al., 2010; Wu et al., 2011)

, smart transition kernels were described for Markov chain Monte Carlo (MCMC) algorithms to sample efficiently a single variable with a large or even infinite state space

(Li et al., 2014; Kalli et al., 2011). This paper is focused on another dimension of the “large-scales” where the variable to sample has a large degree of statistical dependency.

Consider a random variable with a finite domain and a distribution in the following form

 p(X=x)∝~p(X=x), with ~p(X=x)=f0(x)N∏n=1fn(x), (1)

where can be any function of

. Such distribution occurs frequently in machine learning problems. For example, in Bayesian inference for a model with parameter

and i.i.d. observations , the posterior distribution of depends on all the observations when sufficient statistics is not available. The unnormalized posterior distribution can be written as . In undirected graphical model inference problems where a node appears in potential functions, the distribution of depends on the value of all of the functions. The unnormalized conditional distribution is , where denotes the value of all the other nodes in the graph and denotes a potential function that includes in the scope. In this paper we study how to sample a discrete random variable in a manner that is scalable in .

A common approach to address the big data problem is divide-and-conquer that uses parallel or distributed computing resources to process data in parallel and then synchronize the results periodically or merely once in the end (Scott et al., 2013; Medlar et al., 2013; Xu et al., 2014).

An orthogonal approach has been studied for the Metropolis-Hastings (MH) algorithm in a general state space by running a sampler with subsampled data. This approach can be combined easily with the distributed computing idea for even better scalability (e.g. Ahn et al., 2015).

Maclaurin & Adams (2015) introduced an MH algorithm in an augmented state space that could achieve higher efficiency than the standard MH by processing only a subset of active data every iteration while still preserving the correct stationary distribution. But the introduction of auxiliary variables might also slow down the overall mixing rate in the augmented space.

Approximate MH algorithms have been proposed in the subsampling approach with high scalability. The stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) and its extensions (Ahn et al., 2012; Chen et al., 2014; Ding et al., 2014)

introduced efficient proposal distributions based on subsampled data. Approximate algorithms induce bias in the stationary distribution of the Markov chain. But given a fixed amount of runtime they could reduce the expected error in the Monte Carlo estimate via a proper trade-off between variance and bias by mixing faster w.r.t. the runtime. This is particularly important for large-scale learning problems when the runtime is one of the limiting factors for generalization performance

(Bottou & Bousquet, 2008). However, the stochastic gradient MCMC approach usually skips the rejection step in order to obtain sublinear time complexity and the induced bias is very hard to estimate or control.

Another line of research on approximate subsampled MH algorithms does not ignore the rejection step but controls the error with an approximate rejection step based on a subset of data (Korattikara et al., 2014; Bardenet et al., 2014). The bias can thus be better controlled (Mitrophanov, 2005; Pillai & Smith, 2014). That idea has also been extended to slice sampling (DuBois et al., 2014)

and Gibbs for binary variables

(Korattikara et al., 2014).

In this paper we follow the last line of research and propose a novel approximate sampling algorithm to improve the scalability of sampling discrete distributions. We first reformulate the problem in Eq. 1 as a Multi-Armed Bandit (MAB) problem with a finite reward population via the Gumbel-Max trick (Papandreou & Yuille, 2011), and then propose three algorithms with theoretical guarantees on the approximation error and an upper bound of on the sample size. This is to our knowledge the first attempt to address discrete sampling problem with a large number of dependencies and our work will likely contribute to a more complete library of scalable MCMC algorithms. Moreover, the racing algorithm in Sec. 3.3 provides a unified framework for subsampling-based discrete sampling, MH (Korattikara et al., 2014; Bardenet et al., 2014) and slice sampling (DuBois et al., 2014) algorithms as discussed in Sec. 4. We also show in the experiments that our algorithm can be combined straightforwardly with stochastic gradient MCMC to achieve both high efficiency and controlled bias. Lastly, the proposed algorithms also deserve their own interest for MAB problems under this particular setting.

We first review an alternative way of drawing discrete variables and build a connection with MABs in Sec. 2, then propose three algorithms in Sec. 3. We discuss related work in Sec. 4 and evaluate the proposed algorithms on both synthetic data and real-world problems of Bayesian inference and graphical model inference in Sec. 5. Particularly, we show how our proposed sampler can be combined conveniently as a building component with other subsampling sampler for a hierarchical Bayesian model. Sec. 6 concludes the paper with a discussion.

## 2 Approximate Discrete Sampling

### 2.1 Discrete Sampling as an Optimization Problem

The common procedure to sample from a discrete domain is to first normalize and compute the CDF . Then draw a uniform random variable , and find that satisfies

. This procedure requires computing the sum of all the unnormalized probabilities. For

in the form of Eq. 1 this is .

An alternative procedure is to first draw i.i.d. samples from the standard Gumbel distribution111The Gumbel distribution is used to model the maximum extreme value distribution. If a random variable , then . can be easily drawn as with . , and then solve the following optimization problem:

 x=\argmaxi∈Xlog~p(i)+\epsi. (2)

It is shown in Kuzmin & Warmuth (2005) that follows the distribution . With this method after drawing random variables that do not depend on , we turn a random sampling problem to an optimization problem. While the computational complexity is the same to draw an exact sample, an approximate algorithm may potentially save computations by avoiding computing accurate values of when is considered unlikely to be the maximum as discussed next.

### 2.2 Approximate Discrete Sampling as a Multi-Armed Bandits Problem

In a Multi-Armed Bandit (MAB) problem, the ’th bandit is a slot machine with an arm, which when pulled generates an i.i.d. reward from a distribution associated with that arm with an unknown mean . The optimal arm identification problem for MABs (Bechhofer, 1958; Paulson, 1964) in the fixed confidence setting is to find the arm with the highest mean reward with a confidence using as few pulls as possible.

Under the assumption of Eq. 1, the solution in Eq. 2 can be expressed as

 x =\argmaxi∈XN∑n=1logfn(i)+logf0(i)+\epsi\nn (3) =\argmaxi∈XN∑n=1(logfn(i)+1N(logf0(i)+\epsi))\defeqli,n\nn (4)
 =\argmaxi∈X1NN∑n=1li,n=\argmaxi∈XEli∼Uniform(Li)[li]\nn (5) \defeq\argmaxi∈Xμi (6)

where . After drawing Gumbel variables , we turn the discrete sampling problem into the optimal arm identification problem in MABs where the reward is uniformly sampled from a finite population . An approximate algorithm that solves the problem with a fixed confidence may avoid drawing all the rewards from an obviously sub-optimal arm and save computations. We show the induced bias in the sample distribution as follows with the proof in Appx. A.1. If an algorithm solves (2) exactly with a probability at least for any value of , the total variation between the sample distribution and the true distribution is bounded by

 ∥^p(X)−p(X)∥TV≤\de (7)

When applied in the MCMC framework as a transition kernel, we can apply immediately the theories in Mitrophanov (2005); Pillai & Smith (2014) to show that the approximate Markov chain satisfies uniform ergodicity under regular conditions and the analysis of convergence rate are readily available under various assumptions. So the discrete sampling problem of this paper reduces to finding a good MAB algorithm for Eq. 2 in our problem setting.

## 3 Algorithms for MABs with a Finite Population and Fixed Confidence

The key difference of our problem from the regular MABs is that our rewards are generated from a finite population while regular MABs assume i.i.d. rewards. Because one can obtain the exact mean by sampling all the values for arm without replacement, a good algorithm should pull no more than times for each arm regardless of the mean gap between arms. We introduce three algorithms in this section whose sample complexity is upper bounded by in the worst case and can be very efficient when the mean gap is large.

### 3.1 Notations

The iteration of an algorithm is indexed by . We denote the entire index set with , the sampled set of reward indices up to ’th iteration from arm with , and the corresponding number of sampled rewards with . We define the estimated mean for ’th arm with , the natural variance (biased) estimate with , the variance estimate of the mean gap between two arm with (defined only when ), the bound of the reward value . The subscripts and superscripts may be dropped for notational simplicity when the meaning is clear from the context.

We first study one of the state-of-the-art algorithms for fixed-confidence optimal arm identification problem and adjust it for the finite population setting. The lil’UCB algorithm (Jamieson et al., 2014) maintains an upper confidence bound (UCB) of that is inspired by the law of the iterated logarithm (LIL) for every arm. At each iteration, it draws a single sample from the arm with the highest bound and updates it. The algorithm terminates when some arm is sampled much more often than all the other arms. We refer readers to Fig. 1 of Jamieson et al. (2014) for details. The time complexity for iterations is . It was shown in Jamieson et al. (2014) that lil’UCB achieved the optimal sample complexity up to constants.

However, lil’UCB requires i.i.d. rewards for each arm , that is, sampled with replacement from . Therefore, the total number of samples is unbounded and could be when the means are close to each other. We adapt lil’UCB for our problem with the following modifications:

1. Samples without replacement for each arm but keep different arms independent.

2. When for some arm , the estimate becomes exact. So set its UCB to .

3. The algorithm terminates either with the original stopping criterion or when the arm with the highest upper bound has an exact mean estimate, whichever comes first.

The adapted algorithm satisfies all the theoretical guarantees in Thm. 2 of Jamieson et al. (2014) with additional properties as shown in the following proposition with proof in Appx. A.2. Theorem 2 of Jamieson et al. (2014) holds for the adapted lil’UCB algorithm. Moreover . Therefore, when the algorithm terminates, . Notice that Thm. 2 of Jamieson et al. (2014) shows that scales roughly as with being the mean gap and therefore when the gap is large.

### 3.3 Racing Algorithm for a Finite Population

When rewards are sampled without replacement, the negative correlation between rewards would generally improve the convergence of . Unfortunately, the bound in lil’UCB ignores the negative correlation when even with the adaptations. We introduce a new family of racing algorithms (Maron & Moore, 1994) that takes advantage of the finite population setting as shown in Alg. 1. The choice of the uncertainty bound function differentiates specific algorithms and two examples will be discussed in the following sections.

Alg. 1 maintains a set of candidate set initialized with all arms. At iteration , a shared mini-batch of indices are drawn w/o replacement for all survived arms in . Then the uncertainty bound is used to eliminate sub-optimal arms with a given confidence. The algorithm stops when only one arm remains. We require for that the total number of sampled indices equals at the last iteration . Particularly, we take a doubling schedule (so ) and leave as a free parameter. We also require whenever so that Alg. 1 always stops within iterations. The computational complexity for iterations is with the marginal estimate and with the pairwise estimate . The former version is more efficient than the latter when is large at the price of a looser bound.

If satisfies

 \cE\defeqP(∃tG(\de,T(t),^\sg(t),C))≤\de, (8)

for any with a probability at least , Alg. 1 returns the optimal arm with at most samples. The proof is provided in Appx. A.3. Unlike adapted lil’UCB, Racing draws a shared set of sample indices among all the arms and could provide a tighter bound with pairwise variance estimates when there is positive correlation, a typical case in Bayesian inference problems.

#### 3.3.1 Racing with Serfling Concentration bounds for G

Serfling (1974) studied the concentration inequalities of sampling without replacement and obtained an improved Hoeffding bound. Bardenet & Maillard (2013) extended the work and provided an empirical Bernstein-Serfling bound that was later used for the subsampling-based MH algorithm (Bardenet et al., 2014): for any and any , with probability , it holds that

 ^μn−μ≤^σn√2ρnlog(5/\de)n+κClog(5/\de)n\nn (9) \defeqBEBS(\de,n,^\sgn,C) (10)

where , and , with . The extra term that is missing in regular empirical Bernstein bounds reduces the bound significantly when is close to . We set in Alg. 1 to provide a valid for any and set the uncertain bound with the empirical Bernstein-Serfling (EBS) bounds as

 GEBS(\de,T,^\sg,C)=BEBS(\det∗−1,T,^\sg,C) (11)

It is trivial to prove that satisfies the condition in Eq. 8 using a union bound over .

#### 3.3.2 Racing with a Normal Assumption for G

The concentration bounds often give a conservative strategy as they assume an arbitrary bounded reward distribution. When the number of drawn samples is large, the central limit theorem suggests that

follows approximately a Gaussian distribution.

Korattikara et al. (2014) made such an assumption and obtained a tighter bound. We first provide an immediate corollary of Prop. 2 in Appx. A of Korattikara et al. (2014). Let be the estimated means using sampling without replacement from any finite population with mean and unit variance. The joint normal random variables that match the mean and covariance matrix with follow a Gaussian random walk process as

 pμ(~μ(t)|~μ(1),…,~μ(t−1))=N(mt(~μ(t−1)),St) (12)

where , , with short for . The marginal distribution where the variance approaches 0 when . When , we assume

and the central limit theorem suggests that the joint distribution of

can be approximated by the joint distribution of . With the normal assumption, we choose the uncertainty bound in the following form

 GNormal(\de,T,^\sg)=^\sg√T(1−T−1N−1)1/2BNormal (13)

Intuitively we use a constant confidence level, , for all marginal distributions of over where is the CDF of the standard normal. To choose the constant , we plug into the condition for G in Eq. 8

and apply the normal distribution (

12) to solve the univariate equation . This way of computing gives a tighter bound than applying the union bound across as in the previous section because it takes into account the correlation of mean estimates across iterations. Appx. B provides a lookup table and a plot of . Notice that only needs to be computed once and we can obtain it for any

by either interpolating the table or computing numerically with code to be shared (runtime

second). For the parameter of the first mini-batch size , a value of performs robustly in all experiments.

We provide the sample complexity below with the proof in Appx. A.4. Particularly, as , and when .

Let be the best arm and be the minimal normalized gap of means from other arms, defined as when using marginal variance estimate and when using pairwise variance estimate . If Assump. 3.3.2 holds, with a probability at least Racing-Normal draws no more rewards than

 T∗(Δ)=D⎡⎢ ⎢ ⎢ ⎢ ⎢⎢N(N−1)Δ24B2Normal(\de/D′)+1⎤⎥ ⎥ ⎥ ⎥ ⎥⎥m(1) (14)

where . if using and is if using .

### 3.4 Variance Reduction for Random Rewards with Control Variates

The difficulty of MABs depends heavily on the ratio of the mean gap to the reward noise, . To improve the signal noise ratio, we exploit the control variates technique (Wilson, 1984) to reduce the reward variance. Consider a variable whose expectation can be computed efficiently. The residue reward has the same mean as and the variance is reduced if . In the Bayesian inference experiment where the factor , we adopt a similar approach as Wang et al. (2013) and take the Taylor expansion of around a reference point as

 li,n≈li(^\by)+\bgTi(\byn−^\by)+\ha(\byn−^\by)THi(\byn−^\by)\defeqhi,n (15)

where and are the gradient and Hessian matrix of respectively evaluated at .

can computed analytically with the first two moments of

. A typical choice of is .

The control variate method is mostly useful for Racing-Normal. For algorithms depending on a reward bound in order to get a tight bound for it requires a more restrictive condition for C as in Bardenet et al. (2015) and we might end up with an even more conservative strategy in general cases.

## 4 Related Work

The Gumbel-Max trick has been exploited in Kuzmin & Warmuth (2005); Papandreou & Yuille (2011); Maddison et al. (2014) for different problems. The closest work is Maddison et al. (2014)

where this trick is extended to draw continuous random variables with a Gumbel process, reminiscent to adaptive rejection sampling.

Our work is closely related to the optimal arm identification problem for MABs with a fixed confidence. This is, to our knowledge, the first work to consider MABs with a finite population. The proposed algorithms tailored under this setting could be of interest beyond the discrete sampling problem. The normal assumption in Sec. 3.3.2 is similar to UCB-Normal in Auer et al. (2002) but the latter assumes a normal distribution for individual rewards and will perform poorly when it does not hold.

The bounds in Sec. 3.3 are based on subsampling-based MH algorithms in Bardenet et al. (2014); Korattikara et al. (2014). The proposed algorithm extends those ideas from MH to discrete sampling. In fact, let and be the current and proposed value in an MH iteration, Racing-EBS and Racing-Normal reduce to the algorithms in Bardenet et al. (2014) and Korattikara et al. (2014) respectively if we set

 \cX={x,x′},f0(1)=u p(x)q(x′|x),\nn (16) f0(2)=p(x′)q(x|x′),fn(x)=p(\byn|x) (17)

where is the prior distribution, and is the proposal distribution. The difference with Bardenet et al. (2014) is that we distribute the error evenly across in Eq. 11 while Bardenet et al. (2014) set with a free parameter. The differences with Korattikara et al. (2014) are that we take a doubling schedule for

and replace the t-test with the normal assumption. We find that our algorithms are more efficient and robust than both original algorithms in practice. Moreover, the binary Gibbs sampling in Appx. F of

Korattikara et al. (2014) is also a special case of Racing-Normal with . Therefore, Alg. 1 provides a unifying approach to a family of subsampling-based samplers.

The variance reduction technique is similar to the proxies in Bardenet et al. (2015), but the control variate here is a function in the data space while the proxy in the latter is a function in the parameter space. We do not assume the posterior distribution is approximate Gaussian and our algorithm works with multi-modal distributions.

It is important not the confuse the focus of our algorithm for the big problem in Eq. 1 with other algorithms that address sampling for a large state space (big

) or similarly a high-dimensional vector of discrete variables (exponentially large

). The combination of these two approaches for problems with both big and big is possible but beyond the scope of this paper.

## 5 Experiments

Since this is the first work to discuss efficient discrete sampling for problem (1), we compare the adapted lil’UCB, Racing-EBS, Racing-Normal with the exact sampler only. We report the result of Racing-Normal in real data experiments only as the speed gains of the other two are marginal.

### 5.1 Synthetic Data

We construct a distribution with by sampling rewards of for each state from one of the three distributions , , . We normalized to have a fixed distribution in Fig. 1(a) and a reward variance

that controls the difficulty. The normal distribution is the ideal setting for Racing-Normal, and the uniform distribution is desirable for adapted lil’UCB and Racing-EBS as the reward bound is close to

. The LogNormal distribution, whose ex. kurtosis

, is difficult for all due to the heavy tail. We use a tight bound for Racing-EBS. We set the scale parameter of adapted lil’UCB with

and other parameters with the heuristic setting in

Jamieson et al. (2014). Racing uses the pairwise variance estimate.

Fig. 1(b)-LABEL: show the empirical error of best arm identification by drawing samples of for each setting and vary the target error bound . The bound appears very loose for lil’UCB and Racing-EBS but is sharp for Racing-Normal when the noise is large (1(b)) and . This is consistent with the direct comparison of uncertainty bounds in Fig. 1(e). Consequently, given the same error tolerance Racing-Normal requires much fewer rewards than the other conservative strategies in all the settings except when and , as shown in Fig. 1(f)-LABEL:. We verify the observations with more experiments in Appx. C.1 with and marginal estimate .

Surprisingly, Racing-Normal performs robustly regardless of reward distributions with the first mini-batch size while it was shown in Bardenet et al. (2014) that the algorithm with the same normal assumption in Korattikara et al. (2014) failed with LogNormal even when . The dramatic improvement in robustness is mainly due to our doubling scheme where central limit theorem applies quickly with increasing exponentially. We do not claim that the single trick will solve the problem completely because there still exist cases in theory with extremely heavy-tailed reward distributions where our normal assumption does not hold and the algorithm will fail to meet the confidence level. In practice, we do not observe that pathological case in any of the experiments. Figure 2: Bayesian ARCH Model Selection. Solid: exact, dashed: approximate using Sub Gibbs + SGLD with Sub MH.

### 5.2 Bayesian ARCH Model Selection

We evaluate Racing-Normal in a Bayesian model selection problem for the auto-regressive conditional heteroskedasticity (ARCH) models. The discrete sampler is integrated in the Markov chain as a building component to sample the hierarchical model. Specifically, we consider a mixture of ARCHs for the return of stock price series with student-t innovations, each component with a different order :

 rt=\sgtzt, ztiid∼tν(0,1),\sg2t=\al0+q∑i=1\alir2t−i,\nn (18) q∼Discrete(\bpi),αi,νiid∼Gamma(1,1)\nn (19)

where is the prior distribution of a candidate model in the set . The random variables to infer include the discrete model choice and continuous parameters . We adopt the augmented MCMC algorithm in Carlin & Chib (1995) to avoid transdimensional moves. We apply subsampling-based scalable algorithms to sample all variables with subsampled observations : Racing-Normal Gibbs for , stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) corrected with Racing-Normal MH (Sec. 4) for and . We use adjusted priors as suggested by Carlin & Chib (1995) for sufficient mixing between all models and tune them with adaptive MCMC. The adjusted posterior is then close to uniform and the value provides an estimate to the real unnormalized posterior . Control variates are also applied to reduce variance. Details of the sampling algorithm are provided in Appx. C.2.

We apply the model on the 5-minute Shanghai stock exchange composite index of one year consisting of about 13,000 data points (Fig. 2(a)). . We set and . The control variate method reduces the reward variance by 23 orders of magnitude. Fig. 2(b) shows the estimated log-posterior of by normalizing in the adaptive MCMC as a function of the number of likelihood evaluations (proportional to runtime). The subsampling-based sampler (Sub) converges about three times faster. We then fix for a fixed stationary distribution and run MCMC for iterations to compare Sub with the exact sampler. The empirical error rates for Racing-Normal Gibbs and MH are about and respectively. Fig. 2(c) shows estimated adjusted posterior with 5 runs, and LABEL: compares the auto-correlation of sample . Sub obtains over twice the effective sample size without noticeable bias after the burn-in period.

### 5.3 Author Coreference

We then study the performance in a large-scale graphical model inference problem. The author coreference problem for a database of scientific paper citations is to cluster the mentions of authors into real persons. Singh et al. (2012) addressed this problem with a conditional random field model with pairwise factors. The joint and conditional distributions are respectively

 p\bta(\by|\bx) ∝exp⎛⎝∑yi=yj,i≠j,∀i,jf\bta(xi,xj)⎞⎠,\nn (20) p\bta(Yi=yi|\by−i,\bx) ∝exp⎛⎜⎝∑yj∈Cy={j:yj=y,j≠i}f\bta(xi,xj)⎞⎟⎠\nn (21)

where is the set of observed author mentions and is the cluster index for ’th mention. The factor measures the similarity between two mentions based on author names, coauthors, paper title, etc, parameterized by . In the conditional distribution, can take a value of any non-empty cluster or another empty cluster index. When a cluster contains a lot of mentions, a typical case for common author names, the number of factors to be evaluated will be large. We consider the MAP inference problem with fixed using annealed Gibbs sampling (Finkel et al., 2005). We apply Racing-Normal to sample by subsampling for each candidate value . An important difference of this problem from Eq. 1 is that and has a heavy tail distribution. We let the mini-batch size depend on with details provided in Appx. C.3.

We run the experiment on the union of an unlabeled DBLP dataset of BibTex entries with about 5M authors and a Rexa corpus of about 11K author mentions with 3160 entries labeled. We monitor the clustering performance on the labeled subset with the F-1 score (Bagga & Baldwin, 1998). We use and the empirical error rate is about . The number of candidate values varies in and varies in upon convergence. Fig. 3(a) shows the F-1 score as a function of the number of factor evaluations with 7 random runs for each algorithm. Sub Gibbs converges about three times faster than exact Gibbs. Fig. 3(b) shows F-1 as a function of iterations that renders almost identical behavior for both algorithms, which suggests negligible bias in Sub Gibbs. The relative number of the evaluated factors of sub to exact Gibbs indicates about a 5-time speed up near convergence. The initial speed up is small because every cluster is initialized with a single mention, i.e. .

## 6 Discussion

We consider the discrete sampling problem with a high degree of dependency and proposed three approximate algorithms under the framework of MABs with theoretical guarantees. The Racing algorithm provides a unifying approaches to various subsampling-based Monte Carlo algorithms and also improves the robustness of the original MH algorithm in Korattikara et al. (2014). This is also the first work to discuss MABs under the setting of a finite reward population.

Empirical evaluations show that Racing-Normal achieves a robust and the highest speed-up among all competitors. Whilst adaptive lil’UCB shows inferior empirical performance to Racing-Normal, it has a better sample complexity w.r.t. the number of arms . It will be a future direction to combine the bound of Racing-Normal with other MAB algorithms including lil’UCB for a better scalability in . Another important problem is on how to relax the assumptions for Racing-Normal without sacrificing the performance.

It would also be an interesting direction to extend our work to draw continuous random variables efficiently with the Gumbel process (Maddison et al., 2014). In continuous state space, there are infinitely many “arms” and a naive application of our algorithm will lead to infinitely large error bound. This problem can be alleviated with algorithms for contextual MAB problems.

## Acknowledgements

We thank Matt Hoffman for helpful discussions on the connection of our work to the MAB problems. We also thank all the reviewers for their constructive comments. We acknowledge funding from the Alan Turing Institute, Google, Microsoft Research and EPSRC Grant EP/I036575/1.

## References

• Ahn et al. (2012) Ahn, Sungjin, Korattikara, Anoop, and Welling, Max. Bayesian posterior sampling via stochastic gradient fisher scoring. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 1591–1598, 2012.
• Ahn et al. (2015) Ahn, Sungjin, Korattikara, Anoop, Liu, Nathan, Rajan, Suju, and Welling, Max. Large-scale distributed bayesian matrix factorization using stochastic gradient mcmc. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–18. ACM, 2015.
• Auer et al. (2002) Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
• Bagga & Baldwin (1998) Bagga, Amit and Baldwin, Breck. Algorithms for scoring coreference chains. In LREC workshop on linguistics coreference, volume 1, pp. 563–566, 1998.
• Bardenet & Maillard (2013) Bardenet, Rémi and Maillard, Odalric-Ambrym. Concentration inequalities for sampling without replacement. arXiv preprint arXiv:1309.4029, 2013.
• Bardenet et al. (2014) Bardenet, Rémi, Doucet, Arnaud, and Holmes, Chris. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In Proceedings of The 31st International Conference on Machine Learning, pp. 405–413, 2014.
• Bardenet et al. (2015) Bardenet, Rémi, Doucet, Arnaud, and Holmes, Chris. On markov chain monte carlo methods for tall data. arXiv preprint arXiv:1505.02827, 2015.
• Bechhofer (1958) Bechhofer, Robert E. A sequential multiple-decision procedure for selecting the best one of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics, 14(3):408–429, 1958.
• Bottou & Bousquet (2008) Bottou, L. and Bousquet, O. The tradeoffs of large scale learning. In NIPS, volume 20, pp. 161–168, 2008.
• Bratières et al. (2010) Bratières, S., van Gael, J., Vlachos, A., and Ghahramani, Z. Scaling the iHMM: Parallelization versus hadoop. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, pp. 1235–1240, June 2010.
• Carlin & Chib (1995) Carlin, B.P. and Chib, S. Bayesian model choice via Markov chain Monte Carlo. Journal of the Royal Statistical Society, Series B, 57:473–484, 1995.
• Chen et al. (2014) Chen, Tianqi, Fox, Emily, and Guestrin, Carlos. Stochastic gradient hamiltonian monte carlo. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1683–1691, 2014.
• Ding et al. (2014) Ding, Nan, Fang, Youhan, Babbush, Ryan, Chen, Changyou, Skeel, Robert D, and Neven, Hartmut. Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems, pp. 3203–3211, 2014.
• DuBois et al. (2014) DuBois, Christopher, Korattikara, Anoop, Welling, Max, and Smyth, Padhraic. Approximate slice sampling for Bayesian posterior inference. In Proceedings of AISTATS, pp. 185–193, 2014.
• Finkel et al. (2005) Finkel, Jenny Rose, Grenager, Trond, and Manning, Christopher. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd ACL, pp. 363–370, 2005.
• Hoeffding (1963) Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
• Jamieson et al. (2014) Jamieson, Kevin, Malloy, Matthew, Nowak, Robert, and Bubeck, Sébastien. lil’UCB: An optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th COLT, pp. 423–439, 2014.
• Kalli et al. (2011) Kalli, Maria, Griffin, Jim E, and Walker, Stephen G. Slice sampling mixture models. Statistics and computing, 21(1):93–105, 2011.
• Korattikara et al. (2014) Korattikara, Anoop, Chen, Yutian, and Welling, Max. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 181–189, 2014.
• Kuzmin & Warmuth (2005) Kuzmin, Dima and Warmuth, Manfred K. Optimum follow the leader algorithm. In Proceedings of the 18th annual conference on Learning Theory, pp. 684–686. Springer-Verlag, 2005.
• Li et al. (2014) Li, Aaron, Ahmed, Amr, Ravi, Sujith, and Smola, Alex. Reducing the sampling complexity of topic models. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2014.
• Maclaurin & Adams (2015) Maclaurin, Dougal and Adams, Ryan Prescott. Firefly monte carlo: Exact mcmc with subsets of data. In

Twenty-Fourth International Joint Conference on Artificial Intelligence

, 2015.
• Maddison et al. (2014) Maddison, Chris J, Tarlow, Daniel, and Minka, Tom. A sampling. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (eds.), NIPS, pp. 3086–3094, 2014.
• Maron & Moore (1994) Maron, Oded and Moore, Andrew W. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems 6, pp. 59–66. Morgan-Kaufmann, 1994.
• Medlar et al. (2013) Medlar, Alan, Głowacka, Dorota, Stanescu, Horia, Bryson, Kevin, and Kleta, Robert. Swiftlink: parallel mcmc linkage analysis using multicore cpu and gpu. Bioinformatics, 29(4):413–419, 2013.
• Mitrophanov (2005) Mitrophanov, A Yu. Sensitivity and convergence of uniformly ergodic markov chains. Journal of Applied Probability, pp. 1003–1014, 2005.
• Newman et al. (2009) Newman, David, Asuncion, Arthur, Smyth, Padhraic, and Welling, Max. Distributed algorithms for topic models. The Journal of Machine Learning Research, 10:1801–1828, 2009.
• Papandreou & Yuille (2011) Papandreou, G. and Yuille, A. Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models. In Proceedings of ICCV, pp. 193–200, Barcelona, Spain, November 2011.
• Paulson (1964) Paulson, Edward. A sequential procedure for selecting the population with the largest mean from k normal populations. The Annals of Mathematical Statistics, pp. 174–180, 1964.
• Pillai & Smith (2014) Pillai, Natesh S and Smith, Aaron. Ergodicity of approximate MCMC chains with applications to large data sets. arXiv preprint arXiv:1405.0182, 2014.
• Scott et al. (2013) Scott, Steven L, Blocker, Alexander W, Bonassi, Fernando V, Chipman, H, George, E, and McCulloch, R. Bayes and big data: The consensus monte carlo algorithm. EFaBBayes 250 conference, 16, 2013.
• Serfling (1974) Serfling, R. J. Probability inequalities for the sum in sampling without replacement. Ann. Statist., 2(1):39–48, 01 1974.
• Singh et al. (2012) Singh, Sameer, Wick, Michael, and McCallum, Andrew. Monte Carlo MCMC: efficient inference by approximate sampling. In Proceedings of EMNLP-CoNLL 2012, pp. 1104–1113, 2012.
• Wang et al. (2013) Wang, Chong, Chen, Xi, Smola, Alex J, and Xing, Eric P. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, pp. 181–189, 2013.
• Welling & Teh (2011) Welling, Max and Teh, Yee W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of ICML 2011, pp. 681–688, 2011.
• Wilson (1984) Wilson, James R. Variance reduction techniques for digital simulation. American Journal of Mathematical and Management Sciences, 4(3-4):277–312, 1984.
• Wu et al. (2011) Wu, Yao, Yan, Qiang, Bickson, Danny, Low, Yucheng, and Yang, Qing. Efficient multicore collaborative filtering. In ACM KDD CUP workshop, 2011.
• Xu et al. (2014) Xu, M., Teh, Y. W., Zhu, J., and Zhang, B. Distributed context-aware bayesian posterior sampling via expectation propagation. In Advances in Neural Information Processing Systems, 2014.

## Appendix A Proofs

### a.1 Proof of Prop. 2.2

###### Proof.

For a discrete state space, the total variation is equivalent to half of distance between two probability vectors. Denote by the distribution of the output of the approximate algorithm conditioned on the vector of Gumbel variables , and the solution of Eq. 2 as a function of . According to the premise of Prop. 2.2, . We can bound the error of the conditional probability as

 ∑i∈\cX∣∣^p(X=i|\beps)−\dei,x(\beps)∣∣\nn (22) =∣∣^p(X=x(\beps)|\beps)−1∣∣+∑i≠x(\beps)∣∣^p(X=i|\beps)∣∣≤2\de,∀\beps (23)

where is the Kronecker delta function. Then we can show

 ∥^p(X)−p(X)∥TV\nn (24) =\ha∑i∈\cX|~p(X=i)−p(X=i)|\nn (25) =\ha∑i∈\cX∣∣∣∫\beps(^p(X=i|\beps)−\dei,x(\beps))dP(\beps)∣∣∣\nn (26) ≤\ha∑i∈\cX∫\beps∣∣^p(X=i|\beps)−\dei,x(\beps)∣∣dP(\beps)\nn (27) =\ha∫\beps⎛⎝∑i∈\cX∣∣^p(X=i|\beps)−\dei,x(\beps)∣∣⎞⎠dP(\beps)\nn (28) ≤\de (29)

### a.2 Sketch of the proof of Prop. 3.2

###### Proof.

As the proof of this proposition is almost identical to the proof of Jamieson et al. (2014), we only outlines the difference due to the adaptation. In the proof of Thm. 2 in Jamieson et al. (2014), the i.i.d. assumption for rewards from each arm was used only in Lemma 3 to provide Chernoff’s bound and Hoeffding’s bound. As noted in Sec. 6 of Hoeffding (1963) those bounds would still hold when rewards are sampled from a finite population without replacement. Therefore, when all the bounds hold for adapted lil’UCB.

When , the second modification sets the upper bound of the mean estimate to . That is a valid upper bound of , in fact much tighter than the bound in the original algorithm because exactly when the entire population is observed.

Therefore, as long as , Theorem 2 in Jamieson et al. (2014) applies to adapted lil’UCB with modification 1 and 2 only.

With the third modification, could never be bigger than at the stopping time, which proves the second part of Prop 3.2. The proof can then be concluded if we can show modification 3 does not change the output of adapted lil’UCB with the first two modifications only. This is true because if we do not stop when the selected arm satisfies , we do not need to update the upper bound of because the estimated mean is already exact. Since no upper bound is changed, the arm will always be chosen for now on and eventually the original stopping criterion of is met and the same arm will be returned. ∎

### a.3 Proof of Prop. 3.3

###### Proof.

Denote by the arm with the highest estimated mean at iteration and the optimal arm with the highest true mean, . If Alg. 1 does not stop in the first iterations, the estimated means of all the survived arms become exact at the last iteration , because we require . Then . As we require , all the sub-optimal arms will be eliminated by the last iteration and the algorithm always returns the correct best arm. This proves the upper bound of the sample size of .

Now to prove the confidence level, all we need to show is that with at least a probability arm survived all the iterations .

Let us first consider the case when Alg. 1 uses the marginal variance estimate . Let the events

 Ei={∃tG(\deD,T(t),^\sg(t)i,Ci)},∀i≠x∗\nn (30) Ex∗={∃tG(\deD,T(t),^\sg(t)i,Ci)} (31)

Applying condition Eq. 8 and the union bound, we get So with a probability at least , none of those events will happen. In that case for any iteration ,

 ^μx−^μx∗=(^μx−μx)−(^μx∗−μx∗)+(μx−μx∗)\nn (32)

So arm won’t be eliminated at iteration .

Similarly, for the case when Alg. 1 uses the pairwise variance estimate , let the events

 Ei,x={ ∃tG(\deD−1,T(t),^\sg(t)i,Ci+Cx∗)},∀i≠x∗ (35)

Applying condition Eq. 8 and the union bound, we get So with a probability at least for any iteration ,

 ^μx−^μx∗ =(^μx−^μx∗)−(μx−μx∗)+(μx−μx∗)\nn (36)

Therefore arm won’t be eliminated at iteration . ∎

### a.4 Proof of Prop. 3.3.2

###### Proof.

Denote by the arm with the highest estimated mean at iteration . First consider the case when Alg. 1 uses the marginal variance estimate . With the condition in Eq. 8, it follows that where is defined in Eq. 31. So with a probability at least ,

 ^μ(t)x∗−^μ(t)i> μx∗−μi−G(\deD,T(t),^\sg(t)x∗,Cx∗)\nn (38) −G(\deD,T(t),^\sg(t)i,Ci),∀i≠x∗ (39)

Alg. 1 will stop by iteration if the RHS of the equation above satisfies the stopping criterion for all , that is,

 μx∗−μi>2( G(\deD,T(t),^\sg(t)x∗,Cx∗)\nn (40) +G(\deD,T(t),^\sg(t)i,Ci)