 # The OS* Algorithm: a Joint Approach to Exact Optimization and Sampling

Most current sampling algorithms for high-dimensional distributions are based on MCMC techniques and are approximate in the sense that they are valid only asymptotically. Rejection sampling, on the other hand, produces valid samples, but is unrealistically slow in high-dimension spaces. The OS* algorithm that we propose is a unified approach to exact optimization and sampling, based on incremental refinements of a functional upper bound, which combines ideas of adaptive rejection sampling and of A* optimization search. We show that the choice of the refinement can be done in a way that ensures tractability in high-dimension spaces, and we present first experiments in two different settings: inference in high-order HMMs and in large discrete graphical models.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Common algorithms for sampling high-dimensional distributions are based on MCMC techniques (Andrieu et al., 2003; Robert and Casella, 2004), which are approximate in the sense that they produce valid samples only asymptotically. By contrast, the elementary technique of Rejection Sampling (Robert and Casella, 2004) directly produces exact samples, but, if applied naively to high-dimensional spaces, typically requires unacceptable time before producing a first sample.

The algorithm that we propose, OS, is a joint exact Optimization and Sampling algorithm that is inspired both by rejection sampling and by classical A optimization, and which can be applied to high-dimensional spaces. The main idea is to upper-bound the complex target distribution by a simpler proposal distribution , such that a dynamic programming (or another low-complexity) method can be applied to in order to efficiently sample or maximize from it. In the case of sampling, rejection sampling is then applied to , and on a reject at point , is refined into a slightly more complex in an adaptive way. This is done by using the evidence of the reject at , implying a gap between and , to identify a (soft) constraint implicit in which is not accounted for by , and by integrating this constraint in to obtain .

The constraint which is integrated tends to be highly relevant and to increase the acceptance rate of the algorithm. By contrast, many constraints that are constitutive of are never ‘‘activated’’ by sampling from , because never explores regions where they would become visible. For example, anticipating on our HMM experiments in section 3.1, there is little point in explicitly including in a 5-gram constraint on a certain latent sequence in the HMM if this sequence is already unlikely at the bigram level: the bigram constraints present in the proposal will ensure that this sequence will never (or very rarely) be explored by .

The case of optimization is treated in exactly the same way as sampling. Formally, this consists in moving from assessing proposals in terms of the norm to assessing them in terms of the norm. Typically, when a dynamic programming procedure is available for sampling ( norm) with , it is also available for maximizing from ( norm), and the main difference between the two cases is then in the criteria for selecting among possible refinements.

#### Related work

In an heuristic optimization context the two interesting, but apparently little known, papers

(Kam and Kopec, 1996; Popat et al., 2001) , discuss a technique for decoding images based on high-order language models for which upper-bounds are constructed in terms of simpler variable-order models. Our application of OS in section 3.1 to the problem of maximizing a high-order HMM is similar to their technique; however (Kam and Kopec, 1996; Popat et al., 2001) do not attempt to generalize their approach to other optimization problems amenable to dynamic programming or discuss any connection to sampling.

In order to improve the acceptance rate of rejection sampling, one has to lower the proposal curve as much as possible while keeping it above the curve. In order to do that, some authors (Gilks and Wild, 1992; Gorur and Teh, 2008), have proposed Adaptive Rejection Sampling (ARS) where, based on rejections, the curve is updated to a lower curve with a better acceptance rate. These techniques have predominantly been applied to continuous distributions on the one-dimensional real line where convexity assumptions on the target distribution can be exploited to progressively approximate it tighter and tighter through upper bounds consisting of piecewise linear envelopes.

Also in the context of rejection sampling, (Mansinghka et al., 2009) considers the case of a probabilistic graphical model; it introduces an heuristically determined order of the variables in this model and uses this (fixed) order to define a sequence of exact samplers over an increasing set of variables, where the exact sampler over the first variables is recursively obtained by using the preceding exact sampler over the first variables and accepting or rejecting its samples based on the variable. While our experiments on graphical models in section 3.2 have some similarities to this approach, they do not use a cascade of exact samplers, but rather they partition the space of configurations dynamically based on rejects experienced by the current proposal.

## 2 The OS* algorithm

OS is a unified algorithm for optimization and sampling. For simplicity, we first present its sampling version, then move to its optimization version, and finally get to the unified view. We start with some background and notations about rejection sampling.

### 2.1 Background

Let be a measurable function with respect to a base measure on a space , i.e. . We define . The function can be seen as an unnormalized density over , and

as a normalized density which defines a probability distribution over

, called the target distribution, from which we want to sample from222By abuse of language, we will also say that a sample from is a sample from .. While we may not be able to sample directly from the target distribution , let us assume that we can easily compute for any given . Rejection Sampling (RS) (Robert and Casella, 2004) then works as follows. We define a certain unnormalized proposal density over , which is such that (i) we know how to directly sample from it (more precisely, from ), and (ii) dominates , i.e. for all . We then repeat the following process: (1) we draw a sample from , (2) we compute the ratio , (3) with probability we accept , otherwise we reject it, (4) we repeat the process with a new . It can then be shown that this procedure produces an exact sample from . Furthermore, the average rate at which it produces these samples, the acceptance rate, is equal to (Robert and Casella, 2004), where for a (measurable) subset of , we define and similarly with . In Fig. 1, panel (S1), the acceptance rate is equal to the ratio of the area below the curve with that below the curve.

### 2.2 Sampling with OS∗

The way OS does sampling is illustrated on the top of Fig. 1. In this illustration, we start sampling with an initial proposal density (see (S1)). Our first attempt produces , for which the ratio is close to ; this leads, say, to the acceptance of . Our second attempt produces , for which the ratio is much lower than , leading, say, to a rejection. Although we have a rejection, we have gained some useful information, namely that is much lower than , and we are going to use that ‘‘evidence’’ to define a new proposal (see (S2)), which has the following properties:

• One has everywhere on .

• One has .

One extreme way of obtaining such a is to take:

 q′(x)≡{p(x)if x=x2q(x)if x≠x2

which, when the space is discrete, has the effect of improving the acceptance rate, but only slightly so, by insuring that any time happens to select , it will accept it.

A better generic way to find a is the following. Suppose that we are provided with a small finite set of ‘‘one-step refinement actions" , depending on and , which are able to move from to a new such that for any such one has everywhere on and also . Then we will select among these moves the one that is such that the norm of is minimal among the possible ’s, or in other words, such that is minimal in . The idea there is that, by doing so, we will improve the acceptance rate of (which depends directly on ) as much as possible, while (i) not having to explore a too large space of possible refinements, and (ii) moving from a representation for to an only slightly more complex representation for , rather than to a much more complex representation for a that could result from exploring a larger space of possible refinements for .333In particular, even if we could find a refinement that would exactly coincide with , and therefore would have the smallest possible norm, we might not want to use such a refinement if this involved an overly complex representation for . The intuition behind such one-step refinement actions will become clearer when we consider concrete examples later in this paper. Figure 1: Sampling with OS∗ (S1, S2), and optimization with OS∗ (O1, O2).

### 2.3 Optimization with OS∗

The optimization version of OS is illustrated on the bottom of Fig. 1, where (O1) shows on the one hand the function that we are trying to maximize from, along with its (unknown) maximum , indicated by a black circle on the curve, and corresponding to in . It also shows a ‘‘proposal’’ function which is such --- analogously to the sampling case --- that (1) the function is above on the whole of the space and (2) it is easy to directly find the point in at which it reaches its maximum , shown as a black circle on the curve.

A simple, but central, observation is the following one. Suppose that the distance between and is smaller than , then the distance between and is also smaller than . This can be checked immediately on the figure, and is due to the fact that on the one hand is higher than , and that on the other hand it is below , and a fortiori below . In other words, if the maximum that we have found for is at a coordinate and we observe that , then we can conclude that we have found the maximum of up to .

In the case of in the figure, we are still far from the maximum, and so we ‘‘reject’’ , and refine into (see (O2)), using exactly the same approach as in the sampling case, but for one difference: the one-step refinement option that is selected is now chosen on the basis of how much it decreases, not the norm of , but the max of --- where, as a reminder, this max can also be notated , using the norm notation.444A formal definition of that norm is that is equal to the “essential supremum” of over (see below), but for all practical purposes here, it is enough to think of this essential supremum as being the max, when it exists.

Once this has been selected, one can then find its maximum at and then the process can be repeated with until the difference between and is smaller than a certain predefined threshold.

### 2.4 Sampling L1 vs. Optimization L∞

While sampling and optimization are usually seen as two completely distinct tasks, they can actually be viewed as two extremities of a continuous range, when considered in the context of spaces (Ash and Doléans-Dade, 1999).

If is a measure space, and if is a real-valued function on this space, one defines the norm , for as:

 ∥f∥p≡(∫X|f|p(x)dμ(x))1/p.

One also defines the the norm as:

 ∥f∥∞≡inf{C≥0:|f(x)|≤C for almost every x},

where the right term is called the essential supremum of , and can be thought of roughly as the ‘‘max’’ of the function. So, with some abuse of language, we can simply write: The space , for , is then defined as being the space of all functions for which .

Under the simple condition that for some , we have:

The standard notion of sampling is relative to . However we can introduce the following generalization --- where we use the notation instead of in order to avoid confusion with our use of for denoting the target distribution. We will say that we are performing sampling of a non-negative function relative to , for , if and if we sample --- in the standard sense --- according to the normalized density distribution . In the case , we will say that we are sampling relative to , if and if we are performing optimization relative to , more precisely, if for any , we are able to find an such that .

Informally, sampling relative to ‘‘tends’’ to sampling with (i.e. optimization), for tending to , in the sense that for a large , an sampled relative to ‘‘tends’’ to be close to a maximum for . We will not attempt to give a precise formal meaning to that observation here, but just note the connection with the idea of simulated annealing (Kirkpatrick et al., 1983), which we can view as a mix between the MCMC Metropolis-Hastings sampling technique (Robert and Casella, 2004) and the idea of sampling in spaces with larger and larger ’s.

In summary, we thus can view optimization as an extreme form of sampling. In the sequel we will often use this generalized sense of sampling in our algorithms.555Note: While our two experiments in section 3 are based on discrete spaces, the OS algorithm is more general, and can be applied to any measurable space (in particular continuous spaces); in such cases, and have to be measurable functions, and the relation should be read as a.e. (almost everywhere) relative to the base measure .

### 2.5 Os∗ as a unified algorithm

The general design of OS can be described as follows:

• Our goal is to OS-sample from , where we take the expression ‘‘OS-sample’’ to refer to a generalized sense that covers both sampling (in the standard sense) and optimization.

• We have at our disposal a family of proposal densities over the space , such that, for every , we are able to OS-sample efficiently from .

• Given a reject relative to a proposal , with , we have at our disposal a (limited) number of possible ‘‘one-step’’ refinement options , with , and such that .

• We then select one such . One possible selection criterion is to prefer the which has the smallest norm (sampling case) or norm (optimization). In one sense, this is the most natural criterion, as it means we are directly lowering the norm that controls the efficiency of the OS-sampling; for instance, for sampling, if and are two candidates refinements with , then the acceptance rate of is larger than that of , simply because then , and similarly, in optimization, if , then the gap between and is smaller than that between and , simply because then . However, using this criterion may require the computation of the norm of each of the possible one-step refinements, which can be costly, and one can prefer simpler criteria, for instance simply selecting the that minimizes .

• We iterate until we settle on a ‘‘good’’ : either (in sampling) one which is such that the cumulative acceptance rate until this point is above a certain threshold; or (in optimization) one for which the ratio is closer to than a certain threshold, with being the maximum for .

The following algorithm gives a unified view of OS, valid for both sampling and optimization. This is a high-level view, with some of the content delegated to the subroutines OS-Sample, Accept-or-Reject, Update, Refine, Stop, which are described in the text.

On entry into the algorithm, we assume that we are either in sample mode or in optimization mode, and also that we are starting from a proposal which (1) dominates and (2) from which we can sample or optimize directly. We use the terminology OS-Sample to represent either of these cases, where OS-Sample refers to sampling an according to the proposal or optimizing on (namely finding an which is an argmax of ), depending on the case. On line (1), refers to the history of the sampling so far, namely to the set of trials that have been done so far, each being marked for acceptance or rejection (in the case of sampling, this is the usual notion, in the case of optimization, all but the last proposal will be marked as rejects). The stopping criterion Stop() will be to stop: (i) in sampling mode, if the number of acceptances so far relative to the number of trials is larger than a certain predefined threshold, and in this case will return on line (8), first, the list of accepted ’s so far, which is already a valid sample from , and second, the last refinement , which can then be used to produce any number of future samples as desired with an acceptance ratio similar to the one observed so far; (ii) in optimization mode, if the last element of the history is an accept, and in this case will return on line (8), first the value , which in this mode is the only accepted trial in the history, and second, the last refinement (which can be used for such purposes as providing a ‘‘certificate of optimality of relative to ’’, but we do not detail this here).

On line (3), we compute the ratio , and then on line (4) we decide to accept or not based on this ratio; in optimization mode, we accept if the ratio is close enough to , as determined by a threshold666When is a finite domain, it makes sense to stop on a ratio equal to 1, in which case we have found an exact maximum. This is what we do in some of our experiments in section 3.; in sampling mode, we accept based on a Bernoulli trial of probability .

On line (5), we update the history by recording the trial and whether it was accepted or not.

If was rejected (line (6)), then on line (7), we perform a refinement of , based on the principles that we have explained.

### 2.6 A connection with A*

A special case of the OS algorithm, which we call ‘‘OS with piecewise bounds’’, shows a deep connection with the classical A optimization algorithm (Hart et al., 1968) and is interesting in its own right. Let us first focus on sampling, and let us suppose that represents an initial proposal density, which upper-bounds the target density over . We start by sampling with , and on a first reject somewhere in , we split the set into two disjoint subsets , obtaining a partition of . By using the more precise context provided by this partition, we may be able to improve our upper bound over the whole of into tighter upper bounds on each of and , resulting then in a new upper bound over the whole of . We then sample using , and experience at some later time another reject, say on a point in ; at this point we again partition into two subsets and , tighten the bounds on these subsets, and obtain a refined proposal over ; we then iterate the process of building this ‘‘hierarchical partition’’ until we reach a certain acceptance-rate threshold.

If we now move our focus to optimization, we see that the refinements that we have just proposed present an analogy to the technique used by A. This is illustrated in Fig. 2. In A, we start with a constant optimistic bound --- corresponding to our --- for the objective function which is computed at the root of the search tree, which we can assume to be binary. We then expand the two daughters of the root, re-evaluate the optimistic bounds there to new constants, obtaining the piecewise constant proposal , and move to the daughter with the highest bound. We continue by expanding at each step the leaf of the partial search tree with the highest optimistic bound (e.g. moving from to , etc.).777OS, when used in optimization mode, is in fact strictly more general than A, for two reasons: (i) it does not assume a piecewise refinement strategy, namely that the refinements follow a hierarchical partition of the space, where a given refinement is limited to a leaf of the current partition, and (ii) even if such a stategy is followed, it does not assume that the piecewise upper-bounds are constant. Both points will become clearer in the HMM experiments of section 3.1

, where including an higher-order n-gram in

has impact on several regions of simultaneously, possibly overlapping in complex ways with regions touched by previous refinements; in addition, the impact of a single n-gram is non-constant even in the regions it touches, because it depends of the multiplicity of the n-gram, not only on its presence or absence.

We will illustrate OS sampling with (non-constant) piece-wise bounds in the experiments of section 3.2.

## 3 Experiments

### 3.1 HMMs

Note: An extended and more detailed version of these experiments is provided in (Carter et al., 2012).

The objective in our HMM experiments is to sample a word sequence with density proportional to , where is the probability of the sequence under an -gram model and is the probability of observing the noisy sequence of observations given that the word sequence is . Assuming that the observations depend only on the current state, this probability can be written:

 p(x) = ℓ∏i=1plm(xi|xi−1i−n+1) pobs(oi|xi). (1)

#### Approach

Taking a tri-gram language model for simplicity, let us define . Then consider the observation be fixed, and write . In optimization/decoding, we want to find the argmax of , and in sampling, to sample from . Note that the state space associated with can be huge, as we need to represent explicitly all contexts in the case of a trigram model, and even more contexts for higher-order models.

We define , along with , where the maxima are taken over all possible context words in the vocabulary. These quantities, which can be precomputed efficiently, can be seen as optimistic ‘‘max-backoffs’’ of the trigram , where we have forgotten some part of the context. Our initial proposal is then . Clearly, for any sequence of words, we have . The state space of is much less costly to represent than that of .

The proposals , which incorporate n-grams of variable orders, can be represented efficiently through WFSAs (weighted FSAs). In Fig. 3(a), we show a WFSA representing the initial proposal corresponding to an example with four observations, which we take to be the acoustic realizations of the words ‘the, two, dogs, barked’. The weights on edges correspond only to unigram max-backoffs, and thus each state corresponds to a NULL-context. Over this WFSA, both optimization and sampling can be done efficiently by the standard dynamic programming techniques (Viterbi (Rabiner, 1989)and ‘‘backward filtering-forward sampling’’ (Scott, 2002)), where the forward weights associated to states are computed similarly, either in the max-product or in the sum-product semiring. Figure 3: An example of an initial q-automaton (a), and its refinement (b).

Consider first sampling, and suppose that the first sample from produces the two dog barked, marked with bold edges in the drawing. Now, computing the ratio gives a result much smaller than , in part because from the viewpoint of the full model , the trigram the two dog is very unlikely; i.e. the ratio is very low. Thus, with high probability, is rejected. When this is the case, we produce a refined proposal , represented by the WFSA in Fig. 3(b), which takes into account the more realistic bigram weight by adding a node (node 6) for the context two. We then perform a sampling trial with , which this time tends to avoid producing dog in the context of two; if we experience a reject later on some sequence , we refine again, meaning that we identify an n-gram in , which, if we extend its context by one (e.g from a unigram to a bigram or from a bigram to a trigram), accounts for some significant part of the gap between and . We stop the refinement process when we start observing acceptance rates above a certain fixed threshold.

The case of optimization is similar. Suppose that with the maximum is the two dog barked, then we observe that is lower than , reject and refine into . We stop the process at the point where the value of , at its maximum , is equal to the value of at , which implies that we have found the maximum for .

#### Setup

We evaluate our approach on an SMS-message retrieval task. Let be the number of possible words in the vocabulary. A latent variable represents a sentence defined as a sequence of words. Each word is converted into a sequence of numbers based on a mobile phone numeric keypad, assuming some level of random noise in the conversion. The task is then to recover the original message.

We use the English side of the Europarl corpus (Koehn, 2005) for training and test data (1.3 million sentences). A 5-gram language model is trained using SRILM (Stolcke, 2002) on 90% of the sentences. On the remaining 10%, we randomly select 100 sequences for lengths 1 to 10 to obtain 1000 sequences from which we remove the ones containing numbers, obtaining a test set of size 926.

#### Optimization

We limit the average number of latent tokens in our decoding experiments to 1000. In the top plot (a) of Fig. 4 we show the number of iterations (running Viterbi then updating ) that the different n-gram models of size 3, 4 and 5 take to do exact decoding of the test-set. For a fixed sentence length, we can see that decoding with larger n-gram models leads to a sub-linear increase w.r.t. in the number of iterations taken.

To demonstrate the reduced nature of our q-automaton, we show in Table 1 the distribution of n-grams in our final model for a specific input sentence of length 10. The total number of n-grams in the full model would be ; exact decoding here is not tractable using existing techniques. By comparison, our HMM has only 118 five-grams and 9008 n-grams in total.

#### Sampling

For the sampling experiments, we limit the number of latent tokens to 100. We refine our automaton until we reach a certain fixed cumulative acceptance rate (AR). We also compute a rate based only on the last 100 trials (AR-100), as this tends to better reflect the current acceptance rate. In plot (b), at the bottom of Fig. 4, we show a single sampling run using a 5-gram model for an example input, and the cumulative # of accepts (middle curve). It took 500 iterations before the first successful sample from .

We noted that there is a trade-off between the time needed to compute the forward probability weights needed for sampling, and the time it takes to adapt the variable-order HMM. To resolve this, we use batch-updates: making trials from the same -automaton, and then updating our model in one step. By doing this, we noted significant speed-ups in sampling times. Empirically, we found to be a good value. In plot (c), we show the average # of iterations in our models once refinements are finished (AR-100=20%) for different orders over different lengths. We note a sub-linear increase in the number of trials when moving to higher ; for length10, and for , average number of trials: 3-1105,   4-1238,   5-1274.

### 3.2 Discrete Probabilistic Graphical Models

#### Approach

The OS approach can be applied to exact sampling and optimization on graphical models with loops, where the objective function takes the form of a product of local potentials:

 p(x)=∏n∈Nψn(x)∏e∈Eϕe(x), (2)

where defines an undirected graph with nodes and edges . The unary potential functions are denoted by and the binary potentials by . Since integrating and sampling from (2) can be done efficiently for trees, we first determine a spanning tree of the graph . Let us denote by and the maximal and minimal values of the potential for edge . If we define:

 q(x)=∏n∈Nψn(x)∏e∈Tϕe(x)∏e∈E−Tϕmaxe, (3)

then is an upper-bound for over .888Any choice of tree produces an upper-bound, but it is advantageous to choose one for which is as “close” as possible to , which we heuristically do by using Prim’s algorithm (Prim, 1957) for selecting a maximum spanning tree on the graph having weights , . The intuition is that edges with nearly constant potentials create a small gap between the exact value and the the bound and can be left outside the tree.

To improve this upper bound, we use the conditioning idea (see e.g. (Koller and Friedman, 2009), chap. 9.5) which corresponds to partitioning the configuration space . Assume we observe all the possible values of a node, say , having the set of incident edges . The restriction of to the subspace can be written as:

 pi,k(x)=ψi(k)∏n∈N−{i}ψn(x)∏e∈Ciϕe(x)unary potentials∏e∈E−Ciϕe(x)binary % potentials,

where the binary potentials of edges incident to have now been absorbed into unary potentials associated with neighboring nodes to , and where we have used the informal notation in an obvious way. Note that, by doing so, we have eliminated from the graph the edges incident to , which may have been participating in loops; in particular, if in equation (3) we had some edge incident to , then this edge is now absorbed in one of the neighbors of , which implies that the maximum has now been replaced by its exact value, which is necessarily lower, implicitly refining . By conditioning with respect to all the possible values of , we obtain different graphs with the node removed, in other words we have partitioned the event space into subspaces; we then define on each such subspace a proposal as in equation (3), which is lower than the restriction of to . We then define the refinement of on the whole of as , where is taken to be null for . This scheme can be seen to be an instance of with piecewise bounds.

If we repeat this process iteratively based on observed rejections in one of the subspaces, we obtain a hierarchical partition of which is more fine-grained in some regions of than in others. The refinements obtained have the form (3), but on reduced graphs in which we have introduced the evidences given by the conditioned variables. While the cardinality of the hierarchical partition may grow exponentially, we can monitor the acceptance rate/computation time ratio, as we will see below.

#### Ising model experiment

In what follows, we only consider exact sampling, the problem of MAP estimation (i.e. optimization) in graphical models having been thoroughly investigated over the past decade (e.g.

(Sontag et al., 2008)). One interesting question is to understand the trade-off between improving the acceptance rate and incurring the cost of performing a refinement. An a priori possible policy could be to first refine the proposals up to a certain point and only then sample until we get the required number of samples. As the experiments will show, there are however two reasons to interleave sampling with refining: the first is that observing rejected samples from helps to choose the refinements, as argued previously; the second is to have a criterion for stopping the refinement process: the computation of this criterion requires an estimate of the current acceptance rate which can be estimated from the samples.

We consider a Ising model of 100 binary variables on a 10x10 uniform grid with unary potentials and binary coupling strengths drawn according to a centered normal distribution with standard deviation 0.1. Note that, in certain cases, exact sampling for Ising models can be done in polynomial time

(Ullrich, 2010) by using an elegant MCMC approach called coupling from the past (Propp and Wilson, 1996)

. It is based on the fact that two properly coupled Markov chains follow exactly the target distribution at the time they coaelesce. However, applications of this approach rely on certain strong assumptions on the potentials, typically that they are either all attractive or all repulsive

(Craiu and Meng, 2011), while we sample models with random positive or negative coupling strengths, making the problem much harder. One interesting extension of our work would be to use these algorithms as proposal distributions for more general models. Figure 5: Comparison of different refinement policies for the piecewise bound on a 10x10 Ising model grid.

The algorithm was run using 4 different policies for the choice of the refinements, where a policy is a function that takes as input the current proposal and returns a (subspace, conditioning-variable) pair. The first two policies in addition use a rejected sample as input, while the last two do not and are deterministic: (i) random split in the region of the rejected sample means that once an observation has been rejected, we refine the subspace which contains the sample by conditioning with respect to one of the remaining variables selected at random; (ii) highest bound improvement on rejected sample is the same but the index of the conditioning variable is selected such that refining on leads to the largest decrease in the value of (for this specific rejected sample ); this is a similar refinement strategy to the one we used in the HMM experiments; (iii) most probable region refines a variable (selected uniformly at random) in the most probable subspace of the piecewise bound ; (iv) highest acceptance rate is the most ‘‘ambitious’’ greedy policy where the largest reduction of the total mass of the proposal is identified among all the possible choices of (subspace, conditioning-variable). This has been implemented efficiently by maintaining a priority queue containing all the possible triples (subspace, conditioning-variable, maximal-bound-improvement). The acceptance rates obtained by following these policies are compared on Figure 5 (left) by running 2000 refinements with policies (i) and (iii), 800 refinements with policy (ii) and 400 refinement with policy (iv). Results confirm that the best refinement is obtained using the deterministic policy (iv). The policy (ii) based on rejected samples reaches the same acceptance rates using twice as many refinements. The other two policies (i) and (iii) are very naive and do not reach a significant acceptance rate, even after 400 iterations. Based on these results, one might conclude that the policy (iv) is the best. However, we will now see that this is not the case when the computation time is taken into account.

After trials, the partition function of , namely , can be estimated (without bias) by where the observed accept ratios and refinement weights are used. Hence,

is an unbiased estimate of the current acceptance rate

. Note that these estimators are based on all the samples , accepted or not. Hence, if we decide to stop refining now, the expected time to obtain additional exact samples from the target distribution is approximately , where is the average time to obtain one trial from the distribution . If we add the total time spent in computing the set of refinements up to the current refinement , we obtain an estimate of the expected total time to obtain samples: This quantity computed for sample is plotted against the acceptance rate on Figure 5 (right). For each policy, the expected computation time starts to decrease as the acceptance rate increases; in this regime, the refinement time is small compared to the time reduction due to a higher acceptance rate. For large values of the acceptance rate, the refinement time is no more negligeable, leading to an increase of the total computation time. We see (Figure 5 (right)) that the highest acceptance rate policy (iv), despite its very good acceptance rate for a fixed number of refinements, requires more time in total than alternative refinements policies. The difference is striking: if we look at the minimum of each curve (which corresponds to the optimal stopping time for the refinements), it is at least 10 times faster to use the policy (ii) based on a refinement rule applied only to the rejected sample. This experiment confirms the benefit of using rejected samples: by adaptively choosing the refinements, we spot the regions of the space that are important to refine much faster than by computing the best possible alternative refinement.

## 4 Conclusions and Perspectives

In this paper, we have proposed a unified viewpoint for rejection sampling and heuristic optimization, by using functional upper-bounds for both. While in sampling, the upper-bounds are refined by decreasing their integral ( norm), in optimization they are refined by decreasing their maximum ( norm). Depending on the problem, several classes of upper bounds can be used. We showed that variable-order max-backoff bounds on -gram probabilities gave state-of-the art performances on the exact decoding of high-order HMMs. For many practical problems, simpler piecewise bounds can be derived, which we illustrated on the example of sampling and decoding on a large tree-width graphical model. One interesting property of the proposed approach is the adaptive nature of the algorithm: the rejected sample is used to quickly choose an effective refinement, an approach which can be computational attractive compared to the computation of the norm of all the potential one-step refinements. In the case of graphical model sampling, we showed that this can lead to a speedup factor of an order of magnitude.

The results presented in this paper motivate further research in the development of domain-specific functional bounds. One important extension will be to derive bounds for agreement-based models corresponding to the product of two (or more) simple models and . One typical example corresponds to the agreement between an HMM and a probabilistic context-free grammar in order to take into account both the syntactic and the low-level -gram structures of the language Rush et al. (2010).

Another research direction is to improve those models that are based on piecewise bounds: their quality can be limited since their refinement is always local, i.e. they can only improve the bound on a single element of the partition. In the example of graphical model sampling, if we condition on the value of a first variable, say , and then further refine the partition by conditioning on the value of a second variable , the complementary region of the space will not be impacted. Intuitively, the quality of the bound could be improved if we could refine it jointly for and , in a way analogous to what was done in the context of HMMs, where the incorporation of one higher-order n-gram into the proposal had a global impact on the whole event space.

## References

• Andrieu et al.  Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan.

An introduction to MCMC for machine learning.

Machine Learning, 50(1-2):5--43, 2003.
• Ash and Doléans-Dade  Robert B. Ash and Catherine A. Doléans-Dade. Probability and Measure Theory. Academic Press, 1999.
• Carter et al.  Simon Carter, Marc Dymetman, and Guillaume Bouchard.

Exact Sampling and Decoding in High-Order Hidden Markov Models.

In Proceedings of EMNLP, Jeju, South Korea, July 2012.
• Craiu and Meng  Radu V. Craiu and Xiao-Li Meng. Perfection within Reach: Exact MCMC Sampling. In S. Brooks, A. Gelman, G.L. Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, pages 199--226. CRC Press, 2011.
• Gilks and Wild  W. R. Gilks and P. Wild. Adaptive rejection sampling for gibbs sampling. Applied Statistics, 42(2):337--348, 1992.
• Gorur and Teh  Dilan Gorur and Yee Whye Teh. Concave convex adaptive rejection sampling. Technical report, Gatsby Computational Neuroscience Unit, 2008.
• Hart et al.  P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions On Systems Science And Cybernetics, 4(2):100--107, 1968.
• Kam and Kopec  Anthony C. Kam and Gary E. Kopec. Document image decoding by heuristic search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:945--950, 1996.
• Kirkpatrick et al.  S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671--680, 1983.
• Koehn  Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit, pages 79--86, 2005.
• Koller and Friedman  Daphne Koller and Nir Friedman. Probabilistic Graphical Models. MIT Press, 2009.
• Mansinghka et al.  Vikash Mansinghka, Daniel Roy, Eric Jonas, and Joshua Tenenbaum. Exact and Approximate Sampling by Systematic Stochastic Search. In Proc. AISTATS, 2009.
• Popat et al.  Kris Popat, Danel Green, Justin Romberg, and Dan Bloombug. Adding linguistic constraints to document image decoding: Comparing the iterate complete path and stack algorithms. In Proceedings of IS&T/SPIE Electronic Imaging 2001: Document Recognition & Retrieval VIII, pages 259--271, 2001.
• Prim  R.C. Prim. Shortest Connection Networks and Some Generalizations. Bell System Technical Journal, 36:1389--1401, 1957.
• Propp and Wilson  J. Propp and D. Wilson. Exact sampling with coupled markov chains and applications to statistical mechanics. Random Structures and Algorithms, 9:223--252, 1996.
• Rabiner  Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, February 1989.
• Robert and Casella  Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2004. ISBN 0387212396.
• Rush et al.  Alexander M Rush, David Sontag, Michael Collins, and Tommi Jaakkola.

Dual decomposition and linear programming relaxations for natural language processing.

In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP 2010), pages 1--11, 2010.
• Scott  Steven L. Scott. Bayesian methods for hidden markov models: Recursive computing in the 21st century. Journal of the American Statistical Association, 97:337--351, 2002.
• Sontag et al.  David Sontag, Talya Meltzer, Amir Globerson, Yair Weiss, and Tommi Jaakkola. Tightening LP relaxations for MAP using message-passing. In

24th Conference in Uncertainty in Artificial Intelligence

, pages 503--510. AUAI Press, 2008.
• Stolcke  Andreas Stolcke. Srilm - an extensible language modeling toolkit. In Proceedings of the International Conference of Spoken Language Processing (INTERSPEECH 2002), pages 257--286, 2002.
• Ullrich  M. Ullrich. Exact sampling for the ising model at all temperatures. Monte Carlo Methods Appl, 2010. to appear.