Conditional Poisson Stochastic Beam Search

by   Clara Meister, et al.
ETH Zurich

Beam search is the default decoding strategy for many sequence generation tasks in NLP. The set of approximate K-best items returned by the algorithm is a useful summary of the distribution for many applications; however, the candidates typically exhibit high overlap and may give a highly biased estimate for expectations under our model. These problems can be addressed by instead using stochastic decoding strategies. In this work, we propose a new method for turning beam search into a stochastic process: Conditional Poisson stochastic beam search. Rather than taking the maximizing set at each iteration, we sample K candidates without replacement according to the conditional Poisson sampling design. We view this as a more natural alternative to Kool et. al. 2019's stochastic beam search (SBS). Furthermore, we show how samples generated under the CPSBS design can be used to build consistent estimators and sample diverse sets from sequence models. In our experiments, we observe CPSBS produces lower variance and more efficient estimators than SBS, even showing improvements in high entropy settings.


page 1

page 2

page 3

page 4


Determinantal Beam Search

Beam search is a go-to strategy for decoding neural sequence models. The...

Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement

The well-known Gumbel-Max trick for sampling from a categorical distribu...

A Streaming Approach For Efficient Batched Beam Search

We propose an efficient batching strategy for variable-length decoding o...

Best-First Beam Search

Decoding for many NLP tasks requires a heuristic algorithm for approxima...

An Empirical Investigation of Beam-Aware Training in Supertagging

Structured prediction is often approached by training a locally normaliz...

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Neural sequence models are widely used to model time-series data in many...

A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

We propose Composition Sampling, a simple but effective method to genera...

1 Introduction

Many NLP tasks require the prediction of structured outputs, such as sentences or parse trees, either during decoding or as part of a training algorithm. For today’s neural architectures, beam search reddy-1977 has become the decoding algorithm of choice due to its efficiency and empirical performance AAAI1714571; edunov-etal-2018-understanding; XLNET; meister-etal-2020-best. Beam search is a deterministic method, which invites a natural question: What is the proper stochastic generalization of beam search? Several recent papers have investigated this question kool2019stochastic; Kool2020Estimating; shi2020uniquerandomizer. Here we build on this line of work and introduce an alternative stochastic beam search that the authors contend is a more faithful stochasticization of the original algorithm in that it recovers standard beam search as a special case. We name our algorithm conditional Poisson stochastic beam search (CPSBS) as we draw on the conditional Poisson sampling scheme hajek1964 in its construction. The relationship between CPSBS and other common decoding strategies is displayed visually in Table 1.

At every iteration, CPSBS replaces the top- operator in the beam search algorithm with conditional Poisson sampling, resulting in a decoding strategy that generates samples-without-replacement. Importantly, annealing our sampling distribution at each time step turns local sampling into a local top- computation and thereby recovers beam search. We subsequently show that these samples can be used to construct a statistically consistent estimator for the expected value of an arbitrary function of the output.

max width=  


Set Size
Δ  Greedy Search Beam Search Δ  Ancestral Conditional Sampling Poisson Beams Δ 

Table 1: A comparison of beam-based decoding algorithms for sequence models, by solution set size and objective. The and sample variants are related through annealing: As the annealing parameter of the distribution , sampling turns into computing an (see § 3).

In our experiments with neural machine translation models, we observe that CPSBS leads to better estimates of expected

bleu and conditional model entropy than SBS and the sum-and-sample estimator Kool2020Estimating, distinctly outperforming Monte Carlo sampling for both small sample sizes and low temperatures. Furthermore, we find that CPSBS can be used as a diverse sampling strategy. We take these results as confirmation that CPSBS is a useful tool in the newfound arsenal of sampling strategies for neural sequence models.

2 Beam Search

In this section, we overview the necessary background on neural sequence models and beam search in order to motivate our algorithm in § 3.

Neural Sequence Models.

We consider locally normalized probabilistic models over sequences :


where is a member of a set of well-formed outputs . In the context of language generation models, well-formed outputs are sequences of tokens from a vocabulary ; all begin and end with special tokens bos and eos, respectively. We use to represent the subsequence . In this work, we consider the setting where the maximum sequence length is upper-bounded; we denote this upper bound . Without loss of generality, we may condition on an input , as is necessary for machine translation and other conditional generation tasks.

Beam Search.

Beam search is a commonly used search heuristic for finding an approximate solution to the following optimization problem:


Its most straightforward interpretation is as a pruned version of breadth-first search, where the breadth of the search is narrowed to the top- candidates. However, here we will present beam search in a nonstandard lens meister-etal-2020-beam; meister-etal-2021-determinantal in order to emphasize the connection with our stochastic generalization in § 3. Specifically, we present the algorithm as iteratively finding the highest-scoring set under a specific set function.

Under this paradigm, the initial beam contains only the bos token. At subsequent steps , beam search selects the highest-scoring candidates from the set that we define below:222Sequences already ending in eos are not extended by and are simply added to the set “as is.”



is sequence concatenation. Those candidate sets with collectively higher probability under the model

have higher score. This process continues until all end in eos, or . For notational ease, we define ; throughout this paper, we will assume and identify the elements of with the integers .

We can formulate the time-step dependent set function whose beam search finds as


where is the weight of the element of . To recover beam search, we set our weights equal to probabilities under a model i.e. . Note that we leave the constraint that implicit in Eq. 4. As should be clear from notation, this set function only assigns positive scores to subsets of of size exactly and the assigned score is proportional to the product of the probability of the candidates under the model . Putting this all together, beam search may be viewed as the following iterative process: [ams align,colback=gray!35!white,colframe=white!5!white,arc=0pt,outer arc=0pt]Y_0 &= {bos }
Y_t & = argmax_Y_t’ ⊆B_t  Q_t(Y_t’ ∣Y_t-1)
& return   Y_T

3 Conditional Poisson Stochastic Beams

Our paper capitalizes on a very simple observation: Rather than taking its , we may renormalize Eq. 4 into a distribution and sample-without-replacement a size set at each iteration:

[ams align,colback=gray!35!white,colframe=white!5!white,arc=0pt,outer arc=0pt] Y_0 &= {bos }
Y_t & ∼Q_t(⋅∣Y_t-1)
& return   Y_T

This recursion corresponds to performing conditional Poisson sampling (CPS; hajek1964;see App. A for overview), a common sampling-without-replacement design tille_book,333A sampling design is a probability distribution over sets of samples.

at every time step. Thus we term this scheme conditional Poisson stochastic beam search. CPSBS gives us a probability distribution over

sets of candidates of size , i.e., the final beam . We denote the CPSBS distribution and we write to indicate that is the stochastic beam at the end of a sampling run. We may write as a marginal probability, summing over all sequences of beams that could have resulted in :444This formulation reveals that it is wildly intractable to compute .


Note the structural zeros of prevent any incompatible sequence of beams. We provide a theoretical analysis of the scheme in § 4 and an empirical analysis in § 5.

Normalizing .

At each time step , we compute —a distribution over subsets of size of a base set —using the CPS design. The normalizing constant for this distribution is defined as


Despite there being exponentially many summands, we can sum over all subsets in time via the following recurrence relation:555The reader may recognize this recurrence as the weighted generalization of Pascal’s triangle, , which is why we chose the notation .

We give complete pseudocode in App. C. Correctness of this algorithm is shown in taskar_dpp. The normalizing constant can then be efficiently computed as

Sampling from .

We can efficiently sample sets from using the algorithm below:

1: Initialization
2:for  :
3:    Number of remaining elements
4:   Add the element of to with prob.
5:return Guaranteed to have size

In words, the algorithm considers adding each element one at a time until elements have been sampled. Notice that line 4 adjusts the probability of sampling item given that items have already been sampled, which ensures that exactly elements are sampled at termination.

Setting .

The weight assigned to the item of directly affects its probability of being included in the sampled set, i.e., , also termed an item’s inclusion probability. In this paper, we write to denote the inclusion probability under the distribution , defined as:


One strategy is to choose at time step such that . This choice recovers beam search when we anneal our chosen weights : as the temperature parameter , the CP distribution will assign probability 1 to the set containing the top- elements.666In the event of ties, annealed CP will converge to a distribution that breaks ties uniformly at random.

Finding ’s that result in pre-specified inclusion probabilities is possible, but it requires solving a numerical optimization problem aires1999algorithms; GRAFSTROM20092111. Further, in CPSBS, we will be sampling from a different distribution at each time step and it would be quite slow to solve the numerical optimization problem each iteration. Luckily, the choice of yields a good approximation to the target inclusion probabilities in both theory and practice hajek1981sampling; bondesson; aires1999algorithms.

4 Statistical Estimation with Conditional Poisson Stochastic Beam Search

In this section, we discuss statistical estimation with CPSBS samples. To that end, we construct two estimators with different properties. However, only the second estimator provides good performance in practice, which is discussed later in § 5.

4.1 The Horvitz–Thompson Estimator

We build upon the Horvitz–Thompson (HT) estimator HorvitzThompson1952, which is a common technique for estimation from sampling-without-replacement (SWOR) schemes.

Let be , we seek to approximate its expected value under :


The Monte Carlo estimator of the above quantity is


where . However, in the special case of sampling from a finite population—which is extremely common in NLP—it can be very wasteful. For example, if a distribution is very peaked, it will sample the same item repeatedly; this could lead to inaccurate approximations for some . As a consequence, the mean square error (MSE) of the estimator with respect to can be quite high for small . Indeed, we see this empirically in Fig. 2.

Taking samples without replacement allows us to cover more of the support of in our estimate of . However, we must take into account that our samples are no longer independent, i.e., . We now define the HT estimator, using notation specifically for the case of CPSBS where :


where CPSBS’s inclusion probability is


i.e., the probability of sampling a set that contains the element . In Eq. 11, the distribution may be viewed as a proposal distribution in the sense of importance sampling mcbook and as the corresponding importance weight corrections. If we can exactly compute , then the HT estimator is unbiased777Note that it is common to normalize Eq. 11 by the sum of importance weights, i.e., divide by the sum . While this leads to a biased estimator, it can significantly reduce variance, which is often worthwhile. (see § B.1 for proof). However, the summation in Eq. 12 is intractable so we resort to estimation.

4.2 Estimating Inclusion Probabilities

In this section, we develop statistical estimators of the inclusion probabilities under conditional Poisson stochastic beam search. An important caveat: the analysis in this section only applies to the estimators of the inverse inclusion probabilities themselves. Further analysis may be undertaken to analyze the variance of the Horvitz–Thompson estimators that make use of these estimators.

4.2.1 Naïve Monte Carlo

It is not straightforward to estimate the reciprocal inclusion probabilities. Thus, we attempt to estimate the inclusion probabilities directly and take the reciprocal of this estimator. This strategy leads to a consistent, but biased, estimator.888Since by Jensen’s inequality for

, the reciprocal of an unbiased estimate of

is not an unbiased estimate of One obvious way to derive an inclusion probability estimator is the Monte Carlo estimator:


where .

Proposition 4.1.

Eq. 13 has the following two properties:

  1. [label=)]

  2. is an unbiased estimator of and

  3. is a consistent estimator of with asymptotic variance


Here denotes the asymptotic variance, which is the variance after the number of samples

is large enough such that the central limit theorem has kicked in



Proof given in § B.2. ∎

Qualitatively, what this result tells us is that if we are asking about the inverse inclusion probability of a candidate with a low inclusion probability, our estimator may have very high variance. Indeed, it is unlikely that we could derive an estimator without this qualitative property due to the presence of the inverse. Moreover, the estimator given in Eq. 13 is not of practical use: If we are interested in the inverse inclusion probability of a specific candidate , then we may have to sample a very large number of beams until we eventually sample one that actually contains . In practice, what this means is that our estimate of the inclusion probably for a rare will often be zero, which we cannot invert.999One solution would be to smooth our estimates of the inclusion probabilities, adding a small to ensure that we do not divide by zero, but the authors find our next approach to be more methodologically sound. Instead, we pursue an importance sampling strategy for estimating , which we outline in the next section.

4.2.2 Importance Sampling

We now turn to an inclusion probability estimator that is based on importance sampling. Recall from Eq. 12 that the inclusion probability for is a massive summation over sequences of possible beams that could have generated . Rather than computing the sum, we will estimate the sum through taking samples. Our procedure starts by generating hindsight samples from the following proposal distribution that is conditioned on :


In words, is conditioned on its sets containing the prefix (thus it is always the case that ).101010This proposal distribution can be realized through a minor modification of our algorithm in § 3, where corresponding to is placed at the beginning and added to deterministically. For brevity, we omit an explicit notational dependence of and on .

Lemma 4.1.

The joint proposal distribution111111We have omitted dependency on for brevity. may be expressed in terms of as follows:


where we define as the joint probability of the beams under the original distributions . We omit that both and are conditioned on .


See § B.2. ∎

In terms of computation, Eq. 16 makes use of the fact that the per-time-step inclusion probability for a given can be computed efficiently with dynamic programming using the following identity:


For completeness, we give pseudocode in App. C. Given samples for defined in Eq. 16 with respect to a given , we propose the following unbiased estimator of inclusion probabilities:


where is a prefix of . One simple derivation of Eq. 19 is as an importance sampler. We start with the estimator given in Eq. 13 and perform the standard algebraic manipulations witnessed in importance sampling:


where equality (i) above follows from Lemma 4.1. This derivation serves as a simple proof that Eq. 19 inherits unbiasedness from Eq. 13

Proposition 4.2.

Eq. 19 has the following two properties:

  1. [label=)]

  2. is an unbiased estimator of ;

  3. The estimator of the inverse inclusion probabilities is consistent with the following upper bound on the asymptotic variance:


    where we assume that the following bound:


    holds for all .


Proof given in § B.2. ∎

Proposition 4.2 tells us that we can use Eq. 19 to construct a consistent estimator of the inverse inclusion probabilities. Moreover, assuming , then we have that the importance sampling yields an estimate , unlike the Monte Carlo estimator . We further see that, to the extent that approximates , then we may expect the variance of Eq. 19 to be small—specifically in comparison to the naïve Monte Carlo estimator in Eq. 13—which is often the case for estimators built using importance sampling techniques when a proposal distribution is chosen judiciously monte-carlo. Thus, given our estimator in Eq. 19, we can now construct a practically useful estimator for using the HT estimator in Eq. 11. In the next section, we observe that this estimator is quite efficient in the sequence model setting.

5 Experiments

We repeat the analyses performed by kool2019stochastic, running experiments on neural machine translation (NMT) models; for reproducibility, we use the pretrained Transformer model for WMT’14 bojar2014findings English–French made available by fairseq121212 ott2019fairseq. We evaluate on the En-Fr newstest2014 set, containing 3003 sentences. Further details can be found in App. D. Our implementation of CPSBS modifies the beam search algorithm from the fairseq library. We additionally consider the beam search, stochastic beam search, diverse beam search, and ancestral sampling algorithms available in fairseq.

5.1 Statistical Estimators for Language Generation Models

Estimators have a large number of applications in machine learning. For example, the REINFORCE algorithm

reinforce constructs an estimator for the value of the score function; minimum-Bayes risk decoding kumar-byrne-2004-minimum uses an estimate of risk in its optimization problem. In this section, we compare estimators for sentence-level bleu score and conditional model entropy for NMT models. Notably, NMT models that are trained to minimize cross-entropy with the empirical distribution131313Label-smoothing label_smoothing is typically also employed, which leads to even higher entropy distributions. are not peaky distributions ott2018analyzing; eikema2020map; thus, standard estimation techniques, e.g., Monte Carlo, should generally provide good results. However, we can vary the annealing parameter of our model in order to observe the behavior of our estimator with both high- and low-entropy distributions, making this a comprehensive case study. Here the annealed model distribution is computed as


where we should expect a standard Monte Carlo estimator to provide good results at close to 1 when is naturally high entropy. We test our estimator in this setting so as to give a comparison in a competitive setting. Specifically, we assess the performance of our estimator of given in Eq. 11—using inclusion probability estimates from Eq. 19 with and with importance weight normalization—in comparison to three other approaches: Monte Carlo (MC) sampling, the sum-and-sample (SAS) estimator, and stochastic beam search (SBS).

Monte Carlo.

Under the Monte Carlo sampling scheme with sample size , we estimate the expected value of under our model using Eq. 10 with a sample .

Figure 1: bleu score estimates for three different sentences using estimators for respective decoding methods. indicates scaling temperature; values and sentences are chosen to mimic kool2019stochastic.
Sum and Sample.

The sum-and-sample estimator botev17a; liu2019rao; Kool2020Estimating is an unbiased estimator that takes as input a deterministically chosen set of size and samples an additional from the remaining elements, , where we obtain the set using beam search in our experiments. Formally, the SAS estimator can be written as:

Stochastic Beam Search.

Stochastic beam search kool2019stochastic; Kool2020Estimating

is a SWOR algorithm likewise built on top of beam search. The algorithm makes use of truncated Gumbel random variables at each iteration, resulting in a sampling design equivalent to performing the Gumbel-top-

trick vieira2014gumbel on the distribution . Estimators built using SBS likewise follow the Horvitz–Thompson scheme of Eq. 11; we refer the reader to the original work for inclusion probability computations. They suggest normalizing the estimator by the sum of sample inclusion probabilities to help reduce variance; we therefore likewise perform this normalization in our experiments.

(a) RMSE of bleu score estimator for different temperatures. Results are averaged across several sentences.
(b) RMSE of conditional model entropy estimator for various temperatures. Results are averaged across several sentences. We see a larger bias under both CPSBS and SBS at higher temperatures in these experiments.
Figure 2: RMSE evaluations.

To assess the error of our estimator, we compute its root MSE (RMSE) with respect to a baseline result. While computing the exact value of an expectation is typically infeasible in the sequence model setting, we can average our (unbiased) MC estimator in Eq. 10 over multiple runs to create a good baseline. Specifically, we compute our MC estimator 50 times for a large sample size (); variances are reported in App. D.1313footnotetext: We refer the reader to the original work kool2019stochastic for equations for inclusion probability estimates.

Probabilistic models for language generation typically have large vocabularies. In this setting, the computation of Eq. 6 is inefficient due to the large number of items in the set that are assigned very small probability under the model. We experiment with truncating this distribution such that the set of possible extensions of a sequence consist only of the highest probability tokens within the core % of probability mass ( in our experiments), similar to the process in nucleus sampling holtzman2019curious. We compare this approach to the original algorithm design in App. D and see that empirically, results are virtually unchanged; the following results use this method. We also compare the decoding time of different sampling methods in Fig. 7.

bleu Score Estimation.

bleu Papineni:2002:BMA:1073083.1073135

is a widely used automatic evaluation metric for the quality of machine-generated translations. Estimates of

bleu score are used in minimum risk training shen-etal-2016-minimum

and reinforcement learning-based approaches

RanzatoCAZ15 to machine translation. As such, accurate and low-variance estimates are critical for the algorithms’ performance. Formally, we estimate the expected value of , whose dependence on we leave implicit, under our NMT model for reference translation . For comparison, we use the same sentences and similar annealing temperatures141414Results for converged rapidly for all estimators, thus not providing an interesting comparison. evaluated by kool2019stochastic

. We repeat the sampling 20 times and plot the value and standard deviation (indicated by shaded region) of different estimators in

Fig. 1. From Fig. 1, we can see that CPSBS has lower variance than our baseline estimators across all temperatures and data points.151515The sampling distribution at is not the same across strategies, hence the difference in variances even at . Especially in the low temperature setting, our estimator converges rapidly with minor deviation from the exact values even for small sample sizes. Additionally, in Fig. 1(a) we see that the RMSE is typically quite low except at higher temperatures. In such cases, we observe the effects of biasedness, similar to kool2019stochastic’s observations.

Conditional Entropy Estimation.

We perform similar experiments for estimates of a model’s conditional entropy, i.e., , whose dependence on we again leave implicit. We show results in Fig. 1(b), with plots of the value in App. D since results are quite similar to Fig. 1. We see further confirmation that our estimator built on CPSBS is generally quite efficient.

5.2 Diverse Sampling

We show how CPSBS can be used as a diverse set sampling design for language generation models. We generate a sample of translations , i.e., according to the CPSBS scheme, where weights are set as at each time step, as recommended in § 3. In Fig. 3, we show the trade-off between minimum, average and maximum sentence-level bleu score (as a quality measure) and -gram diversity, where we define -gram diversity as the average fraction of unique vs. total -grams for in a sentence:

Figure 3: Average ( min and max) bleu score versus diversity for sample size . Points correspond to different annealing temperatures {0.1, …, 0.8}. Results for show very similar trends.

Metrics are averaged across the corpus. We follow the experimental setup of kool2019stochastic, using the newstest2014 dataset and comparing three different decoding methods: SBS, diverse beam search (DiverseBS; vijayakumar2018diverse) and ancestral sampling. As in their experiments, we vary the annealing temperature in the range as a means of encouraging diversity; for DiverseBS we instead vary the strength parameter in the same range. Interestingly, we see that temperature has virtually no effect on the diversity of the set of results returned by CPSBS. Despite this artifact, for which the authors have not found a theoretical justification,161616While scaling sampling weights by a constant should not change the distribution , an transformation of weights—which is the computation performed by temperature annealing—should. the set returned by CPSBS is still overall more diverse (position on -axis) than results returned by DiverseBS and reflect better min, max, and average bleu in comparison to random sampling. SBS provides a better spectrum for the diversity and bleu tradeoff; we thus recommend SBS when diverse sets are desired.

6 Conclusion

In this work, we present conditional Poisson stochastic beam search, a sampling-without-replacement strategy for sequence models. Through a simple modification to beam search, we turn this mainstay decoding algorithm into a stochastic process. We derive a low-variance, consistent estimator of inclusion probabilities under this scheme; we then present a general framework for using CPSBS to construct statistical estimators for expectations under sequence models. In our experiments, we observe a reduction in mean square error, and an increase in sample efficiency, when using our estimator in comparison to several baselines, showing the benefits of CPSBS.


Appendix A Conditional Poisson Sampling

Here we provide a brief overview of the sampling design at the core of CPSBS: conditional Poisson sampling. We consider a base set where and we map the elements of to the integers . As a warm up, we first consider (unconditional) Poisson sampling, also known as a Bernoulli point process. To sample a subset , we do as follows: for each element

, we flip a coin where the odds of heads is

. Then, we simply take to be the subset of elements whose coin flips were heads. However, this sampling scheme clearly does not guarantee a sample of items, which can cause problems in our application; sampling more than items would make the stochastic beam search process inefficient while sampling fewer than —or even 0—items may not leave us with a large enough set at the end of our iterative process.

If instead, we condition on the sets always having a prescribed size , i.e., reject samples where , we arrive at the conditional

Poisson process. Formally, the conditional Poisson distribution is defined over

as follows,


By analyzing Eq. 26, we can see that sets with the largest product of weights are the most likely to be sampled; further, this distribution is invariant to rescaling of weights due to the size requirement. This is similar to the conditions under which beam search chooses the set of largest weight, i.e., highest scoring, elements. Indeed, we note the extreme similarity between Eq. 4 and Eq. 26, the only difference being a dependence on a prior set. However, unlike beam search, sets with a lower weight product now have the possibility of being chosen.

Appendix B Proofs

b.1 Unbiasedness of the Horvitz–Thompson Estimator

Proposition B.1.

Given a SWOR design over the set with inclusion probabilities , the Horvitz–Thompson estimator (Eq. 11) gives us an unbiased estimator of , where is a function whose expectation under we seek to approximate.




b.2 Proofs of Expected Values and Variances of Inclusion Probability Estimators

See 4.1


Consider the probability of sampling according to . Algebraic manipulation reveals: equationparentequation


which proves the identity. ∎

See 4.1


i) The estimator is easily shown to be unbiased:


and its variance may be derived as follows: equationparentequation


ii) By the strong law of large numbers, we have


Since is continuous, we may appeal to the continuous mapping theorem to achieve consistency:


We can compute the asymptotic variance by the delta rule: equationparentequation


See 4.2


i) We first prove that the estimator of the inclusion probabilities is unbiased through the following manipulation: equationparentequation


ii) To show consistency, we appeal to the strong law of large number and the continuous mapping theorem. By the strong law of large numbers, we have that


Since is continuous, we have equationparentequation


which shows consistency.

Now, we derive a bound on the asymptotic variance of the inverse inclusion probabilities: Suppose,


We start with the variance of importance sampling. This is a standard result monte-carlo. Then we proceed with algebraic manipulation integrating the assumption above: equationparentequation