1 Introduction
Many NLP tasks require the prediction of structured outputs, such as sentences or parse trees, either during decoding or as part of a training algorithm. For today’s neural architectures, beam search reddy1977 has become the decoding algorithm of choice due to its efficiency and empirical performance AAAI1714571; edunovetal2018understanding; XLNET; meisteretal2020best. Beam search is a deterministic method, which invites a natural question: What is the proper stochastic generalization of beam search? Several recent papers have investigated this question kool2019stochastic; Kool2020Estimating; shi2020uniquerandomizer. Here we build on this line of work and introduce an alternative stochastic beam search that the authors contend is a more faithful stochasticization of the original algorithm in that it recovers standard beam search as a special case. We name our algorithm conditional Poisson stochastic beam search (CPSBS) as we draw on the conditional Poisson sampling scheme hajek1964 in its construction. The relationship between CPSBS and other common decoding strategies is displayed visually in Table 1.
At every iteration, CPSBS replaces the top operator in the beam search algorithm with conditional Poisson sampling, resulting in a decoding strategy that generates sampleswithoutreplacement. Importantly, annealing our sampling distribution at each time step turns local sampling into a local top computation and thereby recovers beam search. We subsequently show that these samples can be used to construct a statistically consistent estimator for the expected value of an arbitrary function of the output.
In our experiments with neural machine translation models, we observe that CPSBS leads to better estimates of expected
bleu and conditional model entropy than SBS and the sumandsample estimator Kool2020Estimating, distinctly outperforming Monte Carlo sampling for both small sample sizes and low temperatures. Furthermore, we find that CPSBS can be used as a diverse sampling strategy. We take these results as confirmation that CPSBS is a useful tool in the newfound arsenal of sampling strategies for neural sequence models.2 Beam Search
In this section, we overview the necessary background on neural sequence models and beam search in order to motivate our algorithm in § 3.
Neural Sequence Models.
We consider locally normalized probabilistic models over sequences :
(1) 
where is a member of a set of wellformed outputs . In the context of language generation models, wellformed outputs are sequences of tokens from a vocabulary ; all begin and end with special tokens bos and eos, respectively. We use to represent the subsequence . In this work, we consider the setting where the maximum sequence length is upperbounded; we denote this upper bound . Without loss of generality, we may condition on an input , as is necessary for machine translation and other conditional generation tasks.
Beam Search.
Beam search is a commonly used search heuristic for finding an approximate solution to the following optimization problem:
(2) 
Its most straightforward interpretation is as a pruned version of breadthfirst search, where the breadth of the search is narrowed to the top candidates. However, here we will present beam search in a nonstandard lens meisteretal2020beam; meisteretal2021determinantal in order to emphasize the connection with our stochastic generalization in § 3. Specifically, we present the algorithm as iteratively finding the highestscoring set under a specific set function.
Under this paradigm, the initial beam contains only the bos token. At subsequent steps , beam search selects the highestscoring candidates from the set that we define below:^{2}^{2}2Sequences already ending in eos are not extended by and are simply added to the set “as is.”
(3) 
where
is sequence concatenation. Those candidate sets with collectively higher probability under the model
have higher score. This process continues until all end in eos, or . For notational ease, we define ; throughout this paper, we will assume and identify the elements of with the integers .We can formulate the timestep dependent set function whose beam search finds as
(4) 
where is the weight of the element of .
To recover beam search, we set our weights equal to probabilities under a model i.e. .
Note that we leave the constraint that implicit in Eq. 4.
As should be clear from notation, this set function only assigns positive scores to
subsets of of size exactly and the assigned score is proportional to
the product of the probability of the candidates under the model .
Putting this all together, beam search
may be viewed as the following iterative process:
[ams align,colback=gray!35!white,colframe=white!5!white,arc=0pt,outer arc=0pt]Y_0 &= {bos }
Y_t & = argmax_Y_t’ ⊆B_t Q_t(Y_t’ ∣Y_t1)
& return Y_T
3 Conditional Poisson Stochastic Beams
Our paper capitalizes on a very simple observation: Rather than taking its , we may renormalize Eq. 4 into a distribution and samplewithoutreplacement a size set at each iteration:
[ams align,colback=gray!35!white,colframe=white!5!white,arc=0pt,outer arc=0pt]
Y_0 &= {bos }
Y_t & ∼Q_t(⋅∣Y_t1)
& return Y_T
This recursion corresponds to performing conditional Poisson sampling (CPS; hajek1964;see App. A for overview), a common samplingwithoutreplacement design tille_book,^{3}^{3}3A sampling design is a probability distribution over sets of samples.
at every time step. Thus we term this scheme conditional Poisson stochastic beam search. CPSBS gives us a probability distribution over
sets of candidates of size , i.e., the final beam . We denote the CPSBS distribution and we write to indicate that is the stochastic beam at the end of a sampling run. We may write as a marginal probability, summing over all sequences of beams that could have resulted in :^{4}^{4}4This formulation reveals that it is wildly intractable to compute .(5) 
Note the structural zeros of prevent any incompatible sequence of beams. We provide a theoretical analysis of the scheme in § 4 and an empirical analysis in § 5.
Normalizing .
At each time step , we compute —a distribution over subsets of size of a base set —using the CPS design. The normalizing constant for this distribution is defined as
(6) 
Despite there being exponentially many summands, we can sum over all subsets in time via the following recurrence relation:^{5}^{5}5The reader may recognize this recurrence as the weighted generalization of Pascal’s triangle, , which is why we chose the notation .
We give complete pseudocode in App. C. Correctness of this algorithm is shown in taskar_dpp. The normalizing constant can then be efficiently computed as
(7) 
Sampling from .
We can efficiently sample sets from using the algorithm below:
In words, the algorithm considers adding each element one at a time until elements have been sampled. Notice that line 4 adjusts the probability of sampling item given that items have already been sampled, which ensures that exactly elements are sampled at termination.
Setting .
The weight assigned to the item of directly affects its probability of being included in the sampled set, i.e., , also termed an item’s inclusion probability. In this paper, we write to denote the inclusion probability under the distribution , defined as:
(8)  
One strategy is to choose at time step such that . This choice recovers beam search when we anneal our chosen weights : as the temperature parameter , the CP distribution will assign probability 1 to the set containing the top elements.^{6}^{6}6In the event of ties, annealed CP will converge to a distribution that breaks ties uniformly at random.
Finding ’s that result in prespecified inclusion probabilities is possible, but it requires solving a numerical optimization problem aires1999algorithms; GRAFSTROM20092111. Further, in CPSBS, we will be sampling from a different distribution at each time step and it would be quite slow to solve the numerical optimization problem each iteration. Luckily, the choice of yields a good approximation to the target inclusion probabilities in both theory and practice hajek1981sampling; bondesson; aires1999algorithms.
4 Statistical Estimation with Conditional Poisson Stochastic Beam Search
In this section, we discuss statistical estimation with CPSBS samples. To that end, we construct two estimators with different properties. However, only the second estimator provides good performance in practice, which is discussed later in § 5.
4.1 The Horvitz–Thompson Estimator
We build upon the Horvitz–Thompson (HT) estimator HorvitzThompson1952, which is a common technique for estimation from samplingwithoutreplacement (SWOR) schemes.
Let be , we seek to approximate its expected value under :
(9) 
The Monte Carlo estimator of the above quantity is
(10) 
where . However, in the special case of sampling from a finite population—which is extremely common in NLP—it can be very wasteful. For example, if a distribution is very peaked, it will sample the same item repeatedly; this could lead to inaccurate approximations for some . As a consequence, the mean square error (MSE) of the estimator with respect to can be quite high for small . Indeed, we see this empirically in Fig. 2.
Taking samples without replacement allows us to cover more of the support of in our estimate of . However, we must take into account that our samples are no longer independent, i.e., . We now define the HT estimator, using notation specifically for the case of CPSBS where :
(11) 
where CPSBS’s inclusion probability is
(12)  
i.e., the probability of sampling a set that contains the element . In Eq. 11, the distribution may be viewed as a proposal distribution in the sense of importance sampling mcbook and as the corresponding importance weight corrections. If we can exactly compute , then the HT estimator is unbiased^{7}^{7}7Note that it is common to normalize Eq. 11 by the sum of importance weights, i.e., divide by the sum . While this leads to a biased estimator, it can significantly reduce variance, which is often worthwhile. (see § B.1 for proof). However, the summation in Eq. 12 is intractable so we resort to estimation.
4.2 Estimating Inclusion Probabilities
In this section, we develop statistical estimators of the inclusion probabilities under conditional Poisson stochastic beam search. An important caveat: the analysis in this section only applies to the estimators of the inverse inclusion probabilities themselves. Further analysis may be undertaken to analyze the variance of the Horvitz–Thompson estimators that make use of these estimators.
4.2.1 Naïve Monte Carlo
It is not straightforward to estimate the reciprocal inclusion probabilities. Thus, we attempt to estimate the inclusion probabilities directly and take the reciprocal of this estimator. This strategy leads to a consistent, but biased, estimator.^{8}^{8}8Since by Jensen’s inequality for
, the reciprocal of an unbiased estimate of
is not an unbiased estimate of One obvious way to derive an inclusion probability estimator is the Monte Carlo estimator:(13) 
where .
Proposition 4.1.
Eq. 13 has the following two properties:

[label=)]

is an unbiased estimator of and
(14) 
is a consistent estimator of with asymptotic variance
(15)
Here denotes the asymptotic variance, which is the variance after the number of samples
is large enough such that the central limit theorem has kicked in
bickel2015mathematical.Proof.
Proof given in § B.2. ∎
Qualitatively, what this result tells us is that if we are asking about the inverse inclusion probability of a candidate with a low inclusion probability, our estimator may have very high variance. Indeed, it is unlikely that we could derive an estimator without this qualitative property due to the presence of the inverse. Moreover, the estimator given in Eq. 13 is not of practical use: If we are interested in the inverse inclusion probability of a specific candidate , then we may have to sample a very large number of beams until we eventually sample one that actually contains . In practice, what this means is that our estimate of the inclusion probably for a rare will often be zero, which we cannot invert.^{9}^{9}9One solution would be to smooth our estimates of the inclusion probabilities, adding a small to ensure that we do not divide by zero, but the authors find our next approach to be more methodologically sound. Instead, we pursue an importance sampling strategy for estimating , which we outline in the next section.
4.2.2 Importance Sampling
We now turn to an inclusion probability estimator that is based on importance sampling. Recall from Eq. 12 that the inclusion probability for is a massive summation over sequences of possible beams that could have generated . Rather than computing the sum, we will estimate the sum through taking samples. Our procedure starts by generating hindsight samples from the following proposal distribution that is conditioned on :
(16) 
In words, is conditioned on its sets containing the prefix (thus it is always the case that ).^{10}^{10}10This proposal distribution can be realized through a minor modification of our algorithm in § 3, where corresponding to is placed at the beginning and added to deterministically. For brevity, we omit an explicit notational dependence of and on .
Lemma 4.1.
The joint proposal distribution^{11}^{11}11We have omitted dependency on for brevity. may be expressed in terms of as follows:
(17) 
where we define as the joint probability of the beams under the original distributions . We omit that both and are conditioned on .
Proof.
See § B.2. ∎
In terms of computation, Eq. 16 makes use of the fact that the pertimestep inclusion probability for a given can be computed efficiently with dynamic programming using the following identity:
(18) 
For completeness, we give pseudocode in App. C. Given samples for defined in Eq. 16 with respect to a given , we propose the following unbiased estimator of inclusion probabilities:
(19) 
where is a prefix of . One simple derivation of Eq. 19 is as an importance sampler. We start with the estimator given in Eq. 13 and perform the standard algebraic manipulations witnessed in importance sampling:
(20)  
where equality (i) above follows from Lemma 4.1. This derivation serves as a simple proof that Eq. 19 inherits unbiasedness from Eq. 13
Proposition 4.2.
Eq. 19 has the following two properties:

[label=)]

is an unbiased estimator of ;

The estimator of the inverse inclusion probabilities is consistent with the following upper bound on the asymptotic variance:
(21) where we assume that the following bound:
(22) holds for all .
Proof.
Proof given in § B.2. ∎
Proposition 4.2 tells us that we can use Eq. 19 to construct a consistent estimator of the inverse inclusion probabilities. Moreover, assuming , then we have that the importance sampling yields an estimate , unlike the Monte Carlo estimator . We further see that, to the extent that approximates , then we may expect the variance of Eq. 19 to be small—specifically in comparison to the naïve Monte Carlo estimator in Eq. 13—which is often the case for estimators built using importance sampling techniques when a proposal distribution is chosen judiciously montecarlo. Thus, given our estimator in Eq. 19, we can now construct a practically useful estimator for using the HT estimator in Eq. 11. In the next section, we observe that this estimator is quite efficient in the sequence model setting.
5 Experiments
We repeat the analyses performed by kool2019stochastic, running experiments on neural machine translation (NMT) models; for reproducibility, we use the pretrained Transformer model for WMT’14 bojar2014findings English–French made available by fairseq^{12}^{12}12https://github.com/pytorch/fairseq/tree/master/examples/translation ott2019fairseq. We evaluate on the EnFr newstest2014 set, containing 3003 sentences. Further details can be found in App. D. Our implementation of CPSBS modifies the beam search algorithm from the fairseq library. We additionally consider the beam search, stochastic beam search, diverse beam search, and ancestral sampling algorithms available in fairseq.
5.1 Statistical Estimators for Language Generation Models
Estimators have a large number of applications in machine learning. For example, the REINFORCE algorithm
reinforce constructs an estimator for the value of the score function; minimumBayes risk decoding kumarbyrne2004minimum uses an estimate of risk in its optimization problem. In this section, we compare estimators for sentencelevel bleu score and conditional model entropy for NMT models. Notably, NMT models that are trained to minimize crossentropy with the empirical distribution^{13}^{13}13Labelsmoothing label_smoothing is typically also employed, which leads to even higher entropy distributions. are not peaky distributions ott2018analyzing; eikema2020map; thus, standard estimation techniques, e.g., Monte Carlo, should generally provide good results. However, we can vary the annealing parameter of our model in order to observe the behavior of our estimator with both high and lowentropy distributions, making this a comprehensive case study. Here the annealed model distribution is computed as(23) 
where we should expect a standard Monte Carlo estimator to provide good results at close to 1 when is naturally high entropy. We test our estimator in this setting so as to give a comparison in a competitive setting. Specifically, we assess the performance of our estimator of given in Eq. 11—using inclusion probability estimates from Eq. 19 with and with importance weight normalization—in comparison to three other approaches: Monte Carlo (MC) sampling, the sumandsample (SAS) estimator, and stochastic beam search (SBS).
Monte Carlo.
Under the Monte Carlo sampling scheme with sample size , we estimate the expected value of under our model using Eq. 10 with a sample .
Sum and Sample.
The sumandsample estimator botev17a; liu2019rao; Kool2020Estimating is an unbiased estimator that takes as input a deterministically chosen set of size and samples an additional from the remaining elements, , where we obtain the set using beam search in our experiments. Formally, the SAS estimator can be written as:
(24)  
Stochastic Beam Search.
Stochastic beam search kool2019stochastic; Kool2020Estimating
is a SWOR algorithm likewise built on top of beam search. The algorithm makes use of truncated Gumbel random variables at each iteration, resulting in a sampling design equivalent to performing the Gumbeltop
trick vieira2014gumbel on the distribution . Estimators built using SBS likewise follow the Horvitz–Thompson scheme of Eq. 11; we refer the reader to the original work for inclusion probability computations. They suggest normalizing the estimator by the sum of sample inclusion probabilities to help reduce variance; we therefore likewise perform this normalization in our experiments.To assess the error of our estimator, we compute its root MSE (RMSE) with respect to a baseline result. While computing the exact value of an expectation is typically infeasible in the sequence model setting, we can average our (unbiased) MC estimator in Eq. 10 over multiple runs to create a good baseline. Specifically, we compute our MC estimator 50 times for a large sample size (); variances are reported in App. D.^{13}^{13}footnotetext: We refer the reader to the original work kool2019stochastic for equations for inclusion probability estimates.
Probabilistic models for language generation typically have large vocabularies. In this setting, the computation of Eq. 6 is inefficient due to the large number of items in the set that are assigned very small probability under the model. We experiment with truncating this distribution such that the set of possible extensions of a sequence consist only of the highest probability tokens within the core % of probability mass ( in our experiments), similar to the process in nucleus sampling holtzman2019curious. We compare this approach to the original algorithm design in App. D and see that empirically, results are virtually unchanged; the following results use this method. We also compare the decoding time of different sampling methods in Fig. 7.
bleu Score Estimation.
bleu Papineni:2002:BMA:1073083.1073135
is a widely used automatic evaluation metric for the quality of machinegenerated translations. Estimates of
bleu score are used in minimum risk training shenetal2016minimumand reinforcement learningbased approaches
RanzatoCAZ15 to machine translation. As such, accurate and lowvariance estimates are critical for the algorithms’ performance. Formally, we estimate the expected value of , whose dependence on we leave implicit, under our NMT model for reference translation . For comparison, we use the same sentences and similar annealing temperatures^{14}^{14}14Results for converged rapidly for all estimators, thus not providing an interesting comparison. evaluated by kool2019stochastic. We repeat the sampling 20 times and plot the value and standard deviation (indicated by shaded region) of different estimators in
Fig. 1. From Fig. 1, we can see that CPSBS has lower variance than our baseline estimators across all temperatures and data points.^{15}^{15}15The sampling distribution at is not the same across strategies, hence the difference in variances even at . Especially in the low temperature setting, our estimator converges rapidly with minor deviation from the exact values even for small sample sizes. Additionally, in Fig. 1(a) we see that the RMSE is typically quite low except at higher temperatures. In such cases, we observe the effects of biasedness, similar to kool2019stochastic’s observations.Conditional Entropy Estimation.
We perform similar experiments for estimates of a model’s conditional entropy, i.e., , whose dependence on we again leave implicit. We show results in Fig. 1(b), with plots of the value in App. D since results are quite similar to Fig. 1. We see further confirmation that our estimator built on CPSBS is generally quite efficient.
5.2 Diverse Sampling
We show how CPSBS can be used as a diverse set sampling design for language generation models. We generate a sample of translations , i.e., according to the CPSBS scheme, where weights are set as at each time step, as recommended in § 3. In Fig. 3, we show the tradeoff between minimum, average and maximum sentencelevel bleu score (as a quality measure) and gram diversity, where we define gram diversity as the average fraction of unique vs. total grams for in a sentence:
(25) 
Metrics are averaged across the corpus. We follow the experimental setup of kool2019stochastic, using the newstest2014 dataset and comparing three different decoding methods: SBS, diverse beam search (DiverseBS; vijayakumar2018diverse) and ancestral sampling. As in their experiments, we vary the annealing temperature in the range as a means of encouraging diversity; for DiverseBS we instead vary the strength parameter in the same range. Interestingly, we see that temperature has virtually no effect on the diversity of the set of results returned by CPSBS. Despite this artifact, for which the authors have not found a theoretical justification,^{16}^{16}16While scaling sampling weights by a constant should not change the distribution , an transformation of weights—which is the computation performed by temperature annealing—should. the set returned by CPSBS is still overall more diverse (position on axis) than results returned by DiverseBS and reflect better min, max, and average bleu in comparison to random sampling. SBS provides a better spectrum for the diversity and bleu tradeoff; we thus recommend SBS when diverse sets are desired.
6 Conclusion
In this work, we present conditional Poisson stochastic beam search, a samplingwithoutreplacement strategy for sequence models. Through a simple modification to beam search, we turn this mainstay decoding algorithm into a stochastic process. We derive a lowvariance, consistent estimator of inclusion probabilities under this scheme; we then present a general framework for using CPSBS to construct statistical estimators for expectations under sequence models. In our experiments, we observe a reduction in mean square error, and an increase in sample efficiency, when using our estimator in comparison to several baselines, showing the benefits of CPSBS.
References
Appendix A Conditional Poisson Sampling
Here we provide a brief overview of the sampling design at the core of CPSBS: conditional Poisson sampling. We consider a base set where and we map the elements of to the integers . As a warm up, we first consider (unconditional) Poisson sampling, also known as a Bernoulli point process. To sample a subset , we do as follows: for each element
, we flip a coin where the odds of heads is
. Then, we simply take to be the subset of elements whose coin flips were heads. However, this sampling scheme clearly does not guarantee a sample of items, which can cause problems in our application; sampling more than items would make the stochastic beam search process inefficient while sampling fewer than —or even 0—items may not leave us with a large enough set at the end of our iterative process.If instead, we condition on the sets always having a prescribed size , i.e., reject samples where , we arrive at the conditional
Poisson process. Formally, the conditional Poisson distribution is defined over
as follows,(26) 
By analyzing Eq. 26, we can see that sets with the largest product of weights are the most likely to be sampled; further, this distribution is invariant to rescaling of weights due to the size requirement. This is similar to the conditions under which beam search chooses the set of largest weight, i.e., highest scoring, elements. Indeed, we note the extreme similarity between Eq. 4 and Eq. 26, the only difference being a dependence on a prior set. However, unlike beam search, sets with a lower weight product now have the possibility of being chosen.
Appendix B Proofs
b.1 Unbiasedness of the Horvitz–Thompson Estimator
Proposition B.1.
Given a SWOR design over the set with inclusion probabilities , the Horvitz–Thompson estimator (Eq. 11) gives us an unbiased estimator of , where is a function whose expectation under we seek to approximate.
Proof.
equationparentequation
(27a)  
(27b)  
(27c)  
(27d)  
(27e)  
(27f) 
∎
b.2 Proofs of Expected Values and Variances of Inclusion Probability Estimators
See 4.1
Proof.
Consider the probability of sampling according to . Algebraic manipulation reveals: equationparentequation
(28a)  
(28b) 
which proves the identity. ∎
See 4.1
Proof.
i) The estimator is easily shown to be unbiased:
(29) 
and its variance may be derived as follows: equationparentequation
(30a)  
(30b)  
(30c)  
(30d) 
ii) By the strong law of large numbers, we have
(31) 
Since is continuous, we may appeal to the continuous mapping theorem to achieve consistency:
(32) 
We can compute the asymptotic variance by the delta rule: equationparentequation
(33a)  
(33b)  
(33c) 
∎
See 4.2
Proof.
i) We first prove that the estimator of the inclusion probabilities is unbiased through the following manipulation: equationparentequation
(34a)  
(34b)  
(34c)  
(34d)  
(34e)  
(34f)  
(34g) 
ii) To show consistency, we appeal to the strong law of large number and the continuous mapping theorem. By the strong law of large numbers, we have that
(35) 
Since is continuous, we have equationparentequation
(36a) 
which shows consistency.
Now, we derive a bound on the asymptotic variance of the inverse inclusion probabilities: Suppose,
(37) 
We start with the variance of importance sampling. This is a standard result montecarlo. Then we proceed with algebraic manipulation integrating the assumption above: equationparentequation
(38a)  
(38b)  
(38c)  
(38d)  