Estimating Gradients for Discrete Random Variables by Sampling without Replacement
The well-known Gumbel-Max trick for sampling from a categorical distribution can be extended to sample k elements without replacement. We show how to implicitly apply this 'Gumbel-Top-k' trick on a factorized distribution over sequences, allowing to draw exact samples without replacement using a Stochastic Beam Search. Even for exponentially large domains, the number of model evaluations grows only linear in k and the maximum sampled sequence length. The algorithm creates a theoretical connection between sampling and (deterministic) beam search and can be used as a principled intermediate alternative. In a translation task, the proposed method compares favourably against alternatives to obtain diverse yet good quality translations. We show that sequences sampled without replacement can be used to construct low-variance estimators for expected sentence-level BLEU score and model entropy.READ FULL TEXT VIEW PDF
We describe a very simple method for `consistent sampling' that allows f...
Sampling is a fundamental technique, and sampling without replacement is...
In neural machine translation (NMT), we search for the mode of the model...
Weighted sampling is a fundamental tool in data analysis and machine lea...
Despite some empirical success at correcting exposure bias in machine
We introduce Ensemble Rejection Sampling, a scheme for exact simulation ...
We are interested in the behavior of particular functionals, in a framew...
Estimating Gradients for Discrete Random Variables by Sampling without Replacement
the log-probability for each category by addingindependent Gumbel distributed noise and returning the category with maximum perturbed log-probability. This trick has recently (re)gained popularity as it allows to derive reparameterizable continous relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016). However, there is more: as was noted (in a blog post) by Vieira (2014), taking the top largest perturbations (instead of the maximum, or top 1) yields a sample of size from the categorical distribution without replacement. We call this extension of the Gumbel-Max trick the Gumbel-Top- trick.
In this paper, we consider factorized distributions over sequences, represented by (parametric) sequence models
. Sequence models are widely used, e.g. in tasks such as neural machine translation(Sutskever et al., 2014; Bahdanau et al., 2015) and image captioning (Vinyals et al., 2015b). Many such tasks require obtaining a set of representative sequences from the model. These can be random samples, but for low-entropy distributions, a set of sequences sampled using standard sampling (with replacement) may contain many duplicates. On the other hand, a beam search can find a set of unique high-probability sequences, but these have low variability and, being deterministic, cannot be used to construct statistical estimators.
In this paper, we propose sampling without replacement as an alternative method to obtain representative sequences from a sequence model. We show that for a sequence model, we can do this by applying the Gumbel-Top- trick implicitly, without instantiating all sequences in the (typically exponentially large) domain. This procedure allows us to draw sequences using a number of model evaluations that grows only linear in the number of samples and the maximum sampled sequence length. The algorithm uses the idea of top-down sampling (Maddison et al., 2014) and performs a beam search over stochastically perturbed log-probabilities, which is why we refer to it as Stochastic Beam Search.
Stochastic Beam Search is a novel procedure for sampling sequences that avoids duplicate samples. Unlike ordinary beam search, it has a probabilistic interpretation. Thus, it can e.g. be used in importance sampling. As such, Stochastic Beam Search conceptually connects sampling and beam search, and combines advantages of both methods. In Section 4 we give two examples of how Stochastic Beam Search can be used as a principled alternative to sampling or beam search. In these experiments, Stochastic Beam Search is used to control the diversity of translation results, as well as to construct low variance estimators for sentence-level BLEU score and model entropy.
If , then has a Gumbel distribution with location and we write . From this it follows that , so we can shift Gumbel variables.
The Gumbel-Max trick (Gumbel, 1954; Maddison et al., 2014) allows to sample from the categorical distribution (1) by independently perturbing the log-probabilities with Gumbel noise and finding the largest element.
Formally, let i.i.d. and let , then with . In a slight abuse of notation, we write and we call the perturbed log-probability of element or category in the domain .
Considering the maximum the top 1 (one), we can generalize the Gumbel-Max trick to the Gumbel-Top- trick to draw an ordered sample of size without replacement111With sampling without replacement from a categorical distribution, we mean sampling the first element, then renormalizing the remaining probabilities to sample the next element, etcetera. This does not mean that the inclusion probability of element is proportional to : if we sample elements all elements are included with probability 1., by taking the indices of the largest perturbed log-probabilities. Generalizing , we denote with the function that takes a sequence of values and returns the indices of the largest values, in order of decreasing value.
For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that
where is the domain (without replacement) for the -th sampled element.
A sequence model is a factorized parametric distribution over sequences. The parameters define the conditional probability of the next token given the partial sequence . Typically is defined as a softmax normalization of unnormalized log-probabilities with optional temperature (default ):
The normalization is w.r.t. a single token, so the model is locally normalized. The total probability of a (partial) sequence
follows from the chain rule of probability:
A sequence model defines a valid probability distribution over both partial and complete sequences. When the length is irrelevant, we simply writeto indicate a (partial or complete) sequence. If the model is additionally conditioned on a context (e.g., a source sentence), we write .
A beam search is a limited-width breadth first search. In the context of sequence models, it is often used as an approximation to finding the (single) sequence that maximizes (7), or as a way to obtain a set of high-probability sequences from the model. Starting from an empty sequence (e.g. ), a beam search expands at every step at most partial sequences (those with highest probability) to compute the probabilities of sequences with length . It terminates with a beam of
complete sequences, which we assume to be of equal length (as sequences can be padded).
We derive Stochastic Beam Search by starting with the explicit application of the Gumbel-Top- trick to sample sequences without replacement from a sequence model. This requires instantiating all sequences in the domain to find the largest perturbed log-probabilities. Then we transition to top-down sampling of the perturbed log-probabilities, and we use Stochastic Beam Search to instantiate (only) the sequences with the largest perturbed log-probabilities. As both methods are equivalent, Stochastic Beam Search implicitly applies the Gumbel-Top- trick and thus yields a sample of sequences without replacement.
We represent the sequence model (7) as a tree (as in Figure 1), where internal nodes at level represent partial sequences , and leaf nodes represent completed sequences. We identify a leaf by its index and write as the corresponding sequence, with (normalized!) log-probability . To obtain a sample from the distribution (7) without replacement, we should sample from the set of leaf nodes without replacement, for which we can naively use the Gumbel-Top- trick (Section 2.3):
Compute for all sequences . To reuse computations for partial sequences, the complete probability tree is instantiated, as in Figure 1.
Sample , so can be seen as the perturbed log-probability of sequence .
Let , then is a sample of sequences from (7) without replacement.
As instantiating the complete probability tree is computationally prohibitive, we construct an equivalent process that only requires computation linear in the number of samples and the sequence length.
For the naive implementation of the Gumbel-Top- trick, we only defined the perturbed log-probabilities for leaf nodes , which correspond to complete sequences . For the Stochastic Beam Search implementation, we also define the perturbed log-probabilities for internal nodes corresponding to partial sequences. We identify a node (internal or leaf) by the set of leaves in the corresponding subtree, and we write as the corresponding (partial or completed) sequence. Its log-probability can be computed incrementally from the parent log-probability using (6), and since the model is locally normalized, it holds that
Now for each node , we define as the maximum of the perturbed log-probabilities in the subtree leaves . By Equation (2), has a Gumbel distribution with location (hence its notation ):
Since is a Gumbel perturbation of the log-probability , we call it the perturbed log-probability of the partial sequence . We can define the corresponding Gumbel noise , which can be inferred from by the relation .
If we want to sample for all nodes, we can use the bottom-up sampling procedure: sample the leaves and recursively compute using (10). This is effectively sampling from the degenerate (constant) distribution resulting from conditioning on the children.
The recursive bottom-up sampling procedure can be interpreted as ancestral sampling from a tree-structured graphical model (somewhat like Figure 1) with edges directed upwards. Alternatively, we can reverse the graphical model and sample the tree top-down, starting with the root and recursively sampling the children conditionally.
Note that for the root (since it contains all leaves ), it holds that , so we can let 222Or we can simply set (e.g. condition on) . This does not affect the result by the independence of and .. Starting with , we can recursively sample the children conditionally on the parent variable . For it holds that and we can sample conditionally on (10), e.g. with their maximum equal to .
Sampling a set of Gumbels conditionally on their maximum being equal to a certain value is non-trivial, but can be done by first sampling the and then sampling the individual Gumbels conditionally on both the and . Alternatively, we can let independently and let . Then
is a set of Gumbels with a maximum equal to . See Appendix B for details and numerically stable implementation.
If we recursively sample the complete tree top-down, this is equivalent to sampling the complete tree bottom-up, and as a result, for all leaves, it holds that , independently. The benefit of using top-down sampling is that if we are interested only in obtaining the top leaves, we do not need to instantiate the complete tree.
The key idea of Stochastic Beam Search is to apply the Gumbel-Top- trick for a sequence model, without instantiating the entire tree, by using top-down sampling. With top-down sampling, to find the top leaves, at every level in the tree we can suffice with only expanding (instantiating the subtree for) the nodes with highest perturbed log-probability . To see this, first assume that we instantiated the complete tree using top-down sampling and consider the nodes that are ancestors of at least one of the top leaves (the shaded nodes in Figure 1). At every level of the tree, there will be at most such nodes (as each of the top leaves has only one ancestor at level ), and these nodes will have higher perturbed log-probabilities than the other nodes at level , which do not contain a top leaf in the subtree. This means that if we discard all but the nodes with highest log-probabilities , we are guaranteed to include the ancestors of the top leaves. Formally, the -th highest log-probability of the nodes at level provides a lower bound required to be among the top leaves, while is an upper bound for the set of leaves such that it can be discarded or pruned if it is lower than the lower bound, so if is not among the top .
Thus, when we apply the top-down sampling procedure, at each level we only need to expand the nodes with the highest perturbed log-probabilities to end up with the top leaves. By the Gumbel-Top- trick the result is a sample without replacement from the sequence model. The effective procedure is a beam search over the (stochastically) perturbed log-probabilities for partial sequences , hence the name Stochastic Beam Search. As we use to select the top partial sequences, we can also think of as the stochastic score of the partial sequence . We formalize Stochastic Beam Search in Algorithm 1.
Stochastic Beam Search should not only be considered as a sampling procedure, but also as a principled way to randomize a beam search. As a naive alternative, one could run an ordinary beam search, replacing the top- operation by sampling. In this scenario, at each step of the beam search we could sample without replacement from the partial sequence probabilities using the Gumbel-Top- trick.
However, in this naive approach, for a low-probability partial sequence to be extended to completion, it does not only need to be initially chosen, but it will need to be re-chosen, independently, again with low probability, at each step of the beam search. The result is a much lower probability to sample this sequence than assigned by the model. Intuitively, we should somehow commit to a sampling ‘decision’ made at step . However, a hard commitment to generate exactly one descendant for each of the partial sequences at step would prevent generating any two sequences that share an initial partial sequence.
Our Stochastic Beam Search algorithm makes a soft commitment to a partial sequence (node in the tree) by propagating the Gumbel perturbation of the log-probability consistently down the subtree. The partial sequence will then be extended as long as its total perturbed log-probability is among the top , but will fall off the beam if, despite the consistent perturbation, another sequence is more promising.
As an alternative to Stochastic Beam Search, we can draw samples without replacement by rejecting duplicates from samples drawn with replacement. However, if the domain is large and the entropy low (e.g. if there are only a few valid translations), then rejection sampling requires many samples and consequently many (expensive) model evaluations. Also, we have to estimate how many samples to draw (in parallel) or draw samples sequentially. Stochastic Beam Search executes in a single pass, and requires computation linear in the sample size and the sequence length, which (except for the beam search overhead) is equal to the computational requirement for sampling with replacement.
In this experiment we compare Stochastic Beam Search as a principled (stochastic) alternative to Diverse Beam Search (Vijayakumar et al., 2018) in the context of neural machine translation to obtain a diverse set of translations for a single source sentence . Following the setup by Vijayakumar et al. (2018) we report both diversity as measured by the fraction of unique -grams in the translations as well as mean and maximum BLEU score (Papineni et al., 2002) as an indication of the quality of the sample. The maximum BLEU score corresponds to ‘oracle performance’ reported by Vijayakumar et al. (2018), but we report the mean as well since a single good translation and completely random sentences scores high on both maximum BLEU score and diversity, while being undesirable. A good method should increase diversity without sacrificing mean BLEU score.
We compare four different sentence generations methods: Beam Search (BS), Sampling, Stochastic Beam Search (SBS) (sampling without replacement) and Diverse Beam Search with groups (DBS()) (Vijayakumar et al., 2018). For Sampling and Stochastic Beam Search, we control the diversity using the softmax temperature in Equation (5). We use , where a higher
results in higher diversity. Heuristically, we also varyfor computing the scores with (deterministic) Beam Search. The diversity of Diverse Beam Search is controlled by the diversity strengths parameter, which we vary between . We set the number of groups equal to the sample size , which Vijayakumar et al. (2018) reported as the best choice.
We modify the Beam Search implementation in fairseq333https://github.com/pytorch/fairseq to implement Stochastic Beam Search, and use the fairseq implementations for Beam Search, Sampling and Diverse Beam Search. For theoretical correctness of the Stochastic Beam Search, we disable length-normalization (Wu et al., 2016) and early stopping (and therefore also do not use these parameters for the other methods). We use the pretrained model from Gehring et al. (2017) and use the wmt14.v2.en-fr.newstest2014 444https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2 test set consisting of 3003 sentences. We run the four methods with sample sizes and plot the minimum, mean and maximum BLEU score among the translations (averaged over the test set) against the average of and -gram diversity, where -gram diversity is defined as:
In Figure 2, we represent the results as curves, indicating the trade-off between diversity and BLEU score. The points indicate datapoints and the dashed lines indicate the (averaged) minimum and maximum BLEU score. For the same diversity, Stochastic Beam Search achieves higher mean/maximum BLEU score. Looking at a certain BLEU score, we observe that Stochastic Beam Search achieves the same BLEU score as Diverse Beam Search with a significantly larger diversity. For low temperatures (), the maximum BLEU score of Stochastic Beam Search is comparable to the deterministic Beam Search, so the increased diversity does not sacrifice the best element in the sample. Note that Sampling achieves higher mean BLEU score at the cost of diversity, which may be because good translations are sampled repeatedly. However, the maximum BLEU score of both Sampling and Diverse Beam Search is lower than with Beam Search and Stochastic Beam Search.
In our second experiment, we use sampling without replacement to evaluate the expected sentence level BLEU score for a translation given a source sentence . Although we are often interested in corpus level BLEU score, estimation of sentence level BLEU score is useful, for example when training using minibatches to directly optimize BLEU score (Ranzato et al., 2016).
We leave dependence of the BLEU score on the source sentence implicit, and write . Writing the domain of (given ) as (e.g. all possible translations), we want to estimate the following expectation:
Under a Monte Carlo (MC) sampling with replacement scheme with size , we write as the set555Formally, when sampling with replacement, is a multiset. of indices of sampled sequences and estimate (11) using
If the distribution has low entropy (for example if there are only few valid translations), then MC estimation may be inefficient since repeated samples are uninformative. We can use sampling without replacement as an improvement, but we need to use importance weights to correct for the changed sampling probabilities. Using the Gumbel-Top- trick, we can implement an estimator equivalent to the estimator described by Vieira (2017), derived from priority sampling (Duffield et al., 2007):
Here is the -th largest element of , which can be considered the empirical threshold for the Gumbel-Top- trick (since if ), and we define . If we would assume a fixed threshold and variably sized sample , then and is a standard importance weight. Surprisingly, using a fixed sample size (and empirical threshold
) also yields in an unbiased estimator, and we include a proof adapted fromDuffield et al. (2007) and Vieira (2017) in Appendix D. To obtain , we need to sacrifice the last sample666For Stochastic Beam Search, we use a sample/beam size , set equal to the -th largest perturbed log-probability and compute (13) based on the remaining samples. Alternatively, we could use a beam size of ., slightly increasing variance.
We have to take care computing the importance weights as depending on the entropy the terms in the quotient can become very small, and in our case the computation of can suffer from catastrophic cancellation. For details, see Appendix C.
Because the model is not trained to use its own predictions as input, at test time errors can accumulate. As a result, when sampling with the default temperature , the expected BLEU score is very low (below 10). To improve quality of generated sentences we use lower temperatures and experiment with . We then use different methods to estimate the BLEU score:
Monte Carlo (MC), using Equation (12).
Beam Search (BS), where we compute a deterministic beam (the temperature affects the scoring) and compute . This is not a statistical estimator, but a lower bound to the target (11) which serves as a validation of the implementation and gives insight on how many sequences we need at least to capture most of the mass in (11). We also compute the normalized version , which can heuristically be considered as a ‘determinstic estimate’.
In Figure 3 we show the results of computing each estimate 100 times (BS only once as it is deterministic) for three different sentences777Sentence 1500, 2000 and 2500 from the WMT14 dataset. Sentences 0, 500, 1000 are shorter and obtained 0 BLEU score in some cases, therefore being uninteresting in comparisons. for temperatures and sample sizes to . We report the empirical mean and -th and -th percentile. The normalized SBS estimator indeed achieves significantly lower variance than the unnormalized version and for , it significantly reduces variance compared to MC, without adding observable bias. For the results are similar, but we are less interested in this case as the overall BLEU score is lower than for .
Additionally to estimating the BLEU score we use such that Equation (11) becomes the model entropy (conditioned on the source sentence ):
Entropy estimation is useful in optimization settings where we want to include an entropy loss to ensure diversity. It is a different problem than BLEU score estimation as high BLEU score (for a good model) correlates positively with model probability, while for entropy rare contribute the largest terms . We use the same experimental setup as for the BLEU score and present the results in Figure 4. The results are similar to the BLEU score experiment: the normalized SBS estimate has significantly lower variance than MC for while for , results are similar. This shows that Stochastic Beam Search can be used to construct practical statistical estimators.
The idea of sampling by solving optimization problems has been used for various purposes (Papandreou & Yuille, 2011; Hazan & Jaakkola, 2012; Tarlow et al., 2012; Ermon et al., 2013; Maddison et al., 2014; Chen & Ghahramani, 2016; Balog et al., 2017), but to our knowledge this idea has not been used for sampling without replacement.
Most closely related to our work is the work by Maddison et al. (2014), who note that the Gumbel-Max trick (Gumbel, 1954) can be applied implicitly and generalize it to continuous distributions, using an search to implicitly find the maximum of a Gumbel process. In this work, we extend the idea of top-down sampling to efficiently draw multiple samples without replacement from a factorized distribution (with possibly exponentially large domain) by implicitly applying the Gumbel-Top- trick. This is a new and practical sampling method.
The blog post by Vieira (2014) describes the relation of the Gumbel-Top- trick (as we call it) to Weighted Reservoir Sampling (Efraimidis & Spirakis, 2006), which is an algorithm for drawing weighted samples without replacement from a stream but also yields a practical parallel algorithm for sampling without replacement. In this setting, there is a fixed probability for each element , but no parametric (sequence) model. The implication that the Gumbel-Top- trick can be used for sampling without replacement is not widely known888 For example, at the time of writing the popular PyTorch
For example, at the time of writing the popular PyTorch(Paszke et al., 2017) library uses the (for large ) expensive sequential algorithm..
The Gumbel-Max trick has also been used to define relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016), which can be reparameterized for low-variance but biased gradient estimators and can also be used for training sequence models (Gu et al., 2018). We think our work is a step in the direction to improve these methods in the context of sequence models.
Beam search is widely used for approximate inference in various domains such as machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Ranzato et al., 2016; Vaswani et al., 2017; Gehring et al., 2017), image captioning (Vinyals et al., 2015b), speech recognition (Graves et al., 2013) and other structured prediction settings (Vinyals et al., 2015a; Weiss et al., 2015). Although typically a test-time procedure, there are works that include beam search in the training loop (Daumé et al., 2009; Wiseman & Rush, 2016; Edunov et al., 2018b; Negrinho et al., 2018; Edunov et al., 2018a) for training sequence models on the sequence level (Ranzato et al., 2016; Bahdanau et al., 2017). Many variants of beam search have been developed, such as a continuous relaxation (Goyal et al., 2018), diversity encouraging variants (Li et al., 2016; Shao et al., 2017; Vijayakumar et al., 2018) or using modifications such as length-normalization (Wu et al., 2016) or simply applying noise to the output (Edunov et al., 2018a). Our Stochastic Beam Search is a principled alternative that shares some of the benefits of these heuristic variants, such as the ability to control diversity or produce randomized output.
We introduced Stochastic Beam Search: an algorithm that is easy to implement on top of a beam search as a way to sample sequences without replacement. This algorithm relates sampling and beam search, combining advantages of these two methods. Our experiments support the idea that it can be used as a drop-in replacement in places where sampling or beam search is used. In fact, our experiments show Stochastic Beam Search can be used to yield lower-variance estimators and high-diversity samples from a neural machine translation model. In future work, we plan to leverage the probabilistic interpretation of beam search to develop new beam search related statistical learning methods.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500, 2018a.
Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018.
Speech recognition with deep recurrent neural networks.In Acoustics, Speech and Signal Processing, IEEE international conference on, pp. 6645–6649, 2013.
Learning beam search policies via imitation learning.In Advances in Neural Information Processing Systems, pp. 10673–10682, 2018.
Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, pp. 3156–3164, 2015b.
For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that
where is the domain (without replacement) for the -th sampled element.
First note that
The step from (16) to (17) follows from the independence of the and (Section 2.3) and the step from (17) to (18) uses the Gumbel-Max trick. The proof follows by induction on . The case is the Gumbel-Max trick, while if we assume the result (15) proven for , then
A random variable has a truncated Gumbel distribution with location and maximum (e.g. ) with CDF if:
The inverse CDF is:
In order to sample a set of Gumbel variables , e.g. with their maximum being exactly , we can first sample the , and then sample the Gumbels conditionally on both the and :
Sample . We do not need to condition on since the is independent of the (Section 2.3).
Set , since this follows from conditioning on the and .
Sample for . This works because, conditioning on the and , it holds that:
Equivalently, we can let , let and define
We can think of this as using the Gumbel-Max trick for step 1 (sampling the argmax) in the sampling process described above. Additionally, for :
so here we recover step 2 (setting ). For it holds that:
This means that , so this is equivalent to step 3 (sampling for ).
Direct computation of (22) can be unstable as large terms need to be exponentiated. Instead, we compute:
where we have defined