Log In Sign Up

Leveraging sentence similarity in natural language generation: Improving beam search using range voting

by   Sebastian Borgeaud, et al.

We propose a novel method for generating natural language sentences from probabilistic language models, selecting from a beam search using a range voting procedure. The proposed method could be applied to any language model, including both n-gram models and neural network models, and could be applied to any generation task. Instead of choosing the most likely output, our method chooses the most representative output, providing a solution to the common problem of short outputs being preferred over longer and more informative ones. We evaluate our method on an image captioning task, and find that the generated captions are longer and more diverse than those generated using standard beam search, with higher BLEU scores (particularly when the beam size is large), and better performance in a human evaluation.


page 1

page 2

page 3

page 4


Incremental Beam Manipulation for Natural Language Generation

The performance of natural language generation systems has improved subs...

Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned

This paper focuses on enhancing the captions generated by image-caption ...

On Hallucination and Predictive Uncertainty in Conditional Language Generation

Despite improvements in performances on different natural language gener...

Beam Search with Bidirectional Strategies for Neural Response Generation

Sequence-to-sequence neural networks have been widely used in language-b...

Speculative Beam Search for Simultaneous Translation

Beam search is universally used in full-sentence translation but its app...

Combining Learned Lyrical Structures and Vocabulary for Improved Lyric Generation

The use of language models for generating lyrics and poetry has received...

Communication-based Evaluation for Natural Language Generation

Natural language generation (NLG) systems are commonly evaluated using n...

1 Introduction

A language model specifies a probability distribution over sequences of words. In many applications, it is desirable to output a single sequence, rather than a distribution. A common approach is to choose the most likely sequence. However, for the probabilities to sum to 1, the probability of a sequence must tend to 0 as length increases. This leads to a long-recognised problem, that choosing the most likely sequence favours short sequences

(Brown et al., 1995). This is problematic when the most likely sequence is not representative of the whole distribution. For example, in dialogue generation tasks, the most likely output can be “I don’t know”, even when most of the probability mass is assigned to long informative sequences. Cao and Clark (2017) call this the “boring output problem”.

For a real-valued distribution, we can choose a representative output by taking the mean. However, for a discrete distribution (such as over sequences), the mean is not well-defined. In this paper, we choose a representative output using tools from voting theory, which allows us to avoid the boring output problem. The basic idea is that, if the distribution assigns most of the probability mass to a group of similar sequences, we would like to generate one of these sequences – even if they have low probability as individual sequences, they have high probability as a group.

We evaluate our approach on an image captioning task (see Fig. 1 for an example). We find that our approach generates longer and more diverse captions, while achieving higher BLEU scores, and performing better in a human evaluation. This suggests that our approach mitigates the boring output problem.

2 Related work

To increase the length and diversity of a model’s outputs, some authors have proposed changes to the model architecture. In dialogue generation, Cao and Clark (2017) use a latent variable model to capture the possible ‘topics’ of a response.

Others have proposed changing the objective function. In dialogue generation, Li et al. (2016a) optimise mutual information instead of probability. In machine translation, Tu et al. (2017) modify an encoder-decoder model by adding a ‘reconstructor’ to predict the input based on the output.

However, modifying the model or the objective function depends on the particular task, and applying these techniques to an existing system requires retraining the model. In this paper, we focus on general-purpose methods which can be applied to any probabilistic model in any generation task. Existing methods include length normalisation (Wu et al., 2016; Freitag and Al-Onaizan, 2017) and diverse decoding (Li et al., 2016b; Li and Jurafsky, 2016), which we discuss in §4.1.

0.00230: a couple of people that are sitting on a bench 0.00132: a man sitting on a bench next to a dog 0.00079: a black and white photo of a man sitting on a 0.00079: bench 0.00075: a couple of people sitting on a bench 0.00066: a man sitting on a bench with a dog 0.00064: a man and a woman sitting on a bench 0.00048: a man and a woman sitting on a park bench 0.00046: a black and white photo of a man and a horse 0.00033: a black and white photo of a man and a dog 0.00025: a black and white photo of a man on a horse
Figure 1: Image from the MSCOCO validation dataset and beam captions with probability (beam size ). See §3.1 for beam search, §4.1 for the model. Range voting with similarity (see §3.2) selects“a black and white photo of a man sitting on a bench”.

3 Method

3.1 Beam search

When working with a distribution over sequences, it is not feasible to consider all possible sequences. Finding the most likely sequence can be computationally expensive – in fact, for an RNN it is undecidable (Chen et al., 2018). A common solution is to use beam search, which generates the sequence one token at a time, maintaining a list of the  most promising sequences at each time step (for example: Brown et al., 1995; Koehn, 2004a). Greedy search is the special case where .

Beam search introduces an extra hyper-parameter, the beam size . Increasing  covers more of the search space, but increases the computational cost. It is tempting to assume that increasing  will produce better results, but empirically, the quality of the most likely sequence starts to decrease after  exceeds a certain threshold (Koehn and Knowles, 2017), which stems from the problem discussed in §1. Tuning the value of  to maximise performance can be challenging.

In the next section, we propose an alternative way to generate from a beam, which avoids the drop in performance as beam size increases. Rather than choosing the most likely sequence, we choose the most representative sequence.

3.2 Range voting

To formalise the idea of the most representative sequence, we propose to use a voting procedure. Although voting has been applied to ensembles of classifiers (for an overview, see: 

Kuncheva, 2004; Kuncheva and Rodríguez, 2014), we are not aware of work using voting to select from a distribution.

We can see each sequence as a candidate in an election, and the probability of a sequence as the proportion of votes for that candidate. From this perspective, the problem of probability mass being split across long sequences is the well-known problem of vote splitting. Suppose candidate  wins an election. Now suppose we run the election again, but add an additional candidate , identical to . A voting system is robust against vote splitting (and called independent of clones) if the winner must be  or  (Tideman, 1987).

A well-studied system which is independent of clones is range voting (Heckscher, 1892; Smith, 2000; Tideman, 2006; Lagerspetz, 2016). Each voter scores each candidate in the range , and the candidate with the highest total score wins.

In our setting, probability mass can be seen as the proportion of votes placing a candidate as first choice (see Fig. 1 for an example). For range voting, we need to augment the votes with scores for all other candidates. We propose to do this using a similarity measure. The final score for a sequence is given in (1), for a beam of sequences  and a similarity measure .111An alternative way to understand this method is that each sequence acts as both voter and candidate. As a voter, each sequence is weighted by its probability.


Defining semantic similarity between sentences is recognised as a hard problem (Achananuparp et al., 2008; Cer et al., 2017; Pawar and Mago, 2019). In this work, we focus on simple, domain-agnostic similarity measures which do not require additional training.

First, we consider similarity based on n-grams. For a sequence , we write for its set of n-grams, and for its bag of n-grams. We define two measures in (23). Both are asymmetric, to encourage informative sequences: if  contains plus more information,  should be high, but  should be lower. This allows an informative sequence to gather more votes.


Second, inspired by Mueller and Thyagarajan (2016), we consider a similarity measure based on the hidden states of the LSTM during generation (see §4.1

). For each sequence, we find the average LSTM hidden state, and then compute cosine similarity. We refer to this measure as


Beam size 1 2 10 100 1 2 10 100
Standard beam search 0.6666 0.6797 0.6723 0.6618 0.2539 0.2683 0.2716 0.2631
Length normalisation 0.6666 0.6847 0.6472 0.6310 0.2539 0.2672 0.2576 0.2472
Diverse decoding 0.6666 0.6790 0.6724 0.6643 0.2539 0.2668 0.2693 0.2637
0.6666 0.6855 0.6626 0.6636 0.2539 0.2647 0.2561 0.2460
0.6666 0.6854 0.6631 0.6646 0.2539 0.2647 0.2562 0.2458
0.6666 0.6820 0.6736 0.6719 0.2539 0.2682 0.2722 0.2713
0.6666 0.6820 0.6763 0.6721 0.2539 0.2682 0.2723 0.2713
0.6666 0.6797 0.6842 0.6910 0.2539 0.2683 0.2796 0.2823
Table 1: BLEU-1 and BLEU-4 scores obtained on the MSCOCO validation images.

4 Experiments

We evaluate our method on the MSCOCO dataset (Lin et al., 2014), which consists of 82,783 training images and 40,504 validation images, each annotated with 5 captions from human annotators.

4.1 Model and baselines

We use the ‘Show and Tell’ architecture of Vinyals et al. (2015)

. The task is framed as a supervised learning problem: an encoder-decoder model is trained to maximise the probability of the annotator captions given an input image. The encoder is a pretrained Inception V3 CNN

(Szegedy et al., 2016)

from which we extract a feature vector from the final pooling layer

(Ioffe and Szegedy, 2015). The decoder is an LSTM (Hochreiter and Schmidhuber, 1997) with 512 hidden units, with dropout (

), initialising the hidden state using the encoder. The vocabulary consists of the 5000 most common words in the training captions, for which embeddings of size 512 are learned from scratch. We trained the model for 20 epochs with vanilla SGD, starting with a learning rate of 2.0, which is halved every 8 epochs.

As well as comparing to standard beam search, we consider two existing baselines. Length normalisation divides the log-probability by sequence length (Wu et al., 2016; Freitag and Al-Onaizan, 2017). Diverse decoding penalises expansions of the same initial sequence (Li et al., 2016b; Li and Jurafsky, 2016). The other methods mentioned in §2 cannot be straightforwardly applied to this task.

4.2 BLEU scores

Table 1 shows BLEU scores (Papineni et al., 2002) on the MSCOCO validation set. For beam size , all methods reduce to greedy search.

The bigram similarity measures and the measure improve BLEU scores for almost all beam sizes. In contrast, diverse decoding has almost no effect on BLEU, while length normalisation performs worse than standard beam search. The best result is achieved by at . This is significantly better than the best beam search result (), with for a paired bootstrap test following Koehn (2004b).

Consistent with Ott et al. (2018) and Koehn and Knowles (2017), increasing  too much reduces BLEU for standard beam search. However, this drop does not occur for our voting method.

Average caption length
Beam size 1 2 10 100
S. beam search 8.41 8.79 9.18 9.11
Length norm. 8.41 9.19 10.24 10.43
Diverse decod. 8.41 8.71 9.12 9.15
8.41 9.22 10.40 11.20
8.41 9.21 10.38 11.15
8.41 8.96 9.86 10.55
8.41 8.96 9.86 10.55
8.41 8.79 9.17 8.82
Table 2: Average length of the generated captions.
Distinct captions Distinct unigrams Distinct bigrams
Beam size 2 10 100 2 10 100 2 10 100
Standard beam search 9208 5488 4150 668 621 605 3395 2778 2479
Length normalisation 9978 6418 5039 681 627 587 3502 2863 2471
Diverse decoding 9942 6424 4403 672 646 612 3402 3023 2561
10727 8916 10808 687 646 628 3576 3232 3596
10727 8902 10768 687 645 638 3572 3238 3607
9519 7598 9221 673 620 580 3446 2854 2887
9522 7590 9248 673 620 581 3444 2848 2892
9208 7613 10133 668 629 655 3395 2891 3331
Table 3: Number of distinct captions, unigrams and bigrams.

4.3 Caption length

To analyse differences between methods, we first look at caption length, shown in Table 2. Standard beam search produces slightly longer captions as  increases up to 10. All n-gram measures generate longer captions than standard beam search, and length continues to increase as  goes to 100. Length normalisation also increases caption length, but this is at the cost of BLEU score (see §4.2). Diverse decoding does not increase caption length. The measure produces slightly shorter captions – as it is symmetric, it does not favour long sequences as the asymmetric n-gram measures do (see §3.2).

4.4 Caption diversity

Second, we investigate the diversity of the generated captions by counting the number of distinct captions, unigrams, and bigrams (see Table 3). This follows the approach of Li et al. (2016a), Dhingra et al. (2017), and Xu et al. (2017, 2018).

For standard beam search, the number of distinct captions drops as  increases. Both baselines weaken this effect, but the drop is still present. In contrast, range voting maintains caption diversity as  increases, for all similarity measures.

Similarly, standard beam search sees a drop in the number of distinct unigrams and bigrams as  increases, and the baselines do not seem to mitigate this. In contrast, the unigram measures and the measure maintain both unigram diversity and bigram diversity as  increases, while the bigram measures partially maintain bigram diversity.

4.5 Human evaluation

BLEU is known to be imperfect, and does not always match human judgements (Callison-Burch et al., 2006). While the n-gram similarity measures produce similar BLEU scores to standard beam search, they also produce longer captions. A longer caption is potentially more informative. To investigate whether they are more informative in way that is not reflected by BLEU, we took 500 validation images for human evaluation, comparing the captions produced by standard beam search () against our best-performing n-gram measure (, ). Each pair of captions was presented in a random order, with the original image, and judged on a five-point scale (one caption much better, slightly better, or no difference). The voted caption was rated better 106 times, and worse 73 times. This is statistically significant, with for a two-tailed sign test, discarding ties (Emerson and Simon, 1979). However, for captions rated much better, the voted caption was better 27 times and worse 40 times. This is suggestive but not fully significant ().

These results support the claim that a voted caption represents more of the information present in a model’s distribution over captions – this often leads to a better caption, but where the model is wrong, adding wrong information can make the caption much worse. After all, our method is designed as a better way to select from a distribution, not as an improvement to the distribution itself.

5 Conclusion

We have proposed a new method for generating natural language from a language model, by re-ranking a beam search. Instead of choosing the most likely sequence, we choose the most representative sequence, formalising representativeness using a similarity measure and range voting.

We have evaluated our method on an image captioning task. Despite using simple similarity measures, we achieve an increase in BLEU score, an increase in caption length and diversity, and statistically significantly better performance in a human evaluation. Unlike standard beam search, performance of our method does not drop as beam size continues to increase, removing the sensitivity of results on this hyperparameter. Better similarity measures could further improve results.

Finally, our approach can be applied to any probabilistic language model, without any need for additional training. This opens up many other tasks, including machine translation, summarisation, dialogue systems, and question answering. If multiple outputs can be used (e.g. offering options to a user), our method can be extended to use reweighted range voting (Smith, 2005), a procedure which elects multiple candidates.


  • P. Achananuparp, X. Hu, and X. Shen (2008) The evaluation of sentence similarity measures. In Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery, pp. 305–316. Cited by: §3.2.
  • P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. C. Lai, and R. L. Mercer (1995) Method and system for natural language translation. Google Patents. Note: US Patent 5,477,451 Cited by: §1, §3.1.
  • C. Callison-Burch, M. Osborne, and P. Koehn (2006) Re-evaluation the role of BLEU in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), External Links: Link Cited by: §4.5.
  • K. Cao and S. Clark (2017) Latent variable dialogue models and their diversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), External Links: Link Cited by: §1, §2.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. External Links: Link Cited by: §3.2.
  • Y. Chen, S. Gilroy, K. Knight, and J. May (2018) Recurrent neural networks as weighted language recognizers. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2261–2271. External Links: Link Cited by: §3.1.
  • B. Dhingra, L. Li, X. Li, J. Gao, Y. Chen, F. Ahmed, and L. Deng (2017)

    Towards end-to-end reinforcement learning of dialogue agents for information access

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 484–495. External Links: Link Cited by: §4.4.
  • J. D. Emerson and G. A. Simon (1979)

    Another look at the sign test when ties are present: the problem of confidence intervals

    The American Statistician 33 (3), pp. 140–142. Cited by: §4.5.
  • M. Freitag and Y. Al-Onaizan (2017)

    Beam search strategies for neural machine translation

    In Proceedings of the First Workshop on Neural Machine Translation, pp. 56–60. External Links: Link Cited by: §2, §4.1.
  • A. G. Heckscher (1892) Bidrag til grundlæggelse af en afstemningslære. om methoderne ved udfindelse af stemmerflerhed i parlamenter (afsteming over ændringforslag m.v.) ved valg og domstole. Ph.D. Thesis, University of Copenhagen. Cited by: §3.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32nd International Conference on Machine Learning

    Proceedings of Machine Learning Research, Vol. 37, pp. 448–456. External Links: Link Cited by: §4.1.
  • P. Koehn and R. Knowles (2017) Six challenges for neural machine translation. In First Workshop on Neural Machine Translation, pp. 28–39. External Links: Link Cited by: §3.1, §4.2.
  • P. Koehn (2004a) Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Conference of the Association for Machine Translation in the Americas, pp. 115–124. Cited by: §3.1.
  • P. Koehn (2004b) Statistical significance tests for machine translation evaluation. In

    Proceedings of the 2004 conference on empirical methods in natural language processing (EMNLP)

    pp. 388–395. Cited by: §4.2.
  • L. I. Kuncheva and J. J. Rodríguez (2014) A weighted voting framework for classifiers ensembles. Knowledge and Information Systems 38 (2), pp. 259–275. Cited by: §3.2.
  • L. I. Kuncheva (2004) Combining pattern classifiers: methods and algorithms. John Wiley & Sons. Cited by: §3.2.
  • E. Lagerspetz (2016) Social choice and democratic values. Springer. Cited by: §3.2.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016a) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 110–119. External Links: Link Cited by: §2, §4.4.
  • J. Li and D. Jurafsky (2016) Mutual information and diverse decoding improve neural machine translation. arXiv preprint arXiv:1601.00372. Cited by: §2, §4.1.
  • J. Li, W. Monroe, and D. Jurafsky (2016b) A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562. Cited by: §2, §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In

    European Conference on Computer Vision

    pp. 740–755. Cited by: §4.
  • J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. In

    Proceedings of the 30th AAAI Conference on Artificial Intelligence

    External Links: Link Cited by: §3.2.
  • M. Ott, M. Auli, D. Grangier, and M. Ranzato (2018) Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning (ICML), External Links: Link Cited by: §4.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. External Links: Link Cited by: §4.2.
  • A. Pawar and V. Mago (2019)

    Challenging the boundaries of unsupervised learning for semantic similarity

    IEEE Access 7. External Links: Link Cited by: §3.2.
  • W. D. Smith (2000) Range voting. External Links: Link Cited by: §3.2.
  • W. D. Smith (2005) Reweighted range voting – new multiwinner voting method. External Links: Link Cited by: §5.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2818–2826. Cited by: §4.1.
  • N. Tideman (1987) Independence of clones as a criterion for voting rules. Social Choice and Welfare 4 (3), pp. 185–206. Cited by: §3.2.
  • N. Tideman (2006) Collective decisions and voting: the potential for public choice. Routledge. Cited by: §3.2.
  • Z. Tu, Y. Liu, L. Shang, X. Liu, and H. Li (2017) Neural machine translation with reconstruction. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §2.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. Cited by: §4.1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, K. Jeff, S. Apurva, J. Melvin, L. Xiaobing, K. Łukasz, G. Stephan, K. Yoshikiyo, K. Taku, K. Hideto, S. Keith, K. George, P. Nishant, W. Wei, Y. Cliff, S. Jason, R. Jason, R. Alex, V. Oriol, C. Greg, H. Macduff, and D. Jeffrey (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. External Links: Link Cited by: §2, §4.1.
  • X. Xu, O. Dušek, I. Konstas, and V. Rieser (2018) Better conversations by modeling, filtering, and optimizing for coherence and diversity. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3981–3991. External Links: Link Cited by: §4.4.
  • Z. Xu, B. Liu, B. Wang, S. Chengjie, X. Wang, Z. Wang, and C. Qi (2017) Neural response generation via GAN with an approximate embedding layer. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 617–626. External Links: Link Cited by: §4.4.