A language model specifies a probability distribution over sequences of words. In many applications, it is desirable to output a single sequence, rather than a distribution. A common approach is to choose the most likely sequence. However, for the probabilities to sum to 1, the probability of a sequence must tend to 0 as length increases. This leads to a long-recognised problem, that choosing the most likely sequence favours short sequences(Brown et al., 1995). This is problematic when the most likely sequence is not representative of the whole distribution. For example, in dialogue generation tasks, the most likely output can be “I don’t know”, even when most of the probability mass is assigned to long informative sequences. Cao and Clark (2017) call this the “boring output problem”.
For a real-valued distribution, we can choose a representative output by taking the mean. However, for a discrete distribution (such as over sequences), the mean is not well-defined. In this paper, we choose a representative output using tools from voting theory, which allows us to avoid the boring output problem. The basic idea is that, if the distribution assigns most of the probability mass to a group of similar sequences, we would like to generate one of these sequences – even if they have low probability as individual sequences, they have high probability as a group.
We evaluate our approach on an image captioning task (see Fig. 1 for an example). We find that our approach generates longer and more diverse captions, while achieving higher BLEU scores, and performing better in a human evaluation. This suggests that our approach mitigates the boring output problem.
2 Related work
To increase the length and diversity of a model’s outputs, some authors have proposed changes to the model architecture. In dialogue generation, Cao and Clark (2017) use a latent variable model to capture the possible ‘topics’ of a response.
Others have proposed changing the objective function. In dialogue generation, Li et al. (2016a) optimise mutual information instead of probability. In machine translation, Tu et al. (2017) modify an encoder-decoder model by adding a ‘reconstructor’ to predict the input based on the output.
However, modifying the model or the objective function depends on the particular task, and applying these techniques to an existing system requires retraining the model. In this paper, we focus on general-purpose methods which can be applied to any probabilistic model in any generation task. Existing methods include length normalisation (Wu et al., 2016; Freitag and Al-Onaizan, 2017) and diverse decoding (Li et al., 2016b; Li and Jurafsky, 2016), which we discuss in §4.1.
3.1 Beam search
When working with a distribution over sequences, it is not feasible to consider all possible sequences. Finding the most likely sequence can be computationally expensive – in fact, for an RNN it is undecidable (Chen et al., 2018). A common solution is to use beam search, which generates the sequence one token at a time, maintaining a list of the most promising sequences at each time step (for example: Brown et al., 1995; Koehn, 2004a). Greedy search is the special case where .
Beam search introduces an extra hyper-parameter, the beam size . Increasing covers more of the search space, but increases the computational cost. It is tempting to assume that increasing will produce better results, but empirically, the quality of the most likely sequence starts to decrease after exceeds a certain threshold (Koehn and Knowles, 2017), which stems from the problem discussed in §1. Tuning the value of to maximise performance can be challenging.
In the next section, we propose an alternative way to generate from a beam, which avoids the drop in performance as beam size increases. Rather than choosing the most likely sequence, we choose the most representative sequence.
3.2 Range voting
To formalise the idea of the most representative sequence, we propose to use a voting procedure. Although voting has been applied to ensembles of classifiers (for an overview, see:Kuncheva, 2004; Kuncheva and Rodríguez, 2014), we are not aware of work using voting to select from a distribution.
We can see each sequence as a candidate in an election, and the probability of a sequence as the proportion of votes for that candidate. From this perspective, the problem of probability mass being split across long sequences is the well-known problem of vote splitting. Suppose candidate wins an election. Now suppose we run the election again, but add an additional candidate , identical to . A voting system is robust against vote splitting (and called independent of clones) if the winner must be or (Tideman, 1987).
A well-studied system which is independent of clones is range voting (Heckscher, 1892; Smith, 2000; Tideman, 2006; Lagerspetz, 2016). Each voter scores each candidate in the range , and the candidate with the highest total score wins.
In our setting, probability mass can be seen as the proportion of votes placing a candidate as first choice (see Fig. 1 for an example). For range voting, we need to augment the votes with scores for all other candidates. We propose to do this using a similarity measure. The final score for a sequence is given in (1), for a beam of sequences and a similarity measure .111An alternative way to understand this method is that each sequence acts as both voter and candidate. As a voter, each sequence is weighted by its probability.
Defining semantic similarity between sentences is recognised as a hard problem (Achananuparp et al., 2008; Cer et al., 2017; Pawar and Mago, 2019). In this work, we focus on simple, domain-agnostic similarity measures which do not require additional training.
First, we consider similarity based on n-grams. For a sequence , we write for its set of n-grams, and for its bag of n-grams. We define two measures in (2–3). Both are asymmetric, to encourage informative sequences: if contains plus more information, should be high, but should be lower. This allows an informative sequence to gather more votes.
). For each sequence, we find the average LSTM hidden state, and then compute cosine similarity. We refer to this measure as.
|Standard beam search||0.6666||0.6797||0.6723||0.6618||0.2539||0.2683||0.2716||0.2631|
We evaluate our method on the MSCOCO dataset (Lin et al., 2014), which consists of 82,783 training images and 40,504 validation images, each annotated with 5 captions from human annotators.
4.1 Model and baselines
We use the ‘Show and Tell’ architecture of Vinyals et al. (2015)
. The task is framed as a supervised learning problem: an encoder-decoder model is trained to maximise the probability of the annotator captions given an input image. The encoder is a pretrained Inception V3 CNN(Szegedy et al., 2016)
from which we extract a feature vector from the final pooling layer(Ioffe and Szegedy, 2015). The decoder is an LSTM (Hochreiter and Schmidhuber, 1997) with 512 hidden units, with dropout (
), initialising the hidden state using the encoder. The vocabulary consists of the 5000 most common words in the training captions, for which embeddings of size 512 are learned from scratch. We trained the model for 20 epochs with vanilla SGD, starting with a learning rate of 2.0, which is halved every 8 epochs.
As well as comparing to standard beam search, we consider two existing baselines. Length normalisation divides the log-probability by sequence length (Wu et al., 2016; Freitag and Al-Onaizan, 2017). Diverse decoding penalises expansions of the same initial sequence (Li et al., 2016b; Li and Jurafsky, 2016). The other methods mentioned in §2 cannot be straightforwardly applied to this task.
4.2 BLEU scores
The bigram similarity measures and the measure improve BLEU scores for almost all beam sizes. In contrast, diverse decoding has almost no effect on BLEU, while length normalisation performs worse than standard beam search. The best result is achieved by at . This is significantly better than the best beam search result (), with for a paired bootstrap test following Koehn (2004b).
|Average caption length|
|S. beam search||8.41||8.79||9.18||9.11|
|Distinct captions||Distinct unigrams||Distinct bigrams|
|Standard beam search||9208||5488||4150||668||621||605||3395||2778||2479|
4.3 Caption length
To analyse differences between methods, we first look at caption length, shown in Table 2. Standard beam search produces slightly longer captions as increases up to 10. All n-gram measures generate longer captions than standard beam search, and length continues to increase as goes to 100. Length normalisation also increases caption length, but this is at the cost of BLEU score (see §4.2). Diverse decoding does not increase caption length. The measure produces slightly shorter captions – as it is symmetric, it does not favour long sequences as the asymmetric n-gram measures do (see §3.2).
4.4 Caption diversity
Second, we investigate the diversity of the generated captions by counting the number of distinct captions, unigrams, and bigrams (see Table 3). This follows the approach of Li et al. (2016a), Dhingra et al. (2017), and Xu et al. (2017, 2018).
For standard beam search, the number of distinct captions drops as increases. Both baselines weaken this effect, but the drop is still present. In contrast, range voting maintains caption diversity as increases, for all similarity measures.
Similarly, standard beam search sees a drop in the number of distinct unigrams and bigrams as increases, and the baselines do not seem to mitigate this. In contrast, the unigram measures and the measure maintain both unigram diversity and bigram diversity as increases, while the bigram measures partially maintain bigram diversity.
4.5 Human evaluation
BLEU is known to be imperfect, and does not always match human judgements (Callison-Burch et al., 2006). While the n-gram similarity measures produce similar BLEU scores to standard beam search, they also produce longer captions. A longer caption is potentially more informative. To investigate whether they are more informative in way that is not reflected by BLEU, we took 500 validation images for human evaluation, comparing the captions produced by standard beam search () against our best-performing n-gram measure (, ). Each pair of captions was presented in a random order, with the original image, and judged on a five-point scale (one caption much better, slightly better, or no difference). The voted caption was rated better 106 times, and worse 73 times. This is statistically significant, with for a two-tailed sign test, discarding ties (Emerson and Simon, 1979). However, for captions rated much better, the voted caption was better 27 times and worse 40 times. This is suggestive but not fully significant ().
These results support the claim that a voted caption represents more of the information present in a model’s distribution over captions – this often leads to a better caption, but where the model is wrong, adding wrong information can make the caption much worse. After all, our method is designed as a better way to select from a distribution, not as an improvement to the distribution itself.
We have proposed a new method for generating natural language from a language model, by re-ranking a beam search. Instead of choosing the most likely sequence, we choose the most representative sequence, formalising representativeness using a similarity measure and range voting.
We have evaluated our method on an image captioning task. Despite using simple similarity measures, we achieve an increase in BLEU score, an increase in caption length and diversity, and statistically significantly better performance in a human evaluation. Unlike standard beam search, performance of our method does not drop as beam size continues to increase, removing the sensitivity of results on this hyperparameter. Better similarity measures could further improve results.
Finally, our approach can be applied to any probabilistic language model, without any need for additional training. This opens up many other tasks, including machine translation, summarisation, dialogue systems, and question answering. If multiple outputs can be used (e.g. offering options to a user), our method can be extended to use reweighted range voting (Smith, 2005), a procedure which elects multiple candidates.
- The evaluation of sentence similarity measures. In Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery, pp. 305–316. Cited by: §3.2.
- Method and system for natural language translation. Google Patents. Note: US Patent 5,477,451 Cited by: §1, §3.1.
- Re-evaluation the role of BLEU in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), External Links: Cited by: §4.5.
- Latent variable dialogue models and their diversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), External Links: Cited by: §1, §2.
- SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. External Links: Cited by: §3.2.
- Recurrent neural networks as weighted language recognizers. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2261–2271. External Links: Cited by: §3.1.
Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 484–495. External Links: Cited by: §4.4.
Another look at the sign test when ties are present: the problem of confidence intervals. The American Statistician 33 (3), pp. 140–142. Cited by: §4.5.
Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pp. 56–60. External Links: Cited by: §2, §4.1.
- Bidrag til grundlæggelse af en afstemningslære. om methoderne ved udfindelse af stemmerflerhed i parlamenter (afsteming over ændringforslag m.v.) ved valg og domstole. Ph.D. Thesis, University of Copenhagen. Cited by: §3.2.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
Batch normalization: accelerating deep network training by reducing internal covariate shift.
Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 37, pp. 448–456. External Links: Cited by: §4.1.
- Six challenges for neural machine translation. In First Workshop on Neural Machine Translation, pp. 28–39. External Links: Cited by: §3.1, §4.2.
- Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Conference of the Association for Machine Translation in the Americas, pp. 115–124. Cited by: §3.1.
Statistical significance tests for machine translation evaluation.
Proceedings of the 2004 conference on empirical methods in natural language processing (EMNLP), pp. 388–395. Cited by: §4.2.
- A weighted voting framework for classifiers ensembles. Knowledge and Information Systems 38 (2), pp. 259–275. Cited by: §3.2.
- Combining pattern classifiers: methods and algorithms. John Wiley & Sons. Cited by: §3.2.
- Social choice and democratic values. Springer. Cited by: §3.2.
- A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 110–119. External Links: Cited by: §2, §4.4.
- Mutual information and diverse decoding improve neural machine translation. arXiv preprint arXiv:1601.00372. Cited by: §2, §4.1.
- A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562. Cited by: §2, §4.1.
Microsoft COCO: common objects in context.
European Conference on Computer Vision, pp. 740–755. Cited by: §4.
Siamese recurrent architectures for learning sentence similarity.
Proceedings of the 30th AAAI Conference on Artificial Intelligence, External Links: Cited by: §3.2.
- Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning (ICML), External Links: Cited by: §4.2.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. External Links: Cited by: §4.2.
Challenging the boundaries of unsupervised learning for semantic similarity. IEEE Access 7. External Links: Cited by: §3.2.
- Range voting. External Links: Cited by: §3.2.
- Reweighted range voting – new multiwinner voting method. External Links: Cited by: §5.
Rethinking the inception architecture for computer vision.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §4.1.
- Independence of clones as a criterion for voting rules. Social Choice and Welfare 4 (3), pp. 185–206. Cited by: §3.2.
- Collective decisions and voting: the potential for public choice. Routledge. Cited by: §3.2.
- Neural machine translation with reconstruction. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, External Links: Cited by: §2.
- Show and tell: a neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. Cited by: §4.1.
- Google’s neural machine translation system: bridging the gap between human and machine translation. External Links: Cited by: §2, §4.1.
- Better conversations by modeling, filtering, and optimizing for coherence and diversity. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3981–3991. External Links: Cited by: §4.4.
- Neural response generation via GAN with an approximate embedding layer. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 617–626. External Links: Cited by: §4.4.