Natural Language Generation (NLG) 
, consisting of text summarization, machine translation 
, etc., has received great attention recently. NLG aims to generate high-quality text, but the evaluation metrics for high-quality vary from task to task. For instance, in the summarization task, the evaluation emphasizes on faithfulness, i.e., whether the generated text can precisely summarize the given source text. In contrast, the diversity of outputs will not be emphasized here. Nevertheless, in the case of story generation  task, the diversity and interestingness  are usually highly regarded, therefore the diversity metric becomes a decisive criterion.
Thus, it can be concluded that the evaluation metric of various tasks in NLG may be distinct from the others significantly. There are a series of tasks in NLG, however, that value both diversity and faithfulness. For example, in the data-to-text task , it is necessary to accurately cover the content in the source text whilst avoiding generating dull, awkward as well as repetitive text at the same time. Hence, the faithfulness-diversity tradeoff becomes a major challenge of this set of tasks.
For these tasks that equally emphasize on diversity and faithfulness, it is extraordinarily challenging to generate high-quality text. The quality of the generated text is intensively relevant to decoding strategies. Beam search
would sort vocabularies by their probabilities through some heuristic methods and select the high-probability phrases to generate sentences. Its tenet is bound to result in its terrible performance on diversity metric, and degeneration, which delineates the phenomenon of generating incoherent, unrelated and repetitive text, might even appear in some certain tasks. Subsequently, sampling methods could somehow alleviate degenerating with its multifarious parameters combination. However, in order to balance faithfulness, it struggles to generate high-quality text, which still lacks diversity, and failed to present copious expressions in syntax and sentence patterns. On the other hand, to further improve the diversity, a wide range of decoding strategies, a.k.a guided decoding [6, 10, 11] are proposed, which saliently ameliorate the diversity of generated text, but resulted in unfaithful outputs. The features of the above-mentioned different decoding strategies are demonstrated in Fig. 1. It remains unclear that how to strike a balance between faithfulness and diversity.
To address these shortcomings, namely the tradeoff between diversity and faithfulness, this paper proposes a new decoding strategy, IFDID, Information Filter upon Diversity-Improved Decoding, focusing on the balance between diversity and faithfulness, which is leveraged in the decoding stage. IFDID could be considered as a two-stage decoding strategy utilizing Enhance-Filter framework. In Enhance stage, the probabilities of some typical tokens of the source text will be incremented in order to enhance the diversity. This stage is based on the sampling method. Whilst probabilities of more tokens are mounted, the random selection range is enlarged, leading to the diverse choices of tokens. The probability of more decisive tokens being selected will reap more promotion and vice versa. Afterward, inspired by the hypothesis of text that is seen as human-like should harbor information amount that is close to the entropy, or anticipated information, of natural language strings , Filter stage could be perceived as an entropy filter aiming to augment and ensure faithfulness. Further, a similar idea is embodied in nucleus sampling , which is a probability filter in the decoding stage.
Extensive experiments on three generation tasks, namely commonsense generation (CommonGEN ), story generation (RocStories ) and text generation with specific style (AdGen ), are carried out and demonstrate that our approach outperforms competitive benchmarks along several metrics. We evaluate faithfulness and diversity by automatic evaluation and human evaluation. Intriguingly, we observe that, compared to beam search and sampling methods, IFDID proffers more diverse expressions. Additionally, against guided decoding strategies represented by gamma sample , IFDID could mitigate the sacrifice of faithfulness while promoting diversity, which could strike a balance between faithfulness and diversity.
In summary, our key contributions are as follows. (a) We propose a two-stage decoding strategy, IFDID, which could perform well and generate high-quality text in any NLG task that demands both diversity and faithfulness leveraging an Enhance-Filter framework. (b) The IFDID is flexible, which could fit any degree of tradeoff, either in favor of faithfulness or diversity. (c) We compare IFDID widely with beam search, sampling methods including top-k sampling, etc. and guided decoding methods on several popular benchmarks, demonstrating the effectiveness of our method across English and Chinese. Experiments show that the proposed approach could achieve a tradeoff between faithfulness and diversity.
2.1 Diversity-improved approach
Gamma sample  presents that controlling the probabilities of these typical tokens is substantially crucial and is done by a gamma parameter:
is the post-probability distribution, as well as functionis the process of gamma sample and probabilities of typical tokens are monotonically increased. Specifically, the way to change these typical tokens is by maintaining the set of typical tokens of the current decoding step, , and the set of frozen tokens, , whose probabilities have been modified in previous steps. The computational approach of is:
where , and respectively refers to the sum of probabilities of tokens in , and tokens neither in nor . The purpose of controlling the quality of the generated text is achieved by modifying the probability of these typical tokens.
Based on the foundation of decoding and gamma sample, the design of typical tokens set is immensely significant. We propose a novel, effective typical tokens set construction approach.
Formally, we define the input is a set of input pieces , and output is , where , represents a single token. The typical tokens selection principle is demonstrated below.
Theme tokens. This set of tokens is selected in order to control the relatedness of generated text, which is designed as a list of .
Terminal tokens. Within this set of tokens, the sentence length could be governed well. The punctuation such as ’.’, ’!’, ’?’, etc., which frequently appears at the end of a sentence should be picked.
Repeated tokens. The main issues that hinder diversity enhancement include duplication, which could lead to degeneration phenomenon . In order to boost diversity, tokens that have appeared in the previous output sequence should be penalized. We collect these tokens into a set of repeated tokens.
Leveraging the novel typical tokens set construction approach and the backbone of gamma transformation in Sec. 2.1, more typical tokens probabilities will be modified, and the diversity of generated text will be improved. The following are some of the novel decoding principles proposed in this paper.
When meeting with extremeness. During decoding steps, some extreme conditions may occur, which refers to the probability of a certain token approximately being equal to 0 or 1 leading to uneven distribution. Since this condition is common in some decoding steps, ignoring these situations is not an option. In other words, keeping these circumstances is the best approach that this paper employs. If the probability distribution obtained in the current decoding step contains a token whose probability is closer to 0 or 1 and goes beyond a certain threshold, such as , then we modify the probability of this token to the value that is closest to 0 or 1 and does not exceed the threshold. Lastly, normalization is necessary after the modification.
When facing other languages that leads to much key information appearing in one input piece.
Language like Chinese allows multiple tokens to form a word and each token has a distinctive meaning. For instance, ’概率’ means ’probability’, but ’概’ means ’concept’ as well as ’率’ means ’rate’. When come into this, the paper proposed a solution here. The work is to convert these tokens into word embedding vectors and then sum them up. Lastly, the average of embeddings is the whole representation of the multi-token input pieces. Accordingly, the information could be captured entirely and is assigned to equal weight.
2.2 Information filter
Based on hypothesis  discussed before, the probability may not convey the human-like information properly. So the information theory-related strategy is reasonable to be applied in the decoding algorithm here. We present an information filter based on information theory, used to filter tokens that do not satisfy the required entropy out in order to enhance faithfulness. The tokens left need to harbor an information amount close to the entropy introduced below in order to convey the human-like message.
Given the true probability distribution of natural language strings, we can compute the information content of a string precisely. Assuming that is the probability distribution predicted by the model, which approximates
well, we can use it to estimate the information content of the sentence. The entropy, i,e, information amount, of a text stringis defined as:
where refers to the amount of information of , is the distribution predicted to fit the true probability distribution of . Subsequently, we denote the expected information amount of a random extracted from , also known as the entropy of as:
which could be perceived as the entropy of distribution obtained in each decoding step. After getting the entropy, what is the principle of selecting a ’human-like token’? We suppose the information content of the next token to be selected needs to be close to the entropy of distribution received at a certain step, i.e., the entropy given the previous context, in order to convey human-like information . In other words, the amount of information contained in each token in a sentence plays a decisive role in whether the sentence is human-like. We denote the difference, , as:
where is the information amount contained in . The selected next token of sentence at each decoding step must satisfy:
Here, represents a threshold set by humans indicating the maximum permitted difference. Only tokens that satisfied the above information amount requirement could pass the information filter, while the probabilities of tokens that are dissatisfied with the requirement will be set to 0 and filtered out. After information filter, the probability distribution is demonstrated as:
where is pre-probability before information filter. Normalization is conducted after Eq. (7).
2.3 IFDID: Information filter upon diversity-improved decoding
After introducing the proposed diversity-improved approach (DID) and information filter (IF), the overview of the IFDID strategy is demonstrated in Fig. 2, which is a two-stage decoding strategy. We demonstrate IFDID under the condition of story generation task . The left part is Enhance stage, a.k.a diversity-improved decoding (DID) introduced in Sec. 2.1, which is aimed at enhancing diversity through the selection and modification manipulated upon typical tokens set. The typical set includes ’happy’, ’school’, etc. and their probability is modified. Additionally, the right part is an information filter (IF) in Sec. 2.2 known as the Filter stage committing to guarantee faithfulness. ’IM’ refers to information content. IM of token ’a’ does not satisfy the Eq. (6), hence, it will not be selected. After the IF stage, we randomly select tokens from the left tokens set, whose probability is not 0. Ultimately, ’school’ is chosen to be the next token in this example.
|beam.||= 5||-||= 5|
|gamma.||= 0.9||= 0.7||= 0.5|
|= 0.4||= 0.4||= 0.4|
|= 0.99||= 0.99||= 0.9|
|top- = 350||top- = 300||top- = 300|
|IFDID||= 0.1||= 0.2||= 0.95|
In this section, we explore the effectiveness towards the tradeoff between diversity and faithfulness of our presented decoding strategy, IFDID, on three general NLG tasks: commonsense generation, story generation, and specific style text generation.
3.1 Experiment Setup
The parameters of each decoding strategy upon three tasks are shown in Table 1.
Evaluation metrics. We assess our decoding strategy utilizing automatic evaluation and human evaluation. We leverage following automatic evaluation metrics to evaluate respectively upon three tasks: ROUGE , BLEU , Dist  and perplexity . We use ROUGE and BLEU to represent faithfulness, while Dist refers to diversity as well as perplexity is the scale of fluency. Besides, we conduct a human evaluation as well, while reviewers scored fluency, grammar, diversity and faithfulness on a scale from 1 to 3, and for interestingness, scores are from 1 to 5. We recruit a group of people to complete the template sheet.222The template could be found at this Google Doc.
3.2 Datasets and Baselines
Commonsense generation. We conduct an experiment upon CommonGEN , a publicly recognized constrained commonsense generation task. Given a collection of common concepts, the aim is to produce a cohesive phrase expressing an everyday occurrence using these terms. We employ the T5 model  fine-tuned on CommonGEN333https://huggingface.co/mrm8488/t5-base-finetuned-common_gen available on HuggingFace platform. Moreover, we compare with previously proposed decoding algorithms, including greedy search, beam search444Here, the beam search strategy we leverage the forbid_duplicate_ngrams strategy, namely, we prohibit repeated -grams from appearing twice in a sentence. , temperature sampling, top- sampling and gamma sample .
Story generation. Our experiment is carried upon RocStories555The data could be found at https://github.com/yxuansu/SimCTG/tree/main/data/ROCStories  to test the ability of story generation. The dataset is annotated into a specific form containing 1.5K samples, while the task is to generate stories based on the given keywords. We reproduce the available fine-tuned GPT-2  generation model666https://huggingface.co/cambridgeltl/simctg_rocstories on the Huggingface platform. In order to conduct adequate experiments, we compare our approaches with greedy search, top- sampling, nucleus sampling  as well as gamma sample  to prove the performance.
|decoding||automatic evaluation||human evaluation|
Specific style text generation. We conduct the specific style text generation experiment on AdGen , a Chinese advertising text generation constructed from a Chinese e-commerce platform. The dataset contains 119K advertisement text and apparel specification table pairings. Each table has a collection of attribute-value pairs that describe an item of apparel. The task is to generate text with an advertising style based on the given descriptive information. We use 3127 samples for testing and leverage the OPPO777https://www.oppo.com/
self-research model, which is not open-sourced. The baselines include beam search with and without the forbid_duplicate_ngrams strategy, top- sampling, nucleus sampling  and gamma sample .
3.3 Experiment Results
IFDID provides a better tradeoff between diversity and faithfulness than traditional approaches and guided decoding. Greedy search and beam search harbor the lowest diversity across all three tasks, and their faithfulness was improved under the addition of the repetition reduction strategy. While sampling approaches generate results with high faithfulness, both temperature sampling and nucleus sampling, and perform exceptionally well faithfulness on all tasks. Their scores of ROUGE and BLEU outperform other decoding strategies, and the diversity is between beam search/greedy search and guided decoding. The results of guided decoding also show a collective tendency, which has the highest diversity and the lowest faithfulness, indicating that this decoding algorithm performs badly in terms of diversity-faithfulness balancing. Our proposed IFDID, whose ROUGE and BLEU score is between traditional strategies and guided decoding, has higher faithfulness than guided decoding and lower faithfulness than traditional strategies. Similarly, in terms of the Dist metric, IFDID falls between guided decoding and traditional techniques, delivering superior outcomes across the board.
|decoding||automatic evaluation||human evaluation|
|decoding||automatic evaluation||human evaluation|
The results of the human evaluation illustrate the superiority of IFDID in terms of fluency and overall quality. Through the human evaluation results of the three experiments, we can observe that IFDID outperforms other decoding strategies in terms of both fluency and overall quality. IFDID also clearly compensates for the shortcomings of guided decoding in terms of faithfulness. Subsequently, IFDID is substantially ahead of all other decoding strategies in the aspect of interestingness in the story generation task.
3.4 Further Analysis
To further analyze the effects of each parameter in IFDID on diversity and faithfulness, we separately explored , representing the parameter leveraged to control terminal tokens set, , the parameter controlling theme tokens, and top-, which used to determine the selection range of near-synonym upon CommonGEN  task. The results are illustrated in Fig 3. We can see that under the =0.9 condition, both diversity and faithfulness reach the optimal result, while both diversity and faithfulness are impaired when the generated text is shorter, namely, when the value is smaller. The selection of plays a decisive role in the tradeoff between diversity and faithfulness. When the value is small, diversity is poor, while conversely, when the value is large, faithfulness decreases significantly. Ultimately, it is important to choose a suitable top- value to ensure faithfulness, and basically, the larger the top-, the higher the diversity.
In this paper, we propose a decoding strategy, IFDID, in order to obtain a tradeoff between diversity and faithfulness. Our approach is a two-stage decoding strategy leveraging the Enhance-Filter framework, which can balance faithfulness and diversity in the meantime. The Enhance stage is leveraged to improve diversity by modifying probabilities of tokens, whereas the Filter stage is used to ensure faithfulness. Our experiments cover a wide range of tasks, illustrating that it is currently the SOTA decoding strategy for pursuing a tradeoff between diversity and faithfulness.
We’d like to thank the reviewers for their valuable feedback. We especially thank Tensor Lab of OPPO Research Institute for support with technical discussion. Many thanks also to the people at OPPO Research Institute who help us with human evaluation.
A. Gatt and E. Krahmer,
“Survey of the state of the art in natural language generation: Core
tasks, applications and evaluation,”
Journal of Artificial Intelligence Research, vol. 61, pp. 65–170, January 2018.
-  I. Mani, Automatic Summarization, vol. 3, John Benjamins Publishing, 2001.
-  W. J. Hutchins and H. L. Somers, “An introduction to machine translation,” 1992.
-  S. Narayan, G. Simões, Y. Zhao, J. Maynez, D. Das, M. Collins, and M. Lapata, “A well-composed text is half done! composition sampling for diverse conditional generation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022, pp. 1319–1339, Association for Computational Linguistics.
-  A. Fan, M. Lewis, and Y. Dauphin, “Hierarchical neural story generation,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, July 2018, pp. 889–898, Association for Computational Linguistics.
-  X. Lu, S. Welleck, P. West, L. Jiang, J. Kasai, D. Khashabi, R. L. Bras, L. Qin, Y. Yu, R. Zellers, N. A. Smith, and Y. Choi, “NeuroLogic a*esque decoding: Constrained text generation with lookahead heuristics,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, July 2022, pp. 780–799, Association for Computational Linguistics.
-  B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren, “CommonGen: A constrained text generation challenge for generative commonsense reasoning,” in Findings of the Association for Computational Linguistics: EMNLP 2020, Online, Nov. 2020, pp. 1823–1840, Association for Computational Linguistics.
-  T. C. Ferreira, C. Lee, E. Miltenburg, and E. Krahmer, “Neural data-to-text generation: A comparison between pipeline and end-to-end architectures,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, Nov. 2019, pp. 552–562, Association for Computational Linguistics.
-  A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in International Conference on Learning Representations, 2020.
-  N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “CTRL: A conditional transformer language model for controllable generation,” 2019.
-  M. Ghazvininejad, X. Shi, J. Priyadarshi, and K. Knight, “Hafez: an interactive poetry generation system,” in Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, July 2017, pp. 43–48, Association for Computational Linguistics.
-  C. Meister, G. Wiher, T. Pimentel, and R. Cotterell, “On the probability–quality paradox in language generation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, May 2022, pp. 36–45, Association for Computational Linguistics.
-  N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen, “A corpus and cloze evaluation for deeper understanding of commonsense stories,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, June 2016, pp. 839–849, Association for Computational Linguistics.
-  Z. Shao, M. Huang, J. Wen, W. Xu, and X. Zhu, “Long and diverse text generation with planning-based hierarchical variational model,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, Nov. 2019, pp. 3257–3268, Association for Computational Linguistics.
-  S. Wu and M. Sun, “Sampling with attribute-related information for controlling language models,” arXiv preprint arXiv:2205.06036, 2022.
-  A. F. Frank and T. F. Jaeger, “Speaking rationally: Uniform information density as an optimal strategy for language production,” in In Cogsci, 2008.
-  C. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out, Barcelona, Spain, July 2004, pp. 74–81, Association for Computational Linguistics.
-  K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311–318, Association for Computational Linguistics.
-  J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting objective function for neural conversation models,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, June 2016, pp. 110–119, Association for Computational Linguistics.
-  D. Jurafsky and J. H. Martin, Speech and Language Processing, Pearson Education India, 2000.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou,
W. Li, and P. J. Liu.,
“Exploring the limits of transfer learning with a unified text-to-text transformer,”vol. 21, no. 1, jan 2020.
J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao,
“Deep reinforcement learning for dialogue generation,”in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, Nov. 2016, pp. 1192–1202, Association for Computational Linguistics.
-  A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
-  A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. 2020, OpenReview.net.