Dissecting the components and factors of Neural Text Generation

Neural text generation metamorphosed into several critical natural language applications ranging from text completion to free form narrative generation. Generating natural language has fundamentally been a human attribute and the advent of ubiquitous NLP applications and virtual agents marks the need to impart this skill to machines. There has been a colossal research effort in various frontiers of neural text generation including machine translation, summarization, image captioning, storytelling etc., We believe that this is an excellent juncture to retrospect on the directions of the field. Specifically, this paper surveys the fundamental factors and components relaying task agnostic impacts across various generation tasks such as storytelling, summarization, translation etc., In specific, we present an abstraction of the imperative techniques with respect to learning paradigms, pretraining, modeling approaches, decoding and the key challenges. Thereby, we hope to deliver a one-stop destination for researchers in the field to facilitate a perspective on where to situate their work and how it impacts other closely related tasks.


page 1

page 2

page 3

page 4


QURIOUS: Question Generation Pretraining for Text Generation

Recent trends in natural language processing using pretraining have shif...

A Survey of Knowledge-Enhanced Text Generation

The goal of text generation is to make machines express in human languag...

CoNT: Contrastive Neural Text Generation

Recently, contrastive learning attracts increasing interests in neural t...

DISCO : efficient unsupervised decoding for discrete natural language problems via convex relaxation

In this paper we study test time decoding; an ubiquitous step in almost ...

Protecting Intellectual Property of Language Generation APIs with Lexical Watermark

Nowadays, due to the breakthrough in natural language generation (NLG), ...

NUBIA: NeUral Based Interchangeability Assessor for Text Generation

We present NUBIA, a methodology to build automatic evaluation metrics fo...

Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation

We introduce Texar, an open-source toolkit aiming to support the broad s...

1 Introduction

Text Generation is the task of producing written or spoken narrative from structured or unstructured data. The overarching goal is the seamless human-machine communication by presenting a wealth of data in a way we can comprehend. With respect to the modeling approaches, there are three main paradigms in generating text based on the schema of input and output: (i) Text-to-Text (ii) Data-to-Text (iii) None-to-Text. Table 1 presents the categorization of different tasks based on this paradigm. These several tasks deserve undivided attention and accordingly they have been heavily dissected, studied and surveyed in the recent past. For instance, independent and exclusive surveys are periodically conducted on summarization Lin and Ng (2019); Allahyari et al. (2017); Nenkova and McKeown (2012); Tas and Kiyani , knowledge to text generation DBLP:conf/inlg/GardentSNP17, DBLP:conf/naacl/Koncel-Kedziorski19, machine translation Chu and Wang (2018); Dabre et al. (2019); Chand (2016); Slocum (1985), dialog response generation Liu et al. (2016); Montenegro et al. (2019); Ramesh et al. (2017); Chen et al. (2017), storytelling, narrative generation Tong et al. (2018); Togelius et al. (2011), image captioning Hossain et al. (2018) etc., to dig deeper into task specific approaches that are foundational as well as in the bleeding edge of research. While these are extremely necessary, often the focus on techniques that are beneficial to other tightly coupled tasks are overlooked. The goal of this survey is to focus on these key components that are task agnostic to improve the ensemble of tasks in neural text generation.

Paradigm Task Input Output
Text-to-Text Dialog Conversation History Next Response
Machine Translation Source Language Target Language
Style Transfer Style 1 Text Style 2 Text
Summarization Single/Multiple Documents Summary
Data-to-Text Image Captioning Image Descriptive Text
Visual Storytelling Images Descriptive Text
Speech Recognition Audio Text
Table to Text Table Text
Knowledge Bases to Text Knowledge Bases Text
None-to-Text Language Modeling Null Sequence of Text
Table 1: Paradigms of Tasks in Text Generation. For the purposes of compactness, we include ’Knowledge-to-text’ paradigm within ’Data-to-text’.
Figure 1: Components and factors in approaches for Text Generation. Components are laid out in grey boxes and factors are laid out on the left in dotted lines.

There have been several studies conducted on surveying text generation. Perera and Nand (2017) present a detailed overview of information theory based approaches. Iqbal and Qureshi (2020) primarily focus on core modeling approaches, especially VAEs Kingma and Welling (2014) and GANs Goodfellow et al. (2014). Gatt and Krahmer (2018) elaborated on tasks such as captioning, style trasfer etc., with a primary focus on data-to-text tasks. Controllability aspect is explored by Prabhumoye et al. (2020). The workclosest to this is by Lu et al. (2018) who perform an empirical study on the core more modeling approaches only. In contrast to these, this paper focuses on task agnostic components and factors capable of pushing the ensemble of tasks forward. Figure 1 presents the various components and factors that are important to study in neural text generation which are elaborated in this paper.

2 Modeling Approaches

2.1 Core Modeling Paradigms

Supervised Learning:

Most generation approaches in this setting use maximum likelihood objective for training sequence generation with a sequential multi-label cross entropy.

However, there is an inherent inconsistency in exposure to ground truth text between training and inference stages when using teacher forcing during training. This leads to the problem of exposure bias Ranzato et al. (2016). During training, the token in the current time step is predicted conditioned on ground truth prefix correcting the course irrespective of what word is predicted by the model. However, during inference, the same is conditioned on generated prefix in the absence of ground truth prefix. This problem becomes severe with the increasing length of the output. A solution to address this issue is scheduled sampling Bengio et al. (2015) which mixes teacher forced embeddings and model predictions from previous time step.

Reinforcement Learning:

The main issue with the supervised learning approach for text generation is the mismatch between maximum likelihood objective that is optimized and metrics for text quality. Reinforcement learning addresses this mismatch by directly optimizing end metrics which could be non-differentiable. Typically, policy gradient algorithms are used to optimize for BLEU score directly via Reinforce. The objective is shown below which is the sum of log probabilities multiplied by the reward score. The reward score itself is computed as the expected BLEU score.


However, computing BLEU before every update is not computationally efficient to incorporate in the training procedure. Another problem is the inherent inefficiency of the metric itself i.e BLEU is not the best measure to evaluate text quality. In practice, usually, the policy network is usually pre-trained with maximum likelihood objective before optimizing for BLEU score.

Adversarial Learning:

The third paradigm is adversarial learning comprising of competing objectives. The mismatch in training and inference stages is addressed using Professor Forcing Lamb et al. (2016)

with adversarial domain adaptation to bring the behavior of the training and sampling close to each other. This is done by sharing the parameters between teacher forcing network and the free running network. Apart from this the two main components are a generator and a discriminator. Discriminator here is optimized to correctly classify the sequence as belonging to free running behavior or teacher forced behavior. The generator has two goals: (i) maximize the likelihood of the data (ii) fool the discriminator. There are two options with respect to keeping one fixed and bringing the other closer to the first. This can be done with respect to either of teacher forcing network or free running network. Empirically, professor forcing also plays the role of a regularizer. Generative Adversarial Networks (GAN) also gained popularity with respect to this in the recent times. The core idea is that the gradient of the discriminator guides how to alter the generated data and by what margin in order to make it more realistic. This slight change is apparent in continuous values in comparison to language which is a discrete space. There are several variants adopted to address specific problems such as SeqGAN to assess partially generated sequence

Yu et al. (2017), MaskGAN to improve sample quality using text filling Fedus et al. (2018) and LeakGAN to model long term dependencies by leaking discriminator information to generator Guo et al. (2018). The three main challenges researched in this area are:

Discrete Sampling: The sampling step selecting argmax

in language is a non-differentiable function. One solution is to replace it with a continuous approximation by adding Gumbel noise which is negative log of negative log of a sample from uniform distribution, also known as Gumbel Softmax.

Mode Collapse: GANs typically face the issue of sampling from specific tokens to cheat discriminator, known as mode collapse. In this way, only a subspace of target distribution is learnt by the generator. DP-GAN addresses this using an explicit diversity promoting reward Xu et al. (2018b).

Power dynamics between Generator and Discriminator: Another problem arises when the discriminator is trained faster than the generator. This is most often the case, the gradient from discriminator vanishes leading to no real update to generator.

2.2 Pre-training

Recent couple of years have seen a major surge in interest for pre-training techniques. While they are primarily focused on language understanding tasks, there has been some work targeted for pre-training for generation as well. UniLM (UNIfied pre-trained Language Model, Dong et al. (2019a)) is proposed as a pre-training mechanism for both natural language understanding and natural language generation tasks. Fundamentally, the previously widely used ELMO Peters et al. (2018) constitutes a language model that is left to right and right to left. While GPT Radford et al. has an autoregressive left to right language model, BERT Devlin et al. (2019) has a bidirectional language model. UniLM is optimized jointly for all of the above objectives along with an additional new seq2seq LM which is bidirectional encoding followed by unidirectional decoding. Depending on the use case, UniLM can be adopted to use Unidirectional LM (left to right), Bidirectional LM (attention on all tokens) and Seq2seq LM (attention on all tokens in previous segment and left context in the current segment). With a similar goal in mind, MASS Song et al. (2019) modified masking patterns in input to achieve this. BERT and XLNet Yang et al. (2019) pre-train an encoder and GPT pretrains a decoder. This is a framework introduced to pretrain encoder-attention-decoder together. Encoder masks a sequence of length k and the decoder predicts the same sequence of length k and every other token is masked. While the idea of jointly training the encoder-attention-decoder remains the same as in UniLM, the interesting contribution here is the way masking is utilized to bring out the following advantages. (i) The tokens masked in decoder are the tokens that are not masked in encoder. This complementary masking encourages joint training of encoder-decoder. (ii) Encoder supports decoder by extracting useful information from the masked fragments which improves the understanding or NLU capabilities of the model. (iii) Since a sequence of length k is decoded consecutively, NLG capability is improved as well. Note that when k is 1, the model is closer to BERT which is biased to an encoder and when k is the length of sentence, the model is closer to GPT which is biased to decoder. Similar to UniLM, BART Lewis et al. (2019) has a bidirectional encoder and an autoregressive decoder. The underlying model is standard transformer Vaswani et al. (2017) based neural MT framework. The main difference of BART from MASS is that the tokens masked here are not necessarily consecutive. The main idea and the second difference is to corrupt text with arbitrary noise and reconstruct original text. The input is corrupted with the following transformations: token masking, token deletion, token infilling, sentence permutation and document rotation. Following this, Raffel et al. (2019) proposed T5 as a unifying framework that ties all NLP problems as text generation tasks with a text-in and text-out paradigm. Recently, Dathathri et al. (2020) introduced plug and play language models capable of efficiently training fewer parameters to control a huge underlying pretrained model. Finetuning these vast models for generative tasks has been studied in style transformers Sudhakar et al. (2019)

and conversational agents

Dinan et al. (2019).

2.3 Decoding Strategies

The natural next step after pre-training and training is decoding. The distinguishing characteristic of generation is the absence of one to one correspondence between time steps of input and the output, thereby introducing a crucial component which is decoding. Primarily, they can be categorized as (i) autoregressive and (ii) non-autoregressive.

Autoregressive decoding:

Traditional models with this strategy correspond well to the true distributions of words. This mainly comes from respecting the conditional dependence property from left to right. The autoregressive techniques can be further viewed as sampling and search techniques. The main disadvantage of this strategy is throttling transformer based models that fall short in replicating their training advantages as training can be non-sequential and inference holds to be sequential with autoregressive decoding.

Non-autoregressive decoding:

This line of work primarily addresses two problems that are associated with autoregressive decoding. First, by definition, there is a conditional independence property that holds. This leads to the multimodality problem, where each time step considers different variants with respect to the entire sequence and these conditions compete with each other. Second, the main advantage is the reduction in latency during real time generation. Guo et al. (2020)

addressed this problem in the context of neural machine translation using transformers by copying each of the source inputs to the decoder either uniformly or repeatedly based on their fertility counts. This is done to address varying sequence lengths between source and target texts. These fertilities are predicted using a dedicated neural network to reduce the unsupervised problem to a supervised one and thereby enabling it to be used as a latent variable. This invariable replications based on fertilities may lead to duplication of words. Closely followed by this,

van den Oord et al. (2018)

took a different approach by introducing probability density distillation by modifying a convolutional neural network using a pre-trained teacher network to score a student network attempting to minimize the KL divergence between itself and the teacher network. Both these works set the trend of using latent variables to capture the interdependence between different time steps in the decoder. Following this work,

Lee et al. (2018) use iterative refinement by denoising the latent variables at each of the refinement steps. This idea of iterative decoding inspired way to more avenues by combining the benefits of cloze style mask prediction objectives from Bert Devlin et al. (2019). Some of them include insertion based techniques Gu et al. (2019), repeated masking and regenerating Ghazvininejad et al. (2019) and providing model predictions to the input Ghazvininejad et al. (2020).

Wang et al. (2019) proposed an alternative approach to address repetition (observed in Guo et al. (2020)) and completeness using regularization terms for each. Repetition is handled by regularizing similarity between consecutive words. Completeness is addressed by enabling reconstruction of source sentence from hidden states of the decoder, based on the duality of translation tasks between source to target and target to source. Concurrently, Guo et al. (2019) also address these issues by improving the inputs to decoder using additional phrase table information and sentence level alignment between source and target word embeddings.

Sampling and Search Techniques:

1. Random Sampling:

The words are sampled randomly based on the probability from the entire distribution without pruning any of the mass.


2. Greedy Decoding:

This technique simply boils down to selecting argmax of the probability distribution. As you keep selecting argmax everywhere, the problem is that it limits the diversity of generation. Note that this may not result in the best output as there may be an alternate hypothesis comprising of a path that does not have to select the most probable word at each time step.


A major disadvantage of greedy decoding is that there is no mechanism to correct the course if a mistake is made. This accumulates errors for the following time steps. It is monotonous with more predictable texts. This is alleviated by the next techniques and beam search. This is also worked out for discrete settings using gumbel-greedy decoding Gu et al. (2018). Variants of this were also studied by Zarrieß and Schlangen (2018)

3. Beam Search:

Beam search introduces a course correction mechanism in approximation of the argmax by selecting a beam size number of beams at each time step. When beam size is 1, this is the same as greedy decoding and when beam size is the size of the vocabulary, it it computationally very expensive. It has been relatively well studied in task agnostic objectives Wang et al. (2014) for instance, including social media text Wang and Ng (2013), error correction Dahlmeier and Ng (2012). Small beam sizes may lead to ungrammatical sentences, they get more grammatical with increasing beam size. Similarly small beam sizes may be less relevant with respect to content but get more generic with increasing beam size. There are several varieties within beam search:

(a) Noisy Parallel Approximate Decoding: This method Cho (2016) introduces some noise in each hidden state to non-deterministically make it slightly deviate from argmax.

(b) Beam Blocking: Repetition is one of the problems we see in NLG and this technique Paulus et al. (2018)

combats this problem by blocking the repeated n-grams. It essentially adjusts the probability of any repeated n-gram to 0.

(c) Iterative Beam Search: In order to search a more diverse search space, another technique Kulikov et al. (2019) was introduced to iteratively perform beam search several times. And for each current time step, we avoid all of the partial hypotheses encountered until that time step in the previous iterations based on soft or hard decisions on how to include or exclude these beams.

(d) Diverse Beam Search: One problem with beam search is that most times the decoded sequence still tends to come from a few highly significant beams thereby suppressing diversity. The moderation by Vijayakumar et al. (2016) adds a diversity penalty computed (for example using hamming distance) between the current hypothesis and the hypotheses in the groups to readjust the scores for predicting the next word.

(e) Clustered Beam Search: The goal is prune unnecessary beams. At each time step, Tam (2020)

get the top 2b candidates and embed them by using averaged Glove representations. Cluster them using k-means to get k clusters. And then, they pick the top b/k candidates from each cluster to get b candidates in total for that time step.

(f) Clustering Post Decoding: The above approaches modify decoding step itself. This technique Kriz et al. (2019) clusters after decoding is done. Sentence representations from any of the diversity promoting beam search variants are obtained. These are then clustered and the sentence with high log likelihood is selected from the cluster.

4. Top-k sampling:

This technique by Fan et al. (2018) randomly samples from the k most probable candidates from this distribution. This means that we are confining the model to select from a truncated probability mass.


If is the size of vocabulary, then it is random sampling and if is 1 then it is greedy decoding. High valued k results in dicey words but are non-monotonous and low valued k results in safe outputs which are monotonous. The problem however is that k is limited to the same value in all scenarios.

5. Top-p sampling:

The aforementioned problem of a fixed value of is addressed by top-p sampling. This is also known as nucleus sampling Holtzman et al. (2020), which instead of getting rid of the unspecified probability mass in top-k sampling, importance is shifted to the amount of probability mass preserved. This addresses scenarios where there could be broader set of reasonable options and sometimes a narrow set of options. It is achieved by selecting a dynamic number of words from a cumulative probability distribution of words until a threshold probability value is attained.


3 Key Challenges

For each of the challenges, this section provides a list of solutions. The pitfalls of these solutions are also described there by encouraging research to address these key challenges.

1. Fluency:

There are a couple of detrimental factors that affect the fluency of text generation, which are repetition and coherence.

Solution - Beam blocking: Blocking beams containing previously generated n-grams from subsequent generation combats repetition and encourages diversity. There are multiple options to perform this including cutting the beam stream or select from the rest of the n-grams Klein et al. (2017); Paulus et al. (2018)etc.,

- Problem: However, sometimes beams with natural kind of repetition done for instance in order to emphasize something, that is naturally done by humans are also blocked. Selecting the number of beams is often a problem since it is natural for a function word to repeat more often.

- Solution to problem: Massarelli et al. (2019) extensively studied the variants of introducing beam blocking which is also referred to as n-gram blocking by applying delays in beam search.

Solution - Unlikelihood objective: Welleck et al. (2020) argue that there is a fundamental flaw in the objective of likelihood. The main idea is to decrease the probability of unlikely or negative candidates. The negative candidates are selected from the previous contexts either at token or at sequence levels which are essentially n-grams. This way, we are simultaneously optimizing for both likelihood with unlikelihood by discouraging the repetition of previous outputs.

- Problem: This may not seem a major issue, however, selecting negative contexts is tricky and needs to be beyond selection of simple n-gram sequences that occurred previously.

Solution - Coverage penalty: This discourages the attention mechanism to attend the same word repeatedly See et al. (2017). Navigating through each of the time step in the source, if across different time steps of the decoded output, the attention weights are higher for that particular source timestep, then that timestep is covered and hence the coverage penalty would be log(1) which is 0. Otherwise coverage penalty would be the attention probability mass on that source time step.

Solution - Static and Dynamic Planning: This addresses coherence in terms of layout or structural organization of the text Yao et al. (2019). A schema of static or dynamic plans are used to form an abstract flow of the text from which the actual text is realized.

- Problem: However, underlying language models are capable of taking over, leading to hallucinations and thereby compromising the fidelity of text.

2. Length of Decoding:

One factor that distinguishes generation from rest of the seq2seq family of tasks is the variability in the length of the generated output. The main problem here is that as the length of the sequence increases, the sum of the log probability scores decrease. This means that models prefer shorter hypotheses. Some solutions to combat this problem are the following.

Solution - Length Normalization or Penalty: The generated output is scored by normalizing or dividing with length. Wu et al. (2016)

explore a different variation of the normalization constant. This is pretty standard when the dataset has high variance in lengths.

Solution - Probability Boosting: This technique multiplies the probability with a fixed constant at every time step. This alleviates the diminishing score problem.

Solution - Bias: Incorporate bias in the model based on empirical relations on lengths in source and target sentences in the training data.

3. Content Selection:

Certain tasks demand copying over the details in the input such as rare proper nouns for instance in news articles etc., This is especially needed in tasks like summarization which can demand a combination of extractive and abstractive techniques.

Solution - Copy Mechanism: Copy mechanism can take various forms such as pointing to unknown words Gulcehre et al. (2016) based on attention See et al. (2017) or a joint or a conditional copy mechanism Gu et al. (2016); Puduppully et al. (2019). It maybe based on attention that copies segments from input into the output. The problem is that sometimes, this technique boils down from a combination of being extractive and abstractive to sort of an extractive system.

Solution - Hierarchical Modeling: This technique maintains a global account of the content. This is often modeled using hierarchical techniques or dual stage models Martin et al. (2018); Xu et al. (2018a); Gehrmann et al. (2018) where the first stage pre-selects relevant keywords for generation in the following stage.

- Problem: Such models possibly take a hit on fluency while connecting dots between selected content and generation. This means that Rouge-1 can be good because the right words are extracted but Rouge-2 may decrease as it affects the fluency.

4. Optimization Objective:

Similar to the observation earlier in Section 2, there is an inherent mismatch in the between the objective function which is maximum likelihood and the end metrics which are BLEU, Rouge etc;

Solution - Reinforcement Learning: A common solution for this problem is using reinforcement learning to optimize end metrics such as Rouge. Often, a combination of MLE and RL objectives are used Hu et al. (2020); Wang et al. (2018).

- Problem:However, this is still a problem since these end metrics do not directly correlate to human judgements. Hence optimizing for BLEU or Rouge does not ensure human quality text.

Solution - Maximum Mutual Information: The idea is to incorporate pairwise information of source and target instead of only one direction which is usually target given source Li et al. (2016). The target probability is subtracted from target given source probability to diminish the probability of generic sentences. A viable extension to this is conditioning on personality for consistency.

Solution - Distinguishability: Hallucinations in abstractive generation are unwanted byproducts of optimizing log loss. To combat this, several researchers explored optimizing for minimized distinguishability with human generated text Hashimoto et al. (2019); Theis et al. (2016). Following similar path, Kang and Hashimoto (2020) proposed truncating loss to get rid of unwanted samples.

5. Speed:

Practical applications call for generating text in real time without time lag in decoding in addition to chasing the state of the art results. Model compression plays a crucial part in demonstrating an increase in the speed of generation. Cheng et al. (2017) exhaustively surveyed the different techniques to perform model compression. While there are techniques in the hardware side, there are certain modeling approaches that can handle this problem as well Gonzalvo et al. (2016). Most of this work is studied in the context of real time interpretation of speech Fügen et al. (2007); Yarmohammadi et al. (2013); Grissom II et al. (2014). Recently, Deng and Rush (2020) proposed a cascaded decoding approach introducing Markov Transformers to demonstrating high speed and accuracy.

Quantization: Quantizing Roy et al. (2018); Gray (1984) the weights i.e sharing the same weight value when they belong to a bin also proved helpful in improving the speed. This also facilitates the computations of gradients only once per bin.

Distillation: It can be performed with a teacher and a smaller student network that tries to replicate the performance of the teacher with fewer parameters Chen et al. (2019).

Pruning: This technique thresholds and prunes all the connections that have weights lesser than the predetermined threshold and then we can retrain the network in order to adjust the weights of the remaining connections.

Real time: Gu et al. (2017) trained an agent that learns to decide between the actions of reading by discarding a candidate or writing by accepting a candidate. The policy network is optimized with a combination of quality evaluated with BLEU and delay evaluated by number of consecutive words in reading stage which increases wait time.

Caching: Another trick is to cache some of the previous computations to avoid repetition.

4 Evaluation

Similar to other generative modeling, text generation also faces crucial challenges in evaluation Reiter and Belz (2009); Reiter (2018). van der Lee et al. (2019) present some of the best practices of evaluating automatically generated text. The main hindrance to standardize or evaluate NLG like other standard tasks is that it is often a sub-component of other tasks. This means that the input can be in varied forms such as tables, images and text. In certain settings such as diverse image captioning, we would need more objects or entities. Sometimes in dialog, we would need pronouns to have a natural coherence instead of repeating nouns.

Desiderata of Text:

It is crucial to define the factors contributing to the quality of good text. Some of the factors include relevant content, appropriate structure in terms of coherence and suitable surface forms. In addition, fluency, grammaticality, believability and novelty in some scenarios are crucial factors.

Intrinsic and Extrinsic: Evaluation in subjective scopes such as text generation can be performed intrinsically or extrinsically. Intrinsic evaluation is performed internally with respect to the generation itself and extrinsic evaluation is typically performed on the metric used to evaluate a downstream task in which this generation is used. The quality can also be judged using automatic metrics and human evaluation.

(a) Automatic Metrics:

Here, we outline the broad categories of metrics along with their advantages and disadvantages. These metrics can be classified into the following categories:

Word overlap based metrics:

These are based on the extent of word overlap, which means that they capture replication of words. The problem with such measures is that they do not focus on semantics but rather just the surface form of words and alone. This includes precision for n-grams(BLEU Papineni et al. (2002)), improved weighting for rare n-grams (NIST Doddington (2002)), recall for n-grams (ROUGE Lin and Hovy (2002)), F1 equivalent of n-grams (METEOR Banerjee and Lavie (2005)

), tf-idf based cosine similarity for n-grams (CiDER

Vedantam et al. (2015)). In extension to this, we also have specific metrics to evaluate content selection by measuring summarization content units using PYRAMID Nenkova and Passonneau (2004) and parsed scene graphs with objects and relations using SPICE Anderson et al. (2016). Stanojevic and Sima’an (2014) proposed BEER to address this as a ranking problem with character n-grams along with words.

Language Model based metrics:

This includes perplexity Brown et al. (1992)

. Such metrics are good in commenting about the language model itself. It sort of gives the average number of choices each random variable has. However, it does not directly evaluate the generation itself, for instance a decrease in perplexity does not imply a decrease in the word error rate. It just means that intrinsically, the LM is good enough to select the right next word for that corpus. The human likeness is also measured by training a model to discriminate between human and machine generated text such as an automatic turing test

Lowe et al. (2017); Cui et al. (2018); Hashimoto et al. (2019).

Embedding based metrics:

This has the advantage of being able to capture semantics. MEANT 2.0 Lo (2017) and YISI-1 Lo et al. (2018) computes structural similarity with shallow semantic parses being definitely and discretionarily used respectively along with word embeddings. Recently, contextulaized embeddings have been extensively used to capture this, such as BertScore Zhang et al. (2020) and BLEURT Sellam et al. (2020). Metrics based on a combination of different embeddings are also proposed Shimanaka et al. (2018); Ma et al. (2017). However the problem of not correlating to human judgements still persists.

(b) Emulated Automatic Metrics:

These metrics check for the intended behavior in generation based on the sub-problem the modeling approach is addressing. To check correctness or fidelity or loyalty with respect to source document, we can apply inference. Diversity can be evaluated by computing corpus based distributions on number of distinct entities Fan et al. (2019); Dong et al. (2019b); Clark et al. (2018) and so on. Recently, Wang et al. (2020) worked on identifying factual inconsistencies generated summaries. The idea is that when a question is posed, the source document and the summary should result in same or similar answers.

(c) Human Evaluation:

There are broadly two mechanisms in conducting subjective evaluations which is a challenging component of text generation. The first is preference testing and the second is scoring. Some studies have shown that preference based testing is prone to less variance compared to absolute scoring. Here are some important points to keep in mind during conducting human evaluation. (i) They are very expensive to conduct and hence not feasible to check the model by repeated examination. (ii) There are no standard universally agreed upon guidelines to setup such tasks. In other words, conducting subjective evaluation itself is subjective in nature. (iii) Scores tend to vary based on the nature of scales whether the judgements are binary, discrete integer values or continuous. (iv) It is observed that human preferences are inconsistent. They are biased with personal and demographic conditions. In such cases, it is important to measure inter-annotator agreement as well. (v) Some people might be lenient and others more strict which is not scaled across people. (vi) Framing the task in an unambiguous way to elicit the right information and maintain reproducibility. Having critically discussed human evaluation, this is still really the best we got. It is absolutely crucial to perform human evaluation in most NLG tasks. So, these problems need to be taken merely as cautions to develop more rational and systematic testing conditions. Comparisons between automatic and human evaluation systems Belz and Reiter (2006) are also studied actively in order to bring human evaluation closer to automatic metrics.

5 Conclusion

The past decade witnessed text generation dribbling from niche scenarios into several mainstream NLP applications. This urges the need for a snapshot to retrospect the progress of varied text generation tasks in unison. This paper is written with the goal of presenting a one-stop destination for task agnostic components and factors in text generation for researchers foraging to situate their work and guage their impact in this vast field. Moving forward, we envision that there are some of the crucial directions to focus for impactful innovation in text generation. These include (i) generation in real time (ii) non-autoregressive decoding (iii) consistency with situated contexts in real and virtual environments and games (iv) consistency with personality with opinions especially for virtual agents (v) conditioning on multiple modalities together with text and data (vi) investigation is still ongoing on finding better metrics to evaluate NLG with better correlated human judgements (vii) creative text generation. We believe this is the right time to extend advancements in any particular task to other tightly coupled tasks to revamp improvements in text generation as a holistic task.


  • Allahyari et al. (2017) Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D Trippe, Juan B Gutierrez, and Krys Kochut. 2017. Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268.
  • Anderson et al. (2016) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: semantic propositional image caption evaluation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V, volume 9909 of Lecture Notes in Computer Science, pages 382–398. Springer.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics.
  • Belz and Reiter (2006) Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, April 3-7, 2006, Trento, Italy. The Association for Computer Linguistics.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1171–1179.
  • Brown et al. (1992) Peter F. Brown, Stephen Della Pietra, Vincent J. Della Pietra, Jennifer C. Lai, and Robert L. Mercer. 1992.

    An estimate of an upper bound for the entropy of english.

    Comput. Linguistics, 18(1):31–40.
  • Chand (2016) Sunita Chand. 2016. Empirical survey of machine translation tools. In 2016 Second International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pages 181–185. IEEE.
  • Chen et al. (2017) Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A survey on dialogue systems: Recent advances and new frontiers. SIGKDD Explorations, 19(2):25–35.
  • Chen et al. (2019) Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, and Jingjing Liu. 2019. Distilling the knowledge of BERT for text generation. CoRR, abs/1911.03829.
  • Cheng et al. (2017) Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
  • Cho (2016) Kyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language model. CoRR, abs/1605.03835.
  • Chu and Wang (2018) Chenhui Chu and Rui Wang. 2018. A survey of domain adaptation for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1304–1319.
  • Clark et al. (2018) Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. 2018. Neural text generation in stories using entity representations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2250–2260. Association for Computational Linguistics.
  • Cui et al. (2018) Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. 2018. Learning to evaluate image captioning. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    , pages 5804–5812. IEEE Computer Society.
  • Dabre et al. (2019) Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2019. A survey of multilingual neural machine translation. arXiv preprint arXiv:1905.05395.
  • Dahlmeier and Ng (2012) Daniel Dahlmeier and Hwee Tou Ng. 2012. A beam-search decoder for grammatical error correction. In

    Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

    , pages 568–578. Association for Computational Linguistics.
  • Dathathri et al. (2020) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: A simple approach to controlled text generation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Deng and Rush (2020) Yuntian Deng and Alexander M. Rush. 2020. Cascaded text generation with markov transformers. CoRR, abs/2006.01112.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  • Dinan et al. (2019) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  • Doddington (2002) George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145.
  • Dong et al. (2019a) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019a. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
  • Dong et al. (2019b) Ruo-Ping Dong, Khyathi Raghavi Chandu, and Alan W. Black. 2019b. Induction and reference of entities in a visual story. CoRR, abs/1909.09699.
  • Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898.
  • Fan et al. (2019) Angela Fan, Mike Lewis, and Yann N. Dauphin. 2019. Strategies for structuring story generation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2650–2660. Association for Computational Linguistics.
  • Fedus et al. (2018) William Fedus, Ian J. Goodfellow, and Andrew M. Dai. 2018. Maskgan: Better text generation via filling in the _______. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Fügen et al. (2007) Christian Fügen, Alex Waibel, and Muntsin Kolss. 2007. Simultaneous translation of lectures and speeches. Machine translation, 21(4):209–252.
  • Gatt and Krahmer (2018) Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J. Artif. Intell. Res., 61:65–170.
  • Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109.
  • Ghazvininejad et al. (2019) Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 6111–6120. Association for Computational Linguistics.
  • Ghazvininejad et al. (2020) Marjan Ghazvininejad, Omer Levy, and Luke Zettlemoyer. 2020. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785.
  • Gonzalvo et al. (2016) Xavi Gonzalvo, Siamak Tazari, Chun-an Chan, Markus Becker, Alexander Gutkin, and Hanna Silen. 2016. Recent advances in google real-time hmm-driven unit selection synthesizer.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
  • Gray (1984) Robert Gray. 1984. Vector quantization. IEEE Assp Magazine, 1(2):4–29.
  • Grissom II et al. (2014) Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé III. 2014. Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP), pages 1342–1352.
  • Gu et al. (2018) Jiatao Gu, Daniel Jiwoong Im, and Victor OK Li. 2018. Neural machine translation with gumbel-greedy decoding. In

    Thirty-Second AAAI Conference on Artificial Intelligence

  • Gu et al. (2019) Jiatao Gu, Qi Liu, and Kyunghyun Cho. 2019. Insertion-based decoding with automatically inferred generation order. Trans. Assoc. Comput. Linguistics, 7:661–676.
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640.
  • Gu et al. (2017) Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor O. K. Li. 2017. Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pages 1053–1062. Association for Computational Linguistics.
  • Gulcehre et al. (2016) Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 140–149.
  • Guo et al. (2018) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5141–5148. AAAI Press.
  • Guo et al. (2019) Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 3723–3730. AAAI Press.
  • Guo et al. (2020) Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and Tie-Yan Liu. 2020. Fine-tuning by curriculum learning for non-autoregressive neural machine translation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7839–7846. AAAI Press.
  • Hashimoto et al. (2019) Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 1689–1701. Association for Computational Linguistics.
  • Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Hossain et al. (2018) Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2018. A comprehensive survey of deep learning for image captioning. CoRR, abs/1810.04020.
  • Hu et al. (2020) Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, and Graham Neubig. 2020. What makes A good story? designing composite rewards for visual storytelling. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7969–7976. AAAI Press.
  • Iqbal and Qureshi (2020) Touseef Iqbal and Shaima Qureshi. 2020.

    The survey: Text generation models in deep learning.

    Journal of King Saud University-Computer and Information Sciences.
  • Kang and Hashimoto (2020) Daniel Kang and Tatsunori Hashimoto. 2020. Improved natural language generation via loss truncation. CoRR, abs/2004.14589.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
  • Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017.

    Opennmt: Open-source toolkit for neural machine translation.

    In Proceedings of ACL 2017, System Demonstrations, pages 67–72.
  • Kriz et al. (2019) Reno Kriz, João Sedoc, Marianna Apidianaki, Carolina Zheng, Gaurav Kumar, Eleni Miltsakaki, and Chris Callison-Burch. 2019. Complexity-weighted loss and diverse reranking for sentence simplification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3137–3147.
  • Kulikov et al. (2019) Ilia Kulikov, Alexander Miller, Kyunghyun Cho, and Jason Weston. 2019. Importance of search and evaluation strategies in neural dialogue modeling. In Proceedings of the 12th International Conference on Natural Language Generation, pages 76–87.
  • Lamb et al. (2016) Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609.
  • van der Lee et al. (2019) Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, INLG 2019, Tokyo, Japan, October 29 - November 1, 2019, pages 355–368. Association for Computational Linguistics.
  • Lee et al. (2018) Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1173–1182. Association for Computational Linguistics.
  • Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
  • Lin and Hovy (2002) Chin-Yew Lin and Eduard Hovy. 2002. Manual and automatic evaluation of summaries. In

    Proceedings of the ACL-02 Workshop on Automatic Summarization

    , pages 45–51.
  • Lin and Ng (2019) Hui Lin and Vincent Ng. 2019. Abstractive summarization: A survey of the state of the art. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9815–9822.
  • Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian Vlad Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016.

    How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132.
  • Lo (2017) Chi-kiu Lo. 2017. MEANT 2.0: Accurate semantic MT evaluation for any output language. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pages 589–597. Association for Computational Linguistics.
  • Lo et al. (2018) Chi-kiu Lo, Michel Simard, Darlene A. Stewart, Samuel Larkin, Cyril Goutte, and Patrick Littell. 2018. Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the parallel corpus filtering task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 908–916. Association for Computational Linguistics.
  • Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1116–1126. Association for Computational Linguistics.
  • Lu et al. (2018) Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Neural text generation: Past, present and beyond. CoRR, abs/1803.07133.
  • Ma et al. (2017) Qingsong Ma, Yvette Graham, Shugen Wang, and Qun Liu. 2017. Blend: a novel combined MT metric based on direct assessment - CASICT-DCU submission to WMT17 metrics task. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pages 598–603. Association for Computational Linguistics.
  • Martin et al. (2018) Lara J Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Harrison, and Mark O Riedl. 2018. Event representations for automated story generation with deep neural nets. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Massarelli et al. (2019) Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio Silvestri, and Sebastian Riedel. 2019. How decoding strategies affect the verifiability of generated text. arXiv preprint arXiv:1911.03587.
  • Montenegro et al. (2019) Joao Luis Zeni Montenegro, Cristiano André da Costa, and Rodrigo da Rosa Righi. 2019. Survey of conversational agents in health. Expert Systems with Applications.
  • Nenkova and McKeown (2012) Ani Nenkova and Kathleen McKeown. 2012. A survey of text summarization techniques. In Mining text data, pages 43–76. Springer.
  • Nenkova and Passonneau (2004) Ani Nenkova and Rebecca J. Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2004, Boston, Massachusetts, USA, May 2-7, 2004, pages 145–152. The Association for Computational Linguistics.
  • van den Oord et al. (2018) Aäron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. 2018. Parallel wavenet: Fast high-fidelity speech synthesis. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    , volume 80 of Proceedings of Machine Learning Research, pages 3915–3923. PMLR.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL.
  • Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Perera and Nand (2017) Rivindu Perera and Parma Nand. 2017. Recent advances in natural language generation: A survey and classification of the empirical literature. Comput. Informatics, 36(1):1–32.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
  • Prabhumoye et al. (2020) Shrimai Prabhumoye, Alan W Black, and Ruslan Salakhutdinov. 2020. Exploring controllable text generation techniques. arXiv preprint arXiv:2005.01822.
  • Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6908–6915.
  • (79) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training.
  • Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
  • Ramesh et al. (2017) Kiran Ramesh, Surya Ravishankaran, Abhishek Joshi, and K Chandrasekaran. 2017. A survey of design techniques for conversational agents. In International Conference on Information, Communication and Computing Technology, pages 336–350. Springer.
  • Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Reiter (2018) Ehud Reiter. 2018. A structured review of the validity of BLEU. Comput. Linguistics, 44(3).
  • Reiter and Belz (2009) Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
  • Roy et al. (2018) Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. 2018. Theory and experiments on vector quantized autoencoders. CoRR, abs/1805.11063.
  • See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083.
  • Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. 2020. BLEURT: learning robust metrics for text generation. CoRR, abs/2004.04696.
  • Shimanaka et al. (2018) Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. RUSE: regressor using sentence embeddings for automatic machine translation evaluation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 751–758. Association for Computational Linguistics.
  • Slocum (1985) Jonathan Slocum. 1985. A survey of machine translation: its history, current status, and future prospects. Computational linguistics, 11(1):1–17.
  • Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
  • Stanojevic and Sima’an (2014) Milos Stanojevic and Khalil Sima’an. 2014. BEER: better evaluation as ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT@ACL 2014, June 26-27, 2014, Baltimore, Maryland, USA, pages 414–419. The Association for Computer Linguistics.
  • Sudhakar et al. (2019) Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Maheswaran. 2019. ”transforming” delete, retrieve, generate approach for controlled text style transfer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3267–3277. Association for Computational Linguistics.
  • Tam (2020) Yik-Cheung Tam. 2020. Cluster-based beam search for pointer-generator chatbot grounded by knowledge. Computer Speech & Language, page 101094.
  • (94) Oguzhan Tas and Farzad Kiyani. A survey automatic text summarization. PressAcademia Procedia, 5(1):205–213.
  • Theis et al. (2016) Lucas Theis, Aäron van den Oord, and Matthias Bethge. 2016. A note on the evaluation of generative models. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Togelius et al. (2011) Julian Togelius, Georgios N. Yannakakis, Kenneth O. Stanley, and Cameron Browne. 2011. Search-based procedural content generation: A taxonomy and survey. IEEE Trans. Comput. Intell. AI Games, 3(3):172–186.
  • Tong et al. (2018) Chao Tong, Richard C. Roberts, Rita Borgo, Sean P. Walton, Robert S. Laramee, Kodzo Wegba, Aidong Lu, Yun Wang, Huamin Qu, Qiong Luo, and Xiaojuan Ma. 2018. Storytelling and visualization: An extended survey. Information, 9(3):65.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575. IEEE Computer Society.
  • Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
  • Wang et al. (2020) Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evaluate the factual consistency of summaries. arXiv preprint arXiv:2004.04228.
  • Wang and Ng (2013) Pidong Wang and Hwee Tou Ng. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 471–481.
  • Wang et al. (2018) Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018. No metrics are perfect: Adversarial reward learning for visual storytelling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 899–909.
  • Wang et al. (2014) Xuancong Wang, Hwee Tou Ng, and Khe Chai Sim. 2014. A beam-search decoder for disfluency detection. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1457–1467.
  • Wang et al. (2019) Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-autoregressive machine translation with auxiliary regularization. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 5377–5384. AAAI Press.
  • Welleck et al. (2020) Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikelihood training. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Xu et al. (2018a) Jingjing Xu, Xuancheng Ren, Yi Zhang, Qi Zeng, Xiaoyan Cai, and Xu Sun. 2018a. A skeleton-based model for promoting coherence among sentences in narrative story generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4306–4315.
  • Xu et al. (2018b) Jingjing Xu, Xu Sun, Xuancheng Ren, Junyang Lin, Bingzhen Wei, and Wei Li. 2018b. DP-GAN: diversity-promoting generative adversarial network for generating informative and diversified text. CoRR, abs/1802.01345.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764.
  • Yao et al. (2019) Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019. Plan-and-write: Towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7378–7385.
  • Yarmohammadi et al. (2013) Mahsa Yarmohammadi, Vivek Kumar Rangarajan Sridhar, Srinivas Bangalore, and Baskaran Sankaran. 2013. Incremental segmentation and decoding strategies for simultaneous translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1032–1036.
  • Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 2852–2858. AAAI Press.
  • Zarrieß and Schlangen (2018) Sina Zarrieß and David Schlangen. 2018. Decoding strategies for neural referring expression generation. Proceedings of INLG 2018.
  • Zhang et al. (2020) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.