A Survey of Pretrained Language Models Based Text Generation

by   Junyi Li, et al.

Text Generation aims to produce plausible and readable text in human language from input data. The resurgence of deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs). Grounding text generation on PLMs is seen as a promising direction in both academia and industry. In this survey, we present the recent advances achieved in the topic of PLMs for text generation. In detail, we begin with introducing three key points of applying PLMs to text generation: 1) how to encode the input data as representations preserving input semantics which can be fused into PLMs; 2) how to design a universal and performant architecture of PLMs served as generation models; and 3) how to optimize PLMs given the reference text and ensure the generated text satisfying special text properties. Then, we figure out several challenges and future directions within each key point. Next, we present a summary of various useful resources and typical text generation applications to work with PLMs. Finally, we conclude and summarize the contribution of this survey.




we begin with introducing three key points of applying PLMs to text generation: 1) how to encode the input data as representations preserving input semantics


check and see if the motor is tilted just a little,or the drive wheel is true!if it's not any of them send it back if you can.


page 6


Pretrained Language Models for Text Generation: A Survey

Text generation has become one of the most important yet challenging tas...

Search and Learn: Improving Semantic Coverage for Data-to-Text Generation

Data-to-text generation systems aim to generate text descriptions based ...

Uniform Complexity for Text Generation

Powerful language models such as GPT-2 have shown promising results in t...

Urdu Hindi Poetry Generation using Neural Networks

One of the major problems writers and poets face is the writer's block. ...

A Survey on Retrieval-Augmented Text Generation

Recently, retrieval-augmented text generation attracted increasing atten...

Revisiting Challenges in Data-to-Text Generation with Fact Grounding

Data-to-text generation models face challenges in ensuring data fidelity...

The Truth is Out There: Investigating Conspiracy Theories in Text Generation

With the growing adoption of text generation models in today's society, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Text generation, also known as natural language generation, has been one of the most important sub-fields in natural language processing (NLP). It aims at producing plausible and readable text in human language, from the input data in a variety of forms including text, image, tabular and knowledge base. In recent decades, text generation techniques have been extensively applied to a wide range of applications (Li et al., 2021b). For example, dialog system generates responses to user utterances in a conversation (Zhou et al., 2020a); machine translation translates a text from one language into another (Conneau and Lample, 2019)

; text summarization generates an abridged summaries of the source text 

(El-Kassas et al., 2021).

FIGURE 1. An illustrative process of applying PLMs to text generation. We divide the process into three main steps: input representation learning, model architecture design and selection, model optimization.

The essential goal of text generation is to learn a mapping function from input data to output text. Early approaches usually adopt statistical language models for modeling the condition probabilities of words given the

-gram context (Brown et al., 1990; Brown and Frederking, 1995)

. Such a statistical approach is likely to suffer from the data sparsity issue, and a number of smoothing methods have been developed in order to better estimate unobserved term occurrences 

(Zhai and Lafferty, 2001; Tao et al., 2006). With the emergence and development of deep learning techniques (LeCun et al., 2015)

, neural network models have dominated the mainstream techniques in text generation and achieved tremendous improvements in generating natural language text. Deep neural generation models usually adopt the sequence-to-sequence framework 

(Sutskever et al., 2014)

based on the encoder-decoder scheme: the encoder first maps the input sequence into fix-sized low-dimensional vectors (called

input embeddings), and then the decoder generates the target text conditioned on the input embeddings. Various text generation models have been proposed with different designs for the encoder-decoder architecture, such as graph neural networks (GNN) for graph data (Li et al., 2020a)

and recurrent neural networks (RNN) for text data 

(Li et al., 2019). Besides, the attention mechanism (Bahdanau et al., 2015) and copy mechanism (See et al., 2017) are widely used to improve the performance of text generation models. An important merit of deep neural networks for text generation is that they enable end-to-end learning of semantic mappings from the input data to output text without labor-intensive feature engineering. Moreover, deep neural models employ low-dimensional semantic representations (Iqbal and Qureshi, 2020) to capture linguistic features of language, which is useful to alleviate data sparsity.

Despite the success of deep neural models for text generation, a major performance bottleneck lies in the availability of large-scale labelled datasets. Most of text generation methods require substantial amounts of manually labelled parallel data, which restricts their applicability in many domains that suffer from a dearth of annotated examples. To date, most of existing labelled datasets for text generation tasks are usually small. In such case, deep neural networks are likely to overfit on these small datasets and do not generalize well in practice. Moreover, the early neural models for text generation tasks were still relatively shallow. Therefore, these models have difficulties in modeling the relationship between the context and word meanings and deriving contextual word representations for better generation (Qiu et al., 2020).

In recent years, the paradigm of pretrained language models (PLMs) is thriving in NLP (Qiu et al., 2020). The basic idea is to first pretrain the models on large-scale unsupervised corpus and then fine-tune these models in downstream supervised tasks to achieve state-of-the-art results. With the emergence of Transformer (Vaswani et al., 2017) and the development of computational power, the architecture of PLMs has advanced from shallow to deep, such as BERT (Devlin et al., 2019) and OpenAI GPT (Radford et al., 2019). Substantial works have shown that PLMs can encode massive amounts of linguistic knowledge from corpus into their large-scale parameters and learn universal and contextual representations of language with specially designed objectives such as language modeling during pretraining. Therefore, PLMs are generally beneficial for downstream tasks and can avoid training a new model from scratch. Following the success of PLMs in other NLP tasks, researchers have proposed to solve the text generation task based on PLMs (Brown et al., 2020; Lewis et al., 2020b; Raffel et al., 2020). Pretrained on large-scale corpus, PLMs are able to understand natural language accurately and further express in human language fluently, both of which are critical abilities to fulfill text generation tasks. Grounding text generation on PLMs is seen as a promising direction in both academia and industry. Thus, in this survey, we focus on text generation as this field has been totally transformed by these powerful PLMs.

Existing surveys in this area only partially reviewed some related topics. For example, Qiu et al. (Qiu et al., 2020) summarized two generations of PLMs for the whole NLP domain and introduced various extensions and adaption approaches of PLMs. Kalyan et al. (Kalyan et al., 2021)

gave a brief overview of the advances of self-supervised learning in Transformer-based PLMs. Han

et al. (Han et al., 2021)

took a deep look into the history of pretraining, especially its special relation with transfer learning and self-supervised learning. Besides, El-Kassas

et al. (El-Kassas et al., 2021) mainly paid attention to the current application of PLMs to the field of automatic text summarization. Zaib et al. (Zaib et al., 2020) discussed the implementation of PLMs in dialog systems with a special emphasis on question answering systems. These researches focused on specific applications, e.g., summarization and dialogue systems, while did not go deeper to the core technique, i.e., text generation. To the best of our knowledge, our survey is the first work that presents a comprehensive review of PLMs-based text generation. It aims to provide text generation researchers a synthesis and pointer to related researches.

To start with, we present a general task definition of text generation and an overview of PLMs in Section 2. Given the encoded input data, the goal of text generation is to optimize the generation function (i.e., PLMs) for generating satisfactory output text. Thus, there are three key points of applying PLMs to text generation: 1) how to encode the input data as representations preserving input semantics which can be fused into PLMs (Section 3); 2) how to design a universal and performant architecture of PLMs served to be the generation function (Section 4); 3) how to optimize the generation function (i.e., PLMs) given the reference text and ensure the generated text satisfying special text properties such as fluency and naturalness (Section 5). Then, we figure out several typical non-trivial challenges and solutions within each key point in Section 6. We also present a summary of various useful resources to work with PLMs in Section 7 and review PLMs for a variety of text generation applications in Section 8. Finally, we conclude and summarize the contribution of this survey and future directions in Section 9.

2. Preliminary

In this section, we first present a general task formulation of text generation, then describe the background of PLMs, and finally introduce three key aspects of applying PLMs to text generation.

2.1. Text Generation

Generally, a text can be denoted as a sequence of tokens , where each token is drawn from a word vocabulary . The task of text generation aims to generate plausible and readable text in human language. In most cases, text generation is conditioned on the input data, such as text, image, table, and knowledge bases, which can be denoted as . In particular, the generated text is desired to satisfy some special properties such as fluency, naturalness, and coherence. We define the desired property for output text as a set . Thus, the task of text generation can be formally described as:


where the generation function takes as input of and to produce the output text . In this paper, the generation function is specially crafted based on a PLM .

Specifically, according to the type of the input data and the property set , text generation can be categorized into different kinds of applications:

 When the input data is not provided or a random noise vector , text generation will degenerate into language modeling or unconditional text generation (Radford et al., 2018, 2019). In this case, the output text is required to satisfy some common language properties, such as fluency and naturalness.

 When the input data is a set of discrete attributes (e.g., topic words, sentiment labels), text generation becomes topic-to-text generation (Dathathri et al., 2020) or attribute-based generation (Keskar et al., 2019). The input data plays the role of controlling the meaning of the generated text. In such situation, the output text should be relevant to the input topics or attributes.

 When the input data is structured data like knowledge bae or table, text generation will be considered as data-to-text generation (Li et al., 2021c; Gong et al., 2020). This task aims to generate descriptive text about the structured data. Therefore, the output text must be objective and accurate.

 When the input data is multimedia input such as image and speech, text generation becomes image caption (Xia et al., 2021) or speech recognition (Fan et al., 2019). In image caption, we might expect the generated caption text to be vivid for attracting children, while in speech recognition, the transformed text must be faithful to the original speech.

 The most common form of input data is a text sequence, spanning a number of applications such as machine translation (Conneau and Lample, 2019), text summarization (Rothe et al., 2020) and dialog system (Zhang et al., 2020c). While, for each kind of task, the output text would be expected to satisfy some specific properties. For example, in dialog system, the generated response should be relevant to the input dialog history and context.

2.2. Pretrained Language Models

Pretrained language models (PLMs) are pretrained on large-scale unlabelled corpus and can be fine-tuned on downstream tasks. Pretrained on text data, PLMs are able to encode massive linguistic knowledge into their vast amounts of parameters, which can enhance the understanding of language and improve the generation quality.

Owing to the great achievements that Transformer (Vaswani et al., 2017) has made, almost all PLMs employ the backbone of Transformer. GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) are firstly developed based on Transformer decoder and encoder respectively. Following GPT and BERT, PLMs such as XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), ERNIE (Zhang et al., 2019a), T5 (Raffel et al., 2020) and BART (Lewis et al., 2020b) are introduced. Among them, XLNet, RoBERTa and ERNIE are improved over BERT model, while T5 and BART are encoder-decoder based PLMs. Recent studies showed that the performance of PLMs can be boosted just by increasing the scale of the model parameters (Kaplan et al., 2020), which triggered the development of large-scale PLMs like GPT-3 (175B) (Brown et al., 2020), PANGU (200B) (Zeng et al., 2021), GShard (600B) (Lepikhin et al., 2021) and Switch-Transformers (1.6T) (Fedus et al., 2021)

which contains billions of or trillions of parameters. In addtion to language understanding and generation, PLMs are also designed for other tasks like named entity recognition 

(Pires et al., 2019), programming (Feng et al., 2020), and networking (Louis, 2020).

According to the pretraining objectives, the PLMs for text generation can be categorized as masked language models, causal language models, prefix language models, and encoder-decoder language models. The details of each category are discussed in Section 4.

2.3. Applying PLMs to Text Generation

To leverage PLMs for downstream text generation tasks, we need to consider three key aspects from the perspectives of data, model, and optimization, respectively:

Input Data: How to effectively encode the input as representations preserving input semantics that can be fused into the PLM ?

For text generation, the input data, containing critical semantic information for the target output, is often not present in a suitable form for PLMs, and the data forms are also varied for different tasks. Therefore, it needs to develop effective, flexible representation learning approaches for capturing semantic evidence from the input in different data forms.

Model Architecture: How to design an effective and performant PLM as the generation function ? In the literature, a number of PLMs have been developed, with different achitectures (e.g.,denoised autoencoder (Lewis et al., 2020b) and autoregressive decoder (Radford et al., 2019)). When adapting PLMs to text generation tasks, we need to make specific design on the underlying PLMs in order to achieve good task performance.

Optimization Algorithm: How to effectively optimize the generation function (i.e., PLMs) given the reference text and ensure the generated text satisfying special text properties  ? After encoding the inputs and designing appropriate PLMs, it is important to develop effective optimization algorithms for producing the satisfactory text. A major challenge lies in that some desired properties for output text are difficult to be formulated or optimized.

In the following sections, we mainly present recent research works on PLM-based text generation, focusing on the above three aspects.

for tree= forked edges, grow’=0, draw, rounded corners, node options=align=center, calign=edge midpoint, , [PLMs for Text Generation, text width=1.3cm, fill=black!10 [Encoding Input Representations, text width=1.8cm, for tree=fill=red!20 [Unstructured, text width=1.1cm, for tree=fill=red!30 [Paragraph, text width=1.2cm [ Hierarchy-based RL: DialogBERT (Gu et al., 2021b); Li et al. (Li et al., 2021d);
Graph-based RL: BASS (Wu et al., 2021); Ouyang et al. (Ouyang et al., 2021), text width=6.9cm, node options=align=left ] ] [Document, text width=1.2cm [ Encoding Inter-Sentential Semantics: BERTSumExt (Liu and Lapata, 2019); HIBERT et al. (Zhang et al., 2019b);
Capturing Critical Semantics: Nguyen et al. (Nguyen et al., 2021); Liu et al. (Liu et al., 2021g);
RL Efficiency: Huang et al. (Huang et al., 2021a); DANCER (Gidiotis and Tsoumakas, 2020); Manakul et al. (Manakul and Gales, 2021), text width=6.9cm, node options=align=left ] ] [Multi-language, text width=1.2cm [ Cross-Lingual: XLM (Conneau and Lample, 2019); CMLM et al. (Ren et al., 2019);
Multi-Lingual: mBART (Liu et al., 2020); mT5 (Xue et al., 2021a); Wang et al. (Wang et al., 2021a), text width=6.9cm, node options=align=left ] ] ] [Structured, text width=1.1cm, for tree=fill=red!20 [Bridging Semantic Gap, text width=1.2cm [ Structured Data Linearization: Ribeiro et al. (Ribeiro et al., 2020); TableGPT et al. (Gong et al., 2020);
Representation Alignment: Li et al. (Li et al., 2021c), text width=6.7cm, node options=align=left ] ] [Capturing Structural Information, text width=1.2cm [ Incorporating Additional Objectives: TableGPT (Gong et al., 2020); Mager et al. (Mager et al., 2020);
Adding Structural Information as Input: Ribeiro et al. (Ribeiro et al., 2020); Fan et al. (Fan and Gardent, 2020);
Employing Structural Module: StructAdapt et al. (Ribeiro et al., 2021); Li et al. (Li et al., 2021c), text width=6.7cm, node options=align=left ] ] [ Maintaining Text Fidelity, text width=1.2cm [ Incorporating Additional Objectives: TableGPT (Gong et al., 2020); Harkous et al. (Harkous et al., 2020);
Utilizing Copying Mechanism: Li et al. (Li et al., 2021c); Suadaa et al. (Suadaa et al., 2021);
Adding Target Information as Input: Chen et al. (Chen et al., 2020a)

, text width=6.7cm, node options=align=left ] ] ] [Multimedia, text width=1.1cm, for tree=fill=red!10 [Image Captioning, text width=1.2cm [ XGPT 

(Xia et al., 2021); VisualGPT (Chen et al., 2021); Yang et al. (Yang et al., 2021), text width=5.5cm, node options=align=left ] ] [Video Captioning, text width=1.2cm [ VideoBERT (Sun et al., 2019b); CBT (Sun et al., 2019a); Unified VLP (Zhou et al., 2020b); UniVL (Luo et al., 2020), text width=5.5cm, node options=align=left ] ] [Speech Recognition, text width=1.2cm [ Fan et al. (Fan et al., 2019); Liao et al. (Liao et al., 2021), text width=5.5cm, node options=align=left ] ] ] ] [Designing PLMs as Generation Function, text width=1.8cm, for tree=fill=green!20 [Standard Architecture, text width=1.1cm, for tree=fill=green!30 [Masked LM, text width=1.2cm [ BERT2BERT (Rothe et al., 2020); XLM (Conneau and Lample, 2019), text width=6.0cm, node options=align=left ] ] [Causal LM, text width=1.2cm [ GPT-2 (Radford et al., 2019); GPT-3 (Brown et al., 2020); CPM (Zhang et al., 2020a); CTRL (Keskar et al., 2019); PanGu- (Zeng et al., 2021), text width=6.0cm, node options=align=left ] ] [Prefix LM, text width=1.2cm [ UniLM (Dong et al., 2019); UniLMv2 (Bao et al., 2020a); GLM (Du et al., 2021), text width=6.0cm, node options=align=left ] ] [Encoder-Decoder LM, text width=1.2cm [ MASS (Song et al., 2019); T5 (Raffel et al., 2020); BART (Lewis et al., 2020b); ProphetNet (Qi et al., 2020); CPM-2 (Zhang et al., 2021a), text width=6.0cm, node options=align=left ] ] ] [Architecture Extensions, text width=1.1cm, for tree=fill=green!10 [Auxiliary Embeddings, text width=1.2cm [ Relative Positional Embeddings (Raffel et al., 2020; Qi et al., 2020); Hierarchical Positional Embeddings (Li et al., 2020c);
Dialogue User Embeddings (Bao et al., 2020b); Vowel Embeddings (Xue et al., 2021b), text width=5.6cm, node options=align=left ] ] [Improved Attention Modules, text width=1.2cm [ Multi-view Attention: Chen et al. (Chen and Yang, 2020); Liu et al. (Liu et al., 2021f);
Cross-attention: VECO (Luo et al., 2021), text width=5.6cm, node options=align=left ] ] ] ] [Optimizing PLMs for Text Generation, text width=1.8cm, for tree=fill=yellow!20 [Fine-Tuning for Text Generation, text width=1.1cm, for tree=fill=yellow!30 [Vanilla Fine-Tuning, text width=1.2cm [ DialoGPT (Zhang et al., 2020c); Ribeiro et al. (Ribeiro et al., 2020), text width=6.0cm, node options=align=left ] ] [Intermediate Fine-Tuning, text width=1.2cm [ DAIFT: Liu et al. (Liu et al., 2021c);
TAIFT: Fabbri et al. (Fabbri et al., 2021); Mao et al. (Mao et al., 2019), text width=6.0cm, node options=align=left ] ] [Multi-Task Fine-Tuning, text width=1.2cm [ Pure Multi-Task Fine-Tuning: Goodwin et al. (Goodwin et al., 2020); Bai et al. (Bai et al., 2021);
Hybrid Multi-Task Fine-Tuning: Liu et al. (Liu et al., 2021g); Li et al. (Li et al., 2021c), text width=6.0cm, node options=align=left ] ] [Parameter-Efficient Fine-Tuning, text width=1.2cm [ Adapter-based Fine-Tuning: Houlsby et al. (Houlsby et al., 2019); Ribeiro et al. (Ribeiro et al., 2021);
Freezing-based Fine-Tuning: Gheini et al. (Gheini et al., 2021);
Distillation-based Fine-Tuning: Chen et al. (Chen et al., 2020c), text width=6.0cm, node options=align=left ] ] ] [Prompt-Tuning for Text Generation, text width=1.1cm, for tree=fill=yellow!20 [Discrete Prompts, text width=1.2cm [ GPT-2 (Radford et al., 2019); GPT-3 (Brown et al., 2020), text width=3.0cm, node options=align=left ] ] [Continuous Prompts, text width=1.2cm [ Prefix-Tuning (Li and Liang, 2021); Gu et al. (Gu et al., 2021c), text width=3.0cm, node options=align=left ] ] ] [Property-Tuning for Text Generation, text width=1.1cm, for tree=fill=yellow!10 [Relevance, text width=1.2cm [ TransferTransfo (Wolf et al., 2019); DialoGPT (Zhang et al., 2020c); Zeng et al. (Zeng and Nie, 2020), text width=5.0cm, node options=align=left ] ] [Faithfulness, text width=1.2cm [ Kryscinski et al. (Kryscinski et al., 2018); TED (Yang et al., 2020d), text width=5.0cm, node options=align=left ] ] [Order-Preservation, text width=1.2cm [ CSP (Yang et al., 2020a); mRASP (Lin et al., 2020); Wada et al. (Wada and Iwata, 2018), text width=5.0cm, node options=align=left ] ] ] ] ]

FIGURE 2. The main content of our paper.

3. Encoding Input Representations

As discussed in Section 2, the first aspect is how to encode the input data as representations preserving input semantics for PLMs. In this section, we will introduce three main types of input data for text generation, i.e., unstructured input, structured input, and multimedia input.

3.1. Unstructured Input

In text generation, most of the studies focus on modeling unstructured text input (e.g., sentence, paragraph, and document), which requires accurately understanding the input information and deriving meaningful semantic text representations. The aim of text representation learning is to condense the input text into low-dimensional vectors while preserving its core semantic meanings. In what follows, we discuss how to derive effective semantic representations for three kinds of unstructured input, namely paragraphs, documents and multi-lingual input text.

3.1.1. Paragraph Representation Learning

A paragraph usually contains multiple sentences, and several sentences may discuss a certain topic. To capture the low-level word meanings and high-level topical semantics in a paragraph, many studies proposed hierarchy-based or graph-based methods to learn the paragraph representation.

Hierarchy-based Representation Learning.  For a multi-sentence paragraph such as a multi-turn dialogue, most of the previous work generally concatenated sentences as the model input and predict the output text (Zhang et al., 2020c; Bao et al., 2021). However, flat concatenation is likely to ignore the semantic dynamics across utterances and the information loss may lead to decoding errors. To fill these gaps, several studies proposed to encode the input paragraphs with a hierarchical architecture (Gu et al., 2021b; Li et al., 2021d). Specifically, Gu et al. (Gu et al., 2021b) employed a hierarchical architecture, DialogBERT, to represent the dialogue context, which first encodes dialogue utterances through a Transformer encoder and then encodes the resulting utterance vectors using a discourse-level Transformer to obtain a representation of the entire dialogue context. However, this method lacks the history information when encoding each individual utterance, while the history information is essential for understanding dialogue utterances. Thus, Li et al. (Li et al., 2021d) first employed a Transformer to encode the whole conversation to get the dense context representation, upon which a unidirectional Flow module was designed to capture the context flow on the utterance level.

Graph-based Representation Learning. In a long paragraph, multiple sentences may contain repeated, redundant or contradictory information. How to exploit deep semantic structure in the complex text input is a key to further promote paragraph-based generation performance. Compared with sequence, graph can aggregate relevant disjoint context by uniformly representing them as nodes and their relations as edges (Wu et al., 2021; Ouyang et al., 2021). As a representative example, Wu et al. (Wu et al., 2021) leveraged phrase-relation graph to improve long sequence summarization, where nodes are phrases and edges are similarity. This graph is suitable for information aggregation with the help of coreference resolution that substantially compresses the input. Besides, in conversational machine reading, Ouyang et al. (Ouyang et al., 2021) formulated the input text as two complementary graphs, i.e., explicit and implicit discourse graphs, to fully capture the complicated interactions among all the elementary discourse units (EDUs).

3.1.2. Document Representation Learning

In many text generation tasks such as document translation and document summarization, the input text might be a long document consisting of multiple paragraphs. During the document encoding, it is challenging to encode the cross-sentence semantics and then capture the most critical semantics.

Encoding Inter-Sentential Semantics. Most of PLMs are trained as masked language models, thus they are forced to learn token-level representations instead of sentence-level ones. Although introducing segment embeddings to represent different sentences, it is only applied to sentence-pair inputs. To encode the inter-sentential semantics among document inputs for text generation, several studies (Liu and Lapata, 2019; Zheng and Lapata, 2019; Zhang et al., 2019b) proposed to learn document representations in a hierarchical way. For example, Liu et al. (Liu and Lapata, 2019) insert “[CLS]” tokens at the start of each sentence to collect sentence features in lower layers and then combine them with self-attention in higher layers. Besides, Zhang et al. (Zhang et al., 2019b) proposed HIBERT for learning the document representations in a hierarchical fashion by using a sentence encoder to transform each sentence into a vector and a document encoder to learn sentence representations given their surrounding sentences as context.

Capturing Critical Semantics. In practice, sentences or paragraphs in long documents would inevitably be complement, overlapping or conflicting to each other. Therefore, it is important to retain the most critical contents of documents and verbalize them in the generated text. To address the issue of key points missing in output text, Nguyen et al. (Nguyen et al., 2021) introduced a topic model to capture the global semantics of the document and a mechanism to control the amount of global semantics supplied to the text generation module. Similarly, Liu et al. (Liu et al., 2021g) also proposed two topic-aware contrastive learning objectives to capture the global topic information of a conversation and outline salient facts. These objectives are able to implicitly model the topic change varying upon conversations, pushing PLMs to focus more on snippets that contain salient information from the same topics.

Representation Learning Efficiency. Efficiency is an important factor to consider for modeling long documents, especially when generating long text. Since the self-attention mechanism grows quadratically with sequence length, many works aim to improve the encoding efficiency of self-attention (Huang et al., 2021a; Manakul and Gales, 2021). A representative example is Manakul et al. (Manakul and Gales, 2021) proposed two methods: local self-attention, allowing longer input spans during training; and explicit content selection, reducing memory and compute requirements. Besides, several researchers further adopted divide-and-conquer methods for encoding long documents. For example, Gidiotis et al. (Gidiotis and Tsoumakas, 2020) split a long document and its summary into multiple source-target short sentence pairs, which are used for training PLMs that learn to summarize each part of the document separately. By splitting long document into short sentences, encoding semantics of documents will become simpler, reducing computational complexity and text noise.

3.1.3. Multi-language Representation Learning

Most of PLMs are pretrained on rich English text while ignore other low-resource languages. This problem makes it difficult to directly apply monolingual PLMs to multilingual text generation tasks such as multilingual machine translation.

Cross-lingual Representations. The idea behind learning cross-lingual representation is to learn a shared embedding space for two languages to improve the model’s ability for cross-lingual translation. A well-known cross-lingual PLM is XLM (Conneau and Lample, 2019), which proposed unsupervised and supervised objectives by leveraging monolingual and parallel data, respectively, to learn cross-lingual representations. However, these learned representations on shared BPE spaces is inexplicit and limited. Therefore, Ren et al. (Ren et al., 2019) calculated cross-lingual -gram embeddings and infer an -gram translation table from them for providing explicit representation learning signals.

Multi-lingual Representations. Given more than two languages, multi-lingual PLMs aim to learn representations for any of the language pairs. Based on the monolingual PLMs, Liu et al. (Liu et al., 2020) and Xue et al. (Xue et al., 2021a) proposed mBART and mT5, respectively, which are pretrained once for all languages. Due to the considering difference between languages, several studies utilized contrastive learning to learn multi-lingual representations (Pan et al., 2021; Wang et al., 2021a). In particular, Wang et al. (Wang et al., 2021a) proposed two training objectives: contrastive sentence ranking (CSR) and sentence aligned substitution (SAS). CSR samples sentences from the document and constructs positive and negative pairs based on their saliency. By contrastively learning what is more important, the model is supposed to obtain the ability to distinguish salient information in different languages.

3.2. Structured Input

Structured data (e.g., table, graph, and tree) is also a critical kind of input for text generation in many real-world applications, such as medical report (Hasan and Farri, 2019) and weather report (Goldberg et al., 1994) generation. However, it is non-trivial to model structured input for data-to-text tasks based on PLMs due to three main challenges: (1) there exists a semantic gap between the structured input and natural language input that is used for pretraining PLMs; (2) there is a lack of encoding the input structure which contains the structural information of the input data; (3) it requires to maintain text fidelity to the input information.

3.2.1. Bridging the Semantic Gap

In general, PLMs are pretrained on unstructured text, which is different from the structured data in semantic form. In order to better leverage structured data, we need to bridge the semantic gap between the structured input and natural language input that is used for pretraining PLMs.

Structured Data Linearization. To adapt to the sequential nature of PLMs, a simple approach is to linearize the input data into sequence (Ribeiro et al., 2020; Mager et al., 2020; Fan and Gardent, 2020). Specifically, Ribeiro et al. (Ribeiro et al., 2020)

linearized knowledge graph (KG) into a sequence of triples by concatenating relational triples. Besides, some people adopted template-based heuristic methods to serialize the input data 

(Gong et al., 2020). For example, the attribute-value pair “name: jack reynolds” will be serialized as a sentence “name is jack reynolds”.

Representation Alignment. In addition to directly feeding structured input into PLMs, some people struggled to firstly transform the structured data into embeddings, which can be taken as input of PLMs. For example, Li et al. (Li et al., 2021c) utilized graph neural networks (GNN) to project KG entities into embeddings, and then proposed a representation alignment method to align the entity representations (encoded by GNN) and PLM-based entity embeddings in semantic spaces.

3.2.2. Capturing Structural Information

Unlike the unstructured text, the structured data contains structural information, such as the pair in table or the triple in KG. The structural information can help models understand the input information correctly for generating faithful text.

Incorporating Additional Training Objective. In order to enhance the preservation of structural information, a number of studies introduced auxiliary training objectives related to the structural information in addition to the primary generation objective (Gong et al., 2020; Li et al., 2021c; Mager et al., 2020). The first kind is to reconstruct the semantic structure of the input data. For example, Gong et al. (Gong et al., 2020) utilized the input table attribute names as the labels to reconstruct table structure from PLMs’ learned value representations, which can force PLMs to embed data structure into its representation. Another kind is to adjust the output text based on the structural information. Mager et al. (Mager et al., 2020) proposed cycle-consistency based losses to assess the quality of system output based on how well it can reconstruct the input.

Adding Structural Information as Input. Unlike previous works that implicitly model structural information with training losses, several studies proposed to explicitly take structural information as input (Ribeiro et al., 2020; Fan and Gardent, 2020). Ribeiro et al. (Ribeiro et al., 2020) directly prepended “⟨H⟩”, “⟨R⟩”, and “⟨T⟩” tokens before the head entity, the relation and tail entity of a triple to reveal the relations between entities. Besides, Fan et al. (Fan and Gardent, 2020) first encoded the AMR graph into graph embedding, which can then be taken as input. The graph embedding provides additional information to the encoder by encoding the depth of each node in the rooted graph and the subgraph each node belongs to.

Employing Structural Encoding Module. Since PLMs are originally designed for sequential input, a natural method is to use additional modules to encode the structured input. A representative example is StructAdapt (Ribeiro et al., 2021). StructAdapt adds layer-wise graph convolution modules in order to learn representations built upon the graph connectivity over the PLM encoder. Similarly, Li et al. (Li et al., 2021c) employ GNN to explicitly encode entity relations in KG. The entity embeddings from GNN are regarded as input word embeddings of PLM for generating text.

3.2.3. Maintaining Text Fidelity

In literature of linguistics, fidelity means the generated text adheres to the content in the structured data. Generating high fidelity text that correctly describe information in the structured input is the core of data-to-text generation.

Incorporating Additional Training Objective. To generate high-fidelity text adhereing to input, Gong et al. (Gong et al., 2020) introduced a Optimal-Transport based content matching loss that helps model correctly describe important information from table, which measures the distance between the input information and the output text. While, Harkous et al. (Harkous et al., 2020) employed a semantic fidelity classification loss to detect and avoid generation errors (such as hallucination and omission).

Utilizing Copy Mechanism. The pointer-generator (See et al., 2017) is a critical technology to ensure the faithfulness of generated text about input data by copying important words from input into output. Li et al. (Li et al., 2021c) adopted pointer-generator to copy entities from input knowledge data. While, Suadaa et al. (Suadaa et al., 2021) incorporated the copy mechanism by using general placeholders to avoid producing hallucinated phrases that are not supported by a table.

Adding Target Information as Input. To combat with the low fidelity problem, Chen et al. (Chen et al., 2020a) argued that it is necessary to leverage intermediate meaning representations to achieve faithful generation. Therefore, the authors provided the generation module with a logical form representing the semantics of the target text.

3.3. Multimedia Input

In addition to the above textual data, several attempts have been made to take the multimedia data as input (e.g., image, video, and speech) such as image caption and speech recognition.

3.3.1. Image Captioning

Image captioning, which aims to generate a textual description of an image, has been extensively studied in computer vision research. With the advent of PLMs, researchers have proposed to utilize the remarkable ability of PLMs. A well-known pretrained model is XGPT 

(Xia et al., 2021). Inspired by GPT in textual modal, XGPT takes images as inputs and uses the image captioning task as the basic generative task in the pretraining stage. Chen et al. (Chen et al., 2021) also proposed an image caption pretrained model, VisualGPT. They designed a self-resurrecting attention mechanism to learn how to encode the visual information and adapt it to PLMs decoder. However, traditional vision-language pretraining fails to capture the relationship between the visual and text modalities. Yang et al. (Yang et al., 2021) proposed three pretraining tasks to effectively help the model learn a better aligned representation among the three modalities: text word, visual object, and scene text.

3.3.2. Video Captioning

Video captioning focuses on generating natural language text describing the video content. VideoBERT (Sun et al., 2019b) and CBT (Sun et al., 2019a) are the first pioneers to investigate video-language pretraining with regard to the video captioning task. Since they trained a separate video-to-text decoder, it tends to cause a pretrain-finetune discrepancy. Therefore, Unified VLP (Zhou et al., 2020b) and UniVL (Luo et al., 2020)

proposed unified video and language pretraining model. Unified VLP uses a shared multi-layer Transformer network for both encoding and decoding. While, UniVL encodes the text and video separately by two single-modal encoders, and generates text with a decoder.

3.3.3. Speech Recognition

In practice, speech recognition is hungry for human-transcripted supervised data. So a number of unsupervised and semi-supervised methods are developed to integrate PLMs for weakly-supervised learning. For example, Fan et al. (Fan et al., 2019) proposed an unsupervised approach to pretraining encoder-decoder model with unpaired speech and transcripts. Liao et al. (Liao et al., 2021) proposed a speech recognition post-processing model that aims to transform the incorrect and noisy recognition output into a readable text for humans and downstream tasks by leveraging the Metadata Extraction (MDE) corpus to construct a small task-specific dataset.

4. Designing PLMs for Text Generation

After introducing how to encode the input data into low-dimensional embeddings, in this section, we focus on how to design an effective and suitable PLM as the text generation function .

Such a problem can be modeled as an optimization task by maximizing the conditional probability of the output text given the input, which can be formally factorized by tokens:


where denotes the -th output token, denotes the previously generated tokens and is the embedding for input data .

To model the conditional probability, traditional neural models mainly adopt the RNN architecture (Sutskever et al., 2014), with a number of improvement variants . In recent years, solely based on attention mechanisms, Transformer (Vaswani et al., 2017) can better capture long-range dependency in texts, which is beneficial for modeling and generating texts. With the excellent parallelization capacities, Transformer has been the backbone for developing extremely large PLMs. Based on the Transformer architecture, PLMs can encode rich semantic or linguistic knowledge when trained on large-scale unlabeled corpus . Furthermore, it has shown that PLMs can be effectively fine-tuned according to different text generation tasks , which becomes the first choice of the text generation function .

4.1. Standard Architecture

Existing PLMs utilize either single or double stacks of Transformer layers as backbone. PLMs with single stack, such as BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020) and UniLM (Dong et al., 2019), do not have explicit encoding and decoding process. They yield three variants: masked language models, causal language models and prefix language models, according to different self-attention mask strategies. In contrast, PLMs with double stacks follows the whole Transformer architecture with cross-attention between the encoder and decoder.

4.1.1. Masked Language Models

Masked language models utilize the full attention mask, which is the same as Transformer encoders. Equipped with full attention matrix, models are usually pretrained with masked language modeling (MLM) tasks, i.e., predicting the masked tokens using the contextualized information. The most representative model is BERT (Devlin et al., 2019), which is used extensively in natural language understanding (NLU).

However, due to the discrepancy between the pretraining task of masked LMs and the downstream generation function, masked LMs are rarely utilized for text generation tasks (Yang et al., 2019). It is more common to use BERT as an encoder part for text generation, leveraging its excellent bidirectional encoding capacity. Rothe et al. (Rothe et al., 2020) proposed to utilize three outstanding PLMs, i.e., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and GPT-2 (Radford et al., 2019), for text generation. They experimented to initialize both the encoder and decoder with BERT, of which the result is comparable with other PLMs specially designed for text generation.

4.1.2. Causal Language Models

Similar to Transformer decoder, causal language models adopt the diagonal mask matrix. Causal LMs are designed for language modeling, which usually refers to a probability distribution to compute the probability of occurrence of a number of words. Causal LMs are straightforward for text generation, which predicts the next word conditioned on all the previous words.

GPT (Radford et al., 2018) was the first deep PLM which can be applied to the text generation task. GPT-2 (Radford et al., 2019) explored the transfer capacity of language models for zero-shot generation task, which suggest the significance of sufficient data. GPT-3 (Brown et al., 2020) demonstrated massive model parameters can significantly enhance the downstream generation tasks, by showing it just a few examples or prompts. CTRL (Keskar et al., 2019) is trained as a conditional language model to generate text conditioned on control codes that govern style, content, and task-specific behavior. CPM (Zhang et al., 2020a) and PanGu- (Zeng et al., 2021) practiced on training large-scale auto-regressive Chinese language models.

Although causal LMs are simple and straightforward for text generation, they have several structural and algorithmic limitations. Causal LMs encode the tokens from only left to right, as a result, they neglect the bidirectional information in the input side. Moreover, causal LMs do not address Seq2Seq task particularly, therefore they do not achieve satisfying results in tasks such as summarization and translation.

4.1.3. Prefix Language Models

In order to overcome the disadvantages of the bidirectional masked LMs and the unidirectional causal LMs in text generation, Prefix LMs are proposed to combine the advantages of both. By utilizing the mixture attention masks, the tokens in the source text can attend to each other, while the tokens in the target text can only attend to all source tokens and previous generated ones.

UniLM (Dong et al., 2019) was the first prefix language model. Compared to causal LMs, UniLM utilized prefix attention mask to solve conditional generation tasks, which is similar to the encoder-decoder architecture. UniLMv2 (Bao et al., 2020a) and GLM (Du et al., 2021) improved vanilla prefix masking strategy, by introducing permuted language modeling in XLNet (Yang et al., 2019).

Although the prefix LMs are specially designed for text generation tasks, Raffel et al. (Raffel et al., 2020) has researched the performance between single-stack prefix LMs and double-stack encoder-decoder LMs and made a conclusion that the addition of an explicit encoder-decoder attention is beneficial.

4.1.4. Encoder-Decoder Language Models

Most of the PLMs follow the architecture of standard Transformer, consisting of stack of both encoder and decoder layers. MASS (Song et al., 2019), ProphetNet (Qi et al., 2020) took the sequence with one masked fragment as the input of encoder and then the decoder generates the masked tokens in an autoregressive way. T5 (Raffel et al., 2020) randomly replaced several spans in the source text with different special tokens, and then asked the decoder to predict the replaced spans in turn. BART (Lewis et al., 2020b)

was pretrained with denoising autoencoder (DAE),

i.e., the model learns to recover the original text from corrupted text, which is corrupted with different noising methods, such as sentence permutation, token deletion, and document rotation.

4.2. Architecture Extensions

Based on vanilla Transformer architecture, some researchers improve some modules to adapt to various input formats in specific tasks. In this part, we will introduce extended embeddings and improved attention modules of Transformers. Embeddings are shallow but apparent information added to each token, while extended attentions are usually designed for better processing input information.

4.2.1. Auxiliary Embeddings

Besides the (sub-) word embeddings, almost all the Transformers utilize position embeddings. Compare to CNN and RNN, the self-attention operation is order-independent. Hence, it is essential to provide explicit position information to make use the sequential nature of text. Original Transformer (Vaswani et al., 2017) utilizes the predetermined absolute position embeddings with sinusoidal functions, while most of PLMs, such as BERT and GPT, adopt learned absolute position embeddings. Instead of the absolute ones, relative position embeddings produce learned embeddings according to the offset between two tokens. T5 (Raffel et al., 2020), UniLMv2 (Bao et al., 2020a) and ProphetNet (Qi et al., 2020) employ an improved bucket relative positional method. In addition, hierarchical position embeddings are utilized to indicate inter- and intra- sentence position information, which is often applied in some fixed-format text such as poem (Li et al., 2020c) and lyric (Xue et al., 2021b).

Moreover, it is necessary to leverage auxiliary embeddings to enrich the input information (Kalyan et al., 2021). Similar to segment embeddings used in BERT, dialog state embeddings are employed to assign each utterance (Wolf et al., 2019; Bao et al., 2020b), and user embeddings are utilized to differentiate the characters and background knowledge evolved in the conversation (Bao et al., 2020b; Ham et al., 2020). Under the multilingual scenario, it is common to introduce language embedding, in order to inform the model about the language of each sentence (Song et al., 2019; Chi et al., 2020). In addition, rhyme embeddings (Li et al., 2020c) and vowel embeddings (Xue et al., 2021b) can indicate acoustics information in poem and lyric.

4.2.2. Improved Attention Modules

Though there exist various module variants of layer normalization and position-wise FFN in Transformer (Kalyan et al., 2021), they are rarely used in PLMs for text generation. Hence, we mainly focus the variants of self- and cross- attention modules. In order to adapt to long-form text and alleviate quadratic complexity of full-attention operation, sparse attention is applied in the self-attention module for long-form input. Rather than attending to all other tokens, every token only attends to specific tokens with strategies such as window attention (Zaheer et al., 2020; Manakul and Gales, 2021; Pasunuru et al., 2021b), global attention (Zaheer et al., 2020; Pasunuru et al., 2021b), random attention (Zaheer et al., 2020) and Sinkhorn attention (Zhong et al., 2021).

Besides the sparse attention, several methods are employed to process input from multiple source. It is usual to leverage one or more encoders to encode multiple inputs, and then utilize different strategies to aggregate them in the cross-attention module. Golovanov et al. (Golovanov et al., 2019) conduct mean pooling for dialogue history, current state and persona information. Chen et al. (Chen and Yang, 2020) and Liu et al. (Liu et al., 2021f) further utilize attention to process embeddings from multiple views or knowledge. In addition, VECO (Luo et al., 2021) unify the encoder and decoder module via a plug-and-play cross-attention module. BASS (Wu et al., 2021) and Ribeiro et al. (Ribeiro et al., 2021) substitute self-attention module with graph network to better extract structural information. Zeng et al. (Zeng and Nie, 2021) append the gating mechanism after self-attention to inject condition-aware information.

5. Optimizing PLMs for Text Generation

As discussed in Section 3, after the input data is encoded and the generation model (i.e., PLMs) is designed, the next key step is to optimize PLM for the text generation task. We mainly consider three optimization ways, namely a) fine-tuning b) prompt-tuning and c) property-tuning. Next, we will describe each optimization way in detail.

5.1. Fine-Tuning for Text Generation

During pretraining, PLMs can encode general linguistic knowledge from large-scale corpus. While, it requires task-specific knowledge to perform downstream text generation tasks. For this purpose, fine-tuning is a commonly adopted way to incorporate task-specific information into PLMs by adapting their weights with text generation losses such as the sequence cross-entropy loss (Radford et al., 2019).

According to how the parameters of PLMs are updated, exiting fine-tuning methods for text generation can be categorized as: 1) vanilla fine-tuning 2) intermediate fine-tuning 3) parameter-efficient fine-tuning and 4) multi-task fine-tuning. Compared with vanilla fine-tuning, intermediate and multi-task fine-tuning are able to alleviate the overfitting issue on small datasets. As the vanilla fine-tuning requires adjusting the entire model, parameter-efficient methods such as adapters (Houlsby et al., 2019) can fine-tune PLMs in a lightweight way.

5.1.1. Vanilla Fine-Tuning

In vanilla fine-tuning, PLMs are adapted to downstream text generation tasks based on task-specific losses such as the cross-entropy loss (Radford et al., 2019). Zhang et al. (Zhang et al., 2020c) trained DialoGPT model on the basis of the GPT-2 architecture by modeling a multi-turn dialogue session as a long text and optimize the generation model with language modeling objective. Ribeiro et al. (Ribeiro et al., 2020) investigated two recent PLMs, BART and T5, for graph-to-text generation and fine-tuned them by conventional auto-regressive cross-entropy loss. A major issue of vanilla fine-tuning is that it is often not sufficiently optimized on small datasets, which is prone to overfitting.

5.1.2. Intermediate Fine-Tuning

The basic idea of intermediate fine-tuning is to incorporate an intermediate dataset consisting of sufficient labelled instances. The intermediate dataset can be the same target text generation task but from different domains or a related NLP task from the same target domain. It is helpful to infuse domain- or task-specific knowledge from the intermediate dataset for alleviating the overfitting issue and enhancing the performance on small target text generation datasets (Phang et al., 2018). According to the relatedness between the intermediate dataset and target dataset, intermediate fine-tuning can be divided into two categories, i.e., domain adaptive intermediate fine-tuning (DAIFT) and task adaptive intermediate fine-tuning (TAIFT).

Domain Adaptive Intermediate Fine-Tuning. According to Kalyan et al. (Kalyan et al., 2021), DAIFT utilizes an intermediate dataset, which focuses on a related NLP task (not text generation tasks) from the same target domain, consisting of sufficient labelled instances. By leveraging such an intermediate dataset, PLMs can be enriched with domain-specific knowledge, which is helpful to improve the performance of the target text generation task within the same domain. DAIFT is usually adopted in machine translation to eliminate the issue of unseen languages in translation pairs. For example, to improve the translation quality of the low-resource target language (e.g., Kazakh), Liu et al. (Liu et al., 2021c) constructed a large-scale intermediate monolingual corpus of the target language and fine-tuned mBART by reconstructing the corrupted monolingual text. The intermediate dataset comes from the same language domain with the target dataset (e.g., Kazakh), which can impart language-related linguistic knowledge to PLMs for better translation.

Task Adaptive Intermediate Fine-tuning. In constrast with DAIFT, TAIFT incorporates an intermediate dataset focused on the same target text generation task but from different domains. It aims to infuse task-specific knowledge from the massive intermediate labelled dataset for improving the same target text generation task. Many works have shown that TAIFT performs the same target text generation task on intermediate generic text corpus (e.g., Wikipedia, WebText) can improve the performance of the target generation task on a specific domain (e.g., Movie) (Fabbri et al., 2021; Mao et al., 2019). For example, Fabbri et al. (Fabbri et al., 2021) performed summarization on intermediate pseudo-summaries created from Wikipedia to improve the zero-shot and few-shot performance of abstractive summarization, and Mao et al. (Mao et al., 2019) conducted generation on intermediate BookCorpus from WebText to improve commonsense story generation on the target WritingPrompts dataset.

5.1.3. Multi-Task Fine-Tuning

By incorporating auxiliary tasks, multi-task fine-tuning can improve the primary text generation task by utilizing across-task knowledge. Furthermore, by injecting knowledge from related NLP tasks, multi-task fine-tuning can enhance the robustness of PLMs and reduce the need of large amounts of labelled instances in the text generation task. According to the relatedness between the primary text generation task and auxiliary tasks, multi-task fine-tuning (MTFT) can be divided into two categories, i.e., pure MTFT and hybrid MTFT.

Pure Multi-Task Fine-Tuning. Pure MTFT refers that the auxiliary tasks are the same with the primary text generation task. Previous works mainly utilized additional datasets to eliminate the data scarcity issue of the primary text generation task (Goodwin et al., 2020; Bai et al., 2021). Specifically, Goodwin et al. (Goodwin et al., 2020) performed MTFT on twenty-one datasets on summarization and question answering to enable zero-shot summarization and question answering on previously unseen datasets. Besides, Bai et al. (Bai et al., 2021) incorporated an auxiliary monolingual summarization task to improve the primary cross-lingual summarization task in low-resource setting.

Hybrid Multi-Task Fine-Tuning. Hybrid MTFT means the auxiliary tasks are different (not text generation tasks) from the primary text generation task. These diverse auxiliary tasks can help enhance some specific aspects of the primary generation task. For example, Liu et al. (Liu et al., 2021g) and Jin et al. (Jin et al., 2020) fine-tuned PLMs with auxiliary tasks (e.g., coherence detection, style-carrying text reconstruction) to control the content of the generated text such as the topic change and text style (humor, romance, and clickbait). Besides, to improve the faithfulness of the generated text, Li et al. (Li et al., 2021c) and Gong et al. (Gong et al., 2020) introduced auxiliary input reconstruction tasks to reconstruct KG triples and table values for aligning the input information and the generated content.

5.1.4. Parameter-Efficient Fine-Tuning

As fine-tuning requires updating all the model parameters, it is time-consuming to perform the entire fine-tuning in resource-limited scenarios. There are a number of studies that develop parameter-efficient fine-tuning methods for text generation task.

Adapter-based Fine-Tuning. Adapter is a special neural layer proposed by Houlsby et al. (Houlsby et al., 2019) to fine-tune PLMs in a parameter-efficient way. The adapter module projects the input vector into a small vector and then projects back into the original dimension using two feed-forward layers and a non-linear layer. Specifically, the adapters first project the original -dimensional features into a smaller dimension, , apply a non-linearity, then project back to dimensions. The total number of parameters added per layer, including biases, is . By setting , we can limit the number of additionally added parameters per task. Thus, it is highly efficient to fix the parameters of original PLMs but only fine-tune the adapters (Stickland et al., 2021; Chen and Shuai, 2021). To address the inefficiency and overfitting issues in low-resource abstractive summarization, Chen et al. (Chen and Shuai, 2021) inserted the adapters into both encoder and decoder of PLMs by restricting the number of trainable parameters and layers. Besides, many studies have shown that adpaters can be used to help PLMs efficiently capture some input characteristics for generating more accurate output text with a low extra cost in terms of parameters (Le et al., 2021; Ribeiro et al., 2021). For example, Ribeiro et al. (Ribeiro et al., 2021) utilized the adapters to effectively model the input graph structure when fine-tuning PLMs, which usually are pretrained using natural language and not structured data.

Freezing-based Fine-Tuning. This approach refers to freeze most parameters of PLMs and only update a small proportion of model parameters. Recent studies have shown that not all the parameters of PLMs are necessary to be fine-tuned for text generation tasks and some of them can be fixed during fine-tuning without much impact on the model performance. Several studies have revealed that cross-attention (or encoder-decoder attention) layers are more important than self-attention layers when fine-tuning PLMs for machine translation (Gheini et al., 2021; You et al., 2020). Therefore, Gheini et al. (Gheini et al., 2021) only fine-tuned the cross-attention parameters while kept the encoder and decoder fixed, which achieved close translation performance to fine-tuning all parameters.

Distillation-based Fine-Tuning. It involves distilling large teacher PLMs into small student models for efficient fine-tuning. By distilling the knowledge in PLMs for text generation into small generative model (e.g., sequence-to-sequence), the student models can be efficiently fine-tuned for achieving better generation performance (Shleifer and Rush, 2020; Chen et al., 2020c). A representative example is Chen et al. (Chen et al., 2020c)

leveraged BERT as the teacher model that generates sequences of word probability logits, and treat Seq2Seq model as the student network, which can effectively learn from the teacher’s outputs.

5.2. Prompt-Tuning for Text Generation

Most of generative PLMs are pretrained using language modeling objectives and then fine-tuned on text generation tasks with task-specific objectives. The discrepancy between pretraining and fine-tuning impacts the performance of PLMs on text generation tasks. In prompt-tuning, downstream text generation tasks are reformulated to the language modeling task during pretraining.

5.2.1. Background

According to Liu et al. (Liu et al., 2021e), a prompting function is applied to modify the input text into a prompt through a two-step process:

  1. Apply a template, which is a textual string that has two slots: an input slot for input and an answer slot for a generated answer text that will later be mapped into .

  2. Fill slot with the input text .

Here the prompt can be close or prefix style. The close-style prompt is usually adopted in language understanding tasks. For example, in sentiment analysis where

“I love this movie.”, the template may take a close form such as “ It was a really movie.” to predict the answer in . While, the prefix-style prompt connects the input text and answer such as “English: German: ” in machine translation. Therefore, prefix prompts are commonly used in text generation, as they mesh well with the left-to-right nature of the model. In the above prompt example, the template is composed of discrete natural language tokens. While, they could be virtual words (e.g., represented by numeric ids) which would be mapped into continuous embeddings later.

5.2.2. Discrete Prompts

Most of studies create prompts by manually designing templates based on human introspection. As pioneers, GPT-2 (Radford et al., 2019) performed text generation tasks using various manually-created prompts. For example, the prompt “translate to french, [input], [output]” is used in machine translation. The prompt defines the semantic mapping from input data to output text in a specific text generation task. By utilizing diverse prompts, a single PLM is able to implement many different text generation tasks. Previous studies heavily relied on manual effort to create prompts, however, PLMs are highly sensitive to prompts: improperly-constructed prompts cause low performance (Jiang et al., 2020a). To overcome the need to manually specify prompts, Shin et al. (Shin et al., 2020) proposed AutoPrompt to automatically search for template tokens. Besides, several methods have been proposed to discovery discrete prompts automatically such as paraphrasing existing prompts (Jiang et al., 2020a), generating prompts using PLMs (Gao et al., 2021), and mining prompts from a corpus (Jiang et al., 2020a).

5.2.3. Continuous Prompts

In addition to discrete prompts, a lot of studies explored continuous prompts (a.k.a., soft prompts) in the embedding space. Continuous prompts have two advantages: 1) relax the constraint that the prompt template should be natural language words; 2) remove the restriction that the template is parameterized by PLMs’ parameters. Instead, prompt templates have their own parameters that can be tuned based on the training data of the text generation tasks.

The most well-known continuous prompting method for text generation is prefix-tuning (Li and Liang, 2021), which keeps generative PLMs (e.g., GPT-2, BART) parameters frozen and instead optimizes a sequence of continuous vectors. In contrast to fine-tuning, which updates all PLMs parameters and thus requires storing a tuned copy of the model for each text generation task, prefix-tuning only optimizes the prefix for each text generation task. Based on prefix-tuning, several works struggled to solve other text generation tasks such as dialog generation (Gu et al., 2021c).

5.3. Property-Tuning for Text Generation

For various generation tasks, we expect to optimize PLMs specially for different language properties, so that the generated text can satisfy the corresponding needs of the generation tasks. Next, we discuss three major properties that are enhanced via fine-tuning PLMs.

5.3.1. Relevance

According to linguistic literature (Li et al., 2021e), in text generation, relevance refers that the topical semantics conveyed in output text are highly related to the input text. As a representative example, in dialog systems, the generated responses should be relevant to the historical utterance and other conditions, such as speaker persona and discourse topic.

Compared with traditional neural generative models, PLMs utilize the powerful multi-layer cross-attention mechanism to connect the input side and output side. Therefore, applying PLMs to the dialog generation task would improve the relevance of generated text to the input data (Wolf et al., 2019; Zhang et al., 2020c). A good example is DialoGPT (Zhang et al., 2020c)

, which is formulated as an auto-regressive language model and uses GPT-2 as model architecture. Specially, DialoGPT is first trained on large-scale dialog pairs/sessions, which could enable DialoGPT to capture the joint distribution of

in conversational flow for generating relevant responses to the history utterance. Furthermore, to consider various types of condition information when generating dialog, Zeng et al. (Zeng and Nie, 2020) utilized the masked language modeling objective to train the conditioned dialog generation task. Specifically, they proposed TF-IDF based masking which selects more condition-related tokens to mask, so that PLMs can generate condition-related expressions rather than the general language patterns. Besides, they used a non-parametric attention-based gating mechanism to choose between generating a general word or a condition-related word at each position.

5.3.2. Faithfulness

Faithfulness is also an important language property to consider for text generation, which means the generated content should adhere to the information in input text. For example, the text summarization system aims to generate faithful text representing the salient information within the input text. Sometimes, it is generalized to refer that the generated text is in accord with the world facts.

To be faithful with input text, an desired capacity is the underlying text generation can accurately understand the core semantics of input or acquire sufficient world knowledge. Pretrained on large collections of text with special training objectives, PLMs show excellent natural language understanding capacities in capturing core semantics from plain text (Devlin et al., 2019). Furthermore, it has been found that PLMs indeed encode a large amount of knowledge facts (Jiang et al., 2020a), which is potentially beneficial to generate faithful summary by injecting background knowledge into text. For example, Kryscinski et al. (Kryscinski et al., 2018) utilized a contextual network in PLM decoder to retrieve the most salient parts from the source document for improving the level of faithfulness of generated summaries. Besides, several studies proposed to generate faithful text by introducing additional losses besides the text generation loss (Rothe et al., 2020; Yang et al., 2020d). Specifically, Yang et al. (Yang et al., 2020d)

fine-tuned PLMs through a theme modeling loss and a denoising autoencoder. The role of the theme modeling loss is to optimize PLMs for generating faithful summaries by making the generated summary semantically close to the original article through a semantic classifier. The denoising autoencoder can help PLMs extract salient information from corrupted text to further enhance the faithfulness of generated summaries.

5.3.3. Order-Preservation

In the NLP field, order-preservation is a special property that refers that the order of semantic units (word, phrase, etc.) in both input and output text is consistent. Such a property is key to several important text generation tasks, such as text paraphrase and machine translation. In machine translation, when translating from source language to target language, it often requires preserving the order of phrases in the source and target text for ensuring the accuracy of the translation results.

In machine translation, one line of research to achieve the order-preservation property is to perform word alignment. A representative study is Code-Switching Pre-training (CSP) (Yang et al., 2020a). CSP first extracted the word-pair alignment information from the source and target monolingual corpus automatically. Then, to enhance the order-preservation property during translation, CSP continuously pretrained PLMs by predicting the sentence fragment on the source side given the aligned fragment in the target language. Moreover, to loose the restriction of discrete word alignment, another line of research aims to conduct continuous representation alignment for improving the order-preservation property. Wada et al. (Wada and Iwata, 2018) focused on aligning word representations of each language by mapping word embeddings of each language into a common latent space, making it possible to preserve the word order consistent. Although recent studies have achieved some progress on English language, it is more challenging to enhance order-preservation across multiple languages. Thus, Lin et al. (Lin et al., 2020) proposed mRASP, which enforces words and phrases with similar meanings across multiple languages to be aligned in the representation space.

6. Challenges and Solutions

View Challenge Solution
Data View Lacking Enough Training Data prior knowledge transfer (Peng et al., 2020; Liu et al., 2021f; Zou et al., 2021), data augmentation (Xu et al., 2021; Pasunuru et al., 2021a; Magooda and Litman, 2021; Chen and Yang, 2021a), multi-task learning (Goodwin et al., 2020; Bai et al., 2021)
Domain Transfer Continuously pretrained on specific out-domain data (Chen and Shuai, 2021; Zou et al., 2021), or on auxiliary intermediate tasks (Maurya et al., 2021).
Pretraining Corpus Bias Mitigate the gender bias in word embeddings (Beutel et al., 2017), identify and mask bias-sensitive tokens (Dayanik and Padó, 2020).
Model View Model Compression Quantization by truncating PLMs weights (Stock et al., 2021; Zadeh et al., 2020), pruning less critical weights (Gordon et al., 2020; Guo et al., 2019; Hou et al., 2020; Fan et al., 2020), knowledge distillation (Chen et al., 2020c; Li et al., 2020b; Jiao et al., 2020).
Model Extension Large-scale PLMs (Brown et al., 2020; Zeng et al., 2021; Lepikhin et al., 2021; Fedus et al., 2021), knowledge-enriched PLMs (Li et al., 2021a; Peters et al., 2019; Zhang et al., 2019a; Hao et al., 2020), efficient PLMs (He et al., 2021; Jiang et al., 2020b).
Model Robustness Utilize character embeddings rather than sub-word embeddings (Boukkouri et al., 2020; Ma et al., 2020), adversarial data augmentation (Jia and Liang, 2017; Wang and Bansal, 2018; Zhou et al., 2021b; Xie et al., 2020).
Optim. View Satisfying Text Properties Enhance coherence (Sun et al., 2019a; Li et al., 2021e), preserve factuality (Chen et al., 2020b; Li et al., 2021c; Nan et al., 2021; Dong et al., 2020), improve controllable (Dathathri et al., 2020; Khalifa et al., 2021; Pascual et al., 2021).
Mitigating Tuning Instabilities Intermediate fine-tuning (Phang et al., 2018; Liu et al., 2021c), mixout strategy (Lee et al., 2020), supervised contrastive learning (Gunel et al., 2021).
Table 1. Summary of the existing studies on PLMs with respect to key modules and solutions according to different challenges.

The above three sections describe three key aspects with basic methods in designing text generation models. In this section, we further summarize the major challenges and existing solutions in three different views. A summary of these challenges and solutions is presented in Table 1.

6.1. Data View

We first discuss the challenges and solutions related to the data view.

6.1.1. Lacking Sufficient Training Data

In many text generation tasks, it is difficult to obtain sufficient annotated data. Transfer learning provides an effective solution to transfer the knowledge of data-rich source tasks into data-scarce target text generation tasks. Besides, data augmentation and multi-task learning can also be used to address this problem.

Transfer Learning. To deal with the challenge of lacking enough annotated data in text generation tasks, several studies considered first fine-tuning PLMs on large amounts of external corpus and then transferring into target text generation tasks (Peng et al., 2020; Liu et al., 2021f; Zou et al., 2021). In particular, Peng et al. (Peng et al., 2020) and Zou et al. (Zou et al., 2021) first fine-tuned PLMs on large amounts of external labelled dialog/summary data and then fine-tuned for the target dialog/summarization task in a new domain with limited labelled data. Besides, Liu et al. (Liu et al., 2021f) first trained on large-scale ungrounded dialogues and unstructured knowledge base separately to improve the target knowledge-grounded dialog task.

Data Augmentation. In recent literature, data augmentation has become an important approach to constructing weak-supervised data for improving the model performance in data-scarce text generation tasks. One line of research is to use retrieval models to return simulated data (Xu et al., 2021; Pasunuru et al., 2021a). For the query-focus summarization task, Pasunuru et al. (Pasunuru et al., 2021a) used search engine, i.e., Bing, to retrieve the answer passage as the simulated summary based on the ground-truth query. Another line is to use perturbation-based methods to corrupt the original text (Magooda and Litman, 2021; Chen and Yang, 2021a). For example, Chen et al. (Chen and Yang, 2021a) presented a set of data augmentation methods for conversation summarization, such as random swapping/deletion to randomly swap or delete utterances in conversations.

Multi-Task Learning. Another strategy to overcome the data scarcity issue is to explore multi-task learning by leveraging other data-rich tasks and datasets. Most of studies usually incorporated similar auxiliary text generation tasks for enhancing the primary text generation task such as machine translation (Bai et al., 2021) and abstractive summarization (Goodwin et al., 2020). However, these methods usually adopt independent decoders without sharing parameters, breaking the connections between high-resource and low-resource generation tasks. To bridge these connections, Bai et al. (Bai et al., 2021) employed a unified decoder which learns the alignments and patterns across multiple languages in machine translation.

6.1.2. Transferring from In Domain to Out Domain

For text generation, PLMs are typically pretrained on hundreds of thousands of data points, which is an infeasible requirement when applying PLMs to new domains, especially with large distribution discrepency from pretraining. Therefore, pretraining on intermediate data or tasks before applying PLMs to text generation tasks in new domains may be an effective solution.

Pretraining on Intermediate data. As discussed in Section 5.1.2, the intermediate and target data are of the same domain but can be different tasks. Pretraining on the intermediate data can help PLMs gain more domain-specific knowledge which enhances the performance of the target text generation tasks on the same domain. For example, to improve the translation quality of the low-resource target language (e.g., Kazakh), Liu et al. (Liu et al., 2021c) constructed a large-scale intermediate monolingual corpus of the target language and fine-tuned mBART by reconstructing the corrupted monolingual text.

Pretraining on Intermediate Task. The intermediate task focuses on the same target text generation task but from different domains, which can impart task-specific knowledge to PLMs. For example, Fabbri et al. (Fabbri et al., 2021) performed summarization on intermediate pseudo-summaries created from Wikipedia to improve the zero-shot and few-shot performance of abstractive summarization.

6.1.3. Data Bias from Pretraining Corpus

Since the pretraining corpus of PLMs are collected from the Web, it may contain datasets from various domains such as biomedical and legal corpus. However, these domain-specific data are likely to contain biased information, and PLMs prone to learning and amplifying the data bias of the pretraining corpus.

It has been found that the the generated text from these PLMs are likely to be biased towards some attributes (Brown et al., 2020), i.e., may favor a particular race, gender or aged people, which is not desired for the target generation tasks. These undesirable bias are unexpectedly captured by model components such as word embedding (Bolukbasi et al., 2016) and attention heads (Vig et al., 2020). A simple approach to mitigating the gender bias in word embeddings is to “swap” gendered terms in training data when generating word embeddings (Zhao et al., 2018). Furthermore, simply masking names and pronouns may also reduce bias and improve the performance of certain language tasks (Dayanik and Padó, 2020). However, to date, there is still no a general, unified approach to reducing the data bias from PLMs for text generation. Some of these techniques for bias detection and mitigation have been critiqued as merely capturing over-simplified dimensions of bias with proper debiasing requiring more holistic evaluation (Gonen and Goldberg, 2019).

6.2. Model View

In this section, we present the challenges from the PLM architecture and discuss corresponding solutions for text generation.

6.2.1. Model Compression

Although PLMs have acheved great success on text generation tasks, the backbone Transformers are bulky and resource-hungry, resulting in high memory consumption, computational overhead, and energy cost. To address these issues, one way is through compressing parameters of PLMs. According to Ganesh et al. (Ganesh et al., 2020), there are three kinds of methods including quantization, pruning, and knowledge distillation to compress PLMs for text generation.

Quantization. Quantization means reducing the number of unique values used to represent PLMs weights, which in turn allows to represent them using fewer bits (Ganesh et al., 2020). As most of PLMs are based on Transformer, quantization can be generally applied to those PLMs weights residing in fully-connected layers (i.e., the embedding layers, the linear layers, and the feed-forward network layers). To alleviate the issue of generating unsatisfactory text when truncating PLMs weights, a helpful solution is to first identify important weights and then not truncate them during the quantization step (Zadeh et al., 2020).

Pruning. Pruning refers to identifying and removing redundant or less important weights and/or weights (Ganesh et al., 2020). Pruning methods for text generation largely fall into two categories. The first kind of unstructured pruning prunes individual weights by locating the set of least important weights in PLMs. The importance of weights can be measured by specific metrics such as absolute values (Gordon et al., 2020) and gradients (Guo et al., 2019). The second kind of structured pruning prunes structured blocks of weights or even complete components in PLMs by reducing and simplifying certain modules such as attention heads (Hou et al., 2020) and Transformer layers (Fan et al., 2020).

Knowledge Distillation. Knowledge distillation refers to training a smaller model (called the student) using the output of PLMs (called the teacher

). First, the student model can directly learn from the output word distribution of the final softmax layer in PLMs, which allows the student to mimic the generated text of the teacher model by replicating the word distribution across the whole vocabulary 

(Chen et al., 2020c)

. Second, the student can also learn from the output tensors of PLMs encoders 

(Li et al., 2020b). Intuitively, the representations of PLMs encoder may contain meaningful semantics and contextual relationships between input tokens, which is helpful for generating accurate text. Third, by replicating attention distributions between input data and output text, a student can also learn their contextual dependency (Jiao et al., 2020).

6.2.2. Model Extension

Recently there is a rising interest in the research community to improve the basic architecture of PLMs for achieving better performance in text generation.

Large-scale PLMs. Kaplan et al. (Kaplan et al., 2020) have shown that the performance of PLMs is considerably related to the scale of PLMs parameters. This observation triggered the advent of large-scale PLMs in text generation (Brown et al., 2020; Zeng et al., 2021). The most representative PLMs for text generation is GPT-3 (Brown et al., 2020), which adopts 175 billion parameters, 10x more than any previous non-sparse PLMs. With large-scale parameters, GPT-3 can achieve strong performance in many text generation tasks without any gradient updates or fine-tuning.

Knowledge-Enriched PLMs. Recent studies have shown that integrating the knowledge available in external knowledge sources can further improve the text generation performance of PLMs (Sun et al., 2021; Zhou et al., 2021a). Specifically, ERNIE 3.0 (Sun et al., 2021) was pretrained on a 4TB corpus consisting of plain texts and a large-scale knowledge graph for both language understanding and generation tasks. Without incorporating explicit knowledge, CALM (Zhou et al., 2021a) can pack commonsense knowledge into the parameters by teaching PLMs to write and reason with common concepts through pre-training strategies, yielding better performance on text generation tasks.

Efficient PLMs. Pretraining PLMs on large-scale text data is considerably expensive. Recently, by elaborately designing the model architecture, it is possible to achieve comparable or better text generation performance using less pretraining data (Zhou et al., 2021a) or less pretraining costs (Jiang et al., 2020b). Specifically, while only incrementally pretrained on a relatively small corpus for a few steps, CALM (Zhou et al., 2021a) achieved comparable results with some larger PLMs such as T5 on text generation tasks.

6.2.3. Model Robustness

Transformer-based PLMs are brittle to both adversarial and natural noise, which may have a great impact on the text generation performance of PLMs.

Character Embedding. One reason behind model brittleness is the use of sub-word embeddings during generation. In the case of sub-word embeddings, once first generating a wrong sub-word, it will influence the meaning and representation of the final word which impacts the final generation qualities (Ma et al., 2020). To improve the robustness of PLMs in text generation, several studies have proposed to utilize character-level embeddings (Boukkouri et al., 2020; Xie et al., 2017). In particular, Xie et al. (Xie et al., 2017) combined word and character level embeddings for poetry generation and shown that language models that fused word- and character-level embeddings significantly outperform the models that only utilized word- or character-level embeddings.

Data Augmentation. Another reason of the vulnerability of PLMs is that these PLMs do not generalize well on semantic neighborhood around each generation instance in the representation space (Schmidt et al., 2018). To solve this issue, data augmentation (Jia and Liang, 2017) have been proposed by revising original generation data to augment attack-related data for training. With the development of text generation techniques, back translation (Xie et al., 2020) and variant auto-encoder (Wang et al., 2020) are used to augment new data. Although performed well, these methods have lost the generality. Zhou et al. (Zhou et al., 2021b) utilized a masked language model with Gaussian noise to augment virtual examples for improving the robustness.

6.3. Optimization View

In this part, we discuss the challenges and solutions about how to optimize PLMs for text generation.

6.3.1. Satisfying Special Text Properties

In Section 5.3, we introduced three basic text properties. While, in this section, we will present three more challenging text properties for text generation tasks, i.e., coherence, factuality, and controllability.

Enhancing the Coherence. Language coherence in linguistics is what makes a text semantically meaningful. To enhance the coherence, an important approach is to elaborately plan the generated content, called text planning (Li et al., 2021e; Hua et al., 2021). For example, Li et al. (Li et al., 2021e) designed a two-level text plan: (1) the document plan is modeled as a sequence of sentence plans in order, and (2) the sentence plan is modeled as an entity-based subgraph from KG. The local coherence is naturally enforced by KG subgraphs, and the global coherence can be improved by generating a coherent sequence of subgraphs. During the generation process, the decoding strategies such as top- sampling usually produce improper tokens especially at the border of sentences, which will decrease the discourse coherence. Wang et al. (Wang et al., 2021b) designed an auxiliary task of discourse relation modeling to enhance the discourse coherence of the generated text by classifying adjacent sentences.

Preserving the Factuality. The input data of some text generation tasks usually contains some factual information such as table-to-text generation. In such cases, the generated content should adhere to the original input fact. However, the universal structure of PLMs is unable to retain the text factuality in specific tasks. For data-to-text generation, the pointer generator (See et al., 2017) is usually adopted to copy the input facts into output for preserving factuality (Chen et al., 2020b; Li et al., 2021c)

. Besides, the input text of text summarization sometimes refers to the news including world facts. To make summarization models produce more factual summaries, some works proposed some evaluation metrics or correction methods to measure and revise the generated text for preserving factuality 

(Nan et al., 2021; Dong et al., 2020).

Improving the Controllability. Controlling attributes of generated text becomes difficult without modifying the model architecture of large-scale PLMs to allow for extra input attributes. A representative controllable PLM is the Plug and Play Language Model (Dathathri et al., 2020), also called PPLM, which combines a PLM with one or more simple attribute classifiers that guide text generation without any further training of the PLM. Several studies achieved the goal of controllablility from a distributional view (Khalifa et al., 2021; Pascual et al., 2021). For example, Pascual et al. (Pascual et al., 2021) present a plug-and-play decoding method, which can be described in a single sentence: given a topic or keyword, the model add a shift to the probability distribution over the vocabulary towards semantically similar words.

6.3.2. Mitigating Tuning Instabilities

Due to the issues of catastrophic forgetting and small size of text generation datasets, tuning PLMs for text generation is usually unstable i.e.,

fine-tuning the model with different random seeds results in a large variance of performance. The possible solutions include a) intermediate fine-tuning b) mix-out and c) use of supervised contrastive loss.

Intermediate Fine-Tuning. Recent studies have shown that continuously pretraining PLMs on unsupervised tasks like language modeling, and then fine-tuning them on the target generation tasks can achieve significantly better target performance than using the target generation training data alone (Phang et al., 2018; Liu et al., 2021c). For example, Liu et al. (Liu et al., 2021c) constructed an intermediate monolingual corpus of the target language (e.g., Kazakh) and fine-tuned mBART to reconstruct the corrupted monolingual text for improving the translation quality of the low-resource target language.

Mixout Strategy. Fine-tuning a large PLM on a text generation task is prone to degenerate performance when there are only a small number of training instances available. Lee et al. (Lee et al., 2020) introduce a regularization technique, mixout, which stochastically mixes the parameters of two PLMs. They shown that the mixout strategy regularizes learning to minimize the deviation from one of the two models and the strength of regularization adapts along the optimization trajectory.

Contrastive Learning. The most used cross-entropy loss in text generation, i.e., the KL-divergence between one-hot vectors of labels and the distribution of model’s outputs, lacks robustness to noise labels (Zhang and Sabuncu, 2018) or adversarial examples (Elsayed et al., 2018). Therefore, fine-tuning PLMs with cross-entropy loss tends to be unstable across different runs, especially when supervised data is limited. To tackle this issue, Gunel et al. (Gunel et al., 2021) proposed an objective that includes a supervised contrastive learning term that pushes the words from the same class close and the words from different classes further apart.

7. Evaluation and Resources

In this section, we will present some widely-used evaluation metrics and resources with respect to PLMs for text generation.

7.1. Evaluation

With the increase of the numbers of text generation applications and datasets, evaluating text generated results by humans is costly and time-consuming to design and run, and more importantly, the results are not always repeatable (Celikyilmaz et al., 2020). Therefore, in this section, we focus on automatic evaluation metrics for text generation. According to Celikyilmaz et al. (Celikyilmaz et al., 2020), we present four categories of metrics, i.e., -gram overlap metrics, diversity metrics, semantic similarity metrics, and logit-based metrics. We list the metrics used in each generation tasks in Table 2.

7.1.1. N-Gram Overlap Metrics

These metrics measure the degree of “matching” between machine-generated and ground-truth texts.

BLEU. The Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002) is one of the first metrics used to measure the similarity between two sentences. The metric is originally proposed for machine translation by comparing a candidate translation of text to one or more reference translations and now employed in various generation tasks. BLEU- measures the precision of the co-occurrences of -grams between the generated and real text and conduct length penalty on shorter generated text. Specially, SacreBLEU (Post, 2018) is recommended for use in machine translation to avoid inconsistency issue. And several smoothing methods (Chen and Cherry, 2014) are proposed to evaluate short sentence.

ROUGE. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin, 2004) is a set of metrics for evaluating automatic summarization of long texts consisting of multiple sentences or paragraphs. ROUGE- counts the F1 score of the overlapping -grams between generated and real text.

METEOR. The Metric for Evaluation of Translation with Explicit ORdering (METEOR) (Banerjee and Lavie, 2005)

is designed to address some of the issues found in BLEU. Compared to BLEU, METEOR is computed based on the harmonic mean of the unigram precision and recall, and measures word-to-word matches based on WordNet between generated and real text.

ChrF++. Character

-gram F-score (ChrF++) 

(Popovic, 2017) is an automatic evaluation for machine translation. Different from the word level co-occurrence of BLEU, ChrF++ is mainly focused on the character-level to consider the morpheme overlapping.

7.1.2. Diversity Metrics

Lexical diversity is desirable in many text generation tasks, such as dialog systems and story generation.

Distinct. Distinct measures the degree of diversity by calculating the number of distinct -grams in generated text (Li et al., 2016). The value is scaled by total number of generated tokens to avoid favoring long sentences.

7.1.3. Semantic Similarity Metrics

Researchers utilized neural networks to capture semantic meaning and syntactic structure of sentences by mapping them into vectors, and text generation results can be evaluated by sentence embeddings from the generated and reference texts.

BERTScore. Given the excellent performance of BERT across many tasks, BERTScore (Zhang et al., 2020b)

leverages the pretrained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. BERTScore has been shown to correlate well with human judgments on sentence-level and system-level evaluations.

7.2. Resources

In this section, we will introduce some available open-source libraries and benchmarks.

7.2.1. Open-Source Libraries

There are many popular text generation libraries to help researchers conveniently work with PLMs . Libraries like Transformers (Wolf et al., 2020) and Fairseq (Ott et al., 2019) are helpful to model training and evaluation. Some of libraries like TextBox (Li et al., 2021b) and Dl-Translate 111https://github.com/xhlulu/dl-translate which are built on the top of Transformers library make constructing text generation models easier with just a few lines of code. Others like FastSeq (Yan et al., 2021), DeepSpeed (Rasley et al., 2020), and LightSeq (Wang et al., 2021c) are useful to increase the inference speed of models.

7.2.2. Benchmarks

PLMs have made great progress in a host of Natural Language Understanding (NLU) tasks. Meanwhile, the development of general evaluation benchmarks has also helped drive the progress of these PLMs. In addition to GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) which are general language understanding evaluation benchmarks, an increasing number of general benchmarks designed for text generation have recently been proposed. Liu et al. (Liu et al., 2021d) introduced the General Language Generation Evaluation (GLGE) benchmark, a new multi-task benchmark for evaluating the generalization capabilities of text generation in English language.

8. Application

Tasks Sub-Tasks Datasets Metrics
Machine Translation
WMT’14 English-French (Conneau and Lample, 2019), WMT’16 German-English (Conneau and Lample, 2019), WMT’16 Romanian-English (Conneau and Lample, 2019) SacreBLEU
Sum. Vanilla Sum.
CNN/DailyMail (Song et al., 2019),
XSum (Song et al., 2019)

, GigaWord 

(Song et al., 2019)
Dialogue Sum. SAMSum (Chen and Yang, 2020) ROUGE
Dialogue System
Dialogue System
PersonaChat (Bao et al., 2020b), DailyDialog (Bao et al., 2020b),
DSTC7-AVSD (Bao et al., 2020b)
PPL222https://en.wikipedia.org/wiki/Perplexity, BLEU, Distinct
Dialogue System
MultiWOZ (Budzianowski and Vulic, 2019) BLEU, Inform, Success
Question Generation SQuAD (Dong et al., 2019) BLEU, ROUGE, METEOR
Story Generation
ROCStories (Guan et al., 2020),
WritingPrompts (Rashkin et al., 2020)
PPL2, BLEU, Distinct
Data-to-text Generation
AGENDA (Ribeiro et al., 2020), LDC2017T10 (Mager et al., 2020),
WikiBio (Chen et al., 2020b), WebNLG (Ribeiro et al., 2020), E2E (Chen et al., 2020d)
METEOR, chrF++
Table 2. Summary of common datasets and according metrics used in each generation task. Sum. is short for summarization and doc. denotes document. Smoothing method 7 (with NLTK version 3.4) is usually employed in open-domain dialogue system (Bao et al., 2020b). Inform (rate) and Success (rate) are two accuracy metrics specially designed for task-oriented dialogue system (Budzianowski and Vulic, 2019).

As discussed in Section 2, we can categorize text generation tasks into different kinds of applications according to the input data . And PLMs have been widely applied in various text generation tasks successfully. The overall taxonomy of tasks is shown in Table 2, as well as corresponding common datasets and metrics. In what follows, we will highlight some classic applications, such as machine translation, text summarization and dialogue system. Moreover, we will summarize useful methods of how to design a task-specific PLM or how to utilize PLMs for special tasks.

8.1. Machine Translation

The field of Machine translation (MT), are focused on the automatic translation from one language into another language. With the development of deep learning, Neural Machine Translation (NMT) has become the dominant method in both academic research and commercial use 

(Dabre et al., 2020). According to whether utilizing parallel corpora during fine-tuning PLMs, we categorize machine translation as unsupervised machine translation and supervised translation.

8.1.1. Unsupervised Machine Translation

Unsupervised Machine Translation (UMT) refers to that we only use monolingual corpora without any parallel data during both pretraining and fine-tuning PLMs. UMT enables machine translation not to rely on the large-scale annotated corpora any longer, and it also brings unprecedented breakthroughs in minority language translation. Under the circumstance, UMT is usually broken down into two steps following (Lample et al., 2018): 1) PLMs are pretrained on monolingual corpora using multiple languages, which learns a satisfactory embedding of each token and a probability of each sentence in a certain language. 2) Then iterative back-translation is leveraged to combine the source-to-target and target-to-source model with denoising auto-encoding and back-translation objectives.

Bilingual Unsupervised Machine Translation. In this part, bilingual UMT refers to conduct UMT between two languages, without a third language. We mainly focus on the first step of UMT, i.e., how to design a pretrained monolingual language model. XLM (Conneau and Lample, 2019), mBERT (Devlin et al., 2019) are pretrained with multiple monolingual data using masked language modeling and utilize the pretrained model to initialize both encoder and decoder. mBART (Liu et al., 2020) follow the BART (Lewis et al., 2020b) pretraining scheme on multiple languages. However, the above mention method just perform the original pretraining task with multilingual corpora, without considering the relationship between languages. CMLM (Ren et al., 2019)

proposed cross-lingual MLM to randomly mask tokens in the monolingual sentences and predict corresponding translation candidates in the n-gram translation table. Therefore, CMLM is of benefit to align the embedding of different languages. Moreover, creating pseudo-parallel corpora is also an effective way to augment monolingual dataset. MARGE 

(Lewis et al., 2020a) retrieves a set of relevant texts in many languages and reconstructs the original text conditioned on the retrieved texts.

Multilingual Unsupervised Machine Translation. Compared to bilingual UMT, multilingual UMT explores the UMT with the assistant of a third language. The third language can provide auxiliary monolingual data or parallel data containing only one language in the source or target language. Here, we mainly focus on the second step (back-translation) in UMT. Garcia et al. (Garcia et al., 2020) aggregate back-translation loss and introduce novel cross-translation term to incorporate the auxiliary corpus. Then, they utilize the EM algorithm to optimize the model.

8.1.2. Supervised Machine Translation

Supervised machine translation refers to the use of parallel corpora during pretraining or fine-tuning. Next, we will discuss how to utilize existing PLMs and how design PLMs for parallel corpora.

Utilizing existing PLMs. Almost all the models mentioned above using unsupervised pretraining, such as XLM (Conneau and Lample, 2019) and mBART (Liu et al., 2020), can be directly fine-tuned with bilingual pairs. Moreover, considering the excellent encoding capability of BERT, BERT-fused model (Zhu et al., 2020) leverages BERT to extract contextual embedding for the source sentence, and fuses the representations with each layer of the encoder and decoder. CTNMT (Yang et al., 2020b) leverages asymptotic distillation and dynamic switching gate to integrate the BERT embedding. Most of the above PLMs improve zero-, low- or medium- resource translation with a large margin, compared with a randomly initialized Transformer (Liu et al., 2020). These results demonstrate the effectiveness of pretraining on multilingual corpora. In contrast, the multilingual PLMs usually suffer from performance degradation in high-resource translation. In this case, multiple languages may reduce the weight capacity available for rich languages.

Designing PLMs for parallel corpora. All the PLMs mentioned above are pretrained on monolingual corpora using pretraining tasks such as MLM and DAE. Nevertheless, the pretraining objective is different from the down-stream translation task. Hence, mRASP (Lin et al., 2020) pretrains the model on bilingual pairs with vanilla Seq2Seq loss, whose objective is consistent with the fine-tuning stage. It randomly substitutes the words in the source sentence with the words which have the same meaning in other languages. Hence, the word with similar meaning across different languages is encouraged to share a similar representation. mRASP2 (Pan et al., 2021) augments the representation alignment on both parallel and monolingual data, and applies contrastive learning to minimize the representation gap of similar sentences and maximize that of irrelevant sentences. Pretrained on parallel data, these models can improve machine translation for any pairs of language, including low-resource and high-resource languages. Compared to hundreds of billions of monolingual sentences, these models only require hundreds of million bilingual pairs, whereas, the acquisition of annotated data requires massive manpower and financial resources.

8.2. Text Summarization

Text summarization means to condense texts into a concise summary that preserves important information of original texts. Equipped with summarization, we can comprehend the core idea of various textual contents, such as news, scientific papers and dialogues in a time-efficient way (El-Kassas et al., 2021).

The mainstream approaches to solve summarization tasks are extractive and abstractive. Extractive summarization aims to choose a subset of sentences and concatenates them to form the summary (Liu and Lapata, 2019; Zhang et al., 2019b). In contrast, abstractive summarization represents the input texts as an abstract representation and generates the summary which can have different words from the original text (See et al., 2017; Zhang et al., 2020e). Extractive summarization can be seen as a binary classification task, i.e., determine whether each sentence will preserve in the summary, while abstractive summarization follows the common text generation paradigm. Considering the text generation formulation 1, we only discuss the abstractive summarization in this part.

8.2.1. Document Summarization

Document is the most common textual form in the world, including news, opinin, reviews and scientific papers. The PLMs with prefix LM or encoder-decoder architecture can be directly utilized for summarization, such as UniLM (Dong et al., 2019; Bao et al., 2020a), MASS (Song et al., 2019), T5 (Raffel et al., 2020), BART (Lewis et al., 2020b) and PropherNet (Qi et al., 2020). PEGASUS (Zhang et al., 2020e) is a PLM tailored for summarization. During pretraining, the important sentences in the input document are masked and will be generated using the remaining ones, which shares the similar idea of summarization. Among these PLMs, most of the following works utilize BART or PEGASUS as backbone for summarization.

Considering summarization aims to generate important information from the input, several works first extract keywords, key sentences or relations as guidance and then combine them to PLMs for generating summarization. CIT (Saito et al., 2020) employs an extractor (RoBERTa) to extract the important words and sentences from the input, which will be fed into encoder with the input. In addition, topic models can capture the global semantics of the document, which can integrated into the summarization model (Nguyen et al., 2021). Finally, GSum (Dou et al., 2021) proposes a general framework taking different kinds of guidance into the generation model, including keywords, triples, highlighted sentences and retrieved summaries. Apart from external guidance, several tricks can be applied to summarization. Cao et al. (Cao and Wang, 2021a) improve attention mechanism to emphasize salient content in the document. Refactor (Liu et al., 2021a) first generates different candidate summaries under different setups and then score them and select an optimal candidate summary.

Although the most common textual form of summarization is news, there still exist several works focused on other textual forms. Goodwin et al. (Goodwin et al., 2020) study how to generate summaries conditioned on different topics or questions. DSGPT (Zhang et al., 2021b) proposes to pretrain in e-commerce scenarios and explore the product title and review summarization. Furthermore, PASS (Oved and Levy, 2021) aggregates different reviews of one product into a short summary.

8.2.2. Dialogue Summarization

Different from plain document, dialogue, such as meeting, chat and medical conversation, is consist of multi-turn utterances of two or more users. Hence, it is critical to capture the semi-structured chat content and users’ interactions in dialogue summarization (Feng et al., 2021a). The method used in document summarization can be directly transferred into dialogue summarization. Zhang et al. (Zhang et al., 2021c) first truncate the dialog into several chunks and summarize each chuck into partial summaries. Then they rewrite these partial summaries into a complete summary.

In the meanwhile, others also explore to conduct summarization considering the characteristics of dialogue. Chen et al. (Chen and Yang, 2020) first extract different views of structures from conversations, which are encoded through the conversation encoder later. Afterwards, they utilize a multi-view decoder to combine these views and generate summaries. Furthermore, Chen et al. (Chen and Yang, 2021b) construct discourse relation graphs and action graphs of conversations, in order to concentrate on the most salient utterances and understand concrete details of users’ action. Considering the low information density, topic drifts and frequent coreferences of dialog (Feng et al., 2021a), some researchers conduct auxiliary tasks to extract intrinsic information of dialog. Feng et al. (Feng et al., 2021b) utilize DialoGPT (Zhang et al., 2020c), a PLM specially designed for dialogue, to automatically extract the keywords, detect redundant utterances and divide a dialogue into topically coherent segments. Similarly, ConDigSum (Liu et al., 2021g) detects the dialogue topic transfer and generates summaries for each topic using contrastive learning.

8.3. Dialogue System

Dialogue system (a.k.a.,conversational agent) aims to let machines to communicate with human fluently. Technically, machines are required to generate a response conditioned on history contexts. According to down-stream applications, dialogue system is commonly categorized into open-domain dialogue (ODD) and task-oriented dialogue (TOD). The former intends to converse with humans coherently and engagingly on open domains such as daily life, sports and entertainment (Huang et al., 2020). Yet the latter is focused on assisting users to complete specific tasks, such as hotel reservation and product purchase (Zhang et al., 2020d). In the following, we will discuss them with PLMs in turn.

8.3.1. Open-domain Dialogue System

Open-domain dialogue system is also known as chatbots and more focused on daily chat. For example, Microsoft XiaoIce is a famous open-domain dialogue system to satisfy human need for communication, affection, and social belonging (Zhou et al., 2020a).

Designing Pretraining Dialogue System. Considering that universal PLMs, such as GPT-2, are pretrained on general corpora, some researchers extend PLMs specially for dialog response generation. Due to the difficulty to obtain large-scale dialog corpora, it is usual to pretrain language models on forum posts and comments, such as Reddit, Twitter and Weibo. DialoGPT (Zhang et al., 2020c) and Meena (Adiwardana et al., 2020) utilize casual language modeling like GPT-2 to pretrain on English or Chinese corpora. Blender (Roller et al., 2021) and PLATO (Bao et al., 2020b) utilize the Seq2Seq loss to generate the next utterance based on previous utterances. Moreover, PLATO (Bao et al., 2020b) apply the next utterance classification (NUC) loss, just like the next sentence prediction (NSP) task of BERT, to judge whether the response is relevant to history dialogues to enhance the coherence of utterances. In order to penalize bland responses and decrease repetitions, DialoGPT (Zhang et al., 2020c) employs mutual information maximization (MMI) function to predict the input given generated response and Blender (Roller et al., 2021) adopts unlikelihood training objective to penalize n-grams appearing too many times.

Utilizing Existing PLMs. Aside from pretraining a dialog model, there also exist various methods to utilize existing PLMs to solve dialog tasks. TransferTransfo (Wolf et al., 2019) introduces dialog state embeddings as well as NUC task learning. Based on TransferTransfo, Golovanov et al. (Golovanov et al., 2019) modify the architecture to better model multiple inputs including dialog history, persona information, and current state. In addition, researchers explore the methods to capture the hierarchical structure of dialog. DialogBERT (Gu et al., 2021b) employs a hierarchical Transformer architecture and additional training objectives to capture the discourse-level coherence of dialog. DialoFlow (Li et al., 2021d) proposes a dynamic flow mechanism to model the dialogue history by addressing the semantic influence of each utterance. Furthermore, some papers are focused on the controllable dialogue system. Zeng et al. (Zeng and Nie, 2021) utilize condition-aware Transformer block to steer the response in a specific topic label. StyleDGPT (Yang et al., 2020c) attempts to control the generated response in the target style with KL loss in both word-level and sentence-level.

8.3.2. Task-Oriented Dialogue System

Task-oriented (a.k.a., goal-oriented) dialogue system is a traditional dialog task and has many real-life applications. Before emergence of PLMs, task-oriented dialogue is typically broken down into several modules, including natural language understanding, dialog state tracking, dialogue policy learning and natural language generation (Zhang et al., 2020d). These components have labeled data to guide each module. Though the main goal of task-oriented dialogue is to track user’s intent and state to fulfill necessary slot-value pairs, we mainly focus on the generation part that generates response conditioned on given intent and slot pairs.

Considering the open-ended nature of dialog, researchers usually adopt the auto-regressive GPT-2 as backbone. SC-GPT (Peng et al., 2020) serializes the system action as the input of and generates according response. Moreover, researchers also attempt to build an end-to-end system for task-oriented dialog. Budzianowski et al. (Budzianowski and Vulic, 2019) propose to generate response conditioned on the user input, system action and database state. SimpleTOD (Hosseini-Asl et al., 2020) and Ham et al. (Ham et al., 2020) generate the dialog state, system action and response successively, based on the user input and previous generated tokens. In addition, Shalyminov et al. (Shalyminov et al., 2020) propose to generate and retrieve candidate response respectively and utilize NUC to select the best one. PRAL (Gu et al., 2021a) utilizes two GPTs to model user and system respectively, and also involves a third GPT to perform knowledge distillation.

8.4. Others

In this part, we will introduce other text generation tasks, including question generation, story generation and data-to-text generation.

8.4.1. Question Generation

Question generation (QG) can be seen as dual problem of question answering (QA), i.e., generate coherent question based on given passage and answer. Existing PLMs, such as UniLM (Dong et al., 2019; Bao et al., 2020a) and ProphetNet (Qi et al., 2020), can be directly employed by concatenating the passage and answer as input. Moreover, researchers explore this task in different QA settings. Due to previous works focused on one-hop question generation, Huang et al. (Huang et al., 2021b) propose a two-phase model to solve multi-hop question generation. Considering the answer is usually composed of single sentence, Cao et al. (Cao and Wang, 2021b) attempt to generate open-ended questions which are answered by multiple sentences. Moreover, Majumder et al. (Majumder et al., 2021) propose a clarification question generation task to ask question about the missing information in the passage in order to reduce ambiguity.

8.4.2. Story Generation

Different from the above-mentioned tasks, story (or narrative, news) generation requires to generate a long-form open-ended text leveraging on the given title or premise. It is challenging to produce a coherent and relevant text based on limited input (Garbacea and Mei, 2020).

In order to enhance the knowledge of PLMs, some works introduce knowledge graph or commonsense. Guan et al. (Guan et al., 2020) and Mao et al. (Mao et al., 2019) utilize commonsense knowledge base to intermediately fine-tune PLMs to generate reasonable stories. In order to generate a long-form text, content planning is a widely-used method to select specific content and determine the output structure. PlotMachines et al. (Rashkin et al., 2020) extracts keywords from input as outline. Megatron-Cntrl (Xu et al., 2020) further extends keywords to relevant sentences using knowledge base retrieval. ProGen (Tan et al., 2021) iteratively refines the generated texts to enhance the quality. Moreover, to enhance consistency and coherency of generated long text, Guan et al. (Guan et al., 2020) leverage the contrastive learning loss, similar to NSP loss, to judge whether two sentences are successive in corpus.

8.4.3. Data-to-text Generation

All the above-mentioned tasks are text-to-text generation, i.e., the input is textual data. In the following, we will introduce the data-to-text generation, which refers to generating descriptive text about structured input data, such as table, knowledge graph (KG) and abstract meaning representation (AMR).

Some researchers design special pretraining tasks to pretrain a specific model for table-to-text generation (Xing and Wan, 2021), KG-to-text generation (Chen et al., 2020d) and AMR-to-text generation (Fan and Gardent, 2020). Afterwards, the most direct way to utilize existing PLMs is to simply linearize the structured table (Chen et al., 2020b; Gong et al., 2020) and KG (Ribeiro et al., 2020; Harkous et al., 2020) into textual form. Specially, it is common to leverage the depth-first traversal of AMR as serialization (Ribeiro et al., 2020; Mager et al., 2020). Considering the graph structure of KG and AMR, Li et al. (Li et al., 2021c) and Ribeiro (Ribeiro et al., 2021) employ GNN to obtain a better representation of each node. And Li et al. (Li et al., 2021c) further align the entity embedding of PLM and GNN to bridge the semantic gap. Moreover, multi-task loss of reconstructing the structured table (Gong et al., 2020) and KG (Li et al., 2021c; Ke et al., 2021) is usually utilized to capture the semantic correspondence between structured input and output text. Besides, some works borrow the idea of dual learning to joint learn the data-to-text generation and text-to-data parsing tasks (Ke et al., 2021).

8.4.4. Other Generation Tasks

Besides the mentioned tasks, there also exist various applications with PLMs. We will briefly introduce them. ColdGANs (Scialom et al., 2020) explores the unconditional language generation. KG-BART (Liu et al., 2021b) investigates the commonsense generation, i.e., generating a natural language consisting of provided commonsense concept (word), which can be as the hard-constrained conditional generation (Garbacea and Mei, 2020). Moreover, text style transfer means to convert a text into another style while preserving the basic semantics of input (Garbacea and Mei, 2020), including sentiment transfer, formality transfer and writing style transfer (Krishna et al., 2020). In addition, some researchers devote to literary creation, such as poem (Li et al., 2020c) and lyric (Xue et al., 2021b).

9. Conclusion and Future Directions

In this survey, we present an overview of current representative research efforts on PLMs for text generation, and expect it can facilitate future research. We began with introducing three key points when applying PLMs for text generation, base on which the main content of our survey is divided into three sections from the view of input representation learning, model architecture design, and model optimization. Besides, we discussed several non-trivial challenges with respect to the above three points. Finally, we reviewed a variety of evaluation metrics, open-source libraries, and common applications to help practitioners evaluate, choose and employ PLMs for text generation.

To advance this field, there remains several open problems and future directions.

Controllable Generation. Controllable text generation with PLMs is an interesting direction but still at a very early stage. Controlling some attributes of the generated text has many useful applications such as generating positive response to patients with depression in dialogue systems. However, PLMs are usually pretrained in universal corpus, which is difficult to control the multi-grained attributes of the generated text (e.g., sentiment, topic, and coherence). Keskar et al. (2019) has explored text generation with control codes that govern style, content and task-specific behavior. While, these control codes are preset and coarse-grained. Future work can explore multi-grained control and develop PLMs that are sufficiently steerable.

Optimization Exploration. Fine-tuning is the predominant optimization way to distill the linguistic knowledge learned in PLMs to downstream generation tasks. At present, prompt-based learning has become a performant and lightweight optimization method (Liu et al., 2021e). Future work can explore more kinds of optimization approaches that can combine the advantages of current methods.

Language-agnostic PLMs. Nowadays, almost all the PLMs for text generation are mainly based on English. These PLMs will encounter challenges when dealing with non-English generation tasks. Therefore, language-agnostic PLMs are worthy to be investigated, which need to capture universal linguistic and semantic features across different languages. An interesting direction is how to reuse existing English-based PLMs for text generation in non-English languages.

Ethical Concern. Currently, PLMs are pretrained on large-scale corpus crawled from the web without fine-grained filtering, potentially causing ethical issues such as generating private content about users. Therefore, researchers should try their best to prevent misusing PLMs. For this purpose, we can follow the key steps in  (Ross, 2012), such as identifying threats and potential impacts and assessing likelihood. Besides, the text generated by PLMs might be prejudiced, which is in line with the bias in training data along the dimensions of gender, race, and religion (Brown et al., 2020). Hence, we ought to intervene PLMs for preventing such biases. The research on the general approach is extensive but still preliminary for PLMs.


  • (1)
  • Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a Human-like Open-Domain Chatbot. arXiv preprint arXiv:2001.09977 (2020).
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR.
  • Bai et al. (2021) Yu Bai, Yang Gao, and Heyan Huang. 2021. Cross-Lingual Abstractive Summarization with Limited Parallel Resources. In ACL.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In IEEvaluation@ACL.
  • Bao et al. (2020a) Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. 2020a. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training. In ICML.
  • Bao et al. (2020b) Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng Wang. 2020b. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable. In ACL.
  • Bao et al. (2021) Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhen Guo, Zhibin Liu, and Xinchao Xu. 2021. PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning. In ACL/IJCNLP Findings.
  • Beutel et al. (2017) Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. 2017. Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations. arXiv preprint arXiv:1707.00075 (2017).
  • Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In NIPS.
  • Boukkouri et al. (2020) Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum, and Jun’ichi Tsujii. 2020. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters. In COLING.
  • Brown et al. (1990) Peter F. Brown, John Cocke, Stephen Della Pietra, Vincent J. Della Pietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A Statistical Approach to Machine Translation. Comput. Linguistics 16, 2 (1990), 79–85.
  • Brown and Frederking (1995) Ralf Brown and Robert Frederking. 1995. Applying statistical English language modeling to symbolic machine translation. In TMI. 221–239.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS.
  • Budzianowski and Vulic (2019) Pawel Budzianowski and Ivan Vulic. 2019. Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. In NGT@EMNLP-IJCNLP.
  • Cao and Wang (2021a) Shuyang Cao and Lu Wang. 2021a. Attention Head Masking for Inference Time Content Selection in Abstractive Summarization. In NAACL-HLT.
  • Cao and Wang (2021b) Shuyang Cao and Lu Wang. 2021b. Controllable Open-ended Question Generation with A New Question Type Ontology. In ACL/IJCNLP.
  • Celikyilmaz et al. (2020) Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. 2020. Evaluation of Text Generation: A Survey. arXiv preprint arXiv:2006.14799 (2020).
  • Chen and Cherry (2014) Boxing Chen and Colin Cherry. 2014. A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT@ACL.
  • Chen et al. (2021) Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2021. VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning. arXiv preprint arXiv:2102.10407 (2021).
  • Chen and Yang (2020) Jiaao Chen and Diyi Yang. 2020. Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization. In EMNLP.
  • Chen and Yang (2021a) Jiaao Chen and Diyi Yang. 2021a. Simple Conversational Data Augmentation for Semi-supervised Abstractive Dialogue Summarization. In EMNLP.
  • Chen and Yang (2021b) Jiaao Chen and Diyi Yang. 2021b. Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs. In NAACL-HLT.
  • Chen et al. (2020d) Wenhu Chen, Yu Su, Xifeng Yan, and William Yang Wang. 2020d. KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation. In EMNLP.
  • Chen et al. (2020c) Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, and Jingjing Liu. 2020c. Distilling Knowledge Learned in BERT for Text Generation. In ACL.
  • Chen and Shuai (2021) Yi-Syuan Chen and Hong-Han Shuai. 2021. Meta-Transfer Learning for Low-Resource Abstractive Summarization. In AAAI.
  • Chen et al. (2020a) Zhiyu Chen, Wenhu Chen, Hanwen Zha, Xiyou Zhou, Yunkai Zhang, Sairam Sundaresan, and William Yang Wang. 2020a. Logic2Text: High-Fidelity Natural Language Generation from Logical Forms. In EMNLP Findings.
  • Chen et al. (2020b) Zhiyu Chen, Harini Eavani, Wenhu Chen, Yinyin Liu, and William Yang Wang. 2020b. Few-Shot NLG with Pre-Trained Language Model. In ACL.
  • Chi et al. (2020) Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2020. Cross-Lingual Natural Language Generation via Pre-Training. In AAAI.
  • Conneau and Lample (2019) Alexis Conneau and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In NeurIPS.
  • Dabre et al. (2020) Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A Survey of Multilingual Neural Machine Translation. CSUR (2020).
  • Dathathri et al. (2020) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In ICLR.
  • Dayanik and Padó (2020) Erenay Dayanik and Sebastian Padó. 2020. Masking Actor Information Leads to Fairer Political Claims Detection. In ACL.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  • Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. In NeurIPS.
  • Dong et al. (2020) Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, and Jingjing Liu. 2020. Multi-Fact Correction in Abstractive Text Summarization. In EMNLP.
  • Dou et al. (2021) Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, and Graham Neubig. 2021. GSum: A General Framework for Guided Neural Abstractive Summarization. In NAACL-HLT.
  • Du et al. (2021) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. All NLP Tasks Are Generation Tasks: A General Pretraining Framework. arXiv preprint arXiv:2103.10360 (2021).
  • El-Kassas et al. (2021) Wafaa S. El-Kassas, Cherif R. Salama, Ahmed A. Rafea, and Hoda K. Mohamed. 2021. Automatic text summarization: A comprehensive survey. Expert Syst. Appl. (2021).
  • Elsayed et al. (2018) Gamaleldin F. Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. 2018. Large Margin Deep Networks for Classification. In NeurIPS.
  • Fabbri et al. (2021) Alexander R. Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq R. Joty, Dragomir R. Radev, and Yashar Mehdad. 2021. Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation. In NAACL-HLT.
  • Fan and Gardent (2020) Angela Fan and Claire Gardent. 2020. Multilingual AMR-to-Text Generation. In EMNLP.
  • Fan et al. (2020) Angela Fan, Edouard Grave, and Armand Joulin. 2020. Reducing Transformer Depth on Demand with Structured Dropout. In ICLR.
  • Fan et al. (2019) Zhiyun Fan, Shiyu Zhou, and Bo Xu. 2019. Unsupervised pre-training for sequence to sequence speech recognition. arXiv preprint arXiv:1910.12418 (2019).
  • Fedus et al. (2021) William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint arXiv:2101.03961 (2021).
  • Feng et al. (2021a) Xiachong Feng, Xiaocheng Feng, and Bing Qin. 2021a. A Survey on Dialogue Summarization: Recent Advances and New Frontiers. arXiv preprint arXiv:2107.03175 (2021).
  • Feng et al. (2021b) Xiachong Feng, Xiaocheng Feng, Libo Qin, Bing Qin, and Ting Liu. 2021b. Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization. In ACL/IJCNLP.
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In EMNLP Findings.
  • Ganesh et al. (2020) Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. arXiv preprint arXiv:2002.11985 (2020).
  • Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In ACL.
  • Garbacea and Mei (2020) Cristina Garbacea and Qiaozhu Mei. 2020. Neural Language Generation: Formulation, Methods, and Evaluation. arXiv preprint arXiv:2007.15780 (2020).
  • Garcia et al. (2020) Xavier Garcia, Pierre Foret, Thibault Sellam, and Ankur P. Parikh. 2020. A Multilingual View of Unsupervised Machine Translation. In EMNLP Findings.
  • Gheini et al. (2021) Mozhdeh Gheini, Xiang Ren, and Jonathan May. 2021. On the Strengths of Cross-Attention in Pretrained Transformers for Machine Translation. arXiv preprint arXiv:2104.08771 (2021).
  • Gidiotis and Tsoumakas (2020) Alexios Gidiotis and Grigorios Tsoumakas. 2020. A Divide-and-Conquer Approach to the Summarization of Long Documents. TASLP (2020).
  • Goldberg et al. (1994) Eli Goldberg, Norbert Driedger, and Richard I. Kittredge. 1994. Using Natural-Language Processing to Produce Weather Forecasts. IEEE Expert (1994).
  • Golovanov et al. (2019) Sergey Golovanov, Rauf Kurbanov, Sergey I. Nikolenko, Kyryl Truskovskyi, Alexander Tselousov, and Thomas Wolf. 2019. Large-Scale Transfer Learning for Natural Language Generation. In ACL.
  • Gonen and Goldberg (2019) Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In WNLP@ACL.
  • Gong et al. (2020) Heng Gong, Yawei Sun, Xiaocheng Feng, Bing Qin, Wei Bi, Xiaojiang Liu, and Ting Liu. 2020. TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching. In COLING.
  • Goodwin et al. (2020) Travis R. Goodwin, Max E. Savery, and Dina Demner-Fushman. 2020. Towards Zero Shot Conditional Summarization with Adaptive Multi-task Fine-Tuning. In EMNLP Findings.
  • Gordon et al. (2020) Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In RepL4NLP@ACL.
  • Gu et al. (2021a) Jing Gu, Qingyang Wu, Chongruo Wu, Weiyan Shi, and Zhou Yu. 2021a. PRAL: A Tailored Pre-Training Model for Task-Oriented Dialog Generation. In ACL/IJCNLP Short.
  • Gu et al. (2021b) Xiaodong Gu, Kang Min Yoo, and Jung-Woo Ha. 2021b. DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances. In AAAI.
  • Gu et al. (2021c) Xiaodong Gu, Kang Min Yoo, and Sang-Woo Lee. 2021c. Response Generation with Context-Aware Prompt Learning. CoRR abs/2111.02643 (2021).
  • Guan et al. (2020) Jian Guan, Fei Huang, Minlie Huang, Zhihao Zhao, and Xiaoyan Zhu. 2020. A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation. TACL (2020).
  • Gunel et al. (2021) Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin Stoyanov. 2021. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. In ICLR.
  • Guo et al. (2019) Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lin, and Yanzhi Wang. 2019. Reweighted Proximal Pruning for Large-Scale Language Representation. arXiv preprint arXiv:1909.12486 (2019).
  • Ham et al. (2020) DongHoon Ham, Jeong-Gwan Lee, Youngsoo Jang, and Kee-Eung Kim. 2020. End-to-End Neural Pipeline for Goal-Oriented Dialogue Systems using GPT-2. In ACL.
  • Han et al. (2021) Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Liang Zhang, Wentao Han, Minlie Huang, Qin Jin, Yanyan Lan, Yang Liu, Zhiyuan Liu, Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-Rong Wen, Jinhui Yuan, Wayne Xin Zhao, and Jun Zhu. 2021. Pre-Trained Models: Past, Present and Future. arXiv preprint arXiv:2106.07139 (2021).
  • Hao et al. (2020) Boran Hao, Henghui Zhu, and Ioannis Ch. Paschalidis. 2020. Enhancing Clinical BERT Embedding using a Biomedical Knowledge Base. In COLING.
  • Harkous et al. (2020) Hamza Harkous, Isabel Groves, and Amir Saffari. 2020. Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity. In COLING.
  • Hasan and Farri (2019) Sadid A. Hasan and Oladimeji Farri. 2019. Clinical Natural Language Processing with Deep Learning. In Data Science for Healthcare.
  • He et al. (2021) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: decoding-Enhanced Bert with Disentangled Attention. In ICLR.
  • Hosseini-Asl et al. (2020) Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A Simple Language Model for Task-Oriented Dialogue. In NeurIPS.
  • Hou et al. (2020) Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. DynaBERT: Dynamic BERT with Adaptive Width and Depth. In NeurIPS.
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In ICML.
  • Hua et al. (2021) Xinyu Hua, Ashwin Sreevatsa, and Lu Wang. 2021. DYPLOC: Dynamic Planning of Content Using Mixed Language Models for Text Generation. In ACL/IJCNLP.
  • Huang et al. (2021a) Luyang Huang, Shuyang Cao, Nikolaus Nova Parulian, Heng Ji, and Lu Wang. 2021a. Efficient Attentions for Long Document Summarization. In NAACL-HLT.
  • Huang et al. (2020) Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in Building Intelligent Open-domain Dialog Systems. TOIS (2020).
  • Huang et al. (2021b) Xinting Huang, Jianzhong Qi, Yu Sun, and Rui Zhang. 2021b. Latent Reasoning for Low-Resource Question Generation. In ACL/IJCNLP Findings.
  • Iqbal and Qureshi (2020) Touseef Iqbal and Shaima Qureshi. 2020. The survey: Text generation models in deep learning. Journal of King Saud University-Computer and Information Sciences (2020).
  • Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In EMNLP.
  • Jiang et al. (2020a) Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020a. How Can We Know What Language Models Know. TACL (2020).
  • Jiang et al. (2020b) Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. 2020b. ConvBERT: Improving BERT with Span-based Dynamic Convolution. In NeurIPS.
  • Jiao et al. (2020) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. In EMNLP Findings.
  • Jin et al. (2020) Di Jin, Zhijing Jin, Joey Tianyi Zhou, Lisa Orii, and Peter Szolovits. 2020. Hooks in the Headline: Learning to Generate Headlines with Controlled Styles. In ACL.
  • Kalyan et al. (2021) Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sangeetha. 2021. AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing. arXiv preprint arXiv:2108.05542 (2021).
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361 (2020).
  • Ke et al. (2021) Pei Ke, Haozhe Ji, Yu Ran, Xin Cui, Liwei Wang, Linfeng Song, Xiaoyan Zhu, and Minlie Huang. 2021. JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs. In ACL/IJCNLP Findings.
  • Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv preprint arXiv:1909.05858 (2019).
  • Khalifa et al. (2021) Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. 2021. A Distributional Approach to Controlled Text Generation. In ICLR.
  • Krishna et al. (2020) Kalpesh Krishna, John Wieting, and Mohit Iyyer. 2020. Reformulating Unsupervised Style Transfer as Paraphrase Generation. In EMNLP.
  • Kryscinski et al. (2018) Wojciech Kryscinski, Romain Paulus, Caiming Xiong, and Richard Socher. 2018. Improving Abstraction in Text Summarization. In EMNLP.
  • Lample et al. (2018) Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-Based & Neural Unsupervised Machine Translation. In EMNLP.
  • Le et al. (2021) Hang Le, Juan Miguel Pino, Changhan Wang, Jiatao Gu, Didier Schwab, and Laurent Besacier. 2021. Lightweight Adapter Tuning for Multilingual Speech Translation. In ACL/IJCNLP Short.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. 2015. Deep learning. Nat. (2015).
  • Lee et al. (2020) Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. 2020. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. In ICLR.
  • Lepikhin et al. (2021) Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In ICLR.
  • Lewis et al. (2020a) Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida I. Wang, and Luke Zettlemoyer. 2020a. Pre-training via Paraphrasing. In NeurIPS.
  • Lewis et al. (2020b) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020b. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In NAACL-HLT.
  • Li et al. (2020a) Junyi Li, Siqing Li, Wayne Xin Zhao, Gaole He, Zhicheng Wei, Nicholas Jing Yuan, and Ji-Rong Wen. 2020a. Knowledge-Enhanced Personalized Review Generation with Capsule Graph Neural Network. In CIKM.
  • Li et al. (2020b) Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, and Yaohong Jin. 2020b. BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance. In EMNLP.
  • Li et al. (2021b) Junyi Li, Tianyi Tang, Gaole He, Jinhao Jiang, Xiaoxuan Hu, Puzhao Xie, Zhipeng Chen, Zhuohao Yu, Wayne Xin Zhao, and Ji-Rong Wen. 2021b. TextBox: A Unified, Modularized, and Extensible Framework for Text Generation. In ACL Demonstration.
  • Li et al. (2021c) Junyi Li, Tianyi Tang, Wayne Xin Zhao, Zhicheng Wei, Nicholas Jing Yuan, and Ji-Rong Wen. 2021c. Few-shot Knowledge Graph-to-Text Generation with Pretrained Language Models. In ACL/IJCNLP Findings.
  • Li et al. (2021e) Junyi Li, Wayne Xin Zhao, Zhicheng Wei, Nicholas Jing Yuan, and Ji-Rong Wen. 2021e. Knowledge-based Review Generation by Coherence Enhanced Text Planning. In SIGIR.
  • Li et al. (2019) Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, and Yang Song. 2019. Generating Long and Informative Reviews with Aspect-Aware Coarse-to-Fine Decoding. In ACL.
  • Li et al. (2020c) Piji Li, Haisong Zhang, Xiaojiang Liu, and Shuming Shi. 2020c. Rigid Formats Controlled Text Generation. In ACL.
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In ACL.
  • Li et al. (2021a) Zhongyang Li, Xiao Ding, Kuo Liao, Ting Liu, and Bing Qin. 2021a. CausalBERT: Injecting Causal Knowledge Into Pre-trained Models with Minimal Supervision. arXiv preprint arXiv:2107.09852 (2021).
  • Li et al. (2021d) Zekang Li, Jinchao Zhang, Zhengcong Fei, Yang Feng, and Jie Zhou. 2021d. Conversations Are Not Flat: Modeling the Dynamic Information Flow across Dialogue Utterances. In ACL/IJCNLP.
  • Liao et al. (2021) Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Sefik Emre Eskimez, Liyang Lu, Hong Qu, and Michael Zeng. 2021.

    Generating Human Readable Transcript for Automatic Speech Recognition with Pre-Trained Language Model. In

  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
  • Lin et al. (2020) Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, and Lei Li. 2020. Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information. In EMNLP.
  • Liu et al. (2021d) Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou, and Nan Duan. 2021d. GLGE: A New General Language Generation Evaluation Benchmark. In ACL/IJCNLP Findings.
  • Liu et al. (2021g) Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, and Xiaojie Wang. 2021g. Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization. In EMNLP Findings.
  • Liu et al. (2021e) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021e. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv preprint arXiv:2107.13586 (2021).
  • Liu et al. (2021f) Shilei Liu, Xiaofeng Zhao, Bochao Li, Feiliang Ren, Longhui Zhang, and Shujuan Yin. 2021f. A Three-Stage Learning Framework for Low-Resource Knowledge-Grounded Dialogue Generation. In EMNLP.
  • Liu et al. (2021a) Yixin Liu, Zi-Yi Dou, and Pengfei Liu. 2021a. RefSum: Refactoring Neural Summarization. In NAACL-HLT.
  • Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. TACL (2020).
  • Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders. In EMNLP/IJCNLP.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu et al. (2021b) Ye Liu, Yao Wan, Lifang He, Hao Peng, and Philip S. Yu. 2021b. KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning. In AAAI.
  • Liu et al. (2021c) Zihan Liu, Genta Indra Winata, and Pascale Fung. 2021c. Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation. In ACL/IJCNLP Findings.
  • Louis (2020) Antoine Louis. 2020. NetBERT: A Pre-trained Language Representation Model for Computer Networking. Ph.D. Dissertation.
  • Luo et al. (2021) Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Songfang Huang, Fei Huang, and Luo Si. 2021. VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation. In ACL/IJCNLP.
  • Luo et al. (2020) Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).
  • Ma et al. (2020) Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shijin Wang, and Guoping Hu. 2020. CharBERT: Character-aware Pre-trained Language Model. In COLING.
  • Mager et al. (2020) Manuel Mager, Ramón Fernandez Astudillo, Tahira Naseem, Md. Arafat Sultan, Young-Suk Lee, Radu Florian, and Salim Roukos. 2020. GPT-too: A Language-Model-First Approach for AMR-to-Text Generation. In ACL.
  • Magooda and Litman (2021) Ahmed Magooda and Diane J. Litman. 2021. Mitigating Data Scarceness through Data Synthesis, Augmentation and Curriculum for Abstractive Summarization. In EMNLP Findings.
  • Majumder et al. (2021) Bodhisattwa Prasad Majumder, Sudha Rao, Michel Galley, and Julian J. McAuley. 2021. Ask what’s missing and what’s useful: Improving Clarification Question Generation using Global Knowledge. In NAACL-HLT.
  • Manakul and Gales (2021) Potsawee Manakul and Mark J. F. Gales. 2021. Long-Span Summarization via Local Attention and Content Selection. In ACL/IJCNLP.
  • Mao et al. (2019) Huanru Henry Mao, Bodhisattwa Prasad Majumder, Julian J. McAuley, and Garrison W. Cottrell. 2019. Improving Neural Story Generation by Targeted Common Sense Grounding. In EMNLP/IJCNLP.
  • Maurya et al. (2021) Kaushal Kumar Maurya, Maunendra Sankar Desarkar, Yoshinobu Kano, and Kumari Deepshikha. 2021. ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation. In ACL/IJCNLP Findings.
  • Nan et al. (2021) Feng Nan, Cícero Nogueira dos Santos, Henghui Zhu, Patrick Ng, Kathleen R. McKeown, Ramesh Nallapati, Dejiao Zhang, Zhiguo Wang, Andrew O. Arnold, and Bing Xiang. 2021. Improving Factual Consistency of Abstractive Summarization via Question Answering. In ACL.
  • Nguyen et al. (2021) Thong Nguyen, Anh Tuan Luu, Truc Lu, and Tho Quan. 2021. Enriching and Controlling Global Semantics for Text Summarization. In EMNLP.
  • Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In NAACL-HLT Demonstrations.
  • Ouyang et al. (2021) Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. 2021. Dialogue Graph Modeling for Conversational Machine Reading. In ACL/IJCNLP Findings.
  • Oved and Levy (2021) Nadav Oved and Ran Levy. 2021. PASS: Perturb-and-Select Summarizer for Product Reviews. In ACL/IJCNLP.
  • Pan et al. (2021) Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. 2021. Contrastive Learning for Many-to-many Multilingual Neural Machine Translation. In ACL/IJCNLP.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
  • Pascual et al. (2021) Damian Pascual, Beni Egressy, Clara Meister, Ryan Cotterell, and Roger Wattenhofer. 2021. A Plug-and-Play Method for Controlled Text Generation. In EMNLP Findings.
  • Pasunuru et al. (2021a) Ramakanth Pasunuru, Asli Celikyilmaz, Michel Galley, Chenyan Xiong, Yizhe Zhang, Mohit Bansal, and Jianfeng Gao. 2021a. Data Augmentation for Abstractive Query-Focused Multi-Document Summarization. In AAAI.
  • Pasunuru et al. (2021b) Ramakanth Pasunuru, Mengwen Liu, Mohit Bansal, Sujith Ravi, and Markus Dreyer. 2021b. Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters. In NAACL-HLT.
  • Peng et al. (2020) Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and Jianfeng Gao. 2020. Few-shot Natural Language Generation for Task-Oriented Dialog. In EMNLP Findings.
  • Peters et al. (2019) Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge Enhanced Contextual Word Representations. In EMNLP/IJCNLP.
  • Phang et al. (2018) Jason Phang, Thibault Févry, and Samuel R. Bowman. 2018. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks. arXiv preprint arXiv:1811.01088 (2018).
  • Pires et al. (2019) Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How Multilingual is Multilingual BERT?. In ACL.
  • Popovic (2017) Maja Popovic. 2017. chrF++: words helping character n-grams. In WMT.
  • Post (2018) Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In WMT.
  • Qi et al. (2020) Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training. In EMNLP Findings.
  • Qiu et al. (2020) Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey. arXiv preprint arXiv:2003.08271 (2020).
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR (2020).
  • Rashkin et al. (2020) Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao. 2020. PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking. In EMNLP.
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In SIGKDD.
  • Ren et al. (2019) Shuo Ren, Yu Wu, Shujie Liu, Ming Zhou, and Shuai Ma. 2019. Explicit Cross-lingual Pre-training for Unsupervised Machine Translation. In EMNLP/IJCNLP.
  • Ribeiro et al. (2020) Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich Schütze, and Iryna Gurevych. 2020. Investigating Pretrained Language Models for Graph-to-Text Generation. arXiv preprint arXiv:2007.08426 (2020).
  • Ribeiro et al. (2021) Leonardo F. R. Ribeiro, Yue Zhang, and Iryna Gurevych. 2021. Structural Adapters in Pretrained Language Models for AMR-to-Text Generation. In EMNLP.
  • Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for Building an Open-Domain Chatbot. In EACL.
  • Ross (2012) R.S. Ross. 2012. Guide for conducting risk assessments. In NIST Special Publication.
  • Rothe et al. (2020) Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. 2020. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks. TACL (2020).
  • Saito et al. (2020) Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, and Junji Tomita. 2020. Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models. arXiv preprint arXiv:2003.13028 (2020).
  • Schmidt et al. (2018) Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. 2018. Adversarially Robust Generalization Requires More Data. In NeurIPS.
  • Scialom et al. (2020) Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020. ColdGANs: Taming Language GANs with Cautious Sampling Strategies. In NeurIPS.
  • See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In ACL.
  • Shalyminov et al. (2020) Igor Shalyminov, Alessandro Sordoni, Adam Atkinson, and Hannes Schulz. 2020. Hybrid Generative-Retrieval Transformers for Dialogue Domain Adaptation. arXiv preprint arXiv:2003.01680 (2020).
  • Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In EMNLP.
  • Shleifer and Rush (2020) Sam Shleifer and Alexander M. Rush. 2020. Pre-trained Summarization Distillation. arXiv preprint arXiv:2010.13002 (2020).
  • Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In ICML.
  • Stickland et al. (2021) Asa Cooper Stickland, Xian Li, and Marjan Ghazvininejad. 2021. Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation. In EACL.
  • Stock et al. (2021) Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, and Armand Joulin. 2021. Training with Quantization Noise for Extreme Model Compression. In ICLR.
  • Suadaa et al. (2021) Lya Hulliyyatus Suadaa, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura, and Hiroya Takamura. 2021. Towards Table-to-Text Generation with Numerical Reasoning. In ACL/IJCNLP.
  • Sun et al. (2019a) Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019a. Contrastive Bidirectional Transformer for Temporal Representation Learning. arXiv preprint arXiv:1906.05743 (2019).
  • Sun et al. (2019b) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019b. VideoBERT: A Joint Model for Video and Language Representation Learning. In ICCV.
  • Sun et al. (2021) Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. CoRR abs/2107.02137 (2021).
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS.
  • Tan et al. (2021) Bowen Tan, Zichao Yang, Maruan Al-Shedivat, Eric P. Xing, and Zhiting Hu. 2021. Progressive Generation of Long Text with Pretrained Language Models. In NAACL-HLT.
  • Tao et al. (2006) Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai. 2006. Language Model Information Retrieval with Document Expansion. In NAACL.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
  • Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart M. Shieber. 2020. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. In NeurIPS.
  • Wada and Iwata (2018) Takashi Wada and Tomoharu Iwata. 2018. Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models. arXiv preprint arXiv:1809.02306 (2018).
  • Wang et al. (2019a) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019a. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In NeurIPS.
  • Wang et al. (2019b) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR.
  • Wang et al. (2020) Boxin Wang, Hengzhi Pei, Boyuan Pan, Qian Chen, Shuohang Wang, and Bo Li. 2020. T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack. In EMNLP.
  • Wang et al. (2021a) Danqing Wang, Jiaze Chen, Hao Zhou, Xipeng Qiu, and Lei Li. 2021a. Contrastive Aligned Joint Learning for Multilingual Summarization. In ACL/IJCNLP Findings.
  • Wang et al. (2021b) Wei Wang, Piji Li, and Hai-Tao Zheng. 2021b. Consistency and Coherency Enhanced Story Generation. In ECIR.
  • Wang et al. (2021c) Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, and Lei Li. 2021c. LightSeq: A High Performance Inference Library for Transformers. In NAACL-HLT Industry.
  • Wang and Bansal (2018) Yicheng Wang and Mohit Bansal. 2018. Robust Machine Comprehension Models via Adversarial Training. In NAACL-HLT Short.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP Demonstrations.
  • Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. arXiv preprint arXiv:1901.08149 (2019).
  • Wu et al. (2021) Wenhao Wu, Wei Li, Xinyan Xiao, Jiachen Liu, Ziqiang Cao, Sujian Li, Hua Wu, and Haifeng Wang. 2021. BASS: Boosting Abstractive Summarization with Unified Semantic Graph. In ACL/IJCNLP.
  • Xia et al. (2021) Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, and Ming Zhou. 2021. XGPT: Cross-modal Generative Pre-Training for Image Captioning. In NLPCC.
  • Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard H. Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised Data Augmentation for Consistency Training. In NeurIPS.
  • Xie et al. (2017) Stanley Xie, Ruchir Rastogi, and Max Chang. 2017. Deep poetry: Word-level and character-level language models for shakespearean sonnet generation. In Natural Lang. Process. Deep Learn.
  • Xing and Wan (2021) Xinyu Xing and Xiaojun Wan. 2021. Structure-Aware Pre-Training for Table-to-Text Generation. In ACL/IJCNLP Findings.
  • Xu et al. (2020) Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Raul Puri, Pascale Fung, Anima Anandkumar, and Bryan Catanzaro. 2020. MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models. In EMNLP.
  • Xu et al. (2021) Xinnuo Xu, Guoyin Wang, Young-Bum Kim, and Sungjin Lee. 2021. AugNLG: Few-shot Natural Language Generation using Self-trained Data Augmentation. In ACL.
  • Xue et al. (2021a) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021a. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In NAACL-HLT.
  • Xue et al. (2021b) Lanqing Xue, Kaitao Song, Duocai Wu, Xu Tan, Nevin L. Zhang, Tao Qin, Wei-Qiang Zhang, and Tie-Yan Liu. 2021b. DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling. In ACL/IJCNLP.
  • Yan et al. (2021) Yu Yan, Fei Hu, Jiusheng Chen, Nikhil Bhendawade, Ting Ye, Yeyun Gong, Nan Duan, Desheng Cui, Bingyu Chi, and Ruifei Zhang. 2021. FastSeq: Make Sequence Generation Faster. arXiv preprint arXiv:2106.04718 (2021).
  • Yang et al. (2020b) Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Weinan Zhang, Yong Yu, and Lei Li. 2020b. Towards Making the Most of BERT in Neural Machine Translation. In AAAI.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.
  • Yang et al. (2020a) Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and Qi Ju. 2020a. CSP: Code-Switching Pre-training for Neural Machine Translation. In EMNLP.
  • Yang et al. (2021) Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florêncio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. 2021. TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption. In CVPR.
  • Yang et al. (2020c) Ze Yang, Wei Wu, Can Xu, Xinnian Liang, Jiaqi Bai, Liran Wang, Wei Wang, and Zhoujun Li. 2020c. StyleDGPT: Stylized Response Generation with Pre-trained Language Models. In EMNLP Findings.
  • Yang et al. (2020d) Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, and Eric Darve. 2020d. TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising. In EMNLP Findings.
  • You et al. (2020) Weiqiu You, Simeng Sun, and Mohit Iyyer. 2020. Hard-Coded Gaussian Attention for Neural Machine Translation. In ACL. 7689–7700.
  • Zadeh et al. (2020) Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. 2020. GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference. In MICRO.
  • Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In NeurIPS.
  • Zaib et al. (2020) Munazza Zaib, Quan Z. Sheng, and Wei Emma Zhang. 2020. A Short Survey of Pre-trained Language Models for Conversational AI-A New Age in NLP. In ACSW.
  • Zeng et al. (2021) Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu, and Yonghong Tian. 2021. PanGu-: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation. arXiv preprint arXiv:2104.12369 (2021).
  • Zeng and Nie (2020) Yan Zeng and Jian-Yun Nie. 2020. Generalized Conditioned Dialogue Generation Based on Pre-trained Language Model. arXiv preprint arXiv:2010.11140 (2020).
  • Zeng and Nie (2021) Yan Zeng and Jian-Yun Nie. 2021. A Simple and Efficient Multi-Task Learning Approach for Conditioned Dialogue Generation. In NAACL-HLT.
  • Zhai and Lafferty (2001) ChengXiang Zhai and John D. Lafferty. 2001. Model-based Feedback in the Language Modeling Approach to Information Retrieval. In CIKM. 403–410.
  • Zhang et al. (2020e) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020e. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In ICML.
  • Zhang et al. (2021c) Longxiang Zhang, Renato Negrinho, Arindam Ghosh, Vasudevan Jagannathan, Hamid Reza Hassanzadeh, Thomas Schaaf, and Matthew R. Gormley. 2021c. Leveraging Pretrained Models for Automatic Summarization of Doctor-Patient Conversations. In EMNLP Findings.
  • Zhang et al. (2020b) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020b. BERTScore: Evaluating Text Generation with BERT. In ICLR.
  • Zhang et al. (2021b) Xueying Zhang, Yunjiang Jiang, Yue Shang, Zhaomeng Cheng, Chi Zhang, Xiaochuan Fan, Yun Xiao, and Bo Long. 2021b. DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization. In SIGIR.
  • Zhang et al. (2019b) Xingxing Zhang, Furu Wei, and Ming Zhou. 2019b. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In ACL.
  • Zhang et al. (2020c) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020c. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation. In ACL Demonstrations.
  • Zhang et al. (2021a) Zhengyan Zhang, Yuxian Gu, Xu Han, Shengqi Chen, Chaojun Xiao, Zhenbo Sun, Yuan Yao, Fanchao Qi, Jian Guan, Pei Ke, Yanzheng Cai, Guoyang Zeng, Zhixing Tan, Zhiyuan Liu, Minlie Huang, Wentao Han, Yang Liu, Xiaoyan Zhu, and Maosong Sun. 2021a. CPM-2: Large-scale Cost-effective Pre-trained Language Models. arXiv preprint arXiv:2106.10715 (2021).
  • Zhang et al. (2019a) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019a. ERNIE: Enhanced Language Representation with Informative Entities. In ACL.
  • Zhang et al. (2020a) Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, YuSheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, and Maosong Sun. 2020a. CPM: A Large-scale Generative Chinese Pre-trained Language Model. arXiv preprint arXiv:2012.00413 (2020).
  • Zhang and Sabuncu (2018) Zhilu Zhang and Mert R. Sabuncu. 2018. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In NeurIPS.
  • Zhang et al. (2020d) Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. 2020d. Recent advances and challenges in task-oriented dialog systems. Sci. China Technol. Sci. (2020).
  • Zhao et al. (2018) Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning Gender-Neutral Word Embeddings. In EMNLP.
  • Zheng and Lapata (2019) Hao Zheng and Mirella Lapata. 2019. Sentence Centrality Revisited for Unsupervised Summarization. In ACL.
  • Zhong et al. (2021) Ming Zhong, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization. arXiv preprint arXiv:2109.02492 (2021).
  • Zhou et al. (2021b) Kun Zhou, Wayne Xin Zhao, Sirui Wang, Fuzheng Zhang, Wei Wu, and Ji-Rong Wen. 2021b. Virtual Data Augmentation: A Robust and General Framework for Fine-tuning Pre-trained Models. In EMNLP.
  • Zhou et al. (2020a) Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020a. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot. Comput. Linguistics (2020).
  • Zhou et al. (2020b) Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020b. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI.
  • Zhou et al. (2021a) Wangchunshu Zhou, Dong-Ho Lee, Ravi Kiran Selvam, Seyeon Lee, and Xiang Ren. 2021a. Pre-training Text-to-Text Transformers for Concept-centric Common Sense. In ICLR.
  • Zhu et al. (2020) Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. 2020. Incorporating BERT into Neural Machine Translation. In ICLR.
  • Zou et al. (2021) Yicheng Zou, Bolin Zhu, Xingwu Hu, Tao Gui, and Qi Zhang. 2021. Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining. In EMNLP.