DeepAI
Log In Sign Up

MVP: Multi-task Supervised Pre-training for Natural Language Generation

06/24/2022
by   Tianyi Tang, et al.
Microsoft
6

Pre-trained language models (PLMs) have achieved notable success in natural language generation (NLG) tasks. Up to now, most of the PLMs are pre-trained in an unsupervised manner using large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with less labeled data showcase superior performance compared to unsupervised models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. For pre-training the text generation model MVP, we collect a labeled pre-training corpus from 45 datasets over seven generation tasks. For each task, we further pre-train specific soft prompts to stimulate the model capacity in performing a specific task. Extensive experiments have demonstrated the effectiveness of our supervised pre-training in a number of NLG tasks, and our general methods achieve state-of-the-art performance on 12 of 17 datasets.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/06/2022

Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation

Despite the success of text-to-text pre-trained models in various natura...
11/13/2019

Unsupervised Pre-training for Natural Language Generation: A Literature Review

Recently, unsupervised pre-training is gaining increasing popularity in ...
02/24/2021

Generalized and Transferable Patient Language Representation for Phenotyping with Limited Data

The paradigm of representation learning through transfer learning has th...
11/05/2018

Leveraging Random Label Memorization for Unsupervised Pre-Training

We present a novel approach to leverage large unlabeled datasets by pre-...
04/16/2021

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

A benchmark provides an ecosystem to measure the advancement of models w...
06/03/2021

Unsupervised Learning of General-Purpose Embeddings for Code Changes

Applying machine learning to tasks that operate with code changes requir...
07/11/2022

Demystifying Unsupervised Semantic Correspondence Estimation

We explore semantic correspondence estimation through the lens of unsupe...

1 Introduction

Natural language generation (NLG, also known as text generation) is a crucial capacity for language intelligence, which aims to generate texts that are credible and readable to humans (Garbacea and Mei, 2020). Since the emergence of the pre-training—fine-tuning paradigm, pre-trained language models (PLMs) have dominated the mainstream approaches for NLG, and extensive evidence shows that PLMs can produce highly fluent texts by a large model pre-trained on massive text corpus.

Till now, the majority of PLMs are pre-trained in an unsupervised (self-supervised) manner, based on large-scale general corpus. The basic idea is to leverage intrinsic data correlations as supervision signals of pre-training objectives. For example, T5 utilizes the C4 corpus of approximately GB as the pre-training corpus and employs a denoising objective that enforces the model to recover corrupted text spans successively (Raffel et al., 2020). Pre-trained on unlabeled text data, models can capture certain types of semantic knowledge (e.g., knowledge facts) and generalize to new tasks to some extent (Jiang et al., 2021). However, unsupervised pre-training may incorporate irrelevant or noisy information that affects the performance of downstream tasks (Feng et al., 2022). Moreover, unsupervised pre-training causes models to acquire knowledge at a slower rate as model size increases (Zhang et al., 2021).

In the meanwhile, more and more large-scale labeled datasets have become accessible (Deng et al., 2009; Lin et al., 2020b)

. There is growing evidence that pre-training with labeled data can further improve the performance of PLMs, both in the field of computer vision 

(He et al., 2016; Dosovitskiy et al., 2021)

and natural language processing 

(McCann et al., 2017; Lin et al., 2020b). These promising developments motivate us to consider supervised pre-training of language generation models with labeled data. The advantages of supervised pre-training for natural language generation are at least twofold. First, it enables the explicit learning of task-specific characteristics or semantics, which is typically infeasible by learning from the general text-to-text relationship as in unsupervised pre-training. Second, supervised pre-training can alleviate the discrepancy between unsupervised pre-training and supervised fine-tuning (Lin et al., 2020b). It has been demonstrated that supervised pre-trained models can achieve competitive or superior performance compared to their unsupervised counterparts (CONNEAU and Lample, 2019; Liu et al., 2020), even with significantly less labeled data.

Inspired by the recent success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation by leveraging a variety of supervised text generation datasets. Specially, we collect large-scale labeled pre-training corpus, consisting of million examples from datasets over seven generation tasks. Since recent research shows that an extensive scale of multi-task pre-training (Aghajanyan et al., 2021; Aribandi et al., 2022) is the key to generalizing to new tasks for large PLMs, we combine these labeled datasets for multi-task pre-training.

To develop the text generation model, we adopt a Transformer-based (Vaswani et al., 2017) sequence-to-sequence model as the pre-training backbone. However, different tasks may “neutralize” the ability learned through other tasks (He and Choi, 2021). To mitigate this potential issue, we learn task-specific soft prompts based on the MVP model. Following the structure of Prefix-tuning (Li and Liang, 2021), our prompts are inserted in a layer-wise way. Task-specific pre-training enables prompts to “store” specialized knowledge in the corresponding task and stimulate the MVP model’s capacity in performing such a task.

In this paper, we mainly investigate the following research questions:

  • [leftmargin=*]

  • How does supervised pre-training perform for NLG tasks? Our supervised pre-trained MVP can effectively learn task-specific knowledge during pre-training, compared to unsupervised pre-trained BART. In full tuning experiments, the proposed MVP model with task-specific prompts achieves state-of-the-art performance on out of datasets. In parameter-efficient tuning experiments, only with tuned prompts, our frozen MVP model is superior to the frozen BART, which further verifies the importance of supervised pre-training.

  • Can supervised pre-trained models generalize to unseen tasks? To examine the generalizability of our model, we conduct experiments on unseen language generation and understanding tasks. The experimental results demonstrate that our supervised MVP model has a strong generalization ability for unseen tasks. In the meantime, integrating MVP with existing task-specific methods yields superior performance compared to BART-based counterparts, indicating that MVP can also be utilized to enhance existing methods (e.g., parameter initialization).

For reproducing and reusing our models, we have released our models (e.g., MVP, task-specific prompts and multi-task variants), intermediate results (e.g., the generated texts), and codes for pre-training and fine-tuning at the link https://github.com/RUCAIBox/MVP.

2 Related Work

Pre-trained Models.

Pre-trained models have achieved exceptional success in a wide range of tasks, and the majority of them are pre-trained in an unsupervised manner (Radford et al., 2018; Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020). For example, with large-scale unsupervised plain texts as pre-training corpus, GPT series (Radford et al., 2018, 2019; Brown et al., 2020) employ language modeling as the pre-training task, i.e., predicting the next token conditioned on previous tokens; BART (Lewis et al., 2020) learns to recover the original text from corrupted text which has been altered by arbitrary noise transformations. GPT-3 and BART utilize GB and

GB of plain text as pre-training corpus, respectively. In the meanwhile, the computer vision community benefits a lot from the labeled dataset ImageNet 

(Deng et al., 2009). Influential models, such as ResNet (He et al., 2016), EfficientNet (Tan and Le, 2019), and ViT (Dosovitskiy et al., 2021)

, leverage ImageNet for pre-training. Inspired by the success of leveraging labeled data for pre-training, machine translation researchers explore supervised pre-training 

(McCann et al., 2017; Lin et al., 2020b; Yang et al., 2020; Pan et al., 2021). Lin et al. (2020b) attempt to pre-train a translation model mRASP with parallel data in multiple languages. Despite having much fewer pre-trained data, mRASP still achieves better performance than translation models pre-trained in an unsupervised manner (CONNEAU and Lample, 2019; Liu et al., 2020). In this paper, we propose to pre-train a universal NLG model with a large-scale collection of labeled datasets (GB).

Multi-task Learning.

Our supervised pre-training process can also be viewed as a formulation of multi-task learning (MTL), a method that combines multiple tasks into a single training process (Collobert and Weston, 2008; Worsham and Kalita, 2020). A model trained with MTL can benefit from helpful knowledge of relevant tasks, resulting in improved performance (McCann et al., 2018; Subramanian et al., 2018). Recently, MT-DNN (Liu et al., 2019a) and Muppet (Aghajanyan et al., 2021) collect tens of datasets in the multi-task procedure and achieve better performance in downstream tasks. The pre-finetuning proposed in Muppet (Aghajanyan et al., 2021) shares a similar idea as our multi-task supervised pre-training. Aribandi et al. (2022) further combine the denoising pre-training task of T5 and multi-task learning to pre-train a new model ExT5. MTL has also contributed to sub-fields of text generation, such as open-ended dialogue system (Zhang et al., 2020; Bao et al., 2020), task-oriented dialogue system (Su et al., 2022), text style transfer (Bujnowski et al., 2020), and question answering (Khashabi et al., 2020). At the same time, researchers explore the transferability of models trained on multi-task datasets (Mishra et al., 2022). FLAN (Wei et al., 2022), T0 (Sanh et al., 2022), and ZeroPrompt (Xu et al., 2022) investigate the zero-shot generalization abilities of large PLMs trained on numerous datasets with well-designed prompts. Ye et al. (2021) develop a benchmark CrossFit to study the few-shot learning ability of models.

Prompt Learning.

Prompt learning is a thriving method in the field of natural language processing. Prompt learning converts fine-tuning text into a format similar to pre-training to leverage implicit pre-training knowledge and alleviate the discrepancy between pre-training and fine-tuning (Liu et al., 2021c). GPT-2 (Radford et al., 2019) and T5 (Raffel et al., 2020) add human-written task prompts (instructions) to the input text. For instance, T5 pre-pends “Summarize:” to the input document for summarization tasks. GPT-3 (Brown et al., 2020) further combines several demonstrations to input to learn task patterns, which is called in-context learning. Some researchers also design elaborate prompts or demonstrations for each task and dataset and investigate their effectiveness and robustness (Wei et al., 2022; Sanh et al., 2022; Xu et al., 2022; Mishra et al., 2022). Even so, whether models can truly understand the semantic meanings of prompts looks worthy of further investigation (Webson and Pavlick, 2021). To overcome the constraints of manually constructed prompts, researchers develop continuous (soft) prompts that can be optimized in the continuous space (Lester et al., 2021; Liu et al., 2021d; Qin and Eisner, 2021; Tang et al., 2022). Prefix-tuning (Li and Liang, 2021) and P-tuning v2 (Liu et al., 2022) increase the number of parameters in prompts and employ prompting in each Transformer layer. Considering the random initialization of soft prompts, Gu et al. (2022) propose PPT to pre-train continuous prompts using unlabeled data. SPoT (Vu et al., 2022), PTG (Li et al., 2022a), and UnifiedSKG (Xie et al., 2022) learn the prompts on related tasks and transfer them to new tasks.

3 The MVP Model

This section introduces our MVP model: a Multi-task superVised Pre-trained model for natural language generation. We first collect labeled datasets from diverse NLG tasks as our pre-training corpus and unify input and output in a text-to-text format. Then, we pre-train our MVP model using the pre-training corpus, i.e., a mixture of labeled data from various tasks. We further learn the task-specific prompts to stimulate the MVP model in performing a certain task.

Figure 1: The overview of the pre-training process of our MVP model and task-specific prompts. In the first stage, we utilize labeled datasets from seven tasks to jointly pre-train the model. In the second stage, we freeze the MVP and pre-train specific prompts for each task using intra-task datasets.

3.1 Data Collection

The natural language generation (NLG) task aims to generate a sequence of tokens conditioned on input data (e.g., one or more pieces of text and structured data) (Li et al., 2022b). Typically, NLG tasks are categorized according to the data format of and

. For example, text summarization condenses a long document into a brief text containing essential information; data-to-text generation produces descriptive text about structured input; and dialogue system creates pertinent responses given multiple dialog turns.

In this paper, we collect labeled datasets from representative NLG tasks 111We do not consider incorporating machine translation tasks while focusing on English-only tasks in this work., including data-to-text generation, open-ended dialogue system, question answering, question generation, task-oriented dialogue system, text summarization, and story generation. These datasets come from various domains and are of different sizes. Some datasets are elaborately hand-crafted and thus relatively small in size, while others are created for large-scale weak supervision. Despite originating from various tasks, these diverse labeled datasets contain rich task-specific supervision signals for establishing global sequence-to-sequence mapping relations. The detailed descriptions and statistics of these tasks and datasets for pre-training can be found in Table 6 in Appendix B.1.

To adapt these datasets for multi-task pre-training, we transform all tasks into a unified text-to-text format, i.e., converting different input data into text format. For instance, we linearize structured data (e.g.,knowledge graph or table) by concatenating triples or key-value pairs using the special token “[SEP]” for data-to-text generation, and we utilize the special token “[X_SEP]” to separate answer and paragraph for question generation. The transformed input format of each task can be found in Appendix E. To enrich our datasets, we reverse the input and output of dual tasks for obtaining new datasets (e.g., story generation and summarization, question generation, and question answering). We also eliminate pre-training examples overlapping with evaluation data to avoid data leakage (more details in Appendix B.2). Finally, we obtain a GB supervised pre-training corpus containing M examples (i.e., pairs of ).

In addition, we do not include the datasets of paraphrase generation, text style transfer, and natural language understanding (NLU) during the pre-training phase. We leave them to evaluate the generalization ability of our methods. We reserve some common datasets (e.g., CNN/DailyMail (See et al., 2017) and XSum (Narayan et al., 2018)) for downstream fine-tuning and do not use them during pre-training. The details of the datasets used for fine-tuning evaluation are provided in Table 7 in Appendix B.1.

3.2 Model Architecture

We then pre-train our MVP model in a text-to-text format based on the supervised pre-training corpus. Our model is built upon a Transformer encoder-decoder architecture (Vaswani et al., 2017). In the first stage, our model is trained using a mixture of NLG datasets with human-written instructions 222To avoid ambiguity with continuous prompts, we designate human-written prompts as instructions.. For example, we use “Summarize:” as the prefix instruction for summarization tasks. This process is similar to instruction tuning in FLAN (Wei et al., 2022). The difference is that we only keep one instruction for each task. As shown by (Sanh et al., 2022), a single instruction can typically lead to positive performance, whereas adding multiple instructions does not always improve performance and requires considerable human effort. The detailed instructions for each task can be found in Appendix E.

However, it is not easy to conduct effective multi-task training since these tasks may compete with one another and thus “blur out” the features extracted by individual tasks 

(He and Choi, 2021). To address this issue, in the second stage we freeze our model and train a set of task-specific soft prompts (i.e.,

continuous vectors) using a mixture of corresponding intra-task datasets (

i.e., datasets under the same task 333For instance, we train summarization-specific prompts using summarization datasets (e.g., Newsroom (Grusky et al., 2018), WikiHow (Koupaee and Wang, 2018), and MSNews (Liu et al., 2021a)).). These soft prompts, which are not shared between tasks, encode the task-specific semantic knowledge to alleviate the blurring-out problem.

Specifically, we employ the standard Transformer encoder-decoder (Vaswani et al., 2017) as our backbone, a universal architecture suited for both NLU and NLG tasks. Compared to decoder-only architectures such as GPT-3 (Brown et al., 2020) and prefix LMs such as UniLM (Dong et al., 2019), the encoder-decoder architecture is more effective for text generation tasks (Raffel et al., 2020). As for task-specific prompts, we insert continuous vectors at each Transformer layer, following Prefix-tuning (Li and Liang, 2021). Compared to prompt tuning (Lester et al., 2021), which only adds trainable embeddings to the input word embeddings, layer-wise prompting is more effective and stable (Liu et al., 2022), especially for NLG tasks (Li and Liang, 2021).

3.3 Training Details

Our MVP backbone conforms to the model size of BARTlarge (Lewis et al., 2020), i.e., a Transformer with layers in both the encoder and decoder. The hidden size is , and the inner hidden size of the feed-forward network is . We employ the same byte-level byte-pair-encoding tokenizer as BART, and the vocabulary size is . The whole backbone consists of approximately M parameters and is initialized by BARTlarge. We pre-train the model with a batch size of . We leverage our collected datasets for multi-task learning using a temperature-scaled mixing strategy (Raffel et al., 2020) with mixing rate to mitigate the disparity between tasks and datasets.

The task-specific prompts follow the schema as Prefix-tuning (Li and Liang, 2021), which prepends trainable continuous vectors to the keys and values of the multi-head attention module at each layer. The prompt length is set to , and we utilize the MLP reparameterization function with a medium hidden size of to improve the training robustness and performance (Li and Liang, 2021). Hence, every task prompts have approximately M parameters. Then, we freeze the backbone and train seven groups of task-specific prompts, each of which corresponds to a different task. The batch size is set to , and we leverage the mixing strategy with rate .

In the two stages, the maximum length of the input and output sequences is both set to for supporting examples to contain more tokens. We optimize the model with a constant learning rate of using standard sequence-to-sequence cross-entropy loss. We apply AdamW (Loshchilov and Hutter, 2019) optimizer with , , to improve training stability (Liu et al., 2019b). The weight decay is . For the test, we select the checkpoint with the highest validation performance. All the experiments utilize NVIDIA Tesla V GB GPU cards ( nodes) on Ubuntu 18.04.5 LTS. We implement the code using the PLM library Hugging Face (Wolf et al., 2020) and text generation toolkit TextBox (Li et al., 2021).

In summary, we pre-train a M backbone and seven groups of M task-specific prompts. For downstream tasks, we can either directly utilize the MVP backbone (M) or integrate the backbone with one group of task-specific prompts (M).

4 Experiment Results

In this section, we mainly investigate the first research question we proposed: How does supervised pre-training perform for NLG tasks? Specifically, we apply our MVP model to new datasets from pre-trained generation tasks under full tuning and parameter-efficient tuning settings.

Under the full tuning setting, we fine-tune the entire model (including the backbone MVP and prompts), while for parameter-efficient tuning, we only fine-tune prompts but freeze the parameter weights of MVP. We apply the AdamW optimizer with default hyper-parameters and batch size of in both settings. The learning rate is set to . We optimize the model by the seq2seq loss with label smoothing (Szegedy et al., 2016) factor . We utilize the checkpoint with the best validation performance for test set inference. During inference, we set the beam size to and the no repetitive ngram size to . For evaluation, we leverage the automatic generation metrics BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) to measure the quality of the generated text and employ Distinct (Li et al., 2016) to evaluate its diversity. Details regarding fine-tuning and evaluation can be found in Appendix C.

In all of our experiments, we report the mean and standard deviation of our test set result over three random seeds

, , and . We also reproduce the results of baselines (e.g., BART) to compare them with our models under the same configuration.

Summarization (CNN/DM) Data-to-text (WebNLG) QG (SQuAD) QA (CoQA)
R- R- R-L B- ME R-L B- ME R-L F1 EM
SOTA 47.16a 22.55 43.87 66.14b 47.25 76.10 25.97c 27.33 53.43 73.0d 84.5e
BART 44.470.10 21.500.14 41.350.08 67.330.06 47.780.07 76.830.04 25.080.13 26.730.18 52.550.07 74.000.17 84.070.21
Single 44.350.16 21.510.11 41.210.19 67.400.24 47.800.03 76.720.20 25.770.15 27.140.09 53.020.11 74.900.20 84.570.06
MVP 44.450.05 21.440.12 41.340.08 67.320.10 47.940.13 76.700.26 25.910.07 27.220.10 53.080.16 75.500.20 85.070.21
MVP+R 44.460.07 21.590.11 41.310.06 67.410.12 47.920.03 76.820.14 25.310.13 26.690.04 52.680.03 75.170.21 84.830.12
MVP+S 44.300.07 21.450.03 41.150.08 67.630.13 47.900.05 76.870.16 25.620.02 26.980.10 52.980.09 75.700.00 85.400.10
MVP+M 44.330.03 21.420.07 41.190.03 67.150.14 47.720.21 76.550.04 25.290.10 26.690.12 52.810.12 75.000.10 84.730.06
Story Generation (ROCStories) Open-ended Dialogue (PersonaChat) TODS (MultiWOZ)
B- B- D- D- B- B- D- D- B- Success Inform
SOTA 33.4f 15.4 69.3 48.2g 39.9 1.5 9.4 20.50h 85.30 94.40
BART 33.790.13 15.780.21 3.430.17 78.762.15 49.581.12 39.240.90 1.440.09 08.890.57 20.170.63 75.401.22 84.401.15
Single 33.420.22 15.470.17 3.610.04 81.060.63 49.570.03 40.320.11 1.310.07 07.900.67 20.440.63 74.131.33 82.230.92
MVP 33.960.08 15.960.05 3.170.15 76.111.38 49.560.44 40.410.10 1.550.07 10.200.46 20.340.37 75.470.40 84.070.15
MVP+R 33.190.20 15.470.08 3.140.12 75.140.96 47.680.16 39.900.10 1.680.13 10.801.04 19.230.15 75.000.70 83.400.87
MVP+S 33.530.03 15.480.10 3.560.10 80.290.60 47.700.34 39.910.21 1.770.04 11.720.23 19.790.47 77.230.21 85.131.06
MVP+M 33.420.02 15.520.06 3.130.36 76.103.56 47.940.07 40.000.07 1.500.17 09.441.54 19.540.32 75.002.36 83.232.01
Table 1: The main results on seven seen tasks under full tuning settings. The best and second-best results among all the methods are marked in bold and underlined, respectively. QG, QA, and TODS are short for question generation, question answering, and task-oriented dialogue system, respectively. B, R, D, and ME denote BLEU, ROUGE, Distinct, and METEOR. “–” means the SOTA paper does not compute the corresponding result. These setups and abbreviations are the same below. a (Ravaut et al., 2022)b (Ke et al., 2021)c (Bao et al., 2021)d (Liu et al., 2021a)e (Xiao et al., 2020)f (Guan et al., 2021)g (Chen et al., 2022)h (He et al., 2021)

4.1 Full Tuning Performance

For full tuning, we select a number of widely-used datasets for each seen task. In this setting, we consider three competitive baselines (including MVP) and three model variants with different prompts. Table 1 shows the performance of one dataset for each task. Tables 8 and 9 in Appendix D list the results of other datasets and the GEM benchmark (Gehrmann et al., 2021).

For the baselines, we compare three backbones based on different pre-training strategies:

  • [leftmargin=*]

  • BARTlarge (Lewis et al., 2020): BART is a widely-used PLM for natural language generation. We use it to initialize the MVP model during pre-training.

  • Single-task pre-training (Single): We individually train a single model for each task using intra-task datasets following the same pre-training settings in multi-task training. For instance, we pre-train a summarization model using summarization datasets (e.g., Newsroom, WikiHow, and MSNews). Therefore, we have seven single-task pre-trained models in total.

  • Multi-task pre-training (MVP): This is our MVP model, which is trained on a mixture of labeled datasets from seven tasks.

For the prompt-based variants, we integrate our MVP model with three different prompts:

  • [leftmargin=*]

  • Randomly initialized prompts (MVP+R): The prompts are randomly initialized without pre-training.

  • Single-task pre-trained prompts (MVP+S): This is the primary method in our paper, as introduced in Section 3.2. We pre-train specific prompts for each task.

  • Multi-Task pre-trained prompts (MVP+M): We only pre-train one group of prompts, using the same mixed datasets as the backbone pre-training.

Besides these baselines and variants, we further collect the state-of-the-art (SOTA) results from their original papers for comparison. From the results in Table 1, we can see that:

First, supervised pre-training (i.e., Single and MVP) achieves better performance than unsupervised pre-trained model BART. For example, in question answering, Single and MVP achieve and gain on F1 score compared with BART. This observation demonstrates the effectiveness of our supervised pre-training. With labeled datasets, supervised pre-training enables the model acquire more task-specific information during pre-training, thus leading to improved results on downstream tasks. Regarding single-task (Single) and multi-task pre-training (MVP), our MVP model outperforms single-task counterparts on of metrics (combing all the metrics for different tasks). This result indicates that the proposed multi-task learning approach can enhance single-task performance by learning transferable semantic information across tasks.

Second, task-specific prompt learning is effective for some tasks, such as data-to-text generation and question answering. MVP with single-task prompt pre-training (MVP+S) consistently outperforms the other two methods. Compared with MVP+M, MVP+S can alleviate the “blurring-out” issue of multi-task learning, i.e., different tasks may “neutralize” the ability learned by others (He and Choi, 2021). Pre-trained on intra-task datasets, task-specific prompts can acquire specialized knowledge of each task and stimulate the capacity of the MVP model in performing a certain task. On the other hand, we find that equipping our MVP model with prompts decreases its performance on certain tasks, such as question generation, from to on BLEU-. We speculate that this is due to the different degrees of convergence between the pre-trained backbone and prompts, i.e., the prompts have been trained to perform specific tasks, while the backbone has been trained to be more applicable to different tasks.

Finally, our MVP models (MVP and MVP+R/S/M) produce comparable or better performance compared with current SOTA approaches on data-to-text generation, question answering, story generation, and open-ended dialogue tasks, which shows a strong text generation capability. As for the remaining three tasks, the SOTA models incorporate specific techniques tailored to tasks (e.g., re-ranking framework (Ravaut et al., 2022), self-training (Bao et al., 2021), and various task-specific objectives (He et al., 2021)), which yield better performance than our models. In contrast, the results of our models are still encouraging, considering we adopt only a simple architecture and a unified objective.

Summarization (CNN/DM) Data-to-text (WebNLG) QG (SQuAD) QA (CoQA)
R- R- R-L B- ME R-L B- ME R-L F1 EM
FT BART 44.470.10 21.500.14 41.350.08 67.330.06 47.780.07 76.830.04 25.080.13 26.730.18 52.550.07 74.000.17 84.070.21
FT MVP 44.450.05 21.440.12 41.340.08 67.320.10 47.940.13 76.700.26 25.910.07 27.220.10 53.080.16 75.500.20 85.070.21
BART+R 42.470.16 19.790.15 39.100.17 65.150.66 46.740.18 76.060.34 24.120.16 25.950.11 51.840.26 70.801.31 81.101.21
MVP+R 42.880.10 20.250.04 39.660.06 66.020.58 47.100.13 75.750.39 25.070.04 26.490.13 52.630.06 73.970.64 83.930.49
MVP+S 42.890.11 20.210.08 39.580.11 66.680.08 47.450.04 76.230.12 25.240.04 26.630.05 52.680.05 75.030.06 84.630.15
MVP+M 43.030.06 20.340.05 39.740.07 66.700.35 47.410.23 76.170.28 25.210.03 26.500.02 52.680.03 74.430.06 84.000.10
Story Generation (ROCStories) Open-ended Dialogue (PersonaChat) TODS (MultiWOZ)
B- B- D- D- B- B- D- D- B- Success Inform
FT BART 33.790.13 15.780.21 3.430.17 78.762.15 49.581.12 39.240.90 1.440.09 08.890.57 20.170.63 75.401.22 84.401.15
FT MVP 33.960.08 15.960.05 3.170.15 76.111.38 49.560.44 40.410.10 1.550.07 10.200.46 20.340.37 75.470.40 84.070.15
BART+R 32.470.31 14.960.22 2.850.00 69.190.30 44.881.21 36.972.27 1.300.05 6.430.39 17.630.28 65.002.52 72.332.73
MVP+R 32.610.30 15.080.20 2.960.06 70.700.36 47.240.47 37.621.39 1.310.04 6.860.15 18.860.50 66.933.18 74.604.23
MVP+S 32.870.06 15.080.11 2.850.11 69.911.06 46.900.23 39.420.19 1.330.02 6.940.36 18.800.54 67.833.65 74.203.70
MVP+M 32.690.07 15.250.04 2.970.04 69.810.66 46.950.32 37.411.92 1.340.05 7.000.25 19.220.25 69.632.21 76.233.02
Table 2: The main results on seven seen tasks under parameter-efficient settings. We also include the results of BART and MVP under full tuning (denoted as FT) settings for comparison.

4.2 Parameter-Efficient Tuning Performance

In the lightweight fine-tuning setting, we only tune the prompts while freezing the backbone MVP model. We compare our model to three baselines:

  • [leftmargin=*]

  • Prefix-tuning (Li and Liang, 2021): Prefix-tuning is a popular prompt-based lightweight tuning method for text generation. We employ BARTlarge as its backbone, denoted as BART+R.

  • Only tuning randomly initialized prompts (MVP+R): This baseline only tunes the randomly initialized prompts of MVP+R (Section 4.1), and it shares a similar idea with Prefix-tuning.

  • Only tuning single-task pre-trained prompts (MVP+S): This baseline only tunes the single-task pre-trained prompts of MVP+S (Section 4.1). These pre-trained prompts contain task-specific knowledge and can serve as a better initialization than the random ones.

  • Only tuning multi-task pre-trained prompts (MVP+M): This baseline tunes the multi-task pre-trained prompts of MVP+M (Section 4.1). Such an idea has been used in SPoT (Vu et al., 2022).

From the experimental results in Table 2, we can see that:

First, the good performance of thr MVP model in lightweight settings further demonstrates the effectiveness of supervised pre-training. By comparing two randomly initialized prompting methods (BART+R and MVP+R), we can see that MVP+R achieves superior performance to BART+R due to its multi-task supervised backbone. Furthermore, when initialized with pre-trained prompts, MVP+S and MVP+M achieve improved results over MVP+R, which is consistent with the findings of SPoT (Vu et al., 2022). When compared with MVP+M, MVP+S performs marginally better, indicating that task-specific prompts are useful to improve the model in specific generation tasks.

Surprisingly, our lightweight MVP+S can even outperform fully tuned BART on question generation and question answering tasks, showcasing the effectiveness of the proposed supervised pre-training approach. Another note is that lightweight prompting methods (Lester et al., 2021; Vu et al., 2022) that work on NLU tasks cannot achieve competitive performances when compared to full tuning methods on NLG tasks. Therefore, it is necessary to design lightweight tuning models specially for generation tasks.

5 Generalization Ability

In this section, we concentrate on the second question: Can supervised pre-trained models generalize to unseen tasks? We employ our MVP model to unseen NLG and NLU tasks to verify the generalizability of our model.

Generalization to Unseen NLG Tasks.

According to Deng et al. (2021), every NLG task can be considered as compression (e.g., summarization), transduction (e.g., translation), or creation (e.g., story generation) tasks. Since we do not include any transduction tasks during pre-training, we evaluate our MVP model using two unseen transduction NLG tasks: paraphrase generation and text style transfer.

We follow the methods which can achieve SOTA performance for these two tasks, i.e., AESOP (Sun et al., 2021) for paraphrase generation and SC & BLEU (Lai et al., 2021) for text style transfer. We replace their backbone BART with our MVP model for comparison. The experimental setup remains the same as described in Section 4, and details are listed in Appendix C. From the results in Table 3, we can see that our model outperforms BART on out of metrics and achieves a new SOTA performance, which verifies the strong generalizability of our model. In addition, combining our MVP model with existing SOTA methods yields better performance than BART-based counterparts. This observation suggests that our MVP model can adapt to existing methods effectively by providing superior parameter initialization.

Paraphrase Generation (Quora) B- R- R- R-L ME SOTA 47.3a 73.3 54.1 75.1 49.7 BART + AESOP 48.350.70 74.160.47 55.250.74 75.840.42 50.600.49 MVP + AESOP 49.860.21 74.930.05 56.550.13 76.560.09 52.270.21 Style Transfer (GYAFC E&M) Style Transfer (GYAFC F&R) B- Accuracy HM B- Accuracy HM SOTA 76.52b 93.7c 83.9c 80.29b 92.0c 85.2c BART + SC & BLEU 76.930.55 94.370.87 84.740.05 80.110.29 92.290.37 85.770.10 MVP + SC & BLEU 77.010.20 94.660.36 84.920.04 79.700.25 93.070.28 85.870.27
Table 3:

The main results of unseen NLG tasks, including text style transfer and paraphrase generation. Accuracy is calculated by a pre-trained TextCNN to evaluate the style strength, and HM denotes the harmonic mean of BLEU-

and style accuracy (Lai et al., 2021). a (Sun et al., 2021)b (Chawla and Yang, 2020)c (Lai et al., 2021)
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE Average
Matt. Accuracy F1/Accuracy P/S Corr. F1/Accuracy m./mm. Accuracy Accuracy
BART 60.303.20 96.300.10 90.471.46 / 86.702.10 90.970.15 / 90.300.10 73.030.23 / 89.870.12 90.030.21 / 89.270.15 94.600.10 79.832.76 85.171.05
MVP 59.871.04 96.430.32 92.070.25 / 89.430.29 91.370.15 / 90.900.35 73.200.10 / 90.130.12 89.700.00 / 88.730.25 95.100.26 82.870.58 85.880.37
Table 4: The main results of NLU tasks on the GLUE benchmark. We evaluate the results on the official website https://gluebenchmark.com/.

Generalization to Unseen NLU Tasks.

Although MVP is designed specially for NLG tasks, we also evaluate its performance on unseen NLU tasks using the widely-used GLUE benchmark (Wang et al., 2019). We compare our model to BARTlarge using its original sequence classification method (Lewis et al., 2020). The detailed settings can be found in Appendix C. According to the results presented in Table 4, our MVP model outperforms BART on of metrics and has superior overall performance. This result indicates the strong generalization ability of our MVP model on unseen NLU tasks. It further demonstrates that the model learned through supervised pre-training is not limited to generation tasks and can instead improve the overall semantic representations.

6 Discussion

Methods MTL model MTL prompts Usage mode Open source #NLU #NLG
FLAN (Wei et al., 2022) zero-shot 9 2
T0 (Sanh et al., 2022) zero-shot 4 0
Muppet (Aghajanyan et al., 2021) fine-tune 3 1
ExT5 (Aribandi et al., 2022) fine-tune 8 6
SPoT (Vu et al., 2022) fine-tune 6 0
MVP (ours) fine-tune 3 11
Table 5: Comparison of our paper with existing works. MTL model denotes whether the work utilizes multi-task learning to train a backbone model, similar to MTL prompts. Usage mode is the primary way for applying a model to downstream tasks. Open-source refers to whether the work has released models to the public. #NLU and #NLG are the numbers of NLU and NLG tasks for evaluation.

In this section, we discuss this work and compare it to existing methods.

Supervised Pre-training for NLG Models.

Unsupervised pre-training has been extensively investigated for natural language understanding (Devlin et al., 2019; Liu et al., 2019b) and generation (Lewis et al., 2020; Brown et al., 2020), aiming to learn universal language representations that can adapt to a variety of tasks. In spite of its effectiveness, increasing evidence indicates that a general solution cannot always generate the best task-specific representations in comparison to a supervised approach (Lin et al., 2020b; Aribandi et al., 2022). In the meantime, the growing availability of labeled data makes it feasible to conduct large-scale supervised pre-training. Inspired by the success of supervised pre-training (Dosovitskiy et al., 2021; Lin et al., 2020b), we present an important attempt to pre-train a more capable PLM for NLG tasks using labeled data. Our experiments have shown that such an approach works not only for seen tasks but also for unseen tasks (including NLU tasks), indicating that supervised pre-training is a promising, general, and effective method for various tasks.

Differences with Existing Methods.

To the best of our knowledge, existing supervised pre-training works mainly focus on NLU tasks (Aghajanyan et al., 2021; Aribandi et al., 2022) or some specific NLG tasks (Lin et al., 2020b; Pan et al., 2021). Given the superior performance yielded by supervised pre-training approaches, it is important to explore supervised pre-training for deriving both effective and general NLG models. Our work makes a significant contribution in this direction, achieving SOTA performance with a single model on of datasets. Compared with its strong counterpart ExT5 (Aribandi et al., 2022), our MVP model outperforms it in out of metrics (detailed in Appendix D.2). In order to better understand the difference between our paper with previous supervised (multi-task) pre-training works, we present a detailed comparison in Table 5. As we can see, our work conducts the evaluation with the greatest number of NLG tasks, adopts multi-tasking to learn model and task-specific prompts, and makes abundant resources available.

Applicability.

To facilitate the use of our MVP model, we have released both the code and the pre-trained models. To apply our model, we consider two different settings for seen and unseen tasks. For seen tasks, one can utilize the MVP model or integrate it with task prompts. For unseen tasks, besides the above methods, one can also pre-train prompts using task-specific labeled data. Also, our model can serve as an effective parameter initialization for adapting to existing methods and diverse tasks, as described in Section 5. Furthermore, we release all the intermediate results (e.g., the generated texts) of our model on evaluation tasks. These results provide valuable data resources for understanding and analyzing the task capacity of PLMs. In addition, the pre-trained task-specific prompts can be used to study the task similarity and their effect on the multi-task pre-training.

Limitations.

Despite our efforts to collect as many generation tasks and datasets as possible, we only evaluate the generation quality and generalization ability of our models on a small number of tasks and datasets. The interpretability and robustness of our models require further analysis. Besides, there exists subjectivity when collecting intra-task datasets, albeit our attempts to employ widely-recognized categorizations from the literature. Due to limitation of computing power, we do not study the performance of our method at different model scales. The effectiveness of multi-task pre-training from scratch, similar to ExT5 (Aribandi et al., 2022), also merits an in-depth study. Regarding evaluation methods, we only consider basic automatic metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). However, there is still a certain gap between these metrics and human judgments (Sai et al., 2022).

7 Conclusion

In this paper, we propose a multi-task supervised pre-trained model MVP with task-specific prompts for NLG tasks. Extensive experiments have demonstrated that: 1) supervised pre-training is beneficial for NLG tasks. Our MVP model outperforms the unsupervised pre-trained model BART on examined tasks and even achieves SOTA performance on out of datasets; 2) supervised pre-trained models have strong generalization ability on unseen generation and understanding tasks. We hope that the open-sourced MVP models will facilitate future work on supervised pre-training and contribute to the advancement of NLG research.

References

  • (1) URL https://github.com/markriedl/WikiPlots.
  • Agarwal et al. (2021) Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.278. URL https://aclanthology.org/2021.naacl-main.278.
  • Aghajanyan et al. (2021) Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5799–5811, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.468. URL https://aclanthology.org/2021.emnlp-main.468.
  • Alamri et al. (2018) Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K Marks, et al. Audio visual scene-aware dialog (avsd) challenge at dstc7. arXiv preprint arXiv:1806.00525, 2018. URL http://arxiv.org/abs/1806.00525.
  • Alva-Manchego et al. (2020) Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot, and Lucia Specia. ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.424. URL https://aclanthology.org/2020.acl-main.424.
  • Aribandi et al. (2022) Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler.

    Ext5: Towards extreme multi-task scaling for transfer learning.

    In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Vzh1BFUCiIX.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
  • Bao et al. (2021) Hangbo Bao, Li Dong, Wenhui Wang, Nan Yang, and Furu Wei. s2s-ft: Fine-tuning pretrained transformer encoders for sequence-to-sequence learning. arXiv preprint arXiv:2110.13640, 2021. URL https://arxiv.org/abs/2110.13640.
  • Bao et al. (2020) Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng Wang. PLATO: Pre-trained dialogue generation model with discrete latent variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 85–96, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.9. URL https://aclanthology.org/2020.acl-main.9.
  • Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  • Bentivogli et al. (2009) Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth pascal recognizing textual entailment challenge. In In Proc Text Analysis Conference (TAC’09, 2009.
  • Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1160.
  • Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL http://arxiv.org/abs/2108.07258.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1547. URL https://aclanthology.org/D18-1547.
  • Bujnowski et al. (2020) Pawel Bujnowski, Kseniia Ryzhova, Hyungtak Choi, Katarzyna Witkowska, Jaroslaw Piersa, Tymoteusz Krumholc, and Katarzyna Beksa. An empirical study on multi-task learning for text style transfer and paraphrase generation. In Proceedings of the 28th International Conference on Computational Linguistics: Industry Track, pages 50–63, Online, December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-industry.6. URL https://aclanthology.org/2020.coling-industry.6.
  • Byrne et al. (2019) Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Ben Goodrich, Daniel Duckworth, Semih Yavuz, Amit Dubey, Kyu-Young Kim, and Andy Cedilnik. Taskmaster-1: Toward a realistic and diverse dialog dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4516–4525, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1459. URL https://aclanthology.org/D19-1459.
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://aclanthology.org/S17-2001.
  • Chawla and Yang (2020) Kunal Chawla and Diyi Yang. Semi-supervised formality style transfer using language model discriminator and mutual information maximization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2340–2354, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.212. URL https://aclanthology.org/2020.findings-emnlp.212.
  • Chen et al. (2021) Mingda Chen, Sam Wiseman, and Kevin Gimpel. WikiTableT: A large-scale data-to-text dataset for generating Wikipedia article sections. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 193–209, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.17. URL https://aclanthology.org/2021.findings-acl.17.
  • Chen et al. (2022) Wei Chen, Yeyun Gong, Song Wang, Bolun Yao, Weizhen Qi, Zhongyu Wei, Xiaowu Hu, Bartuer Zhou, Yi Mao, Weizhu Chen, Biao Cheng, and Nan Duan. DialogVED: A pre-trained latent variable encoder-decoder model for dialog response generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4852–4864, Dublin, Ireland, May 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.333.
  • Chen et al. (2020a) Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and William Yang Wang. Logical natural language generation from open-domain tables. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7929–7942, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.708. URL https://aclanthology.org/2020.acl-main.708.
  • Chen et al. (2020b) Wenhu Chen, Yu Su, Xifeng Yan, and William Yang Wang. KGPT: Knowledge-grounded pre-training for data-to-text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8635–8648, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.697. URL https://aclanthology.org/2020.emnlp-main.697.
  • Cheng et al. (2020) Liying Cheng, Dekun Wu, Lidong Bing, Yan Zhang, Zhanming Jie, Wei Lu, and Luo Si. ENT-DESC: Entity description generation by exploring knowledge graph. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1187–1197, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.90. URL https://aclanthology.org/2020.emnlp-main.90.
  • Choi et al. (2018) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1241. URL https://aclanthology.org/D18-1241.
  • Collobert and Weston (2008) Ronan Collobert and Jason Weston.

    A unified architecture for natural language processing: deep neural networks with multitask learning.

    In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series, pages 160–167. ACM, 2008. doi: 10.1145/1390156.1390177. URL https://doi.org/10.1145/1390156.1390177.
  • CONNEAU and Lample (2019) Alexis CONNEAU and Guillaume Lample. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf.
  • Dagan et al. (2006) Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops)

    , pages 248–255, Los Alamitos, CA, USA, jun 2009. IEEE Computer Society.
    doi: 10.1109/CVPR.2009.5206848. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2009.5206848.
  • Deng et al. (2021) Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing, and Zhiting Hu. Compression, transduction, and creation: A unified framework for evaluating natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7580–7605, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.599. URL https://aclanthology.org/2021.emnlp-main.599.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Dinan et al. (2019) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston.

    Wizard of wikipedia: Knowledge-powered conversational agents.

    In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm.
  • Dodge et al. (2016) Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander H. Miller, Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog systems. In 4th International Conference on Learning Representations, ICLR 2016, 2016. URL http://arxiv.org/abs/1511.06931.
  • Dolan and Brockett (2005) William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https://aclanthology.org/I05-5002.
  • Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/c20bb2d9a50d5ac1f713f8b34d9aac5a-Paper.pdf.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  • El Asri et al. (2017) Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 207–219, Saarbrücken, Germany, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-5526. URL https://aclanthology.org/W17-5526.
  • Eric et al. (2017) Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 37–49, Saarbrücken, Germany, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-5506. URL https://aclanthology.org/W17-5506.
  • Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
  • Feng et al. (2022) Yutong Feng, Jianwen Jiang, Mingqian Tang, Rong Jin, and Yue Gao. Rethinking supervised pre-training for better downstream transferring. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Jjcv9MTqhcq.
  • Garbacea and Mei (2020) Cristina Garbacea and Qiaozhu Mei. Neural language generation: Formulation, methods, and evaluation. arXiv preprint arXiv:2007.15780, 2020. URL http://arxiv.org/abs/2007.15780.
  • Gardent et al. (2017) Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1017. URL https://aclanthology.org/P17-1017.
  • Gehrmann et al. (2021) Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.gem-1.10. URL https://aclanthology.org/2021.gem-1.10.
  • Giampiccolo et al. (2007) Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 1–9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401.
  • Gliwa et al. (2019) Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URL https://aclanthology.org/D19-5409.
  • Gopalakrishnan et al. (2019) Karthik Gopalakrishnan, Behnam Hedayatnia, Qinglang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. Topical-chat: Towards knowledge-grounded open-domain conversations. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, pages 1891–1895. ISCA, 2019. doi: 10.21437/Interspeech.2019-3079. URL https://doi.org/10.21437/Interspeech.2019-3079.
  • Graff et al. (2003) David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda.

    English gigaword.

    Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.
  • Grusky et al. (2018) Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1065. URL https://aclanthology.org/N18-1065.
  • Gu et al. (2021) Jing Gu, Mostafa Mirshekari, Zhou Yu, and Aaron Sisto. ChainCQG: Flow-aware conversational question generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2061–2070, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.177. URL https://aclanthology.org/2021.eacl-main.177.
  • Gu et al. (2022) Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. PPT: Pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8410–8423, Dublin, Ireland, May 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.576.
  • Guan et al. (2021) Jian Guan, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wenbiao Ding, and Minlie Huang. Long text generation by modeling sentence-level and discourse-level coherence. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6379–6393, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.499. URL https://aclanthology.org/2021.acl-long.499.
  • Haim et al. (2006) R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7, 2006.
  • He and Choi (2021) Han He and Jinho D. Choi. The stem cell hypothesis: Dilemma behind multi-task learning with transformer encoders. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5555–5577, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.451. URL https://aclanthology.org/2021.emnlp-main.451.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society. doi: 10.1109/CVPR.2016.90. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2016.90.
  • He et al. (2021) Wanwei He, Yinpei Dai, Yinhe Zheng, Yuchuan Wu, Zheng Cao, Dermot Liu, Peng Jiang, Min Yang, Fei Huang, Luo Si, Jian Sun, and Yongbin Li. GALAXY: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. arXiv preprint arXiv:2111.14592, 2021. URL https://arxiv.org/abs/2111.14592.
  • Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf.
  • Hua and Wang (2020) Xinyu Hua and Lu Wang. PAIR: Planning and iterative refinement in pre-trained transformers for long text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 781–793, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.57. URL https://aclanthology.org/2020.emnlp-main.57.
  • Jiang et al. (2020) Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, and Wei Xu. Neural CRF model for sentence alignment in text simplification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7943–7960, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.709. URL https://aclanthology.org/2020.acl-main.709.
  • Jiang et al. (2021) Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021. doi: 10.1162/tacl_a_00407. URL https://aclanthology.org/2021.tacl-1.57.
  • Jin et al. (2020) Zhijing Jin, Qipeng Guo, Xipeng Qiu, and Zheng Zhang. GenWiki: A dataset of 1.3 million content-sharing text and graphs for unsupervised graph-to-text generation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2398–2409, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.217. URL https://aclanthology.org/2020.coling-main.217.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  • Ke et al. (2021) Pei Ke, Haozhe Ji, Yu Ran, Xin Cui, Liwei Wang, Linfeng Song, Xiaoyan Zhu, and Minlie Huang. JointGT: Graph-text joint representation learning for text generation from knowledge graphs. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2526–2538, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.223. URL https://aclanthology.org/2021.findings-acl.223.
  • Khashabi et al. (2020) Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.171. URL https://aclanthology.org/2020.findings-emnlp.171.
  • Kočiský et al. (2018) Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023.
  • Koncel-Kedziorski et al. (2019) Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. Text Generation from Knowledge Graphs with Graph Transformers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2284–2293, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1238. URL https://aclanthology.org/N19-1238.
  • Koupaee and Wang (2018) Mahnaz Koupaee and William Yang Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018. URL http://arxiv.org/abs/1810.09305.
  • Kumar et al. (2020) Ashutosh Kumar, Kabir Ahuja, Raghuram Vadapalli, and Partha Talukdar. Syntax-guided controlled generation of paraphrases. Transactions of the Association for Computational Linguistics, 8:329–345, 2020. doi: 10.1162/tacl_a_00318. URL https://aclanthology.org/2020.tacl-1.22.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  • Ladhak et al. (2020) Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen McKeown. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.360. URL https://aclanthology.org/2020.findings-emnlp.360.
  • Lai et al. (2021) Huiyuan Lai, Antonio Toral, and Malvina Nissim. Thank you BART! rewarding pre-trained models improves formality style transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 484–494, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.62. URL https://aclanthology.org/2021.acl-short.62.
  • Lebret et al. (2016) Rémi Lebret, David Grangier, and Michael Auli. Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1128. URL https://aclanthology.org/D16-1128.
  • Lee et al. (2019) Sungjin Lee, Hannes Schulz, Adam Atkinson, Jianfeng Gao, Kaheer Suleman, Layla El Asri, Mahmoud Adada, Minlie Huang, Shikhar Sharma, Wendy Tay, and Xiujun Li. Multi-domain task-completion dialog challenge. In Dialog System Technology Challenges, volume 8, 2019. URL http://workshop.colips.org/dstc7/dstc8/DTSC8_multidomain_task_proposal.pdf.
  • Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
  • Li et al. (2021) Junyi Li, Tianyi Tang, Gaole He, Jinhao Jiang, Xiaoxuan Hu, Puzhao Xie, Zhipeng Chen, Zhuohao Yu, Wayne Xin Zhao, and Ji-Rong Wen. TextBox: A unified, modularized, and extensible framework for text generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 30–39, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-demo.4. URL https://aclanthology.org/2021.acl-demo.4.
  • Li et al. (2022a) Junyi Li, Tianyi Tang, Jian-Yun Nie, Ji-Rong Wen, and Wayne Xin Zhao. Learning to transfer prompts for text generation. arXiv preprint arXiv:2205.01543, 2022a. URL http://arxiv.org/abs/2205.01543.
  • Li et al. (2022b) Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. A survey of pretrained language models based text generation. arXiv preprint arXiv:2201.05273, 2022b. URL https://arxiv.org/abs/2201.05273.
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  • Li et al. (2018) Xiujun Li, Yu Wang, Siqi Sun, Sarah Panda, Jingjing Liu, and Jianfeng Gao. Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. arXiv preprint arXiv:1807.11125, 2018. URL http://arxiv.org/abs/1807.11125.
  • Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing. URL https://aclanthology.org/I17-1099.
  • Liang et al. (2009) Percy Liang, Michael Jordan, and Dan Klein. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99, Suntec, Singapore, August 2009. Association for Computational Linguistics. URL https://aclanthology.org/P09-1011.
  • Lin et al. (2020a) Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.165. URL https://aclanthology.org/2020.findings-emnlp.165.
  • Lin (2004) Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  • Lin et al. (2020b) Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, and Lei Li.

    Pre-training multilingual neural machine translation by leveraging alignment information.

    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2649–2663, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.210. URL https://aclanthology.org/2020.emnlp-main.210.
  • Lison et al. (2018) Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1275.
  • Liu et al. (2021a) Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou, and Nan Duan. GLGE: A new general language generation evaluation benchmark. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 408–420, Online, August 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.36. URL https://aclanthology.org/2021.findings-acl.36.
  • Liu et al. (2021b) Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, and Xiaojie Wang. Topic-aware contrastive learning for abstractive dialogue summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1229–1243, Punta Cana, Dominican Republic, November 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.106. URL https://aclanthology.org/2021.findings-emnlp.106.
  • Liu et al. (2021c) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021c. URL http://arxiv.org/abs/2107.13586.
  • Liu et al. (2021d) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021d. URL http://arxiv.org/abs/2103.10385.
  • Liu et al. (2022) Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, Dublin, Ireland, May 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-short.8.
  • Liu et al. (2019a) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1441. URL https://aclanthology.org/P19-1441.
  • Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019b. URL http://arxiv.org/abs/1910.10683.
  • Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020. doi: 10.1162/tacl_a_00343. URL https://aclanthology.org/2020.tacl-1.47.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Lu et al. (2020) Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. Gender Bias in Neural Natural Language Processing, pages 189–202. Springer International Publishing, Cham, 2020. doi: 10.1007/978-3-030-62077-6_14. URL https://doi.org/10.1007/978-3-030-62077-6_14.
  • McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/20c86a628232a67e7bd46f76fba7ce12-Paper.pdf.
  • McCann et al. (2018) Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018. URL http://arxiv.org/abs/1806.08730.
  • Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.244.
  • Moon et al. (2019) Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1081. URL https://aclanthology.org/P19-1081.
  • Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098.
  • Mrkšić et al. (2017) Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1777–1788, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1163. URL https://aclanthology.org/P17-1163.
  • Nan et al. (2021) Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, and Nazneen Fatema Rajani. DART: Open-domain structured data record to text generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 432–447, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.37. URL https://aclanthology.org/2021.naacl-main.37.
  • Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata.

    Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
  • Nguyen et al. (2021) Thong Nguyen, Anh Tuan Luu, Truc Lu, and Tho Quan. Enriching and controlling global semantics for text summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9443–9456, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.744. URL https://aclanthology.org/2021.emnlp-main.744.
  • Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@NIPS, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org, 2016. URL http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
  • Novikova et al. (2017) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-5525. URL https://aclanthology.org/W17-5525.
  • Pan et al. (2021) Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 244–258, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.21. URL https://aclanthology.org/2021.acl-long.21.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  • Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models. In Advances in Neural Information Processing Systems, volume 34, pages 11054–11070. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/5c04925674920eb58467fb52ce4ef728-Paper.pdf.
  • Qin and Eisner (2021) Guanghui Qin and Jason Eisner. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5203–5212, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.410. URL https://aclanthology.org/2021.naacl-main.410.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. URL https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
  • Rao and Tetreault (2018) Sudha Rao and Joel Tetreault. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1012. URL https://aclanthology.org/N18-1012.
  • Rashkin et al. (2019) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1534. URL https://aclanthology.org/P19-1534.
  • Rastogi et al. (2020a) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. volume 34, pages 8689–8696, Apr. 2020a. doi: 10.1609/aaai.v34i05.6394. URL https://ojs.aaai.org/index.php/AAAI/article/view/6394.
  • Rastogi et al. (2020b) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. volume 34, pages 8689–8696, Apr. 2020b. doi: 10.1609/aaai.v34i05.6394. URL https://ojs.aaai.org/index.php/AAAI/article/view/6394.
  • Ravaut et al. (2022) Mathieu Ravaut, Shafiq Joty, and Nancy Chen. SummaReranker: A multi-task mixture-of-experts re-ranking framework for abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4504–4524, Dublin, Ireland, May 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.309.
  • Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. doi: 10.1162/tacl_a_00266. URL https://aclanthology.org/Q19-1016.
  • Rodriguez et al. (2020) Pedro Rodriguez, Paul Crook, Seungwhan Moon, and Zhiguang Wang. Information seeking in the spirit of learning: A dataset for conversational curiosity. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8153–8172, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.655. URL https://aclanthology.org/2020.emnlp-main.655.
  • Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston.

    A neural attention model for abstractive sentence summarization.

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1044. URL https://aclanthology.org/D15-1044.
  • Sai et al. (2022) Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra.

    A survey of evaluation metrics used for nlg systems.

    ACM Comput. Surv., 55(2), jan 2022. ISSN 0360-0300. doi: 10.1145/3485766. URL https://doi.org/10.1145/3485766.
  • Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  • Sap et al. (2020) Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, and James Pennebaker. Recollection versus imagination: Exploring human memory and cognition via neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1970–1978, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.178. URL https://aclanthology.org/2020.acl-main.178.
  • See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://aclanthology.org/P17-1099.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
  • Stratos (2019) Karl Stratos. Mutual information maximization for simple and accurate part-of-speech induction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1095–1104, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1113. URL https://aclanthology.org/N19-1113.
  • Su et al. (2021) Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier. Plan-then-generate: Controlled data-to-text generation via planning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 895–909, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.76. URL https://aclanthology.org/2021.findings-emnlp.76.
  • Su et al. (2022) Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai, and Yi Zhang. Multi-task pre-training for plug-and-play task-oriented dialogue system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4661–4676, Dublin, Ireland, May 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.319.
  • Subramanian et al. (2018) Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B18WgG-CZ.
  • Sun et al. (2021) Jiao Sun, Xuezhe Ma, and Nanyun Peng. AESOP: Paraphrase generation with adaptive syntactic control. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5176–5189, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.420. URL https://aclanthology.org/2021.emnlp-main.420.
  • Sun et al. (2019) Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics, 7:217–231, 2019. doi: 10.1162/tacl_a_00264. URL https://aclanthology.org/Q19-1014.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society. doi: 10.1109/CVPR.2016.308. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2016.308.
  • Tan and Le (2019) Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/tan19a.html.
  • Tang et al. (2022) Tianyi Tang, Junyi Li, and Wayne Xin Zhao. Context-tuning: Learning contextualized prompts for natural language generation. arXiv preprint arXiv:2201.08670, 2022. URL http://arxiv.org/abs/2103.10385.
  • Trischler et al. (2017) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2623. URL https://aclanthology.org/W17-2623.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, Los Alamitos, CA, USA, jun 2015. IEEE Computer Society. doi: 10.1109/CVPR.2015.7299087. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2015.7299087.
  • Vu et al. (2022) Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. SPoT: Better frozen model adaptation through soft prompt transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5039–5059, Dublin, Ireland, May 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.acl-long.346.
  • Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
  • Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. doi: 10.1162/tacl_a_00290. URL https://aclanthology.org/Q19-1040.
  • Webson and Pavlick (2021) Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247, 2021. URL http://arxiv.org/abs/2109.01247.
  • Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  • Welivita et al. (2021) Anuradha Welivita, Yubo Xie, and Pearl Pu. A large-scale dataset for empathetic response generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1251–1264, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.96. URL https://aclanthology.org/2021.emnlp-main.96.
  • Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-1042.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  • Worsham and Kalita (2020) Joseph Worsham and Jugal Kalita. Multi-task learning for natural language processing in the 2020s: Where are we going? Pattern Recognition Letters, 136:120–126, 2020. ISSN 0167-8655. doi: https://doi.org/10.1016/j.patrec.2020.05.031. URL https://www.sciencedirect.com/science/article/pii/S0167865520302087.
  • Xiao et al. (2020) Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 3997–4003. International Joint Conferences on Artificial Intelligence Organization, 7 2020. doi: 10.24963/ijcai.2020/553. URL https://doi.org/10.24963/ijcai.2020/553. Main track.
  • Xie et al. (2022) Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I Wang, et al. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models. arXiv preprint arXiv:2201.05966, 2022. URL http://arxiv.org/abs/2201.05966.
  • Xu et al. (2022) Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization. arXiv preprint arXiv:2201.06910, 2022. URL http://arxiv.org/abs/2201.06910.
  • Xu et al. (2016) Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415, 2016. doi: 10.1162/tacl_a_00107. URL https://aclanthology.org/Q16-1029.
  • Yang et al. (2020) Jian Yang, Shuming Ma, Dongdong Zhang, ShuangZhi Wu, Zhoujun Li, and Ming Zhou. Alternating language modeling for cross-lingual pre-training. volume 34, pages 9386–9393, Apr. 2020. doi: 10.1609/aaai.v34i05.6480. URL https://ojs.aaai.org/index.php/AAAI/article/view/6480.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
  • Ye et al. (2021) Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. CrossFit: A few-shot learning challenge for cross-task generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7163–7189, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.572. URL https://aclanthology.org/2021.emnlp-main.572.
  • Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1205. URL https://aclanthology.org/P18-1205.
  • Zhang et al. (2021) Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.90. URL https://aclanthology.org/2021.acl-long.90.
  • Zhang et al. (2020) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.30. URL https://aclanthology.org/2020.acl-demos.30.
  • Zhou et al. (2018) Kangyan Zhou, Shrimai Prabhumoye, and Alan W Black. A dataset for document grounded conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 708–713, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1076. URL https://aclanthology.org/D18-1076.
  • Zhu et al. (2021) Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng. MediaSum: A large-scale media interview dataset for dialogue summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5927–5934, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.474. URL https://aclanthology.org/2021.naacl-main.474.

Appendix A Broader Impacts

In this paper, we pre-trained a language model MVP using labeled NLG datasets. According to the research [Bender et al., 2021, Bommasani et al., 2021], PLMs tend to “remember” what they have “seen” in pre-training corpus. This could result in the reproduction of undesirable biases from pre-training data on downstream tasks. Training data intervention could be a solution to alleviate this issue [Lu et al., 2020]. It is also interesting to investigate whether supervised pre-training produces fewer biases than unsupervised pre-training in the future.

Environmental impact is another factor we should consider. We have attempted a more efficient pre-training strategy and released our PLM for future work. In contrast to large PLMs with tens of billions of parameters, such as T5 [Raffel et al., 2020] and GPT-3 [Brown et al., 2020], we pre-train only a small model with hundreds of millions of parameters. In addition, we utilize supervised pre-training data and initialize our model with pre-trained BART, both of which improve the convergence of our model. Ultimately, our model is pre-trained for about steps, whereas BART of the same size is pre-trained for steps. We also released our pre-trained model and task-specific prompts for reproducing our results and future work at the link https://github.com/RUCAIBox/MVP.

Appendix B Tasks and Datasets

b.1 Description of Tasks and Datasets

We provide the details of the tasks and datasets used in our paper for pre-training and fine-tuning in Tables 6 and 7. If the dataset for pre-training does not have a valid set, we divide % of the training set for validation.

We list the licenses for all datasets if them have. All datasets are publicly available. The majority of them can be directly downloaded from GitHub or Google Drive. ROCStories [Mostafazadeh et al., 2016] and CommonGen [Lin et al., 2020a] can be obtained after filling out a form. GYAFC [Rao and Tetreault, 2018] is accessible after requesting Yahoo and the authors of the dataset.

The tasks and datasets we use in this paper are as follows:

  • [leftmargin=*]

  • Data-to-text generation aims to generate descriptive text about structured data, such as the knowledge graph and the table. We use the following datasets for pre-training:

    1. AGENDA [Koncel-Kedziorski et al., 2019];

    2. ENT-DESC [Cheng et al., 2020];

    3. GenWiki [Jin et al., 2020];

    4. LogicNLG [Chen et al., 2020a];

    5. TEKGEN [Agarwal et al., 2021];

    6. WEATHERGOV [Liang et al., 2009];

    7. WikiTableT [Chen et al., 2021].

    We utilize the following datasets for fine-tuning evaluation:

    1. WebNLG [Gardent et al., 2017], we utilize the version 2.1;

    2. WikiBio [Lebret et al., 2016].

  • Open-ended dialogue system, also known as chatbots, is focused on daily communication. We use the following datasets for pre-training:

    1. Cleaned OpenSubtitles Dialogs (Cleaned OS Dialogs) [Welivita et al., 2021], which is a cleaned variant of OpenSubtitles Dialogs [Lison et al., 2018];

    2. CMU Document Grounded Conversations (CMUDog) [Zhou et al., 2018];

    3. Curiosity [Rodriguez et al., 2020];

    4. DREAM [Sun et al., 2019];

    5. Empathetic Dialogues [Rashkin et al., 2019];

    6. Movie Dialog [Dodge et al., 2016];

    7. MuTual [Stratos, 2019];

    8. OpenDialKG [Moon et al., 2019];

    9. Topical-Chat [Gopalakrishnan et al., 2019];

    10. Wizard of Wikipedia [Dinan et al., 2019].

    We utilize the following datasets for fine-tuning evaluation:

    1. DailyDialog [Li et al., 2017];

    2. DSTC7-AVSD [Alamri et al., 2018];

    3. PersonaChat [Zhang et al., 2018].

  • Paraphrase generation involves rewriting a sentence with the same semantic meaning but a different syntactic or lexical form. We utilize the following datasets for fine-tuning evaluation:

    1. Quora (also known as QQP-Pos) [Kumar et al., 2020]

      , which is a subset of Quora Question Pairs  

      444https://www.kaggle.com/c/quora-question-pairs.

  • Question answering requires the model to answer a question based on optional background information. Note that we conduct this task in a generative way in our paper. We use the following datasets for pre-training:

    1. HotpotQA [Yang et al., 2018];

    2. MS MARCO [Nguyen et al., 2016];

    3. MSQG [Liu et al., 2021a], since it is designed for QG, we reverse the question and answer to enrich QA examples;

    4. NarrativeQA [Kočiský et al., 2018];

    5. Natural Questions [Kwiatkowski et al., 2019];

    6. NewsQA [Trischler et al., 2017];

    7. QuAC [Choi et al., 2018];

    8. TriviaQA [Joshi et al., 2017];

    9. WebQuestions [Berant et al., 2013].

    We utilize the following datasets for fine-tuning evaluation:

    1. CoQA [Reddy et al., 2019];

    2. SQuAD [Rajpurkar et al., 2016], we utilize the version 1.1.

  • Question generation generates a coherent question given a passage and its corresponding answer. We use the following datasets for pre-training:

    1. HotpotQA [Yang et al., 2018];

    2. MS MARCO [Nguyen et al., 2016];

    3. MSQG [Liu et al., 2021a];

    4. NarrativeQA [Kočiský et al., 2018];

    5. NewsQA [Trischler et al., 2017];

    6. QuAC [Choi et al., 2018];

    Most of them are QA tasks, and we invert the question and answer to enrich QG examples.

    We utilize the following datasets for fine-tuning evaluation:

    1. CoQA [Reddy et al., 2019];

    2. SQuAD [Rajpurkar et al., 2016], we utilize the version 1.1.

  • Story generation creates a long and informative text with a short title. We use the following datasets for pre-training:

    1. ChangeMyView [Hua and Wang, 2020];

    2. English Gigaword [Rush et al., 2015];

    3. Hippocorpus [Sap et al., 2020];

    4. WikiPlots [wik, ];

    5. WritingPrompts [Fan et al., 2018], we split the original training set for pre-training and corresponding validation.

    Considering English Gigaword is a large summarization dataset, we use the summary as the title to generate the passage in turn to enrich the examples of story generation.

    We utilize the following datasets for fine-tuning evaluation:

    1. ROCStories [Mostafazadeh et al., 2016];

    2. WritingPrompts [Fan et al., 2018], we use the sets created by Guan et al. [2021] (who split the original valid and test sets for training, validation, and testing) to fine-tune our model for a fair comparison.

  • Task-oriented dialogue system meets real-life needs of users, such as restaurant reservations and airplane bookings. We use the datasets for pre-training, following Su et al. [2022]:

    1. CamRest676 [Wen et al., 2017];

    2. Frames [El Asri et al., 2017];

    3. KVRET [Eric et al., 2017];

    4. MetaLWOZ [Lee et al., 2019];

    5. MSR-E2E [Li et al., 2018];

    6. MultiWOZ [Budzianowski et al., 2018];

    7. Schema-Guided [Rastogi et al., 2020a];

    8. TaskMaster [Byrne et al., 2019];

    9. WOZ [Mrkšić et al., 2017].

    We utilize the following datasets for fine-tuning evaluation:

    1. MultiWOZ [Budzianowski et al., 2018], we utilize the version 2.0;

  • Text style transfer modifies the style (e.g., sentiment and formality) of given texts while retaining their style-independent content. We utilize the following datasets for fine-tuning evaluation:

    1. GYAFC [Rao and Tetreault, 2018], which has two sub-domains “Entertainment and Music” (E&M) and “Family and Relationships” (F&R).

  • Text summarization condenses a long document into a brief text while retaining the essential details. We use the following datasets for pre-training:

    1. English Gigaword [Graff et al., 2003], we use the variant provided by Rush et al. [2015];

    2. MediaSum [Zhu et al., 2021];

    3. MSNews [Liu et al., 2021a];

    4. Newsroom [Grusky et al., 2018];

    5. WikiHow [Koupaee and Wang, 2018].

    We utilize the following datasets for fine-tuning evaluation:

    1. CNN/DailyMail [Hermann et al., 2015], we use the variant provided by See et al. [2017];

    2. SAMSum [Gliwa et al., 2019];

    3. XSum [Narayan et al., 2018].

To better compare with ExT5 [Aribandi et al., 2022], we utilize the language generation benchmark GEM [Gehrmann et al., 2021] for fine-tuning evaluation. GEM includes five tasks:

  • [leftmargin=*]

  • Commonsense generation:

    1. CommonGen (CG) [Lin et al., 2020a].

  • Data-to-text generation:

    1. DART [Nan et al., 2021];

    2. E2E NLG cleaned [Novikova et al., 2017];

    3. ToTTo [Su et al., 2021];

    4. WebNLG [Gardent et al., 2017].

  • Dialogue system:

    1. Schema-Guided Dialog (SGD) [Rastogi et al., 2020b].

  • Text simplification:

    1. WikiAuto + Turk/ASSET (WiA-T/A) [Jiang et al., 2020, Xu et al., 2016, Alva-Manchego et al., 2020].

  • Text summarization:

    1. Wiki-Lingua (WLE) [Ladhak et al., 2020].

To test the generalization ability of our model, we also utilize the natural language standing benchmark GLUE [Wang et al., 2019], which is composed of three tasks:

  • [leftmargin=*]

  • Natural language inference:

    1. MNLI [Williams et al., 2018];

    2. QNLI [Rajpurkar et al., 2016, Wang et al., 2019];

    3. RTE [Dagan et al., 2006, Haim et al., 2006, Giampiccolo et al., 2007, Bentivogli et al., 2009].

  • Paraphrase detection:

    1. MRPC [Dolan and Brockett, 2005];

    2. QQP 4;

    3. STS-B [Cer et al., 2017].

  • Text classification:

    1. CoLA [Warstadt et al., 2019];

    2. SST-2 [Socher et al., 2013].

b.2 Data Leakage

Since our model is pre-trained on a large number of labeled datasets, it may have “seen” examples from fine-tuning test sets during pre-training, which leads to an unfair comparison with other methods. Hence, we eliminate the pre-training examples that share -gram overlap with either of the test datasets. Following Brown et al. [2020], is the th percentile example length in words, and the maximum value of is set to . Finally, we have removed examples from the pre-training datasets. The number of “cleaned” examples for each dataset can be found in Table 6.

Dataset #Train Cleaned #Train #Valid #Test Input Output License
AGENDA 38,720 38,720 1,000 1,000 52.1 141.2 N/A
ENT-DESC 88,652 88,652 11,081 11,081 279.9 31.0 N/A
GenWiki 681,436 681,436 75,716 1,000 21.4 29.5 MIT
LogicNLG 28,450 28,450 4,260 4,305 178.4 14.2 MIT
TEKGEN 6,310,061 6,307,995 788,746 796,982 17.0 21.2 CC BY-SA 2.0
WEATHERGOV 25,000 25,000 1,000 3,528 148.7 30.6 N/A
WikiTableT 1,453,794 1,452,778 4,533 4,351 81.0 99.7 MIT
Cleaned OS Dialogs 13,355,487 13,355,368 1,483,944 - 75.5 16.7 N/A
CMUDoG 82,818 82,818 5,555 14,510 433.0 12.2 N/A
Curiosity 64,930 64,551 8,539 8,495 144.4 20.2 CC BY-NC 4.0
DREAM 14,264 14,242 4,709 4,766 75.6 13.6 N/A
Empathetic Dialogues 64,636 64,636 9,308 8,426 52.7 12.9 CC BY-NC 4.0
Movie Dialog 762,751 762,711 8,216 8,066 126.9 44.0 N/A
MuTual 33,691 33,691 4,090 3,248 53.6 14.5 N/A
OpenDialKG 69,680 69,680 7,743 - 54.2 12.4 CC BY-NC 4.0
Topical-Chat 179,750 179,750 22,295 22,452 223.3 20.0 CDLA-Sharing-1.0
Wizard of Wikipedia 148,357 147,702 15,767 15,564 297.0 16.7 MIT
HotpotQA 90,447 87,815 7,405 - 187.9 2.2 CC BY-SA 4.0
MS MARCO 681,445 681,226 77,580 - 68.7 13.3 N/A
MSQG 198,058 198,029 11,008 - 48.1 3.7 CC BY-SA 4.0
NarrativeQA 65,494 65,494 6,922 21,114 584.1 4.2 Apache 2.0
Natural Questions 96,676 96,676 10,693 6,490 9.0 2.1 CC BY-SA 3.0
NewsQA 97,850 97,700 5,486 5,396 726.8 5.0 MIT
QuAC 83,568 83,485 31,906 - 487.9 12.5 CC BY-SA 4.0
TriviaQA 78,785 78,785 8,837 11,313 14.0 2.0 Apache 2.0
WebQuestions 8,933 8,933 4,863 4,863 6.7 2.4 CC BY 4.0
HotpotQA 90,440 87,808 6,972 - 79.6 19.8 CC BY-SA 4.0
MS MARCO 681,445 681,226 77,580 - 75.9 6.0 N/A
MSQG 198,058 198,029 11,008 11,022 45.9 6.0 CC BY-SA 4.0
NarrativeQA 65,494 65,494 6,922 21,114 579.7 8.6 Apache 2.0
NewsQA 97,850 97,700 5,486 5,396 724.2 7.6 MIT
QuAC 69,109 69,026 26,301 - 496.7 6.5 CC BY-SA 4.0
ChangeMyView 42,462 42,459 6,480 7,562 17.9 104.1 MIT
English Gigaword 3,803,957 3,802,620 189,651 1,951 8.8 33.3 MIT
Hippocorpus 6,168 6,168 686 - 34.1 262.6 CDLA-Permissive 2.0
WikiPlots 101,642 101,641 11,294 - 3.4 338.5 N/A
WritingPrompts 272,600 272,518 15,620 15,138 28.4 630.8 MIT
CamRest676 4,872 4,872 616 - 55.3 9.4 N/A
Frames 26,631 26,631 2,106 - 116.1 13.0 MIT
KVRET 14,136 14,136 1,616 - 30.5 9.3 N/A
MetaLWOZ 176,073 176,073 17,912 - 45.6 8.0 N/A
MSR-E2E 103,362 103,362 5,235 - 51.3 12.8 Microsoft
Schema-Guided 494,946 494,933 73,089 - 120.8 12.5 CC BY-SA 4.0
TaskMaster 249,664 249,662 20,680 - 95.6 12.0 CC BY 4.0
WOZ 6,364 6,359 1,260 - 47.0 10.6 N/A
English Gigaword 3,803,957 3,802,620 189,651 1,951 33.3 8.8 MIT
MediaSum 443,596 442,021 10,000 10,000 1641.0 14.4 N/A
MSNews 136,082 135,937 7,496 7,562 309.9 9.8 CC BY-SA 4.0
Newsroom 995,041 989,351 108,837 108,862 642.4 26.7 N/A
WikiHow 157,252 157,247 5,599 5,577 502.6 45.6 CC BY-NC-SA
Table 6: The statistics and licenses of datasets for pre-training our MVP model. The #Train, #Valid, and #Test denote the number of examples in the train, valid, and test sets, respectively. Cleaned #Train represents the number of training examples after filtering. Input and Output are the average number of words (split by space) in the input and output sequences, respectively. These setups and abbreviations are the same below.
Task Dataset #Train #Valid #Test Input Output License
Commonsen generation CommonGen 67,389 993 5.5 11.6 MIT
Data-to-text generation DART 62,659 2,768 27.5 21.5 MIT
E2E 33,525 4,299 9.5 20.6 CC BY-SA 4.0
ToTTo 120,761 7,700 37.8 18.0 CC BY-SA 3.0
WebNLG 34,338 4,313 4,222 18.0 19.9 CC BY-NA-SA 4.0
WebNLG (GEM) 35,426 1,667 17.7 22.7 CC BY-NA-SA 4.0
WikiBio 582,659 72,831 72,831 81.6 26.1 CC BY-SA 3.0
Open-ended dialogue DailyDialog 76,052 7,069 6,740 72.5 13.9 CC BY-NC-SA 4.0
DSTC7-AVSD 76,590 17,870 1,710 148.2 11.5 MIT
PersonaChat 122,499 14,602 14,056 132.1 11.9 MIT
SGD 164,982 10,000 134.7 11.3 CC BY-SA 4.0
Natural language inference MNLI-m 392,702 9,815 9,796 29.8 Mixed
MNLI-mm 9,832 9,847
QNLI 104,743 5,463 5,463 36.6 CC BY-SA 4.0
RTE 2,490 277 3,000 51.0 N/A
Paraphrase generation Quora 137,185 3,000 3,000 10.9 10.8 N/A
Paraphrase detection MRPC 3,668 408 1,725 43.8 N/A
QQP 363,846 40,430 390,965 22.3 N/A
STS-B 5,749 1,500 1,379 20.3 N/A
Question answering CoQA 107,286 31,621 349.4 2.6 Mixed
SQuAD 75,722 10,570 11,877 156.2 3.6 CC BY-SA 4.0
Question generation CoQA 107,286 31,621 346.6 5.5 Mixed
SQuAD 75,722 10,570 11,877 148.3 11.6 CC BY-SA 4.0
Story generation ROCStories 176,688 9,816 4,909 9.0 40.7 N/A
WritingPrompts 53,516 4,000 2,000 25.5 150.4 MIT
Task-oriented dialogue MultiWOZ 170,220 22,074 22,116 128.3 11.3 MIT
Text classification CoLA 8,551 1,043 1,063 7.7 N/A
SST-2 67,349 872 1,821 9.8 N/A
Text simplification WiA-A 483,801 20,000 359 26.2 21.5 Mixed
WiA-T 359
Text style transfer GYAFC-E&M 52,595 11,508 1,416 9.9 10.6 N/A
GYAFC-F&R 51,967 11,152 1,332 10.7 11.3
Text summarization CNN/DailyMail 287,227 13,368 11,490 679.8 48.3 MIT
SAMSum 14,732 818 819 103.4 20.3 CC BY-NC-ND 4.0
WLE 99,020 28,614 367.6 33.4 CC0 1.0
XSum 204,045 11,332 11,334 373.7 21.1 MIT
Table 7: The statistics and licenses of datasets for fine-tuning our MVP model. The license of the MNLI dataset is composed of OANC, CC BY-SA 3.0, and CC BY 3.0. The license of the CoQA dataset is composed of CC BY-SA 4.0, MSR-LA, and Apache 2.0. The license of the WiA-A/T datasets is composed of CC BY-NC 3.0, CC BY-NC 4.0, and GNU General Public License v3.0.

Appendix C Fine-tuning and Evaluation Details

In this section, we introduce the details for fine-tuning and evaluating each downstream task.

For the experiments in Section 4 (Tables 1 and 2), and Appendix D.1 (Table 8), the fine-tuning details are introduced in Section 4, and the evaluation details are presented as follows:

  • [leftmargin=*]

  • For data-to-text generation tasks, we use BLEU(-), ROUGE-L, and METEOR for evaluation. We use the script provided by Chen et al. [2020b] 555https://github.com/wenhuchen/Data-to-text-Evaluation-Metric;

  • For open-ended dialogue system tasks, we use BLEU-, BLEU-, Distinct-, and Distinct- for evaluation. For DSTC7-AVSD we also utilize CIDEr [Vedantam et al., 2015]. We employ NLTK 3.5 with smoothing function to compute BLEU for PersonaChat and DailyDialog, and utilize the script 666https://github.com/lemuria-wchen/DialogVED/blob/main/src/utils/evaluate.py to evaluate DSTC7-AVSD;

  • For question answering tasks, we use Exact Match (EM) and Macro-averaged F1 score (F1) for evaluation. We use the provided script for CoQA 777https://github.com/PaddlePaddle/ERNIE/blob/repro/ernie-gen/eval/tasks/coqa/eval.py and SQuAD 888https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py.

  • For question generation tasks, we use BLEU-, ROUGE-L, and METEOR for evaluation. We use the script provided by Dong et al. [2019] 999https://github.com/microsoft/unilm/blob/master/unilm-v1/src/qg/eval.py;

  • For story generation, we employ nucleus sampling with and temperature of following Guan et al. [2021]. We use corpus BLEU-, BLEU-, Distinct-, and Distinct- for evaluation. We use NLTK 3.5 to calculate corpus BLEU following Guan et al. [2021];

  • For task-oriented dialogue system tasks, we use BLEU(-), inform (rate), success (rate), and combined score for evaluation. Inform and success are two specially designed accuracy metrics for task-oriented dialogue system, and the combined score is defined as  [Budzianowski et al., 2018]. We use the script provided by Su et al. [2022] 101010https://github.com/awslabs/pptod/blob/main/E2E_TOD/eval.py;

  • For text summarization tasks, we use ROUGE-, ROUGE-, and ROUGE-L for evaluation. We use the toolkit files2rouge 111111https://github.com/pltrdy/files2rouge.

For the experiments in Section 5 (Tables 3 and 4), the fine-tuning and evaluation details are as follows:

  • [leftmargin=*]

  • For paraphrase generation tasks, we employ the fine-tuning and evaluation scripts provided by AESOP [Sun et al., 2021] 121212https://github.com/PlusLabNLP/AESOP. The evaluation metrics are BLEU-, ROUGE-, ROUGE-, ROUGE-L, and METEOR.

  • For text style transfer tasks, we employ the fine-tuning and evaluation scripts provided by SC & BLEU [Lai et al., 2021] 131313https://github.com/laihuiyuan/pre-trained-formality-transfer. We conduct the informal-to-formal transfer and train the model on the data from both the E&M and F&R domains following Lai et al. [2021]. The evaluation metrics are BLEU-, accuracy, and HM. Accuracy is calculated by a pre-trained TextCNN to evaluate the style strength, and HM denotes the harmonic mean of BLEU- and style accuracy [Lai et al., 2021].

  • For GLUE tasks, we utilize the fine-tuning code provided by Hugging Face 141414https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification. The hyper-parameters are consistent with original BART [Lewis et al., 2020] 151515https://github.com/facebookresearch/fairseq/blob/main/examples/bart/README.glue.md. The evaluation is computed by the official website 161616https://gluebenchmark.com/.

For the experiments of the GEM benchmark in Appendix D.2 (Table 9), the fine-tuning settings are the same as those described in Section 4. We use BLEU-, ROUGE-, and METEOR for evaluation. We use the GEM evaluation scripts 171717https://github.com/GEM-benchmark/GEM-metrics.

Appendix D Additional Results

In this section, we provide additional results of our MVP model and other baselines.

d.1 Results of Common Datasets

We also conduct experiments on various common datasets under full tuning settings. Due to space limits in Section 4, these results are shown in Table 8. We can see that these results share a similar trend to those in Section 4, and we achieve SOTA performances in of metrics.

Summarization (XSum) Summarization (SAMSum) QG (CoQA)
R- R- R-L R- R- R-L B- ME R-L
SOTA 49.57a 25.08 41.81 54.3b 29.3 45.2 15.78c 40.15 50.98
BART 45.800.10 22.560.02 37.500.03 53.640.17 29.180.31 49.580.36 22.240.14 24.000.08 54.100.18
MVP 45.670.04 22.530.06 37.410.03 53.820.19 29.030.43 49.320.16 22.520.21 24.200.16 54.500.22
MVP+S 45.590.06 22.540.04 37.390.03 53.780.06 29.420.03 49.600.09 22.450.15 24.070.13 54.630.22
Story Generation (WritingPrompts) Open-ended Dialogue (DailyDialog) WikiBio
B- B- D- D- B- B- D- D- B-
SOTA 22.4d 8.4 31.3 46.1e 40.7 4.1 22.2 45.1f
BART 31.850.47 12.530.22 1.990.17 61.902.52 51.850.14 43.590.11 6.470.06 35.690.42 48.370.19
MVP 31.810.53 12.800.05 2.580.14 69.451.24 52.340.35 43.930.30 6.390.06 35.650.33 48.420.23
MVP+S 29.180.32 11.110.16 3.710.32 78.023.36 51.040.57 42.870.48 6.700.14 36.840.49 48.190.09
Open-ended Dialogue (DSTC7-AVSD) QA (SQuAD)
B- B- B- B- ME R-L CIDEr F1 EM
SOTA 83.2e 70.5 59.8 50.6 31.4 63.8 1.391 91.26g 96.22
BART 82.480.52 69.400.40 58.570.40 49.330.40 31.390.39 64.070.25 1.4010.01 84.950.34 91.980.11
MVP 83.370.64 70.500.47 59.730.47 50.420.32 31.590.33 64.540.47 1.4290.01 86.440.21 93.040.08
MVP+S 83.760.07 70.800.22 60.030.26 50.650.32 31.070.28 64.100.19 1.4030.01 86.780.16 93.210.05
Table 8: The results on six seen tasks under full tuning settings. WikiBio is the dataset of the data-to-text generation task. a [Nguyen et al., 2021]b [Liu et al., 2021b]c [Gu et al., 2021]d [Guan et al., 2021]e [Chen et al., 2022]f [Chen et al., 2020b]g [Raffel et al., 2020]

d.2 Results on the GEM Benchmark

To better compare with ExT5 [Aribandi et al., 2022], we conduct experiments on the GEM benchmark [Gehrmann et al., 2021]. For “unseen” commonsense generation and text simplification tasks, we utilize prompts of data-to-text generation and summarization, respectively. The results are presented in Table 9. Note that, because the fine-tuning and dataset hyper-parameters of ExT5 and GEM are unavailable, the results of some datasets we reproduced differ from the original papers [Aribandi et al., 2022, Gehrmann et al., 2021]. Regardless, our MVP models outperform ExT5 in out of metrics.

Data-to-text (DART) Data-to-text (E2E) Data-to-text (ToTTo)
B- R- ME B- R- ME B- R- ME
T5.1.1 34.31 45.22 36.3 42.57 46.60 38.2 39.79 49.90 36.8
ExT5 36.62 48.14 37.6 42.25 46.70 38.1 40.14 50.33 36.9
BART 38.890.35 48.760.07 38.310.28 37.240.43 47.760.18 39.240.21 50.390.04 55.140.12 41.110.05
MVP 39.130.06 48.920.11 38.530.13 37.380.17 47.960.09 39.390.21 50.580.12 55.240.12 41.270.04
MVP+S 38.830.17 48.490.18 38.410.05 37.320.35 47.400.25 38.900.11 50.690.13 55.520.05 41.290.07
Data-to-text (WebNLG) CommonGen (CG) Dialogue (SGD)
B- R- ME B- R- ME B- R- ME
T5.1.1 31.67 43.31 34.4 8.38 17.01 20.2 33.15 36.17 32.4
ExT5 35.03 48.17 36.5 9.68 19.04 21.4 34.74 37.77 33.0
BART 46.670.30 58.630.44 42.190.07 32.680.85 37.160.28 32.810.24 45.140.11 47.930.05 38.190.11
MVP 47.030.06 59.000.19 42.340.08 32.590.92 37.710.59 33.000.07 45.630.10 48.290.14 38.480.08
MVP+S 47.030.24 59.030.20 42.280.09 34.100.35 37.870.58 33.110.11 45.240.11 48.250.20 38.470.40
Simplification (WiA-A) Simplification (WiA-T) Summarization (WLE)
B- R- ME B- R- ME B- R- ME
T5.1.1 29.30 38.37 30.1 42.12 50.52 36.2 15.55 20.47 19.6
ExT5 29.23 37.98 30.0 41.39 50.38 35.8 16.64 21.16 20.4
BART 71.070.21 71.090.06 47.461.69 90.810.24 83.360.07 57.580.19 18.510.01 22.570.05 21.780.09
MVP 71.550.18 70.880.17 48.190.13 91.730.20 83.460.08 57.340.06 18.800.10 22.840.08 21.950.06
MVP+S 70.370.23 70.650.07 47.700.22 91.120.30 83.590.23 56.950.25 18.520.12 22.570.06 22.020.04
Table 9: The results on the GEM benchmark under full tuning settings. We utilize the large version of T5.1.1 and ExT5, and all the results of them are from Aribandi et al. [2022].

d.3 Results without Fine-tuning

Considering our MVP model has already been pre-trained on several tasks, we conduct experiments on these “seen” tasks without fine-tuning our model. To some degree, this setting can be viewed as zero-shot learning. Nonetheless, it does not conform to the definition of true zero-shot settings [Perez et al., 2021]. To avoid controversy, we refer to this as without fine-tuning.

We include T0-3B [Sanh et al., 2022] as our baseline. The results are listed in Table 10. All tasks demonstrate that methods without fine-tuning perform significantly worse than those with full tuning settings. This suggests that zero-shot strategies that are effective for NLU tasks may not produce satisfactory results for NLG tasks. Even though our model has acquired task knowledge, it struggles to perform well in a new domain without being fine-tuned. Thus, we focus mainly on full tuning settings in this paper.

Summarization (CNN/DM) Data-to-text (WebNLG) QG (SQuAD) QA (CoQA)
R- R- R-L B- ME R-L B- ME R-L F1 EM
FT BART 44.470.10 21.500.14 41.350.08 67.330.06 47.780.07 76.830.04 25.080.13 26.730.18 52.550.07 74.000.17 84.070.21
FT MVP 44.450.05 21.440.12 41.340.08 67.320.10 47.940.13 76.700.26 25.910.07 27.220.10 53.080.16 75.500.20 85.070.21
T0 01.40 10.20 18.43 3.06 12.43 14.91 06.60 13.30
MVP 29.50 11.29 25.92 34.42 31.33 52.33 2.90 13.94 15.48 18.20 29.40
MVP+S 25.60 09.51 22.67 39.43 34.32 55.34 2.96 15.23 18.23 37.30 52.40
Story Generation (ROCStories) Open-ended Dialogue (PersonaChat) TODS (MultiWOZ)
B- B- D- D- B- B- D- D- B- Success Inform
FT BART 33.790.13 15.780.21 3.430.17 78.762.15 49.581.12 39.240.90 1.440.09 08.890.57 20.170.63 75.401.22 84.401.15
FT MVP 33.960.08 15.960.05 3.170.15 76.111.38 49.560.44 40.410.10 1.550.07 10.200.46 20.340.37 75.470.40 84.070.15
T0 08.69 3.02 4.37 35.49 23.20 23.57 2.56 12.06 0.02 2.50 22.10
MVP 01.01 0.31 7.18 86.26 35.54 32.71 2.87 16.38 3.08 2.50 22.20
MVP+S 10.52 3.54 2.13 69.55 37.04 33.38 2.66 14.84 0.38 2.50 22.10
Table 10: The results on seven seen tasks without fine-tuning. We also include the results of BART and MVP under full tuning (denoted as FT) settings for comparison. Given that T0 has been pre-trained on the CNN/DailyMail dataset, we exclude their results to provide a fair comparison (denoted as “–”).

Appendix E Qualitative Examples

In this section, we showcase the linearized inputs, task instructions, and corresponding outputs of a single dataset for tasks in Section 4. We provide the results of BART, MVP, and MVP+S under full tuning settings. To minimize human intervention, we select the first and second instances of the test set with the random seed .

Input
Summarize: Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." Robin’s comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a phone at the wreckage site. The two publications described the supposed video, but did not post it on their websites. The publications said that they watched the video, which was found by a source close to the investigation. "One can hear cries of ’My God’ in several languages," Paris Match reported. "Metallic banging can also be heard more than three times, perhaps of the pilot trying to open the cockpit door with a heavy object. Towards the end, after a heavy shake, stronger than the others, the screaming intensifies. Then nothing." "It is a very disturbing scene," said Julian Reichelt, editor-in-chief of Bild online. An official with France’s accident investigation agency, the BEA, said the agency is not aware of any such video. Lt. Col. Jean-Marc Menichini, a French Gendarmerie spokesman in charge of communications on rescue efforts around the Germanwings crash site, told CNN that the reports were "completely wrong" and "unwarranted." Cell phones have been collected at the site, he said, but that they "hadn’t been exploited yet." Menichini said he believed the cell phones would need to be sent to the Criminal Research Institute in Rosny sous-Bois, near Paris, in order to be analyzed by specialized technicians working hand-in-hand with investigators. But none of the cell phones found so far have been sent to the institute, Menichini said. Asked whether staff involved in the search could have leaked a memory card to the media, Menichini answered with a categorical "no." Reichelt told "Erin Burnett: Outfront" that he had watched the video and stood by the report, saying Bild and Paris Match are "very confident" that the clip is real. He noted that investigators only revealed they’d recovered cell phones from the crash site after Bild and Paris Match published their reports. "That is something we did not know before. … Overall we can say many things of the investigation weren’t revealed by the investigation at the beginning," he said. What was mental state of Germanwings co-pilot? German airline Lufthansa confirmed Tuesday that co-pilot Andreas Lubitz had battled depression years before he took the controls of Germanwings Flight 9525, which he’s accused of deliberately crashing last week in the French Alps. Lubitz told his Lufthansa flight training school in 2009 that he had a "previous episode of severe depression," the airline said Tuesday. Email correspondence between Lubitz and the school discovered in an internal investigation, Lufthansa said, included medical documents he submitted in connection with resuming his flight training. The announcement indicates that Lufthansa, the parent company of Germanwings, knew of Lubitz’s battle with depression, allowed him to continue training and ultimately put him in the cockpit. Lufthansa, whose CEO Carsten Spohr previously said Lubitz was 100% fit to fly, described its statement Tuesday as a "swift and seamless clarification" and said it was sharing the information and documents – including training and medical records – with public prosecutors. Spohr traveled to the crash site Wednesday, where recovery teams have been working for the past week to recover human remains and plane debris scattered across a steep mountainside. He saw the crisis center set up in Seyne-les-Alpes, laid a wreath in the village of Le Vernet, closer to the crash site, where grieving families have left flowers at a simple stone memorial. Menichini told CNN late Tuesday that no visible human remains were left at the site but recovery teams would keep searching. French President Francois Hollande, speaking Tuesday, said that it should be possible to identify all the victims using DNA analysis by the end of the week, sooner than authorities had previously suggested. In the meantime, the recovery of the victims’ personal belongings will start Wednesday, Menichini said. Among those personal belongings could be more cell phones belonging to the 144 passengers and six crew on board. Check out the latest from our correspondents. The details about Lubitz’s correspondence with the flight school during his training were among several developments as investigators continued to delve into what caused the crash and Lubitz’s possible motive for downing the jet. A Lufthansa spokesperson told CNN on Tuesday that Lubitz had a valid medical certificate, had passed all his examinations and "held all the licenses required." Earlier, a spokesman for the prosecutor’s office in Dusseldorf, Christoph Kumpa, said medical records reveal Lubitz suffered from suicidal tendencies at some point before his aviation career and underwent psychotherapy before he got his pilot’s license. Kumpa emphasized there’s no evidence suggesting Lubitz was suicidal or acting aggressively before the crash. Investigators are looking into whether Lubitz feared his medical condition would cause him to lose his pilot’s license, a European government official briefed on the investigation told CNN on Tuesday. While flying was "a big part of his life," the source said, it’s only one theory being considered. Another source, a law enforcement official briefed on the investigation, also told CNN that authorities believe the primary motive for Lubitz to bring down the plane was that he feared he would not be allowed to fly because of his medical problems. Lubitz’s girlfriend told investigators he had seen an eye doctor and a neuropsychologist, both of whom deemed him unfit to work recently and concluded he had psychological issues, the European government official said. But no matter what details emerge about his previous mental health struggles, there’s more to the story, said Brian Russell, a forensic psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the fact that maybe they weren’t going to keep doing their job and they’re upset about that and so they’re suicidal," he said. "But there is no mental illness that explains why somebody then feels entitled to also take that rage and turn it outward on 149 other people who had nothing to do with the person’s problems." Germanwings crash compensation: What we know. Who was the captain of Germanwings Flight 9525? CNN’s Margot Haddad reported from Marseille and Pamela Brown from Dusseldorf, while Laura Smith-Spark wrote from London. CNN’s Frederik Pleitgen, Pamela Boykoff, Antonia Mortensen, Sandrine Amiel and Anna-Maja Rappard contributed to this report.
Gold
Match are "very confident" the video clip is real, an editor says. Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says.
BART
Prosecutor: "A person who has such a video needs to immediately give it to the investigators" French Gendarmerie spokesman: Cell phones have been collected at the site, but haven’t been exploited.
MVP
Prosecutor: "So far no videos were used in the crash investigation" An official with France’s accident investigation agency says the agency is not aware of any video.
MVP+S
Prosecutor: "So far no videos were used in the crash investigation" Two publications claim to have found a cell phone video of the crash. "It is a very disturbing scene," says editor-in-chief of Bild online.
Table 11: The first instance from the CNN/Daily Mail dataset. Task instructions are labeled in italic. The setting is the same below.
Input
Summarize: The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC’s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians’ efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday’s ceremony, said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should immediately end their pressure, and countries that support universal acceptance of the court’s treaty should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the group. "What’s objectionable is the attempts to undermine international justice, not Palestine’s decision to join a treaty to which over 100 countries around the world are members." In January, when the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was overstepping its boundaries. The United States also said it "strongly" disagreed with the court’s decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement. It urged the warring sides to resolve their differences through direct negotiations. "We will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry will include alleged war crimes committed since June. The International Criminal Court was set up in 2002 to prosecute genocide, crimes against humanity and war crimes. CNN’s Vasco Cotovio, Kareem Khadder and Faith Karimi contributed to this report.
Gold
Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June. Israel and the United States opposed the move, which could open the door to war crimes investigations against Israelis.
BART
Palestinian Authority becomes 123rd member of the International Criminal Court. The move gives the court jurisdiction over alleged crimes in Palestinian territories. Israel and the United States opposed the Palestinians’ efforts to join the body.
MVP
"Today brings us closer to our shared goals of justice and peace," foreign minister says. The Palestinians signed the ICC’s founding Rome Statute in January. The move gives the court jurisdiction over alleged crimes in Palestinian territories.
MVP+S
"Today brings us closer to our shared goals of justice and peace," foreign minister says. The United States says it "strongly" disagrees with the decision. The Palestinian Authority is the 123rd member of the International Criminal Court.
Table 12: The second instance from the CNN/Daily Mail dataset.
Input
Describe the following data: Abilene,_Texas | cityServed | Abilene_Regional_Airport
Gold
Abilene, Texas is served by the Abilene regional airport.
Abilene Regional Airport serves the city of Abilene in Texas.
BART
Abilene Regional Airport serves the city of Abilene in Texas.
MVP
Abilene Regional Airport serves the city of Abilene, Texas.
MVP+S
Abilene Regional Airport serves the city of Abilene, Texas.
Table 13: The first instance from the WebNLG dataset, which has two gold target sentences.
Input
Describe the following data: "Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas" | location | Adolfo_Suárez_Madrid–Barajas_Airport
Gold
Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
Adolfo Suarez Madrid-Barajas airport is located at Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
Adolfo Suarez Madrid-Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastian de los Reyes and Alcobendas.
BART
Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
MVP
Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
MVP+S
Adolfo Suárez Madrid–Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
Table 14: The second instance from the WebNLG dataset, which has three gold target sentences.
Input
Generate the question based on the answer: Saint Bernadette Soubirous [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ’ s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary .
Gold
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France ?
BART
Who is believed to have appeared to the Virgin Mary at Lourdes ?
MVP
Who did the Virgin Mary appear to in Lourdes ?
MVP+S
The Grotto is a replica of the grotto at Lourdes , France where the Virgin Mary appeared to whom ?
Table 15:

The first instance from the SQuAD dataset.

Input
Generate the question based on the answer: a copper statue of Christ [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ’ s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary .
Gold
What is in front of the Notre Dame Main Building ?
BART
What is in front of the Main Building and facing it ?
MVP
What is immediately in front of the Main Building ?
MVP+S
What is immediately in front of the Main Building ?
Table 16: The second instance from the SQuAD dataset.
Input
Answer the following question: what color was cotton ? [X_SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ’ s horses slept . but cotton wasn ’ t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ’ s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ’ s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ’ s mommy rubbed her face on cotton ’ s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ’ s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ’ s fur was all all dry . " don ’ t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ’ t want that ! " then cotton thought , " i change my mind . i like being special " .
Gold
white
BART
white
MVP
white
MVP+S
white
Table 17: The first instance from the CoQA dataset.
Input
Answer the following question: what color was cotton ? [SEP] white [X_SEP] where did she live ? [X_SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ’ s horses slept . but cotton wasn ’ t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ’ s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ’ s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ’ s mommy rubbed her face on cotton ’ s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ’ s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ’ s fur was all all dry . " don ’ t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ’ t want that ! " then cotton thought , " i change my mind . i like being special " .
Gold
in a barn
BART
in a barn
MVP
in a barn
MVP+S
in a barn
Table 18: The second instance from the CoQA dataset.
Input
Given the story title: male was out jogging one morning .
Gold
the weather was crisp and cool . male felt good and energetic . he decided to keep jogging longer than normal . male went several more miles out of his way .
BART
he saw a man struggling to walk across a field . male stopped to help him . he saw a guy struggling to walking across a fields . the man thanked male profusely !
MVP
he ran into a man who was holding a knife . male was scared , so he ran away . the man threatened male . male ran back into his house .
MVP+S
he noticed a sound coming from a tree . male went to investigate . he found a small baby bird stuck in the trunk . male took the bird to the vet and they released it .
Table 19: The first instance from the ROCStories dataset.
Input
Given the story title: ken was driving around in the snow .
Gold
he needed to get home from work . he was driving slowly to avoid accidents . unfortunately the roads were too slick and ken lost control . his tires lost traction and he hit a tree .
BART
he hit a patch of black ice . ken slid and hurt his knee . he went to the doctor for it . krampus was sent home with a concussion .
MVP
he was in his car and it was very cold . ken started to back up . the car started to make a loud noise . kena had to back out and call for help .
MVP+S
he was having trouble starting his car . he had to push it a couple of blocks . his car would n’t start . ken had to call a tow truck for help .
Table 20: The second instance from the ROCStories dataset.
Input
Given the dialog: i love to meet new people . [SEP] i have a turtle named timothy . [SEP] my favorite sport is ultimate frisbee . [SEP] my parents are living in bora bora . [SEP] autumn is my favorite season . [X_SEP] hello , how are you doing tonight ?
Gold
i am well an loving this interaction how are you ?
BART
i am doing well , how are you ?
MVP
i am doing well . how are you ?
MVP+S
i am doing well . how are you ?
Table 21: The first instance from the PersonaChat dataset.
Input
Given the dialog: i just bought a brand new house . [SEP] i like to dance at the club . [SEP] i run a dog obedience school . [SEP] i have a big sweet tooth . [SEP] i like taking and posting selkies . [X_SEP] hello , how are you doing tonight ? [SEP] i am well an loving this interaction how are you ?
Gold
i am great . i just got back from the club .
BART
i am doing well , thanks for asking . what do you do for fun ?
MVP
i am doing well . i just got home from dancing at the club .
MVP+S
i am doing well . i just bought a new house .
Table 22: The second instance from the PersonaChat dataset.
Input
Given the task dialog: Belief state [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .
Given the task dialog: Dialogue action [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .
Given the task dialog: System response [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .
Gold
[taxi] destination pizza hut fen ditton departure saint john ’s college
[taxi] [request] leave arrive
what time do you want to leave and what time do you want to arrive by ?
BART
[taxi] destination pizza express fen ditton departure saint johns college
[taxi] [request] leave
what time would you like to leave or arrive ?
MVP
[taxi] destination pizza hut fen ditton departure saint john ’s college
[taxi] [request] leave
sure ! when would you like to arrive ?
MVP+S
[taxi] destination pizza hut fen ditton departure saint john ’s college
[taxi] [request] leave arrive
what time would you like to leave ?
Table 23: The first instance from the MultiWOZ dataset. The task-oriented dialogue is composed of dialogue state tracking, dialogue action learning and system response generation.
Input
Given the task dialog: Belief state [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .
Given the task dialog: Dialogue action [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .
Given the task dialog: System response [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .
Gold
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! your taxi will be [value_car] contact number is [value_phone]
BART
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! booked car type : [value_car] contact number : [value_phone]
MVP
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! booked car type : [value_car] contact number : [value_phone]
MVP+S
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! booked car type : [value_car] contact number : [value_phone]
Table 24: The second instance from the MultiWOZ dataset.