ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

by   Yu Yan, et al.

In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism.Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step.The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Experimental results show ProphetNet achieves the best performance on both abstractive summarization and question generation tasks compared to the models using the same base scale pre-training dataset. For the large scale dataset pre-training, ProphetNet achieves new state-of-the-art results on Gigaword and comparable results on CNN/DailyMail using only about 1/5 pre-training epochs of the previous model.


page 1

page 2

page 3

page 4


STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization

Abstractive summarization aims to rewrite a long document to its shorter...

ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding

Coarse-grained linguistic information, such as name entities or phrases,...

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

We investigate the integration of a planning mechanism into sequence-to-...

Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode

The ultra-large-scale pre-training model can effectively improve the eff...

Knowledge Transfer by Discriminative Pre-training for Academic Performance Prediction

The needs for precisely estimating a student's academic performance have...

Effective Sequence-to-Sequence Dialogue State Tracking

Sequence-to-sequence models have been applied to a wide variety of NLP t...

Variable-Length Music Score Infilling via XLNet and Musically Specialized Positional Encoding

This paper proposes a new self-attention based model for music score inf...

1 Introduction

Large-scale pre-trained language models (Devlin et al., 2018; Radford et al., 2019; Yang et al., 2019) and sequence-to-sequence models (Lewis et al., 2019; Song et al., 2019; Raffel et al., 2019)

have achieved remarkable success in both natural language understanding (NLU) tasks and natural language generation (NLG) tasks. These methods are firstly pre-trained on large-scale unlabeled text data with specific self-supervised objectives and then fine-tuned to adapt to downstream tasks.

Autoregressive (AR) language modeling, which estimates the probability distribution of the text corpus, is widely used for sequence modeling and sequence-to-sequence (Seq2Seq) learning 

(Sutskever et al., 2014). Recently, it also becomes one of the successful self-supervised objectives for large-scale pre-training as used in GPT-2 (Radford et al., 2019). Specifically, given a text sequence , AR language modeling factorizes the likelihood into a product . In this manner, language models (LMs) and Seq2Seq models are usually trained by teacher forcing, where the models are optimized to predict the next token given all previous context tokens at each time step.

However, as discussed in previous works (Pascanu et al., 2013; Gulcehre et al., 2017; Serdyuk et al., 2018), AR-based models may prefer to focus on the latest tokens rather than capture long-term dependencies for the next token prediction. The reasons are as follows: (a) Local correlations such as bigram combination are usually stronger than long-term dependencies. (b) Teacher forcing, where the model focus on one-step ahead prediction for each time step, has no explicit bias toward future token planning and modeling. As a result, the model may learn a bias for language modeling, that is, the modeling of the local token combinations is overfitting but the global coherence and long-term dependency are underfitting (Krueger et al., 2016; Merity et al., 2017; Serdyuk et al., 2018). During inference, the generations tend to maintain local coherence but lack meaningful global structure (Li et al., 2017; Serdyuk et al., 2018), especially when we use greedy decoding instead of beam search.

Figure 1: Traditional language model (left) and ProphetNet (right). We take ProphetNet decoder with future bigram prediction as an illustrated example here.

In this paper, we present a new large-scale pre-trained Seq2Seq model called ProphetNet with a novel self-supervised objective future n-gram prediction. As shown in Figure 1, in addition to the traditional language model (LM) or Seq2Seq model that optimizes one-step ahead prediction, the ProphetNet also learns -step ahead prediction which predicts the next tokens simultaneously based on previous context tokens for each time step during training. This future n-gram prediction is served as extra guidance that explicitly encourages the model to plan for future tokens and prevent overfitting on strong local correlations. The hidden states of ProphetNet are forced to contain useful information that is able to not only help predict the next token but also further help predict multiple future tokens.

Our ProphetNet is based on Transformer (Vaswani et al., 2017) encoder-decoder architecture. There are two goals when designing ProphetNet: (a) the model should be able to simultaneously predict the future n-gram at each time step in an efficient way during the training phase, and (b) the model can be easily converted to predict the next token only as original Seq2Seq model for inference or fine-tuning phase. To achieve that, we extend the two-stream self-attention proposed in XLNet (Yang et al., 2019) to n-stream self-attention. ProphetNet contains a main stream self-attention which is the same as the self-attention in the original Transformer. Besides, we introduce extra self-attention predicting streams for future n-gram prediction respectively. During training, the -th predicting stream attends to the hidden states of the main stream to predict the next -th future token, which guarantees every continuous tokens in the target sequence are trained to predict at one time step.

Since the parameters of the main stream are shared with every predicting stream, we can disable the n-stream self-attention during inference and only the next first token is predicted for each time step, which is same as the original Transformer Seq2Seq model. For experiments, we use the proposed future n-gram prediction with the mask based auto-encoder denoising task (Song et al., 2019; Lewis et al., 2019) which has been proved to be effective for Seq2Seq pre-training as compared in Raffel et al. (2019) for ProphetNet pre-training. We use two scale pre-trained datasets to pre-train ProphetNet, respectively: the base scale (16GB) dataset as used in BERT (Devlin et al., 2018), and the large scale (160GB) similar to BART (Lewis et al., 2019). The pre-trained ProphetNet is further fine-tuned on several NLG tasks. Experimental results show that ProphetNet has achieved the best performance on CNN/DailyMail, Gigaword, and SQuAD 1.1 question generation tasks compared to the models using the same base scale pre-training dataset. For the large scale dataset pre-training experiment, ProphetNet achieves comparable results on CNN/DailyMail and a new state-of-the-art results on Gigaword, using only about 1/5 pre-training epochs of BART and about 1/5 pre-training corpus of T5 (Raffel et al., 2019) and PEGASUS (Zhang et al., 2019).

Figure 2: The architecture of ProphetNet. For simplicity, we take bigram () as an example to introduce ProphetNet, whose modeling target is for each time step. The left part shows the encoder of the ProphetNet which is the same as the original Transformer encoder. The right part presents the decoder of the ProphetNet which incorporates the proposed n-stream self-attention. For seq2seq pre-training, we present the example of inputs and outputs of the mask based auto-encoder denoising task. The token “_” represents the mask symbol []. Note that each and

are the same in this task. The layer normalization and residual connection are ignored.

2 ProphetNet

We propose a new Seq2Seq pre-training model called ProphetNet, which is based on Transformer (Vaswani et al., 2017) Seq2Seq architecture. Compared to the original Transformer Seq2Seq model, ProphetNet introduces four modifications: (a) The novel self-supervised objective called future n-gram prediction as described in § 2.2. (b) The n-stream self-attention mechanism as described in § 2.3. (c) The modified positional embedding as described in § 2.4. (d) The mask based auto-encoder denoising task for Seq2Seq pre-training as described in § 2.5. Figure 2 shows the architecture of ProphetNet. Before we describe our model in detail, we first introduce the notations and sequence-to-sequence learning.

2.1 Sequence-to-Sequence Learning

Given a text sequence pair , where is the source sequence with tokens, and is the target sequence with tokens. The Seq2Seq model aims to model the conditional likelihood , which can be further factorized into a product

according to the chain rule, where

denotes the proceeding tokens before the position . In general, the Seq2Seq model employs an encoder which aims to encode the source sequence representations, and a decoder which models the conditional likelihood with the source representations and previous target tokens as inputs. Teacher forcing is usually used for model training where the model is optimized to predict next target token given the previous golden context tokens and at each time step.

2.2 Future N-gram Prediction

ProphetNet mainly changes the original Seq2Seq optimization of predicting next single token as into at each time step , where denotes the next continuous future tokens. In other words, the next future tokens are predicted simultaneously.

Figure 3: N-stream self-attention mechanism which contains a main stream self-attention and predicting stream self-attention. For simplicity sake, we take 2-stream self-attention () as an example here. Figure (a) presents the attention process of the main stream self-attention. Figure (b) and Figure (c) show the attention process of 1-st predicting stream and 2-nd predicting stream, respectively. Figure (d) shows the inputs, outputs, and the whole multi-layer n-stream self-attention.

Based on Transformer Seq2Seq architecture, ProphetNet contains a multi-layer Transformer encoder with the multi-head self-attention mechanism (Vaswani et al., 2017) and a multi-layer Transformer decoder with the proposed multi-head n-stream self-attention mechanism. Given a source sequence , ProphetNet encodes the into a sequence representation, which is the same as the original Transformer encoder:


where denotes the source sequence representations. On the decoder side, instead of predicting only the next token at each time step like the original Transformer decoder, ProphetNet decoder predicts future tokens simultaneously as we mentioned above:


where the decoder outputs probability at each time step. The future n-gram prediction objective can be further formalized as


The above future n-gram prediction objective can be seen to consist of two parts: (a) the conditional LM loss which is the same as the original teacher forcing, and (b) the future token prediction losses which force the model to predict the future target tokens. The future n-gram prediction loss explicitly encourages the model to plan for future token prediction and prevent overfitting on strong local correlations. Furthermore, we assign the different weights

to each loss as the trade-off between the traditional language modeling and future n-gram prediction. We can give higher weight to the closer future token prediction, which is similar to the discount factor of future reward in reinforcement learning 

(Sutton et al., 1998).

2.3 N-Stream Self-Attention

Ideally, we want the ProphetNet decoder to meet two requirements: (a) the ProphetNet can simultaneously predict the future n-gram at each time step in an efficient way during the training phase, and (b) the model can be easily used to predict next tokens or the next token only in the inference procedure as traditional Transformer decoder. However, the original Transformer decoder cannot be directly used for future n-gram prediction. As shown in the Figure 3, in addition to the masked multi-head self-attention (Vaswani et al., 2017) of the original transformer decoder which is called main stream self-attention here, the n-stream self-attention mechanism incorporates extra self-attention predicting streams which are used to predict next continuous future tokens respectively at each time step. To be concrete, the -th predicting stream is responsible for modeling the probability .

As discussed in (Vaswani et al., 2017), an attention function maps a query and a set of key-value pairs to an output as:


where the queries , keys , and values

are all vectors. The input consists of queries and keys of dimension

. Multi-head attention mechanism further projects queries, keys, and values to different representation subspaces as


where are trainable parameters.

The n-stream self-attention mechanism is shown in Figure 3. As shown in Figure 3 (a), the attention mechanism of the main stream is the same as the masked multi-head self-attention in the traditional Transformer decoder, where a lower triangular matrix is set to control that each position can only attend to their previous tokens:


here we use to denote the sequence of the -th layer hidden state of the main stream.

The -th predicting stream predicts the next -th token based on the previous main stream hidden states at each time step. In other words, the -th predicting stream predicts the based on the previous tokens . For simplicity sake, we take bigram () as an example to introduce, whose modeling target is for each time step. In this case, we have -st predicting stream as shown in Figure 3 (b), and -nd predicting stream which is shown in Figure 3 (c). As shown in Figure 3 (d), we use the trainable vector as the initialize input for -th predicting stream. The hidden state of the -st predicting stream is calculated as:


where denotes the -th layer hidden state of the -st predicting stream at time step , and denotes concatenation operation. To calculate , is taken as the attention query while the attention value and key are previous hidden states of the main stream. Besides we take as attention value and key to make the be position-aware. The is finally used to predict .

Similarly, the hidden state of the -nd predicting stream is calculated by:


where denotes the -th layer hidden state of the -nd predicting stream at time step , which will be finally used to predict .

We share the parameters of each predicting stream and main stream during training. Therefore, we can easily convert the ProphetNet decoder to the traditional Transformer decoder by disabling all the predicting streams during inference or fine-tuning.

2.4 Positional Embedding

We use the special trainable vector

rather than the last token embedding to initialize the token embedding. However, the model does not directly know its previous token and might be more dependent on the positional information. Thus besides the absolute positional embedding, we add the additional relative positional logits in the decoder self-attention calculation procedure which is the same as used in T5 

(Raffel et al., 2019). For mask based auto-encoder denoising tasks, the absolute positions of the decoder input tokens are their absolute positions of the original sentence.

2.5 Seq2Seq Pre-training on Denoising Task

Since it is difficult to obtain the large scale paired text corpus, we pre-train the ProphtNet on the large scale unlabeled text corpus with the auto-encoder denoising task which is widely used for Seq2Seq pre-training (Song et al., 2019; Lewis et al., 2019; Raffel et al., 2019). In general, the denoising Seq2Seq pre-training task requires the seq2seq model to learn to reconstruct the original text given the corrupted original text.

There are several noise functions used to corrupt the original text, such as random token masking, token deleting, token shuffling, and token span masking. In this paper, we only consider token span masking which is the same as the MASS (Song et al., 2019). As shown in Figure 2, we mask out some token spans of the original text as the encoder input, and the model learns to recover the masked tokens. Besides, unlike MASS learns to recover one next token at each time step, ProphetNet learns to recover the next future tokens within each masked token span.

3 Experiments and Results

In this section, we describe the experimental details and results. We first describe the details of ProphetNet pre-training in § 3.1

. Then we fine-tune the ProphetNet on two downstream NLG tasks including text summarization as described in § 

3.2 and question generation as reported in § 3.3. we report the experiment of large-scale pre-training in § 3.4.

3.1 ProphetNet Pre-training

Model Configuration

Our model is based on Transformer (Vaswani et al., 2017) encoder-decoder structure. We pre-train the ProphetNet which contains 12-layer encoder and 12-layer decoder with 1024 embedding/hidden size and 4096 feed-forward filter size. The batch size and training steps are set to 1024 and 500,000, respectively. Our implementation is based on FAIRSEQ111 Our preliminary experiments show that the ProphetNet with future trigram prediction (=3) performs slightly better than bigram (=2). However, the training of bigram is 15% faster than that of the trigram. Considering the training cost, we set the to be 2 for ProphetNet in the following experiments.

Pre-Training Dataset

Following BERT (Devlin et al., 2018), we use BookCorpus (Zhu et al., 2015) and English Wikipedia (16GB in total) to pre-train ProphetNet. The Pre-training of ProphetNet on this 16GB dataset with 500K steps takes about two weeks with GB NVIDIA V100 GPUs. Note that we also pre-train ProphetNet on a larger scale dataset which is described in § 3.4.

Pre-Training Setting

Both the encoder input length and decoder input length of ProphetNet are set to 512 tokens. Following the pre-training settings in MASS (Song et al., 2019), we randomly pick a starting position in every 64 tokens, and then mask a continuous span from . 80% of the masked tokens are replaced by [], 10% replaced by random tokens, and 10% unchanged. The masked length is set to 15% of the total number of tokens.

LEAD-3 (Nallapati et al., 2017) 40.42 17.62 36.67
PTGEN (See et al., 2017) 39.53 17.28 37.98
PTGEN+Coverage (See et al., 2017) 39.53 17.28 36.38
S2S-ELMo (Edunov et al., 2019) 41.56 18.94 38.47
Bottom-Up (Gehrmann et al., 2018) 41.22 18.68 38.34
BERTSUMABS (Liu and Lapata, 2019) 41.72 19.39 38.76
BERTSUMEXTABS (Liu and Lapata, 2019) 42.13 19.60 39.18
MASS (Song et al., 2019) 42.12 19.50 39.01
UniLM (Dong et al., 2019) 43.33 20.21 40.51
ProphetNet 43.68 20.64 40.72
Table 1: Results on the CNN/DailyMail test set.

3.2 Fine-Tuning on Text Summarization

As a typical NLG task, abstractive text summarization aims to generate a short and fluent summary of a long text document. We fine-tune and evaluate ProphetNet on the two widely used text summarization dataset: (a) the non-anonymized version of the CNN/DailyMail dataset (See et al., 2017), and (b) Gigaword corpus (Rush et al., 2015).


We use Adam optimzier (Kingma and Ba, 2015) with a peak learning rate to fine-tune ProphetNet on CNN/DailyMail. The batch size, the learning rate warmup steps, and the total fine-tune epoch are set to 512, 1000, and 10, respectively. During inference, we limit the length of the output to between 45 and 110 tokens with 1.2 length penalty. We set beam size to 5 and remove the duplicated trigrams in beam search (Fan et al., 2017).

We compare our ProphetNet against following baselines: LEAD-3 (Nallapati et al., 2016) which takes the first three sentences as the summary; PTGEN (See et al., 2017) which is Seq2Seq model incorporated with the pointer-generator network; PTGEN+Coverage (See et al., 2017) which introduce a coverage mechanism to PTGEN; Bottom-Up (Gehrmann et al., 2018) which employs a bottom-up content selector based on Seq2Seq model; S2S-ELMo (Edunov et al., 2019) which uses the pre-trained ELMo (Peters et al., 2018) representations. Besides, we also compare our method with several pre-training based strong baselines: BERTSUMABS (Liu and Lapata, 2019), MASS (Song et al., 2019), and UniLM (Dong et al., 2019). Note that these pre-training based strong baselines are all pre-trained on 16GB BookCorpus + English Wikipedia dataset, which is the same dataset as we used for ProphetNet pre-training.

Following See et al. (2017), we report the F1 scores of ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004). The results are presented in Table 1. From the results, we can see that the ProphetNet achieves the best performance on all metrics.

Method R-1 R-2 R-L
OpenNMT (Klein et al., 2017) 36.73 17.86 33.68
Re2Sum (Cao et al., 2018) 37.04 19.03 34.46
MASS (Song et al., 2019) 37.66 18.53 34.89
UniLM (Dong et al., 2019) 38.45 19.45 35.75
ProphetNet 39.23 20.36 36.57

Table 2: Results on Gigaword test set. R is short for ROUGE.


We follow the data pre-processing of UniLM (Dong et al., 2019) to fine-tune ProphetNet on Gigaword. We use Adam optimzier (Kingma and Ba, 2015) with a peak learning rate . The batch size is set to 128 and warm up steps to 1000. We fine-tune model 4 epochs with future bigram prediction training. During inference, we set the length penalty to 1.5 and beam size to 5.

Following UniLM (Dong et al., 2019), we compare our ProphetNet against following baselines: OpenNMT (Klein et al., 2017) which implements the standard Seq2Seq model with attention mechanism; Re2Sum (Cao et al., 2018) which employs an extended Seq2Seq model to generate summaries based on the retrieved candidate summaries. And two pre-training based strong baselines: MASS (Song et al., 2019), and UniLM (Dong et al., 2019). The results are presented in Table 2. It can be observed that ProphetNet outperforms previous models on all metrics.

Method B4 MTR R-L
CorefNQG (Du and Cardie, 2018) 15.16 19.12 -
SemQG (Zhang and Bansal, 2019) 18.37 22.65 46.68
UniLM (Dong et al., 2019) 22.12 25.06 51.07
ProphetNet 24.88 26.62 52.66
MP-GSN (Zhao et al., 2018) 16.38 20.25 44.48
SemQG (Zhang and Bansal, 2019) 20.76 24.20 48.91
UniLM (Dong et al., 2019) 23.75 25.61 52.04
ProphetNet 26.48 27.36 53.89
Table 3: Results on SQuAD 1.1 test set. B4 is short for BLEU-4, MTR is short for METEOR, and R-L is short for ROUGE-L. Model is fine tuned 5 epochs, and the same model is used to generate the results for the two groups of results which use swapped dev and test set.
Dataset Method Corpus R-1 R-2 R-L
CNN/DailyMail T5 (Raffel et al., 2019) 750GB 43.52 21.55 40.69
PEGASUSLARGE (C4) (Zhang et al., 2019) 750GB 43.90 21.20 40.76
PEGASUSLARGE (HugeNews) (Zhang et al., 2019) 3800GB 44.17 21.47 41.11
BART (Lewis et al., 2019) 160GB 44.16 21.28 40.90
ProphetNet 160GB 44.14 21.16 41.27
Gigaword PEGASUSLARGE (C4) (Zhang et al., 2019) 750GB 38.75 19.96 36.14
PEGASUSLARGE (HugeNews) (Zhang et al., 2019) 3800GB 39.12 19.86 36.24
ProphetNet 160GB 39.34 20.47 36.57
Table 4: Results on the CNN/DailyMail and Gigaword test sets of large-scale pre-training models. R is short for ROUGE, and Corpus denotes the size of the pre-training data.

3.3 Fine-Tuning on Question Generation

Recently, the answer-aware question generation task (Zhou et al., 2017) attracts a lot of attention in NLG, which aims to generate a question that asks towards the given answer span based on a given text passage or document. We conduct experiments on this task to further evaluate the ProphetNet model. Following Du et al. (2017), we split the SQuAD 1.1 (Rajpurkar et al., 2016) dataset into training, development and test sets. We also report the results on the data split as did in Zhao et al. (2018), which reverses the development set and test set.

The question generation task is typically formulated as a Seq2Seq problem. The input passage and the answer are packed as “answer [SEP] input passage” as input, and the question is used as the target output sequence. We fine-tune the ProphetNet model 5 epochs in the training set and report the results of the two kinds of data splits as mentioned above. The first 512 tokens of the passage are fed to the model. The peak learning rate is

and the batch size is set to 28.

Following Dong et al. (2019), we compare our model against the following models: CorefNQG (Du and Cardie, 2018) which employs a feature-rich encoder based on Seq2Seq model; MP-GSN (Zhao et al., 2018) which incorporates a gated self-attention encoder with maxout pointer; SemQG (Zhang and Bansal, 2019) which introduces two semantics-enhanced rewards for Seq2Seq model training. Besides, we also compare our model with UniLM (Dong et al., 2019) which is the previous state-of-the-art on this task. Following Dong et al. (2019), we use the BLEU-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE-L (Lin, 2004) metrics for evaluation.

The results are shown in Table 3. It can be seen that our ProphetNet model outperforms all previous question generation methods on all metrics, achieving a new state-of-the-art for question generation on the SQuAD 1.1 dataset.

3.4 Large-scale Pre-training

Recent works show that the performance of the pre-trained model on the downstream task can be improved when using larger scaled pre-training corpora (Lewis et al., 2019; Raffel et al., 2019). We also pre-train ProphetNet on the 160GB English language corpora of news, books, stories and web text, which is similar 222Due to CC-News is not officially released, we use similar public news corpus REALNEWS  (Zellers et al., 2019) to the corpus used in BART (Lewis et al., 2019). The model configuration is the same as described in § 3.1. We fine-tune the ProphetNet on two downstream tasks CNN/DailyMail and Gigaword after pre-training, where the setting is the same as described in § 3.2. We compare ProphetNet (160GB) against the following strong baselines: T5 (Raffel et al., 2019) which is pre-trained on the text corpus of 750GB; PEGASUSLARGE (Zhang et al., 2019) which is pre-trained on the text corpus of 750GB and 3800GB, respectively; And BART (Lewis et al., 2019) which is pre-trained on the similar dataset as the ProphetNet (160GB).

The results are shown in Table 4. Because the model pre-training on this 160GB dataset is extremely time-consuming even if we use 16 32GB NVIDIA V100 GPUs, this paper we report the performance of the pre-trained ProphetNet, which has only been pre-trained for 16 days with 8.5 epochs. It should be noted that the number of this pre-trained epoch is only about 1/5 of the BART pre-training. Our experiments also show that at this pre-training epoch, the performance of the downstream tasks of ProphetNet is not convergence as shown in Figure 4. Nevertheless, it is surprising that our model still achieves comparable performance on CNN/DailyMail compared to other baselines. The ROUGE-L on CNN/DailyMail of ProphetNet is the highest. Moreover, ProphetNet (160GB) outperforms PEGASUSLARGE (C4 750GB) and PEGASUSLARGE (HugeNews 3800GB) using only about 1/5 and 1/20 of the pre-training corpus, respectively. To the best of our knowledge, ProphetNet achieves a new state-of-the-art result on the Gigaword.

Figure 4: Performance increase on CNN/DailyMail dataset as ProphetNet pre-trains for more epochs on 160GB large-scale dateset.

4 Related Work

Unsupervised pre-training has been successfully applied to various natural language processing tasks 

(Radford et al., 2018; Devlin et al., 2018; Liu et al., 2019; Joshi et al., 2019; Lan et al., 2019; Yang et al., 2019; Raffel et al., 2019; Dong et al., 2019; Song et al., 2019; Lewis et al., 2019). GPT (Radford et al., 2018) takes plain text as pre-training data to predict the next tokens with leftward tokens. It is based on the left-to-right language model and can be used to generate stories and continue to write for a given text. BERT (Devlin et al., 2018) and SpanBERT (Joshi et al., 2019) use a Bi-directional language model to recover masked tokens/spans for a given sentence. Bi-directional information flow can be used to recover the masked positions, but no left-to-right language model dependency is learned. As a result, BERT and SpanBERT bring significant improvement for NLU tasks but are not suitable for generation tasks. XLNet (Yang et al., 2019) predicts the tokens with given positions and some tokens with their positions in the sentence in an AR manner. Although it uses AR to build a permuted-ordered language model, it is also not suitable for NLG tasks because it brought too much noise for a left-to-right language model. MASS (Song et al., 2019) pre-trains the sequence-to-sequence model by dropping a continuous token span to corrupt the original text and learns to recover it. T5 (Raffel et al., 2019) investigates different model structures and different pre-training tasks, and is pre-trained on a large scale corpus named C4 which is 750GB. BART (Lewis et al., 2019) uses the encoder-decoder structure to generate the original sentence with its spoiled input to denoise. In the BART decoder, the undamaged language model is learned thus brings improvement to NLG tasks.

Natural language generation methods are typically based on the left-to-right or right-to-left language models and generate one token in each time step. These methods can not capture the information of future tokens. Recently, incorporating future information into language generation tasks has attracted the attention of researchers (Li et al., 2017; Serdyuk et al., 2018; Lawrence et al., 2019)Li et al. (2017) propose an actor-critic model which designs a value function as a critic to estimate the future success. In their method, they not only consider the MLE-based learning but also incorporate an RL-based value function into the decoder process. Serdyuk et al. (2018)

point out traditional Recurrent Neural Networks (RNNs) may prefer to generate each token based on the recent tokens, it is hard to learn the long-term dependencies. To capture the future information and learn the long-term dependencies, they run the forward RNN and backward RNN in parallel. 

Lawrence et al. (2019) concatenates the source and target to train an encoder instead of encoder-decoder architecture. They use special placeholder tokens to replace some tokens of the target for the model training process. At the inference process, they generate the target by replacing each placeholder token.

5 Conclusion

In this paper, we introduce ProphetNet, a sequence-to-sequence pretraining model that learns to predict future n-gram at each time step. ProphetNet achieves the best performance on both abstractive summarization and question generation tasks compared to the models using the same base scale pre-training dataset. Furthermore, ProphetNet achieves comparable results on CNN/DailyMail and a new state-of-the-art results on Gigaword using only about 1/5 the pre-training epochs of the previous model.

For future work, we will apply the proposed ProphetNet to more downstream NLG tasks and NLU tasks. We also plan to pre-train ProphetNet with other pre-training tasks and larger datasets such as C4.


  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §3.3.
  • Z. Cao, W. Li, S. Li, and F. Wei (2018) Retrieve, rerank and rewrite: soft template based neural summarization. In ACL, Cited by: §3.2, Table 2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §1, §3.1, §4.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In NeurIPS, Cited by: §3.2, §3.2, §3.2, §3.3, Table 1, Table 2, Table 3, §4.
  • X. Du and C. Cardie (2018) Harvesting paragraph-level question-answer pairs from wikipedia. In ACL, Cited by: §3.3, Table 3.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: §3.3.
  • S. Edunov, A. Baevski, and M. Auli (2019) Pre-trained language model representations for language generation. arXiv preprint arXiv:1903.09722. Cited by: §3.2, Table 1.
  • A. Fan, D. Grangier, and M. Auli (2017) Controllable abstractive summarization. arXiv preprint arXiv:1711.05217. Cited by: §3.2.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §3.2, Table 1.
  • C. Gulcehre, F. Dutil, A. Trischler, and Y. Bengio (2017) Plan, attend, generate: planning for sequence-to-sequence models. In NIPS, Cited by: §1.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) Spanbert: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529. Cited by: §4.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §3.2, §3.2.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017)

    Opennmt: open-source toolkit for neural machine translation

    In ACL, Cited by: §3.2, Table 2.
  • D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, A. Courville, and C. Pal (2016) Zoneout: regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305. Cited by: §1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §4.
  • C. Lawrence, B. Kotnis, and M. Niepert (2019) Attending to future tokens for bidirectional sequence generation. arXiv preprint arXiv:1908.05915. Cited by: §4.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §1, §1, §2.5, §3.4, Table 4, §4.
  • J. Li, W. Monroe, and D. Jurafsky (2017) Learning to decode for future success. arXiv preprint arXiv:1701.06549. Cited by: §1, §4.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, Cited by: §3.2, §3.3.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345. Cited by: §3.2, Table 1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.
  • S. Merity, N. S. Keskar, and R. Socher (2017) Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182. Cited by: §1.
  • R. Nallapati, F. Zhai, and B. Zhou (2017) Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In AAAI, Cited by: Table 1.
  • R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al. (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023. Cited by: §3.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §3.3.
  • R. Pascanu, T. Mikolov, and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In ICML, Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §3.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf. Cited by: §4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1, §1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: §1, §1, §2.4, §2.5, §3.4, Table 4, §4.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §3.3.
  • A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    In EMNLP, Cited by: §3.2.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §3.2, §3.2, §3.2, Table 1.
  • D. Serdyuk, N. R. Ke, A. Sordoni, A. Trischler, C. Pal, and Y. Bengio (2018) Twin networks: matching the future for sequence generation. In ICLR, Cited by: §1, §4.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §1, §1, §2.5, §2.5, §3.1, §3.2, §3.2, Table 1, Table 2, §4.
  • I. Sutskever, O. Vinyals, and Q. Le (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §1.
  • R. S. Sutton, A. G. Barto, et al. (1998) Introduction to reinforcement learning. Vol. 2, MIT press Cambridge. Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §1, §2.2, §2.3, §2.3, §2, §3.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §1, §4.
  • R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2019) Defending against neural fake news. arXiv preprint arXiv:1905.12616. Cited by: footnote 2.
  • J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu (2019) PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777. Cited by: §1, §3.4, Table 4.
  • S. Zhang and M. Bansal (2019) Addressing semantic drift in question generation for semi-supervised question answering. arXiv preprint arXiv:1909.06356. Cited by: §3.3, Table 3.
  • Y. Zhao, X. Ni, Y. Ding, and Q. Ke (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910. Cited by: §3.3, §3.3, Table 3.
  • Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: §3.3.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    Proceedings of the IEEE international conference on computer vision

    pp. 19–27. Cited by: §3.1.