Pre-trained on large-scale unlabeled text corpora and fine-tuned on downstream tasks, self-supervised representation models such as GPT [GPT], BERT [devlin2019bert] and XLNet [XLNet] have achieved remarkable improvements in natural language understanding (NLU). Different from encoder-only pre-training like BERT or decoder-only pre-training like GPT, natural language generation (NLG) relies on the sequence to sequence generation framework (seq2seq) which consists of a bidirectional encoder and a unidirectional decoder. Current pre-training works in NLG such as MASS [MASS] and UNILM [UNILM] mainly focus on jointly pre-training encoder and decoder on different self-supervised tasks. However, these works pay little attention to the exposure bias issue [seq2seq], a major drawback of teacher-forcing training. This issue is due to the fact that groundtruth words are used during training, while generated words, whether predicted correctly or not, are used for inference where mistakes tend to accumulate.
To alleviate this issue, we present ERNIE-GEN, an enhanced multi-flow seq2seq training framework characterized by a carefully-designed Multi-Flow Attention architecture based on Transformer [Vaswani et al., 2017], as illustrated in Figure 2. ERNIE-GEN incorporates a novel infilling generation mechanism and a noise-aware generation method into pre-training and fine-tuning, which is proved to be quite effective through experiments in §4.3.
Infilling generation. Instead of using last groundtruth word in training or last generated word in inference, we adopt an inserted artificial symbol [ATTN] along with its position to gather history contextual representations at each step in both training and inference, which diverts model’s attention away from last word and coerces it into focusing on all former representations, thus alleviating negative influence of previous mistakes to subsequent generation, as shown in Figure 1(b).
Noise-Aware generation. We corrupt the input target sequence by randomly replacing words to arbitrary words in the vocabulary. This setup, despite its simplicity, proves to be an effective way to make the model be aware of mistakes in training, so that the model is able to detect mistakes and ingore them during inference.
Moreover, in light of the fact that entities, phrases and sentences in human writing are organized in a coherent manner, we incorporate a span-by-span generation task into ERNIE-GEN as a new generation flow to train the model to predict semantically-complete spans consecutively rather than predicting word by word as traditional seq2seq models do. This task is implemented through the infilling generation mechanism in parallel with an infilling-based word-by-word generation flow to facilitate convergence in training, as shown in Figure 1b.
In addition, as shown in Figure 1(c-d), recent pre-training works for NLG like UNILM and MASS only sample a single continuous segment as target sequence. However, this sampling method compromises the correlation between encoder and decoder when it comes to pre-training of long texts (typically 512 words), given that adjacent segments are often relevant semantically. ERNIE-GEN adopts a multi-granularity target fragments sampling strategy to force decoder to rely more on the encoder representations other than the previous generated words, thus enhancing the correlation between encoder and decoder, as shown in Figure 1e.
Empirically, ERNIE-GEN is particularly effective and achieves state-of-the-art results on a range of NLG tasks including abstractive summarization (Gigaword and CN- N/DailyMail), question generation (SQuAD), dialogue generation (Persona-Chat) and generative question answering (CoQA), utilizing a much smaller amount of pre-training data and parameters.
2 Related Work
Pre-Training for NLP Tasks.
Recently, pre-training methods have achieved state-of-the-art results in multiple NLU tasks. ELMo [elmo] pre-trains two unidirectional language models (LMs) with forward and backward direction respectively to feature downstream tasks. GPT utilizes an adjusted Transformer [transformer] to learn a forward LM and then fine-tunes the forward LM on supervised datasets. BERT proposes a masked language modeling (MLM) task to learn deep bidirectional representations. Nevertheless, above methods are usually implemented by just one encoder or decoder, which is less effective in encoder-decoder based generation tasks, thus several works have preliminarily explored the pre-training towards NLG by incorporating BERT’s MLM into the seq2seq framework and shown excellent performance on a range of generation tasks. MASS masks a consecutive fragment (50%) of the input sentence with artificial [MASK] to predict. UNILM masks several words in the input sequence which is a pair of segments for encoder and decoder, and then predicts the masked words in accordance with BERT’s MLM. BART [BART]
corrupts the input sequence and trains the model to generate original sequence as a denoising autoencoder.
Exposure Bias issue.
NLG tasks suffer from the exposure bias which is caused by teacher-forcing training. To address such issue, RNN-based variational autoencoders (VAEs) are leveraged in [cvae, vae1]
, whereas it requires inference for both posterior and prior distribution. Reinforcement learning is also adopted to text generation against exposure bias issue[seq2seq, scst], which is, however, inefficient during training because of the word-by-word sampling procedure. These methods are inefficient and less practical for pre-training that relies on large-scale unlabeled text corpora.
Works like [ernie1, ernie, spanbert] verify that predicting spans reaches substantially better performance on NLU tasks. Meanwhile, inspired by characteristics of human expression, we hope the model have the foresight to generate a semantically-complete span at each step rather than a word. Consequently, a span-by-span generating task is proposed to make the model capable of generating texts more human-like.
3 Proposed Framework
Built on infilling generation mechanism, ERNIE-GEN ado- pts a Multi-Flow Attention architecture to train the model on word-by-word and span-by-span generation tasks in parallel. In this section, we describe ERNIE-GEN according to the training process shown in Figure 2.
3.1 Multi-Granularity Target Fragments
Given a source sequence , we first sample a length distribution from a distribution set
with probability, then select a various of fragments according to iteratively until the fragment budget has been spent (e.g. 25% of ). We denote as the -th fragment which is sampled in length distribution . Sampled fragments are then removed from and stitched together to form target sequence . We denote as the left source sequence after removing sampled fragments.
ERNIE-GEN performs pre-training by predicting the fragmented target sequence and minimizing the negative log likelihood:
where the target sequence is sorted by the position of each fragment. In each fragment , .
Following preliminary trials, we set a hyperparameter, which denotes the ratio of length of all fragments to that of source sequence
. Besides, we introduce two uniform distributionswith probability of and respectively to sample fragments, which aims to learn representations from different perspectives. On the one hand, short fragments benefit learning of semantic relation between words; on the other hand, longer fragments help to memorize sentence-level expressions.
3.2 Noise-Aware Generation
To training a generation model which can detect the false prediction and mitigate its impact on subsequent generation, we corrupt the groundtruth sequence with a procedure where words are being replaced randomly, and the corrupted is represented as . There are two hyperparameters, and , denoting the noising rate in pre-training and fine-tuning respectively.
3.3 Architecture: Multi-Flow Attention
Formally, given a source sequence , a target sequence and an artificial symbol sequence [ATTN][ATTN] which has the same length as , we denote the inference of seq2seq network based on shared Transformer as follows:
where , , denote the query, key, and value in Multi-Head attention operation [transformer]. and indicate the
-th vector representations of the-th layer of Multi-Head Attention for the encoder and the decoder respectively, denotes the concatenation operation. In this work, we call the above procedure the Contextual Flow.
Word-by-word Generation Flow.
Based on infilling generation mechanism, this flow utilizes an inserted [ATTN] symbol to gather history representations word by word (see Figure 1b). To facilitate this process, we place all inserted [ATTN] together as shown in Figure 3b. To be specific, the word-by-word generation flow is updated as follow:
where indicates the -th vector representation of the -th layer for the artificial symbol sequence .
Span-by-span Generation Flow.
Different from word-by-word generation flow, span-by-span flow uses [ATTN] symbols to predict complete spans consecutively, as shown in Figure 3c. Formally, given a list of span boundaries , we conduct the span-by-span generation flow as:
where , and denotes the -th vector representation of the -th span. Essentially, the model is trained to predict a whole span with the same history context .
Instead of randomly sampling spans, we prefer sampling spans with semantical information and knowledge. Specifically, we consider the following two steps to sample spans consecutively in :
Firstly, we implement a T-test to compute t-statistic scores of all bigrams and trigrams, which is based on an initial hypothesis: a random span of arbitrary words with probability cannot be a statistical -gram. The t-statistic score is calculated by , where and
indicate the statistic probability and standard deviation ofrespectively, while denotes the total number of -grams appearing in the training data. According to the t-statistic scores, we filter out the top 200,000 bigrams, top 50,000 trigrams and all unigrams to construct a specific vocabulary of spans, which is represented as .
Secondly, we search the trigram, bigram and unigram in order, starting with current word until a span (-gram, ) is retrieved in .
To integrate the word-by-word generation flow and span-by-span generation flow, we apply them in parallel with a shared contextual flow by leveraging the multi-flow attention architecture, as Figure 3a describes. The multi-flow attention is computed as:
where denotes the concatenation of and , , are the artificial symbol sequences fed into the word-by-word generation flow and span-by-span generation flow respectively. As shown in Figure 3d, attention mask matrix determines whether query and key can attend to each other by modifying the attention weight [transformer] . Specifically, is assigned as:
While training, we add the loss of word-by-word generation flow and span-by-span generation flow with an coefficient :
where and indicate the two generated sequences, and
denotes the cross entropy loss function. In detail, we setand respectively in pre-training and fine-tuning.
3.4 Infilling Decoding
While decoding, the target sequence is unknown, we insert [ATTN] step by step to gather the representation of history context instead of preparing an artificial symbol sequence in advance. Meanwhile, for the purpose of efficiency, we need to drop the inserted [ATTN] after inference at each step, as detailed in Figure 4.
In this section, we compare our ERNIE-GEN with previous works and conduct several ablation experiments to assess the performance of proposed methods in §3.
4.1 Pre-training and Implementation
Analogous to BERT and UNILM, ERNIE-GEN is trained on English Wikipedia111English Wikipedia version: enwiki-20181101. and BookCorpus [book], totaling 16GB. The input sequence is lowercased and truncated to a maximum length of 512. We train a base model ERNIE-GEN (=12, =768, =12, Total Parameters=110M)222We donate the number of layers as , the hidden size as and the number of self-attention heads as . and a large model ERNIE-GEN (=24, =1024, =16, Total Parameters=340M) with parameters initialized by BERT and BERT respectively. Specifically, Adam optimizer with is employed. The peak learning rate is 5e-5 with warmup over the first 4,000 steps and linear decay scheduling. The noising rate for pre-training is 0.05. Batches are organized by limiting the maximum number of tokens to 24,576. Pre-training experiments are carried out on PaddlePaddle platforms333https://github.com/PaddlePaddle/Paddle and Nvidia Tesla V100 GPU. We use 32 GPU cards and 64 GPU cards for ERNIE-GEN and ERNIE-GEN respectively. By virtue of float16 mixed precision training, it takes almost 4 days for 400,000 steps to train ERNIE-GEN while almost 7 days for 450,000 steps to train ERNIE-GEN.
4.2 Fine-tuning on Downstream Tasks
|Task||Epoch||Learning Rate||Noising Rate||Dropout Rate||Batch||Label||Beam||Evaluation Metric|
|SQuAD QG||10||10||2.5e-5||1.5e-5||0.7||0.7||0.1||0.2||32||0.1||1||BLEU-4, METEOR (MTR), ROUGE-L (RG-L)|
|Persona-Chat||-||30||-||1e-4||-||0.0||-||0.1||64||0.1||10||BLEU-1, BLEU-1, Distinct-1, Distinct-1|
aims at generating fluent and concise summaries without being constrained to extracting subsequences from the input articles. We execute experim-ents on Gigaword dataset [gigaword] and CNN/D-ailyMail dataset [cnn]. Gigaword dataset contains 3.8M articles extracted from the Gigaword corpus, while CNN/DailyMail dataset consists of 93k articles and 220k articles from the CNN and Daily Mail respectively .
|Model||Data||Params||RG-1 / RG-2 / RG-L|
|10k training samples : Gigaword 10k|
|MASS [MASS]||18G||160M||25.03 / 9.48 / 23.48|
|UNILM [UNILM]||16G||340M||32.96 / 14.68 / 30.56|
|ERNIE-GEN||16G||110M||33.75 / 15.23 / 31.35|
|ERNIE-GEN||16G||340M||35.05 / 16.10 / 32.50|
|Fully 3.8M training samples|
|MASS [MASS]||18G||160M||37.66 / 18.53 / 34.89|
|BERTSHARE [bertshare]||16G||110M||38.13 / 19.81 / 35.62|
|UNILM [UNILM]||16G||340M||38.45 / 19.45 / 35.75|
|PEGASUS(C4) [pegasus]||750G||568M||38.75 / 19.96 / 36.14|
|PEGASUS(HugeNews) [pegasus]||3.8T||568M||39.12 / 19.86 / 36.24|
|ERNIE-GEN||16G||110M||38.83 / 20.04 / 36.20|
|ERNIE-GEN||16G||340M||39.25 / 20.25 / 36.53|
The results on Gigaword with two scales (10k and 3.8M) are presented in Table 2, and the fine-tuning settings are shown in Table 1. On low-resource task (Gigaword 10k), ERNIE-GEN outperforms UNILM by points in ROUGE-L while ERNIE-GEN yields a gain of ROUGE-L compared with UNILM. On full Gigaword dataset, ERNIE-GEN creates the state-of-the-art results, outperforming various pervious methods. Specifically, ERNIE-GEN outperforms PEGASUS (568M) trained with C4 (750G) by using only 16G training data and 110M parameters. Meanwhile, it is also interesting to see that with model size scaling up, gains in low-resource tasks appear to be more remarkable.
|Model||Data||Params||RG-1 / RG-2 / RG-L|
|BERTSHARE [bertshare]||16G||110M||39.25 / 18.09 / 36.45|
|BERTSUMABS [bertsum]||16G||110M||41.72 / 19.39 / 38.76|
|MASS [MASS]||18G||160M||42.12 / 19.50 / 39.01|
|UNILM [UNILM]||16G||340M||43.33 / 20.21 / 40.51|
|T5 [T5]||750G||340M||42.50 / 20.68 / 39.75|
|T5 [T5]||750G||11B||43.52 / 21.55 / 40.69|
|BART [BART]||160G||400M||44.16 / 21.28 / 40.90|
|PEGASUS(C4) [pegasus]||750G||568M||43.90 / 21.20 / 40.76|
|PEGASUS(HugeNews) [pegasus]||3.8T||568M||44.17 / 21.47 / 41.11|
|ERNIE-GEN||16G||110M||42.30 / 19.92 / 39.68|
|ERNIE-GEN||16G||340M||44.02 / 21.17 / 41.26|
Table 3 shows the performance on CNN/DailyMail. With a similar amount of pre-training data and parameters, ERNIE-GEN outperforms MASS by ROUGE-L scores. Fairly compared with UNILM, ERNIE-GEN obtains substantial gain of ROUGE-L scores. Meanwhile, in spite of small pre-training data and parameters, our large model ERNIE-GEN also achieves state-of-the-art result on ROUGE-L and comparable performance on ROUGE-1 and ROUGE-2.
|Reversed test dev split|
is to generate a question according to a given input passage and a corresponding answer. We evaluate on the SQuAD 1.1 dataset [squad] for question generation task (called SQuAD QG). Following UNILM, we redistribute the original dataset into a new training set and testing set with the original development set unchanged. We also conduct experiment with the reversed devtest split as [split] indicates.
Specifically, the input source sequence is the concatenation of the input passage and the answer text, while the target sequence is a given question. We fine-tune ERNIE-GEN with the settings shown in Table 1. In Table 4, we present results of ERNIE-GEN and several previous works. Again, ERNIE-GEN outperforms UNILM and achieves a state-of-the-art result on question generation by giving BLEU-4 scores.
|LIC [plato]||40.5 / 32.0||0.019 / 0.113|
|PLATO [plato]||45.8 / 35.7||0.012 / 0.064|
|PLATO [plato]||41.8 / 32.4||0.014 / 0.081|
|ERNIE-GEN||46.8 / 36.4||0.023 / 0.168|
Generative Question Answering / Dialogue Response
in multi-turn conversations are challenging because of complex background knowledge and diverse utterances. We conduct an experiment on Persona-Chat dataset [persona] to generate responses according to given multi-turn conversations and persona profile. Table 5 shows that ERNIE-GEN outperforms current task-specific pre-training model on dialogue generation. Beside, we also execute an experiment on CoQA dataset [coqa] to generate free-form answers for input questions and conversations. Compared with early models listed in Table 6, our generative question answering model works considerably better than early works by F1-scores.
4.3 Ablation Studies
To better understand the importance of each proposed generation methods, we conduct experiments concerning the following two aspects:
|# Fine-tuning method||1 Noising fine-tuning: Fine-tuning with noise-aware generation||2 Masking fine-tuning: Only updating the gradients of masked words|
|# Model||Gigaword 10k||CNN/DailyMail 10k||SQuAD QG||Gigaword 10k||CNN/DailyMail 10k||SQuAD QG|
|RG-1 / RG-2 / RG-L||RG-1 / RG-2 / RG-L||Bleu-4 / MTR / RG-L||RG-1 / RG-2 / RG-L||RG-1 / RG-2 / RG-L||Bleu-4 / MTR / RG-L|
|1 ERNIE-GEN||33.75 / 15.23 / 31.35||39.92 / 17.46 / 37.40||23.52 / 25.61 / 51.45||33.30 / 15.04 / 31.22||39.54 / 17.61 / 37.00||22.99 / 25.14 / 51.31|
|2 - noise-aware||33.57 / 15.15 / 31.28||39.78 / 17.63 / 37.23||23.40 / 25.50 / 51.36||33.01 / 14.94 / 31.00||39.53 / 17.61 / 36.97||23.09 / 25.15 / 51.41|
|3 - span-by-span||33.43 / 15.04 / 31.14||39.75 / 17.62 / 37.21||23.37 / 25.56 / 51.32||32.97 / 14.92 / 30.94||39.54 / 17.57 / 36.95||23.10 / 25.14 / 51.42|
|4 - 2 and 3||33.23 / 14.77 / 31.00||39.71 / 17.55 / 37.18||23.34 / 25.54 / 51.30||32.57 / 14.68 / 30.60||39.49 / 17.66 / 36.96||22.89 / 25.08 / 51.28|
The robustness of infilling generation mechanism and noi-se-aware generation method against the exposure bias.
The effectiveness of span-by-span generation task and the complete ERNIE-GEN model.
|#||Task (Metrics)||Typical generation||Infilling generation|
|Fine-tuning without noise-aware generation|
|1||Gigaword 10k (RG-1 / RG-2 / RG-L)||32.98 / 14.67 / 30.51||32.93 /14.46/ 30.53|
|2||CNN/DM 10k (RG-1 / RG-2 / RG-L)||39.25 / 16.70 / 36.65||39.56 / 16.93 / 36.94|
|3||SQuAD QG (Bleu-4 / MTR / RG-L)||21.95 / 24.53 / 50.34||22.13 / 24.66 / 50.51|
|Fine-tuning with noise-aware generation|
|4||Gigaword 10k (RG-1 / RG-2 / RG-L)||32.99 / 14.83 / 30.84||33.23 / 14.77 / 31.00|
|5||CNN/DM 10k (RG-1 / RG-2 / RG-L)||39.34 / 17.30 / 36.75||39.71 / 17.55 / 37.18|
|6||SQuAD QG (Bleu-4 / MTR / RG-L)||23.23 / 25.47 / 51.25||23.34 / 25.54 / 51.30|
In Table 8, we compare two ERNIE-GEN variants that are pre-trained with typical generation mechanism and infilling generation mechanism and that generates word by word. Row 1-3 shows that without noising groundtruth texts, infilling generation outperforms typical generation across tasks. Furthermore, both variants achieve remarkable impr-ovements by fine-tuning with noise-aware generation method (row 4-6). Meanwhile it is interesting that infilling generation benefits more from the noising procedure, suggesting the robustness of infilling generation against mistakes while decoding. Specifically, Figure 5a shows the results with diverse choices of noising rate on Gigaword 10k and SQuAD QG, indicating that appropriate noising substantially benefits the training and alleviates the training-inference discrepancy.
To further analyze the excellence of infilling generation mechanism with noising, we compute the average attention weights of source tokens, unnoised target tokens and noised target tokens in the last self-attention layer respectively on 1,000 samples. Average attention weights with diverse noising rate are shown in Figure 5b, which tells us that the model pays more attention on the decoder side to figure out noised points and assign them less attention weights as the noising rate increased in fine-tuning. Thereby, the model is able to detect and ignore the false predictions properly to alleviate accumulating mistakes while inference.
In column 1 of Table 7, we compare 4 ERNIE-GEN variants on three tasks. We see that noise-aware generation method and span-by-span generation task (rows 2-3 of Table 7) play an important role in ERNIE-GEN pre-training and significantly outperform the baseline model which is only pre-trained with word-by-word infilling generation flow (row 4 of Table 7). After integrating noise-aware generation method and span-by-span generation task, ERNIE-GEN boosts the performance across all three tasks, as shown in row 1 of Table 7. In addition, to verify the idea that fine-tuning with unidirectional MLM like UNILM is inefficient due to the coupling of masking (noising) and predicting, we also list the fine-tuning results obtained by a unidirectional MLM with masking probability of 0.7, as shown in column 2 of Table 7. We observe that our noise-aware generation method significantly outperforms the unidirectional MLM in fine-tuning.
We present an enhanced multi-flow seq2seq pre-training and fine-tuning framework (ERNIE-GEN) for language generation, which incorporates a infilling generation mechanism and a noise-aware generation method to alleviate the exposure bias. Besides, ERNIE-GEN integrates a new span-by-span generation task to train the model to generate texts like human writing, which further improves the performance on downstream tasks. Through extensive experiments, ERNIE-GEN achieves state-of-the-art results on a range of NLG tasks.