PALM: Pre-training an Autoencoding Autoregressive Language Model for Context-conditioned Generation

04/14/2020 ∙ by Bin Bi, et al. ∙ 0

Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation, such as BERT, MASS and BART. The existing pre-training techniques employ autoencoding and/or autoregressive objectives to train Transformer-based models by recovering original word tokens from corrupted text with some masked tokens. In this work, we present PALM which pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus especially for downstream generation conditioned on context, such as generative question answering and conversational response generation. PALM minimizes the mismatch introduced by the existing denoising scheme between pre-training and fine-tuning where generation is more than reconstructing original text. With a novel pre-training scheme, PALM achieves new state-of-the-art results on a variety of language generation benchmarks covering generative question answering (Rank 1 on the official MARCO leaderboard), abstractive summarization on Gigaword and conversational response generation on Cornell Movie Dialogues.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-supervised pre-training has achieved great success in natural language understanding (NLU) and a wide range of NLP tasks Dai and Le (2015); Howard and Ruder (2018); Radford (2018); Peters et al. (2018); Devlin et al. (2018). A variety of training objectives and auxiliary tasks have been introduced into pre-training on massive unlabeled text data, and the pre-trained models can be further fine-tuned for downstream NLU tasks. Among existing pre-training methods, BERT-like approaches, such as BERT Devlin et al. (2018), RoBERTa Liu et al. (2019) and ALBERTLan et al. (2019), are the most prominent by pre-training the bidirectional Transformer Vaswani et al. (2017) encoder on a large text corpus through masked language modeling and next sentence prediction. The BERT-like pre-training is designed for language understanding applications aiming to extract knowledge by comprehending given contextual text.

Different from language understanding, language generation aims at generating natural language sentences, including tasks like neural machine translation 

Bahdanau et al. (2015); Vaswani et al. (2017), abstractive summarization Rush et al. (2015); See et al. (2017); Gehrmann et al. (2018), generative question answering (QA) Tan et al. (2017); Bi et al. (2019) and conversational response generation Vinyals and Le (2015). Many of the language generation tasks require the models to read and to comprehend a given document, based on which output text is generated. In this paper, we present PALM, a novel approach to Pre-training an Autoencoding&autoregressive Language M

odel for text generation based on reading comprehension of textual context.

Recently, several pre-training methods have been proposed for language generation. GPT Radford (2018) and GPT-2 Radford et al. (2019) use a left-to-right Transformer decoder to generate a text sequence token-by-token, which lacks an encoder to condition generation on context. In contrast, MASS Song et al. (2019) and BART Lewis et al. (2019) both employ a Transformer-based encoder-decoder framework, with a bidirectional encoder over corrupted (masked) text and a left-to-right decoder reconstructing the original text. While such denoising pre-training objectives work well for the downstream generation tasks where generated text comes from input but is manipulated, they are less related to the comprehension-based generation tasks asking for instead generating continuations, responses or answers by comprehending input context.

PALM is specifically designed to pre-train a backbone model on a large unlabeled corpus for fine-tuning on the downstream comprehension-based generation tasks, one example of which is generative QA. In generative question answering, QA models are asked to generate an abstractive answer in natural language to a given question by reading and comprehending a contextual passage. Abstractive answer generation is more than manipulating tokens in the passage. An abstractive answer reflects the understanding of the passage and the question, and can include content out of the passage to be self-contained and well-formed. To address comprehension-based generation like generative QA, PALM uses the pre-training objectives that are closely related to the downstream tasks. Specifically, it differs from existing generative pre-training methods in that PALM goes beyond the solely autoencoding/autoregressive methods and combines the merits of autoencoding and autoregression in a single framework. Moreover, it possesses a mechanism built in pre-training for generating coherent text from given context.

With the new design, PALM can surpass existing language generation methods at much less computational cost than that of prior pre-training approaches – It was trained on 16 NVIDIA V100 GPUs for 3 days in our experiments, and expected to perform even better if trained for longer. PALM gives surprisingly good empirical results on a variety of context-aware generation tasks, including pushing the state-of-the-art Rouge-L on the MARCO Q&A + Natural Language Generation benchmark to 0.498 (Rank 1 on the leaderboard 111http://www.msmarco.org/leaders.aspx) and on Gigaword summarization to 0.360, as well as establishing the state-of-the-art perplexity of 21.98 on generating responses to Cornell Movie Dialogues.

We make the following major contributions in this paper:

  • We propose PALM, a novel approach to pre-training a language model on a large unlabeled text corpus, which is able to comprehend contextual text. The pre-trained model is particularly effective to be fine-tuned for language generation conditioned on context.

  • With less training cost than that of existing pre-training methods, PALM significantly advances the state-of-the-art results on a variety of language generation applications, including generative QA, abstractive summarization and conversational response generation. It clearly demonstrates PALM’s effectiveness and generalizability in language generation.

2 Language Modeling

PALM is built upon an extension of an encoder-decoder framework. In this section, we introduce the encoder-decoder framework for language modeling, followed by the base architecture used for PALM.

2.1 Encoder-Decoder

We denote as a pair of text pieces, where is the source text with tokens, and is the target text with tokens. and denote the sets of source text and target text, respectively. An encoder-decoder model learns the parameter set

to estimate the conditional probability

with log-likelihood as the objective function: . The conditional probability

can be further factorized according to the chain rule:

, where denotes the token sequence preceding position .

In the encoder-decoder framework, the encoder reads the source text and generates a set of representations. With the source representations and its preceding token sequence, the decoder estimates the conditional probability of each target token. An attention mechanism Bahdanau et al. (2015) is further introduced between the encoder and the decoder to identify a subset of source representations to attend for predicting each target token.

2.2 Transformer Base

PALM uses the standard Transformer encoder-decoder from Vaswani et al. (2017) as the base architecture. First, an input sequence of tokens is mapped to a sequence of embeddings, which is then passed into the encoder. The encoder consists of a stack of “blocks”, each of which comprises two subcomponents: a self-attention layer followed by a small feed-forward network. Layer normalization Ba et al. (2016) is applied to the input of each subcomponent and a residual skip connection He et al. (2016) adds each subcomponent’s input to its output. Dropout Srivastava et al. (2014) is applied within the feed-forward network, on the skip connection, on the attention weights, and at the input and output of the entire stack.

The decoder is similar in structure to the encoder except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder. The self-attention mechanism in the decoder also uses a form of autoregressive or causal self-attention, which only allows the model to attend to past outputs. The output of the final decoder block is fed into a dense layer with a softmax output, whose weights are shared with the input embedding matrix. All attention mechanisms in the Transformer are split up into independent “heads” whose outputs are concatenated before being further processed.

3 PALM for Context-conditioned Generation

This section presents the new mechanism and pre-training objectives of PALM for generation conditioned on context. The differences between PALM and prior pre-training approaches are discussed as well.

3.1 Joint Modeling of Autoencoding and Autoregression

Existing Transformer-based pre-training methods employ either autoencoding or autoregressive objectives for self-supervision. Autoencoding-based pre-training aims to reconstruct the original text from corrupted input. Notable examples are BERT and its variants RoBERTa and ALBERT, where a certain portion of input tokens are replaced by a special symbol [MASK]. The models are trained to recover the original tokens from the corrupted version by utilizing bidirectional context. However, these autoencoding methods are not applicable to text generation where bidirectional contexts are not available.

On the other hand, an autoregressive model, such as GPT 

Radford (2018); Radford et al. (2019), is only trained to encode unidirectional context (either forward or backward). Specifically, at each output timestep, a token is sampled from the model’s predicted distribution and the sample is fed back into the model to produce a prediction for the next output timestep, and so on. While applicable to text generation, the autoregressive methods are not effective at modeling deep bidirectional context. On the contrary, downstream generation tasks often ask a model to condition generation on given textual context. This results in a gap between autoregressive modeling and effective pre-training.

(a) GPT: Tokens are predicted autoregressively, meaning that GPT can be used for generation. However, it lacks an encoder to condition generation on context.
(b) MASS: It is based on the encoder-decoder architecture, but the decoder predicts only the tokens that are masked in the text input to the encoder.
(c) BART: Rather than masked tokens, the decoder reconstructs the original full sentence from the corrupted input to the encoder. However, it mismatches with most downstream generation which is more than reconstructing original input.
(d) PALM: The encoder predicts masked tokens by encoding context bidirectionally, and the decoder predicts the text segment subsequent to the text input to the encoder, which enables the model to generate continuations in downstream.
Figure 1: A schematic comparison of PALM with GPT, MASS and BART.

To close the gap, PALM is carefully designed to autoregressively generate a text sequence by comprehending the given context in a bidirectional autoencoding manner. In particular, PALM delegates autoencoding-based comprehension to the encoder in Transformer, and autoregressive generation to the Transformer decoder. The encoder and decoder are jointly pre-trained in two stages:

  1. The encoder is first trained as a bidirectional autoencoder to reconstruct the original text from corrupted context in which random tokens are sampled and replaced with [MASK] symbols following BERT’s practice Devlin et al. (2018). The training optimizes the cross-entropy reconstruction loss between encoder’s output and original context, as Masked Language Modeling (MLM) in BERT. By predicting the actual tokens in context that are masked, PALM forces the encoder to comprehend the meaning of the unmasked tokens and the full context.

  2. The encoder and decoder are then jointly trained to autoregressively generate text output out of the context representations from the encoder. The training maximizes the log-likelihood of the text in ground truth from the decoder’s output:

    (1)

    where represents the set of context and represents the set of text to be generated. By conditioning the generation on context representations, PALM forces the decoder to rely deeply on the context instead of preceding generated tokens in next token prediction, which facilitates context-sensitive generation.

3.2 Input&Output Representations

In the phase of model pre-training, input and output representations are tailored to minimize the discrepancy between self-supervised pre-training and supervised fine-tuning. In a typical downstream generation task (e.g., abstractive summarization and generative QA), context is given as a rather long passage, and a model is asked to generate a shorter piece of text based on the comprehension of the context.

Given a contiguous text fragment of length (composed of a few sentences) from an unlabeled corpus, PALM uses the consecutive span of length from the beginning of the fragment as context input to the encoder, and uses the remainder of text span of length as text output to be generated by the decoder. This representation design mimics the input and output of downstream tasks, with the hypothesis that human-written text is coherent and thus the subsequent text span of length captures the comprehension of the preceding context span. In this way, PALM learns to infer the subsequent text content from the preceding content.

The collection of text fragments are constructed from a corpus by following the practice of BERT. In our experiments, we set the maximum length of a fragment to be 500, i.e., . Therefore, the context input consists of at most 400 tokens, and the text output consists of at most 100 tokens.

Figure 1 shows a schematic comparison of input&output representations between PALM and the existing pre-training generation methods, GPT, MASS and BART. GPT uses a decoder to predict tokens autoregressively, without an encoder to condition generation on context. MASS and BART are both trained to recover the original tokens that are masked from corrupted text, where the input to the encoder and the decoder come from the same text segment (e.g., the sequence in Figures 0(b) and 0(c)). They are also expected to output the tokens from the same text sequence. By contrast, in PALM the encoder and the decoder take two different inputs. The input to the decoder comes from the continuation of the text input to the encoder (e.g., is subsequent to in the contiguous text segment in Figure 0(d)). In addition to the continuation predicted by the encoder, PALM produces an extra output from the encoder, which contains the predicted tokens masked in the input (e.g., and in Figure 0(d)). The output predictions from the encoder and the decoder are used for training in the two stages, respectively.

3.3 Copying Tokens from Context

In a human-written document, subsequent text often refers back to entities and tokens present earlier in the preceding text. Therefore, it would increase coherence of text generated in downstream to incorporate the copy mechanism into pre-training on an unlabeled corpus. This allows the model to learn from pre-training when and how to copy tokens in generating text, and the knowledge is transferred to downstream fine-tuning.

Figure 2: The pointer-generator network on top of the decoder in Transformer. For each decoding step , mixture weights for the probability of generating tokens from the vocabulary and copying tokens from context are calculated. The two distributions are weightedly summed to obtain the final distribution.

PALM incorporates the copy mechanism by plugging in the pointer-generator network See et al. (2017); Nishida et al. (2019) on top of the decoder in Transformer. Figure 2 illustrates the pointer-generator network, which allows every token to be either generated from a vocabulary or copied from context in generating text.

Extended vocabulary distribution. Let the extended vocabulary, , be the union of words in the vocabulary and all tokens present in context.

then denotes the probability distribution of the

-th word token, , over the extended vocabulary, defined as:

(2)

where denotes the output representation of -th token from the decoder. The output embedding is tied with the corresponding part of the input embedding Inan et al. (2017), and and are learnable parameters.

Copy distribution. PALM uses an additional attention layer for the copy distribution on top of the decoder. In the course of generation, the layer takes as the query, and outputs as the attention weights and

as the context vector:

(3)
(4)
(5)

where is the representation of -th token in context from the encoder. , , and are learnable parameters. As a result, is the copy distribution over the extended vocabulary, defined as:

(6)

Final distribution. The final probability of generating is defined as a mixture of the extended vocabulary distribution and the copy distribution:

(7)
(8)

where , and are learnable parameters.

The parameters in pointer-generator learned in pre-training are all kept and passed downstream for fine-tuning on labeled data.

4 Experiments

In this section, we present the experimental setup and results of PALM pre-training on a large unlabeled corpus and fine-tuning on a variety of language generation tasks, including generative QA, abstractive summarization and conversational response generation.

Example 1
Input A classic Aston Martin once owned by Spartacus star Peter Ustinov is set to fetch more than
£1 million at auction - twice what it fetched four years ago. The actor bought the Aston
Martin DB4 Cabriolet in 1962, shortly after winning a Best Supporting Actor Oscar for his
role as Batiatus in Spartacus. It was one of the most luxurious cars of its day, costing £4,000
and was delivered to him at a Swiss hotel at a time when the average house price in Britain
was just £2,500.
PALM The Aston Martin DB4 Cabriolet was bought by Peter Ustinov for £4000 and was expected
to fetch for £2.5 million for auction. The car was sold for £1.2 million finally.
MASS peter ustinov’s UNK auctioned for more than $1 million.
Example 2
Input Cape Verde’s 2-0 win over Portugal was the most eye-catching international result of the
week. So, who are Cape Verde and why has this tiny island off the west coast of Africa
suddenly become an international football force? Where are the Cape Verde Islands? Cape
Verde is a group of islands 400 miles from Senegal off the west coast of Africa. Its
population is around 500,000 and boasts both beautiful beaches and striking volcanic
landscapes, making it a haven for tourism.
PALM Cape Verde is a small island off the west coast of Africa with a population of around 500,000
and boasts both beautiful beaches and striking volcanic landscapes, making it a haven for
tourism. Cape Verde is home to the Cape Verde Islands National Park with a number of islands.
MASS tiny african island nation cape verde has beautiful beaches.
Table 1: Example generated continuations of the text input to PALM and MASS.

4.1 Pre-training Configuration

Experimental Setup. PALM is based on the Transformer which consists of a 12-layer encoder and a 12-layer decoder with 768 embedding/hidden size, 3072 feed-forward filter size and 12 attention heads. The parameters of the encoder are initialized by the pre-trained RoBERTaBase model222https://github.com/pytorch/fairseq which was trained with the Masked LM objective, removing Next Sentence Prediction from BERT.

PALM is trained with a dropout rate of 0.1 on all layers and attention weights, and a GELU activation function 

Hendrycks and Gimpel (2016) used as GPT. The learning rate is set to 1e-5, with linear warmup over the first 10k steps and linear decay. The pre-training procedure runs on 16 NVIDIA V100 GPU cards for 800K steps, with each minibatch containing 64 sequences of maximum length 500 tokens.

Pre-training Dataset. We use documents of English Wikipedia and BookCorpus Zhu et al. (2015) as our pre-training corpus, and perform WordPiece tokenization as BERT Devlin et al. (2018). The documents are split into sentences. Different from BERT, we use multiple consecutive sentences up to 400 tokens as the source text input to the encoder, and use the subsequent consecutive sentences up to 100 tokens as the target text to the decoder. The pre-training dataset

is constructed from the documents by a sliding window with the stride of one sentence, resulting in 50M

pre-training pairs.

4.2 Unsupervised Pre-training

To understand the performance of PALM pre-training, we compare generation quality of the pre-trained models of PALM and MASS 333https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth. Specifically, we feed a few sentences from a news article to both pre-trained models, and the models generate a continuation of the input sentences by beam search with a beam of size 5. The news articles from CNN 444https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ are used as input text to eliminate the possibility of the text present in the models’ pre-training corpus, i.e., Wikipedia and BookCorpus.

The overall perplexity of PALM is 17.22, which is much better than MASS’s perplexity of 170.32, indicating PALM’s better language modeling. Table 1 illustrates a couple of example continuations generated by PALM and MASS. In both examples, PALM generates fluent and grammatical English, while MASS outputs a short sentence that is much less relevant to input text, since the MASS model was trained on individual sentences. In the first example, it is interesting to observe that in addition to summarizing the input content, PALM is able to make a non-trivial inference of the expected auction price and the final selling price of the car (might not be factually accurate though). An inference is also made by PALM in the second example in addition to summarization, although the Cape Verde Islands National Park does not really exist.

These examples demonstrate that PALM pre-training has learned to infer and to reason from the input text. Although in the pre-training phase the generated content may not be factually accurate in the absence of rich context, the capability of inference can be transferred downstream by fine-tuning on specific generation tasks.

4.3 Fine-tuning on Generative QA

We also experiment with fine-tuning PALM on several downstream generation tasks. The MARCO benchmark Nguyen et al. (2016) released by Microsoft is the best fit for evaluating generative QA models. In the MARCO dataset, the questions are user queries issued to the Bing search engine and the contextual passages are from real web documents. The data has been split into a training set (153,725 QA pairs), a dev set (12,467 QA pairs) and a test set (101,092 questions with unpublished answers). To evaluate the generative capability, we focus on the Q&A + Natural Language Generation task, the goal of which is to provide the best answer available in natural language that could be used by a smart device / digital assistant.

The answers are human-generated and not necessarily sub-spans of the contextual passages, so we use the ROUGE-L Lin (2004) metric for our evaluation to measure the quality of generated answers against the ground truth.

We fine-tune the pre-trained PALM on the MARCO training set for 10 epochs. We set the batch size to 64, the learning rate to 1e-5, and the maximum input length to 512. The other hyper-parameters are kept the same as pre-training. In fine-tuning PALM, the encoder takes as input

a contextual passage concatenated with a question at the end, and the decoder takes an answer as input . During decoding, we use beam search with a beam of size 5.

Method Rouge-L
ConZNet Indurthi et al. (2018) 0.421
Reader-Writer 0.439
KIGN-QA 0.441
SNET+CES2S 0.450
Communicating BERT 0.483
VNET Wang et al. (2018) 0.484
Selector NLGEN 0.487
BERT+Multi-Pointer 0.495
Masque Nishida et al. (2019) 0.496
PALM 0.498
Table 2: Test results of answer generation on the official MARCO leaderboard as of December 9, 2019.

Table 2 presents the answer generation results on the test set obtained from the official MARCO leaderboard. PALM achieves the 1st place on the leaderboard, outperforming all competing methods in generation quality. Note that PALM pre-trains a single model, while some of the top-performing methods are ensemble models, such as Masque, on the leaderboard. Crucially, the superiority of PALM-single over Masque-ensemble with pre-trained ELMo Peters et al. (2018) and BERT-based methods clearly demonstrates the effectiveness and generalizability of PALM over the other pre-training approaches in language modeling.

4.4 Fine-tuning on Summarization

Text summarization produces a concise and fluent summary conveying the key information in the input (e.g., a news article). We focus on abstractive summarization, a generation task where the summary is not constrained to reusing the phrases or sentences in the input text. Following MASS, we use the Gigaword dataset Graff and Cieri (2003)

for model fine-tuning and evaluation, which consists of a total of 3.8M article-title pairs in English. We take the articles as the input to the encoder and titles for the decoder. We adopt the same optimization hyperparameters from generative QA fine-tuning for the summarization task. The F1 scores of Rouge-1, Rouge-2 and Rouge-L are reported on the Gigaword test set for evaluation.

Method Rouge-1 Rouge-2 Rouge-L
DAE 35.97 17.17 33.14
BERT+LM 37.75 18.45 34.85
OpenNMT 36.73 17.86 33.68
CGU 36.30 18.00 33.80
EndDec 36.30 17.31 33.88
FTSumg 37.27 17.65 34.24
ReSum 37.04 19.03 34.46
ConceptPtr 37.01 17.10 34.87
Seq2Seq+E2T 37.04 16.66 34.93
MASS 37.66 18.53 34.89
UniLM 38.45 19.45 35.75
PALM 38.75 19.79 35.98
Table 3: Results of abstractive summarization on the Gigaword test set. Results of DAE (Denoising Auto-Encoder) and BERT+LM are taken from Song et al. (2019).  Klein et al. (2017);  Lin et al. (2018);  Suzuki and Nagata (2017);  Cao et al. (2018);  Cao et al. (2018);  Wang et al. (2019);  Amplayo et al. (2018);  Dong et al. (2019)

As shown in Table 3, PALM achieves better performance than all existing abstractive summarization models. It is worth noting that UniLM, MASS, BERT+LM and DAE are pre-trained on an unlabeled corpus before supervised fine-tuning on the summarization data. By consistently outperforming these pre-training methods, PALM confirms its effectiveness in leveraging unsupervision signals for language generation.

4.5 Fine-tuning on Response Generation

Conversational response generation aims to produce a flexible response to a conversation Vinyals and Le (2015). Following MASS, we conduct experiments on the Cornell Movie Dialog corpus555https://github.com/suriyadeepan/datasets/tree/master/seq2seq/cornell_movie_corpus Danescu-Niculescu-Mizil and Lee (2011) that contains 140K conversation pairs, and use the training/test splits provided by the dataset. The same training hyperparameters from generative QA fine-tuning are adopted on the response generation task. We report the results in perplexity following Vinyals and Le (2015) (lower is better).

Method Perplexity Perplexity
(10K Data) (110K Data)
Baseline 82.39 26.38
BERT+LM 80.11 24.84
MASS 74.32 23.52
PALM 45.43 21.98
Table 4: Results of conversational response generation in terms of perplexity on Cornell Movie Dialog corpus (lower is better).

We compare PALM with the competing methods including the baseline trained on the data pairs available and the pre-trained BERT+LM and MASS. Following MASS, we train every model on 10K pairs randomly sampled and all 110K training pairs. As shown in Table 4, PALM significantly performs better than all the competitors by a large margin on both the 10K and 110K data, demonstrating its capability in generating responses to context thanks to its new pre-training objectives.

5 Related Work

ELMo Peters et al. (2018) is an early prominent pre-training method based on bidirectional LSTMs. It concatenates left-only and right-only representations, but does not pre-train interactions between these features. GPT Radford (2018) and GPT-2 Radford et al. (2019) are proposed to base language modeling on the Transformer architecture, and use only the Transformer decoder for pre-training. Edunov et al. Edunov et al. (2019) examine different strategies (e.g., ELMo) to add contextualized embeddings to sequence-to-sequence models, and observe the most improvement by adding the learned embeddings to the encoder.

BERT Devlin et al. (2018) introduces Masked Language Modelling, which allows pre-training to learn interactions between left and right context words. Recent work has shown that very strong performance can be achieved by training for longer Liu et al. (2019), by tying parameters across layers Lan et al. (2019), and by masking spans instead of words Joshi et al. (2019). Predictions are not made autoregressively, reducing the effectiveness of BERT for generation tasks.

UniLM Dong et al. (2019) fine-tunes BERT with an ensemble of masks, some of which use only leftward context, allowing UniLM to be used for generation tasks. A difference between UniLM and PALM is that UniLM predictions are conditionally independent, whereas PALM ’s are autoregressive. PALM reduces the mismatch between pre-training and context-conditioned generation tasks by forcing the decoder to predict the continuation of text input to the encoder on an unlabeled corpus.

MASS Song et al. (2019) and BART Lewis et al. (2019) are the two pre-training approaches most similar to PALM. In MASS, an input sequence where a contiguous span of tokens is masked is mapped to a sequence consisting of the missing tokens, whereas BART is trained to reconstruct the original text from corrupted input with some masked tokens. The difference in input & output representations between PALM and MASS & BART is detailed in Section 3.2.

6 Conclusions

In this work, we propose PALM, a novel approach to pre-training an autoencoding and autoregressive language model on a large unlabeled corpus, designed to be fine-tuned on downstream generation conditioned on context. It is built upon an extension of the Transformer encoder-decoder, and jointly pre-trains the encoder and the decoder in an autoencoding denoising stage followed by an autoregressive generation stage.

With less training cost than that of existing pre-training approaches, PALM significantly advances the state-of-the-art results on a variety of context-conditioned generation applications, including generative QA (Rank 1 on the MARCO leaderboard), abstractive summarization and conversational response generation. It has been shown in prior work Liu et al. (2019) that training for more steps over a larger corpus can potentially improve the performance of pre-training. Our future work will explore the potential of training PALM for longer on much more unlabeled text data.

References

  • R. K. Amplayo, S. Lim, and S. Hwang (2018) Entity commonsense representation for neural abstractive summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 697–707. External Links: Link, Document Cited by: Table 3.
  • L. J. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. CoRR abs/1607.06450. External Links: Link, 1607.06450 Cited by: §2.2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §2.1.
  • B. Bi, C. Wu, M. Yan, W. Wang, J. Xia, and C. Li (2019) Incorporating external knowledge into machine reading for generative question answering. External Links: 1909.02745 Cited by: §1.
  • Z. Cao, W. Li, S. Li, and F. Wei (2018) Retrieve, rerank and rewrite: soft template based neural summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 152–161. External Links: Link, Document Cited by: Table 3.
  • Z. Cao, F. Wei, W. Li, and S. Li (2018) Faithful to the original: fact aware neural abstractive summarization. In

    AAAI Conference on Artificial Intelligence

    ,
    External Links: Link Cited by: Table 3.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 3079–3087. External Links: Link Cited by: §1.
  • C. Danescu-Niculescu-Mizil and L. Lee (2011) Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, Portland, Oregon, USA, pp. 76–87. External Links: Link Cited by: §4.5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation, §1, item 1, §4.1, §5.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32, pp. 13042–13054. External Links: Link Cited by: Table 3, §5.
  • S. Edunov, A. Baevski, and M. Auli (2019) Pre-trained language model representations for language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4052–4059. External Links: Link, Document Cited by: §5.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §1.
  • D. Graff and C. Cieri (2003) English gigaword. In Linguistic Data Consortium, Philadelphia. Cited by: §4.4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. External Links: Document, ISSN 1063-6919 Cited by: §2.2.
  • D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §4.1.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Link, Document Cited by: §1.
  • H. Inan, K. Khosravi, and R. Socher (2017)

    Tying word vectors and word classifiers: A loss framework for language modeling

    .
    In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §3.3.
  • S. R. Indurthi, S. Yu, S. Back, and H. Cuayáhuitl (2018) Cut to the chase: a context zoom-in network for reading comprehension. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    Brussels, Belgium, pp. 570–575. External Links: Link, Document Cited by: Table 2.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) SpanBERT: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529. Cited by: §5.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush (2017)

    OpenNMT: open-source toolkit for neural machine translation

    .
    In Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, pp. 67–72. External Links: Link Cited by: Table 3.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    ArXiv abs/1909.11942. Cited by: §1, §5.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv e-prints. External Links: 1910.13461 Cited by: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation, §1, §5.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, pp. 10. Cited by: §4.3.
  • J. Lin, X. Sun, S. Ma, and Q. Su (2018) Global encoding for abstractive summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 163–169. External Links: Link, Document Cited by: Table 3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, §5, §6.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS), Cited by: §4.3.
  • K. Nishida, I. Saito, K. Nishida, K. Shinoda, A. Otsuka, H. Asano, and J. Tomita (2019) Multi-style generative reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2273–2284. External Links: Link, Document Cited by: §3.3, Table 2.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1, §4.3, §5.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1, §3.1, §5.
  • A. Radford (2018) Improving language understanding by generative pre-training. Cited by: §1, §1, §3.1, §5.
  • A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    .
    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389. External Links: Link, Document Cited by: §1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. External Links: Document, Link Cited by: §1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Link, Document Cited by: §3.3.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. CoRR abs/1905.02450. External Links: Link, 1905.02450 Cited by: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation, §1, Table 3, §5.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014)

    Dropout: a simple way to prevent neural networks from overfitting

    .
    J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: ISSN 1532-4435, Link Cited by: §2.2.
  • J. Suzuki and M. Nagata (2017) Cutting-off redundant repeating generations for neural abstractive summarization. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 291–297. External Links: Link Cited by: Table 3.
  • C. Tan, F. Wei, N. Yang, W. Lv, and M. Zhou (2017) S-net: from answer extraction to answer generation for machine reading comprehension. CoRR abs/1706.04815. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1, §1, §2.2.
  • O. Vinyals and Q. Le (2015) A neural conversational model.

    ICML Deep Learning Workshop, 2015

    .
    Cited by: §1, §4.5.
  • W. Wang, Y. Gao, H. Huang, and Y. Zhou (2019) Concept pointer network for abstractive summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3074–3083. External Links: Link, Document Cited by: Table 3.
  • Y. Wang, K. Liu, J. Liu, W. He, Y. Lyu, H. Wu, S. Li, and H. Wang (2018) Multi-passage machine reading comprehension with cross-passage answer verification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 1918–1927. External Links: Link, Document Cited by: Table 2.
  • Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. CoRR abs/1506.06724. External Links: Link, 1506.06724 Cited by: §4.1.