Self-supervised pre-training has achieved great success in natural language understanding (NLU) and a wide range of NLP tasks Dai and Le (2015); Howard and Ruder (2018); Radford (2018); Peters et al. (2018); Devlin et al. (2018). A variety of training objectives and auxiliary tasks have been introduced into pre-training on massive unlabeled text data, and the pre-trained models can be further fine-tuned for downstream NLU tasks. Among existing pre-training methods, BERT-like approaches, such as BERT Devlin et al. (2018), RoBERTa Liu et al. (2019) and ALBERTLan et al. (2019), are the most prominent by pre-training the bidirectional Transformer Vaswani et al. (2017) encoder on a large text corpus through masked language modeling and next sentence prediction. The BERT-like pre-training is designed for language understanding applications aiming to extract knowledge by comprehending given contextual text.
Different from language understanding, language generation aims at generating natural language sentences, including tasks like neural machine translationBahdanau et al. (2015); Vaswani et al. (2017), abstractive summarization Rush et al. (2015); See et al. (2017); Gehrmann et al. (2018), generative question answering (QA) Tan et al. (2017); Bi et al. (2019) and conversational response generation Vinyals and Le (2015). Many of the language generation tasks require the models to read and to comprehend a given document, based on which output text is generated. In this paper, we present PALM, a novel approach to Pre-training an Autoencoding&autoregressive Language M
odel for text generation based on reading comprehension of textual context.
Recently, several pre-training methods have been proposed for language generation. GPT Radford (2018) and GPT-2 Radford et al. (2019) use a left-to-right Transformer decoder to generate a text sequence token-by-token, which lacks an encoder to condition generation on context. In contrast, MASS Song et al. (2019) and BART Lewis et al. (2019) both employ a Transformer-based encoder-decoder framework, with a bidirectional encoder over corrupted (masked) text and a left-to-right decoder reconstructing the original text. While such denoising pre-training objectives work well for the downstream generation tasks where generated text comes from input but is manipulated, they are less related to the comprehension-based generation tasks asking for instead generating continuations, responses or answers by comprehending input context.
PALM is specifically designed to pre-train a backbone model on a large unlabeled corpus for fine-tuning on the downstream comprehension-based generation tasks, one example of which is generative QA. In generative question answering, QA models are asked to generate an abstractive answer in natural language to a given question by reading and comprehending a contextual passage. Abstractive answer generation is more than manipulating tokens in the passage. An abstractive answer reflects the understanding of the passage and the question, and can include content out of the passage to be self-contained and well-formed. To address comprehension-based generation like generative QA, PALM uses the pre-training objectives that are closely related to the downstream tasks. Specifically, it differs from existing generative pre-training methods in that PALM goes beyond the solely autoencoding/autoregressive methods and combines the merits of autoencoding and autoregression in a single framework. Moreover, it possesses a mechanism built in pre-training for generating coherent text from given context.
With the new design, PALM can surpass existing language generation methods at much less computational cost than that of prior pre-training approaches – It was trained on 16 NVIDIA V100 GPUs for 3 days in our experiments, and expected to perform even better if trained for longer. PALM gives surprisingly good empirical results on a variety of context-aware generation tasks, including pushing the state-of-the-art Rouge-L on the MARCO Q&A + Natural Language Generation benchmark to 0.498 (Rank 1 on the leaderboard 111http://www.msmarco.org/leaders.aspx) and on Gigaword summarization to 0.360, as well as establishing the state-of-the-art perplexity of 21.98 on generating responses to Cornell Movie Dialogues.
We make the following major contributions in this paper:
We propose PALM, a novel approach to pre-training a language model on a large unlabeled text corpus, which is able to comprehend contextual text. The pre-trained model is particularly effective to be fine-tuned for language generation conditioned on context.
With less training cost than that of existing pre-training methods, PALM significantly advances the state-of-the-art results on a variety of language generation applications, including generative QA, abstractive summarization and conversational response generation. It clearly demonstrates PALM’s effectiveness and generalizability in language generation.
2 Language Modeling
PALM is built upon an extension of an encoder-decoder framework. In this section, we introduce the encoder-decoder framework for language modeling, followed by the base architecture used for PALM.
We denote as a pair of text pieces, where is the source text with tokens, and is the target text with tokens. and denote the sets of source text and target text, respectively. An encoder-decoder model learns the parameter setwith log-likelihood as the objective function: . The conditional probability
can be further factorized according to the chain rule:, where denotes the token sequence preceding position .
In the encoder-decoder framework, the encoder reads the source text and generates a set of representations. With the source representations and its preceding token sequence, the decoder estimates the conditional probability of each target token. An attention mechanism Bahdanau et al. (2015) is further introduced between the encoder and the decoder to identify a subset of source representations to attend for predicting each target token.
2.2 Transformer Base
PALM uses the standard Transformer encoder-decoder from Vaswani et al. (2017) as the base architecture. First, an input sequence of tokens is mapped to a sequence of embeddings, which is then passed into the encoder. The encoder consists of a stack of “blocks”, each of which comprises two subcomponents: a self-attention layer followed by a small feed-forward network. Layer normalization Ba et al. (2016) is applied to the input of each subcomponent and a residual skip connection He et al. (2016) adds each subcomponent’s input to its output. Dropout Srivastava et al. (2014) is applied within the feed-forward network, on the skip connection, on the attention weights, and at the input and output of the entire stack.
The decoder is similar in structure to the encoder except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder. The self-attention mechanism in the decoder also uses a form of autoregressive or causal self-attention, which only allows the model to attend to past outputs. The output of the final decoder block is fed into a dense layer with a softmax output, whose weights are shared with the input embedding matrix. All attention mechanisms in the Transformer are split up into independent “heads” whose outputs are concatenated before being further processed.
3 PALM for Context-conditioned Generation
This section presents the new mechanism and pre-training objectives of PALM for generation conditioned on context. The differences between PALM and prior pre-training approaches are discussed as well.
3.1 Joint Modeling of Autoencoding and Autoregression
Existing Transformer-based pre-training methods employ either autoencoding or autoregressive objectives for self-supervision. Autoencoding-based pre-training aims to reconstruct the original text from corrupted input. Notable examples are BERT and its variants RoBERTa and ALBERT, where a certain portion of input tokens are replaced by a special symbol [MASK]. The models are trained to recover the original tokens from the corrupted version by utilizing bidirectional context. However, these autoencoding methods are not applicable to text generation where bidirectional contexts are not available.
On the other hand, an autoregressive model, such as GPTRadford (2018); Radford et al. (2019), is only trained to encode unidirectional context (either forward or backward). Specifically, at each output timestep, a token is sampled from the model’s predicted distribution and the sample is fed back into the model to produce a prediction for the next output timestep, and so on. While applicable to text generation, the autoregressive methods are not effective at modeling deep bidirectional context. On the contrary, downstream generation tasks often ask a model to condition generation on given textual context. This results in a gap between autoregressive modeling and effective pre-training.
To close the gap, PALM is carefully designed to autoregressively generate a text sequence by comprehending the given context in a bidirectional autoencoding manner. In particular, PALM delegates autoencoding-based comprehension to the encoder in Transformer, and autoregressive generation to the Transformer decoder. The encoder and decoder are jointly pre-trained in two stages:
The encoder is first trained as a bidirectional autoencoder to reconstruct the original text from corrupted context in which random tokens are sampled and replaced with [MASK] symbols following BERT’s practice Devlin et al. (2018). The training optimizes the cross-entropy reconstruction loss between encoder’s output and original context, as Masked Language Modeling (MLM) in BERT. By predicting the actual tokens in context that are masked, PALM forces the encoder to comprehend the meaning of the unmasked tokens and the full context.
The encoder and decoder are then jointly trained to autoregressively generate text output out of the context representations from the encoder. The training maximizes the log-likelihood of the text in ground truth from the decoder’s output:
where represents the set of context and represents the set of text to be generated. By conditioning the generation on context representations, PALM forces the decoder to rely deeply on the context instead of preceding generated tokens in next token prediction, which facilitates context-sensitive generation.
3.2 Input&Output Representations
In the phase of model pre-training, input and output representations are tailored to minimize the discrepancy between self-supervised pre-training and supervised fine-tuning. In a typical downstream generation task (e.g., abstractive summarization and generative QA), context is given as a rather long passage, and a model is asked to generate a shorter piece of text based on the comprehension of the context.
Given a contiguous text fragment of length (composed of a few sentences) from an unlabeled corpus, PALM uses the consecutive span of length from the beginning of the fragment as context input to the encoder, and uses the remainder of text span of length as text output to be generated by the decoder. This representation design mimics the input and output of downstream tasks, with the hypothesis that human-written text is coherent and thus the subsequent text span of length captures the comprehension of the preceding context span. In this way, PALM learns to infer the subsequent text content from the preceding content.
The collection of text fragments are constructed from a corpus by following the practice of BERT. In our experiments, we set the maximum length of a fragment to be 500, i.e., . Therefore, the context input consists of at most 400 tokens, and the text output consists of at most 100 tokens.
Figure 1 shows a schematic comparison of input&output representations between PALM and the existing pre-training generation methods, GPT, MASS and BART. GPT uses a decoder to predict tokens autoregressively, without an encoder to condition generation on context. MASS and BART are both trained to recover the original tokens that are masked from corrupted text, where the input to the encoder and the decoder come from the same text segment (e.g., the sequence in Figures 0(b) and 0(c)). They are also expected to output the tokens from the same text sequence. By contrast, in PALM the encoder and the decoder take two different inputs. The input to the decoder comes from the continuation of the text input to the encoder (e.g., is subsequent to in the contiguous text segment in Figure 0(d)). In addition to the continuation predicted by the encoder, PALM produces an extra output from the encoder, which contains the predicted tokens masked in the input (e.g., and in Figure 0(d)). The output predictions from the encoder and the decoder are used for training in the two stages, respectively.
3.3 Copying Tokens from Context
In a human-written document, subsequent text often refers back to entities and tokens present earlier in the preceding text. Therefore, it would increase coherence of text generated in downstream to incorporate the copy mechanism into pre-training on an unlabeled corpus. This allows the model to learn from pre-training when and how to copy tokens in generating text, and the knowledge is transferred to downstream fine-tuning.
PALM incorporates the copy mechanism by plugging in the pointer-generator network See et al. (2017); Nishida et al. (2019) on top of the decoder in Transformer. Figure 2 illustrates the pointer-generator network, which allows every token to be either generated from a vocabulary or copied from context in generating text.
Extended vocabulary distribution. Let the extended vocabulary, , be the union of words in the vocabulary and all tokens present in context.
then denotes the probability distribution of the-th word token, , over the extended vocabulary, defined as:
where denotes the output representation of -th token from the decoder. The output embedding is tied with the corresponding part of the input embedding Inan et al. (2017), and and are learnable parameters.
Copy distribution. PALM uses an additional attention layer for the copy distribution on top of the decoder. In the course of generation, the layer takes as the query, and outputs as the attention weights and
as the context vector:
where is the representation of -th token in context from the encoder. , , and are learnable parameters. As a result, is the copy distribution over the extended vocabulary, defined as:
Final distribution. The final probability of generating is defined as a mixture of the extended vocabulary distribution and the copy distribution:
where , and are learnable parameters.
The parameters in pointer-generator learned in pre-training are all kept and passed downstream for fine-tuning on labeled data.
In this section, we present the experimental setup and results of PALM pre-training on a large unlabeled corpus and fine-tuning on a variety of language generation tasks, including generative QA, abstractive summarization and conversational response generation.
|Input||A classic Aston Martin once owned by Spartacus star Peter Ustinov is set to fetch more than|
|£1 million at auction - twice what it fetched four years ago. The actor bought the Aston|
|Martin DB4 Cabriolet in 1962, shortly after winning a Best Supporting Actor Oscar for his|
|role as Batiatus in Spartacus. It was one of the most luxurious cars of its day, costing £4,000|
|and was delivered to him at a Swiss hotel at a time when the average house price in Britain|
|was just £2,500.|
|PALM||The Aston Martin DB4 Cabriolet was bought by Peter Ustinov for £4000 and was expected|
|to fetch for £2.5 million for auction. The car was sold for £1.2 million finally.|
|MASS||peter ustinov’s UNK auctioned for more than $1 million.|
|Input||Cape Verde’s 2-0 win over Portugal was the most eye-catching international result of the|
|week. So, who are Cape Verde and why has this tiny island off the west coast of Africa|
|suddenly become an international football force? Where are the Cape Verde Islands? Cape|
|Verde is a group of islands 400 miles from Senegal off the west coast of Africa. Its|
|population is around 500,000 and boasts both beautiful beaches and striking volcanic|
|landscapes, making it a haven for tourism.|
|PALM||Cape Verde is a small island off the west coast of Africa with a population of around 500,000|
|and boasts both beautiful beaches and striking volcanic landscapes, making it a haven for|
|tourism. Cape Verde is home to the Cape Verde Islands National Park with a number of islands.|
|MASS||tiny african island nation cape verde has beautiful beaches.|
4.1 Pre-training Configuration
Experimental Setup. PALM is based on the Transformer which consists of a 12-layer encoder and a 12-layer decoder with 768 embedding/hidden size, 3072 feed-forward filter size and 12 attention heads. The parameters of the encoder are initialized by the pre-trained RoBERTaBase model222https://github.com/pytorch/fairseq which was trained with the Masked LM objective, removing Next Sentence Prediction from BERT.
PALM is trained with a dropout rate of 0.1 on all layers and attention weights, and a GELU activation functionHendrycks and Gimpel (2016) used as GPT. The learning rate is set to 1e-5, with linear warmup over the first 10k steps and linear decay. The pre-training procedure runs on 16 NVIDIA V100 GPU cards for 800K steps, with each minibatch containing 64 sequences of maximum length 500 tokens.
Pre-training Dataset. We use documents of English Wikipedia and BookCorpus Zhu et al. (2015) as our pre-training corpus, and perform WordPiece tokenization as BERT Devlin et al. (2018). The documents are split into sentences. Different from BERT, we use multiple consecutive sentences up to 400 tokens as the source text input to the encoder, and use the subsequent consecutive sentences up to 100 tokens as the target text to the decoder. The pre-training dataset
is constructed from the documents by a sliding window with the stride of one sentence, resulting in 50Mpre-training pairs.
4.2 Unsupervised Pre-training
To understand the performance of PALM pre-training, we compare generation quality of the pre-trained models of PALM and MASS 333https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth. Specifically, we feed a few sentences from a news article to both pre-trained models, and the models generate a continuation of the input sentences by beam search with a beam of size 5. The news articles from CNN 444https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ are used as input text to eliminate the possibility of the text present in the models’ pre-training corpus, i.e., Wikipedia and BookCorpus.
The overall perplexity of PALM is 17.22, which is much better than MASS’s perplexity of 170.32, indicating PALM’s better language modeling. Table 1 illustrates a couple of example continuations generated by PALM and MASS. In both examples, PALM generates fluent and grammatical English, while MASS outputs a short sentence that is much less relevant to input text, since the MASS model was trained on individual sentences. In the first example, it is interesting to observe that in addition to summarizing the input content, PALM is able to make a non-trivial inference of the expected auction price and the final selling price of the car (might not be factually accurate though). An inference is also made by PALM in the second example in addition to summarization, although the Cape Verde Islands National Park does not really exist.
These examples demonstrate that PALM pre-training has learned to infer and to reason from the input text. Although in the pre-training phase the generated content may not be factually accurate in the absence of rich context, the capability of inference can be transferred downstream by fine-tuning on specific generation tasks.
4.3 Fine-tuning on Generative QA
We also experiment with fine-tuning PALM on several downstream generation tasks. The MARCO benchmark Nguyen et al. (2016) released by Microsoft is the best fit for evaluating generative QA models. In the MARCO dataset, the questions are user queries issued to the Bing search engine and the contextual passages are from real web documents. The data has been split into a training set (153,725 QA pairs), a dev set (12,467 QA pairs) and a test set (101,092 questions with unpublished answers). To evaluate the generative capability, we focus on the Q&A + Natural Language Generation task, the goal of which is to provide the best answer available in natural language that could be used by a smart device / digital assistant.
The answers are human-generated and not necessarily sub-spans of the contextual passages, so we use the ROUGE-L Lin (2004) metric for our evaluation to measure the quality of generated answers against the ground truth.
We fine-tune the pre-trained PALM on the MARCO training set for 10 epochs. We set the batch size to 64, the learning rate to 1e-5, and the maximum input length to 512. The other hyper-parameters are kept the same as pre-training. In fine-tuning PALM, the encoder takes as inputa contextual passage concatenated with a question at the end, and the decoder takes an answer as input . During decoding, we use beam search with a beam of size 5.
|ConZNet Indurthi et al. (2018)||0.421|
|VNET Wang et al. (2018)||0.484|
|Masque Nishida et al. (2019)||0.496|
Table 2 presents the answer generation results on the test set obtained from the official MARCO leaderboard. PALM achieves the 1st place on the leaderboard, outperforming all competing methods in generation quality. Note that PALM pre-trains a single model, while some of the top-performing methods are ensemble models, such as Masque, on the leaderboard. Crucially, the superiority of PALM-single over Masque-ensemble with pre-trained ELMo Peters et al. (2018) and BERT-based methods clearly demonstrates the effectiveness and generalizability of PALM over the other pre-training approaches in language modeling.
4.4 Fine-tuning on Summarization
Text summarization produces a concise and fluent summary conveying the key information in the input (e.g., a news article). We focus on abstractive summarization, a generation task where the summary is not constrained to reusing the phrases or sentences in the input text. Following MASS, we use the Gigaword dataset Graff and Cieri (2003)
for model fine-tuning and evaluation, which consists of a total of 3.8M article-title pairs in English. We take the articles as the input to the encoder and titles for the decoder. We adopt the same optimization hyperparameters from generative QA fine-tuning for the summarization task. The F1 scores of Rouge-1, Rouge-2 and Rouge-L are reported on the Gigaword test set for evaluation.
As shown in Table 3, PALM achieves better performance than all existing abstractive summarization models. It is worth noting that UniLM, MASS, BERT+LM and DAE are pre-trained on an unlabeled corpus before supervised fine-tuning on the summarization data. By consistently outperforming these pre-training methods, PALM confirms its effectiveness in leveraging unsupervision signals for language generation.
4.5 Fine-tuning on Response Generation
Conversational response generation aims to produce a flexible response to a conversation Vinyals and Le (2015). Following MASS, we conduct experiments on the Cornell Movie Dialog corpus555https://github.com/suriyadeepan/datasets/tree/master/seq2seq/cornell_movie_corpus Danescu-Niculescu-Mizil and Lee (2011) that contains 140K conversation pairs, and use the training/test splits provided by the dataset. The same training hyperparameters from generative QA fine-tuning are adopted on the response generation task. We report the results in perplexity following Vinyals and Le (2015) (lower is better).
|(10K Data)||(110K Data)|
We compare PALM with the competing methods including the baseline trained on the data pairs available and the pre-trained BERT+LM and MASS. Following MASS, we train every model on 10K pairs randomly sampled and all 110K training pairs. As shown in Table 4, PALM significantly performs better than all the competitors by a large margin on both the 10K and 110K data, demonstrating its capability in generating responses to context thanks to its new pre-training objectives.
5 Related Work
ELMo Peters et al. (2018) is an early prominent pre-training method based on bidirectional LSTMs. It concatenates left-only and right-only representations, but does not pre-train interactions between these features. GPT Radford (2018) and GPT-2 Radford et al. (2019) are proposed to base language modeling on the Transformer architecture, and use only the Transformer decoder for pre-training. Edunov et al. Edunov et al. (2019) examine different strategies (e.g., ELMo) to add contextualized embeddings to sequence-to-sequence models, and observe the most improvement by adding the learned embeddings to the encoder.
BERT Devlin et al. (2018) introduces Masked Language Modelling, which allows pre-training to learn interactions between left and right context words. Recent work has shown that very strong performance can be achieved by training for longer Liu et al. (2019), by tying parameters across layers Lan et al. (2019), and by masking spans instead of words Joshi et al. (2019). Predictions are not made autoregressively, reducing the effectiveness of BERT for generation tasks.
UniLM Dong et al. (2019) fine-tunes BERT with an ensemble of masks, some of which use only leftward context, allowing UniLM to be used for generation tasks. A difference between UniLM and PALM is that UniLM predictions are conditionally independent, whereas PALM ’s are autoregressive. PALM reduces the mismatch between pre-training and context-conditioned generation tasks by forcing the decoder to predict the continuation of text input to the encoder on an unlabeled corpus.
MASS Song et al. (2019) and BART Lewis et al. (2019) are the two pre-training approaches most similar to PALM. In MASS, an input sequence where a contiguous span of tokens is masked is mapped to a sequence consisting of the missing tokens, whereas BART is trained to reconstruct the original text from corrupted input with some masked tokens. The difference in input & output representations between PALM and MASS & BART is detailed in Section 3.2.
In this work, we propose PALM, a novel approach to pre-training an autoencoding and autoregressive language model on a large unlabeled corpus, designed to be fine-tuned on downstream generation conditioned on context. It is built upon an extension of the Transformer encoder-decoder, and jointly pre-trains the encoder and the decoder in an autoencoding denoising stage followed by an autoregressive generation stage.
With less training cost than that of existing pre-training approaches, PALM significantly advances the state-of-the-art results on a variety of context-conditioned generation applications, including generative QA (Rank 1 on the MARCO leaderboard), abstractive summarization and conversational response generation. It has been shown in prior work Liu et al. (2019) that training for more steps over a larger corpus can potentially improve the performance of pre-training. Our future work will explore the potential of training PALM for longer on much more unlabeled text data.
- Entity commonsense representation for neural abstractive summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 697–707. External Links: Cited by: Table 3.
- Layer normalization. CoRR abs/1607.06450. External Links: Cited by: §2.2.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §1, §2.1.
- Incorporating external knowledge into machine reading for generative question answering. External Links: Cited by: §1.
- Retrieve, rerank and rewrite: soft template based neural summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 152–161. External Links: Cited by: Table 3.
Faithful to the original: fact aware neural abstractive summarization.
AAAI Conference on Artificial Intelligence, External Links: Cited by: Table 3.
- Semi-supervised sequence learning. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 3079–3087. External Links: Cited by: §1.
- Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, Portland, Oregon, USA, pp. 76–87. External Links: Cited by: §4.5.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation, §1, item 1, §4.1, §5.
- Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32, pp. 13042–13054. External Links: Cited by: Table 3, §5.
- Pre-trained language model representations for language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4052–4059. External Links: Cited by: §5.
- Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §1.
- English gigaword. In Linguistic Data Consortium, Philadelphia. Cited by: §4.4.
- Deep residual learning for image recognition. In , pp. 770–778. External Links: Cited by: §2.2.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §4.1.
- Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Cited by: §1.
Tying word vectors and word classifiers: A loss framework for language modeling. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: §3.3.
Cut to the chase: a context zoom-in network for reading comprehension.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 570–575. External Links: Cited by: Table 2.
- SpanBERT: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529. Cited by: §5.
OpenNMT: open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, pp. 67–72. External Links: Cited by: Table 3.
ALBERT: a lite bert for self-supervised learning of language representations. ArXiv abs/1909.11942. Cited by: §1, §5.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv e-prints. External Links: Cited by: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation, §1, §5.
- ROUGE: a package for automatic evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, pp. 10. Cited by: §4.3.
- Global encoding for abstractive summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 163–169. External Links: Cited by: Table 3.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §1, §5, §6.
- MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS), Cited by: §4.3.
- Multi-style generative reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2273–2284. External Links: Cited by: §3.3, Table 2.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Cited by: §1, §4.3, §5.
- Language models are unsupervised multitask learners. Cited by: §1, §3.1, §5.
- Improving language understanding by generative pre-training. Cited by: §1, §1, §3.1, §5.
A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389. External Links: Cited by: §1.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. External Links: Cited by: §1.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Cited by: §3.3.
- MASS: masked sequence to sequence pre-training for language generation. CoRR abs/1905.02450. External Links: Cited by: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation, §1, Table 3, §5.
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: Cited by: §2.2.
- Cutting-off redundant repeating generations for neural abstractive summarization. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 291–297. External Links: Cited by: Table 3.
- S-net: from answer extraction to answer generation for machine reading comprehension. CoRR abs/1706.04815. Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §1, §1, §2.2.
A neural conversational model.
ICML Deep Learning Workshop, 2015. Cited by: §1, §4.5.
- Concept pointer network for abstractive summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3074–3083. External Links: Cited by: Table 3.
- Multi-passage machine reading comprehension with cross-passage answer verification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 1918–1927. External Links: Cited by: Table 2.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books. CoRR abs/1506.06724. External Links: Cited by: §4.1.