Encoder-Agnostic Adaptation for Conditional Language Generation

08/19/2019 ∙ by Zachary M. Ziegler, et al. ∙ Harvard University 0

Large pretrained language models have changed the way researchers approach discriminative natural language understanding tasks, leading to the dominance of approaches that adapt a pretrained model for arbitrary downstream tasks. However it is an open-question how to use similar techniques for language generation. Early results in the encoder-agnostic setting have been mostly negative. In this work we explore methods for adapting a pretrained language model to arbitrary conditional input. We observe that pretrained transformer models are sensitive to large parameter changes during tuning. We therefore propose an adaptation that directly injects arbitrary conditioning into self attention, an approach we call pseudo self attention. Through experiments on four diverse conditional text generation tasks we show that this encoder-agnostic technique outperforms strong baselines, produces coherent generations, and is data efficient.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale language models have been shown to dramatically improve the performance of natural language understanding (NLU) systems on a broad range of tasks (Peters et al., 2018; Devlin et al., 2018; Radford and Salimans, 2018; McCann et al., 2017). The dominant paradigm is to pretrain a self-attention based language model on a large corpus of unlabeled text and then finetune the language model and task-specific head on supervised data. Optimizing the effectiveness of this approach has been the focus of much study (Houlsby et al., 2019; Wang et al., 2019; Chronopoulou et al., 2019).

Given the success of pretraining for NLU tasks, how can large language models best be adapted for conditional language generation? Ideally, one should only need to train a large language model once and then apply it as part of the decoder to a range of tasks with different source modalities (e.g. text, images, bits). In the encoder/decoder framework, a task specific encoder can be used which encodes source information into a continuous vector. The central question is therefore how to adapt a pretrained decoder to effectively utilize arbitrary source information, i.e.


Given the demonstrated quality of samples from large language models (Radford et al., 2019), it is natural to expect that encoder-agnostic adaptation should give improvements in coherence and grammatically even when the source modality is not text, such as with image captioning or class-conditional generation. Unfortunately, past results indicate otherwise. Edunov et al. (2019) show for example that a straightforward extension of Peters et al. (2018) to the conditional generation setting actually hurts performance compared to a model without any pretraining. Other pretraining approaches for language generation (Song et al., 2019; Dong et al., 2019; Lample and Conneau, 2019) have demonstrated strong performance on text-to-text tasks, but these methods are constrained to tasks where the source is natural language and do not address the encoder-agnostic setting.

In this work we consider several different approaches for the problem of encoder-agnostic adaptation. We first make that observation that standard adaptation approaches perform poorly on this task. We posit that because these techniques require relearning key parts of the network structure to inject contextual conditioning, they move the parameters too far from the pretrained values. In contrast, Radford et al. (2019) observe that even trivial conditioning with the original model produces reasonable zero-shot generations without fine-tuning.

These results motivate an approach that learns the correct conditioning to control the model’s output, which we call pseudo self attention

. The idea is to learn a task specific encoder that injects pseudo history into a pretrained self attention model. Because self attention works with sets of any size, the model can immediately utilize or ignore this history. Finetuning adapts the model to this new input while training a task-specific encoder.

Experiments utilize the GPT-2 (Radford et al., 2019)

transformer as a pretrained model. We consider four diverse generation tasks spanning a range of source modalities: class-conditional generation, document summarization, story generation, and image paragraph captioning. Across all tasks, we find that pseudo self attention outperforms the other pretraining methods and is the most consistent. As a practical tool, pseudo self attention improves performance compared to a baseline without pretraining by large margins without sacrificing adherence to the source, even for tasks with large amounts of supervised data. We further demonstrate that the approach is data efficient and produces qualitatively more coherent outputs. Code is available at


2 Related Work

Transfer learning with language models

Extending upon the success of pretrained word embeddings (Mikolov et al., 2013), contextual word vectors based on LSTMs first demonstrated strong results across discriminative NLU tasks (McCann et al., 2017; Howard and Ruder, 2018; Peters et al., 2018). Recent work has shown that the transformer (Vaswani et al., 2017) could further improve language representation. BERT (Devlin et al., 2018) trains a transformer via a cloze task and next sentence prediction objectives, leading to state-of-the-art results on many NLU tasks. GPT and GPT-2 (Radford and Salimans, 2018; Radford et al., 2019) use a similar model in a unidirectional language modeling setting, the latter showing the additional ability to generate impressively coherent unconditional text. As they take the form of standard language models, the GPT models are a natural starting point for pretraining generation models.

Pretrained Decoder Transfer learning for NLG

Natural language generation (NLG) tasks have a long history of incorporating unconditional language models with conditional input, especially for machine translation and speech recognition (Bahl et al., 1983; Koehn et al., 2003). These approaches traditionally use the noisy channel model (i.e. Bayes’ rule), and -gram models as the language model. Recent adaptations of these ideas include the Neural Noisy Channel (Yu et al., 2017) as well as “fusion” methods (Koehn et al., 2003; Gulcehre et al., 2015; Sriram et al., 2018; Stahlberg et al., 2018)

in which the output logits of a language model and a conditional model are combined to calculate the output probabilities. We consider this class of transfer learning as a baseline in a preliminary experiment (see Section

4.1), but focus on alternative “deep” approaches that incorporate the language model weights as an integral part of the model instead of an add-on at the end. Along these lines, Ramachandran et al. (2017) propose a finetuning-based method for machine translation with LSTMs, in which some of the layers of the LSTM are initialized with pretrained language model weights. As their method is specific to LSTMs, however, it is incompatible with modern transformer architectures.

Pretraining-Based Transfer Learning for NLG

Zhang et al. (2019) use BERT in the encoder and decoder of a summarization model via a unique cloze generative process. They demonstrate strong abstractive summarization performance, but the value of the BERT pretraining relative to other model components is not clear and the cloze process significantly reduces the practicality of the model. More related, Edunov et al. (2019) experiment with a representation-based approach for applying ELMo (Peters et al., 2018) to the source and target sides of a standard seq2seq model separately. Their approach consistently improves performance when applied to the source, but actually hurts performance when applied to the decoder. We consider such a representation approach as a baseline in this work.

Most recently, a number of studies experiment with BERT-like masking approaches that are compatible with natural language generation (Song et al., 2019; Dong et al., 2019; Lample and Conneau, 2019). While these works demonstrate impressive performance, they are constrained to text-to-text tasks because they do not have a way to handle arbitrary conditional information. Whereas these works study pretraining methods that optimize transfer for text-to-text tasks, our study considers the separate problem of adapting a fixed pretrained model to arbitrary source conditioning.

Concurrent with this work, Golovanov et al. (2019) propose a similar approach to pseudo self attention and report initial experiments with dialogue generation. This study compliments ours with positive results on dialogue generation, though we aim for experimental evidence over a wide range of language generation tasks and input modalities and comparison to strong encoder-agnostic baselines.

(a) Repr-Transformer
(b) Context-Attn
(c) Pseudo-Self
Figure 1:

Encoder-agnostic variants considered. All methods utilize a problem-specific source encoder, but vary in which parts of the decoder are pretrained and which are randomly initialized. Repr-Transformer trains a new full transformer decoder, Context-Attn trains a new context attention layer, Pseudo-Self attention only modifies part of the self attention layer. Residual connections and layernorm have been omitted for clarity. Green indicates that parameters are initialized with pretrained weights, gray indicates random initialization. Red vectors indicate the target activations at each layer, Blue vectors indicate the source features at the output of the encoder. xN indicates the section within the dotted lines is stacked N times.

3 Methods

We assume that we have a large pretrained language model,

, that the model is an auto-regressive neural network, and that it is based on self attention to implement conditioning on previous tokens, i.e.,

where input for hidden dimension , are parameters, representing the key, value, and query projections respectively, and the output is . 111In practice many of these units (”heads”) are stacked together via concatenation across dimension followed by a final linear projection .

We are interested in using this model to estimate the conditional probability

for an arbitrary input for which we have a small amount of supervised pairs. The goal is to learn a model on this new data that best makes use of the pretrained model with a method that is agnostic to the form of .

All models considered are based on the encoder/decoder architecture, and for each we follow the same high-level procedure: First, some of the weights of the decoder are initialized with weight values from a pretrained language model. Next, a problem-specific encoder and all non-pretrained decoder weights are randomly initialized. Finally, the entire model is trained/fine-tuned end-to-end using the supervised data for the given task. In all cases the input and output embeddings are tied. The models differ only in where and how they use the pretrained weights in the decoder.

Baseline 1: Repr-Transformer

The first approach considered (Fig 0(a)) views the function of the pretrained LM as giving a general-purpose representation of the target text before the source information is introduced. For this method, a standard transformer decoder is used with the target word embeddings replaced by the output representation of the pretrained language model. Preliminary experiments considered both fixing and updating these representations, and found that a fixed weighted-averaging (”ELMo-Style”) method performed better, consistent with Edunov et al. (2019). One possible downside to this approach is that the conditioning information from the encoder is injected after all of the pretrained weights.

Baseline 2: Context-Attn

The second approach (Fig 0(b)) considers initializing a standard transformer decoder with the shared weights of a pretrained LM. The newly added context attention weights at each layer are randomly initialized. While compared to Repr-Transformer the conditioning information is injected alongside the pretrained weights, the randomly initialized context attention block may interfere with the carefully co-tuned pretrained weights of the rest of the model. This may lead to reduced performance and optimization challenges.

Proposed Model: Pseudo-Self

A more radical approach to incorporating conditional information is the ”zero-shot” model proposed by Radford et al. (2019). Instead of learning a representation for and passing it into a separate context attention block they note that an auto-regressive model, , is already a conditional model. If is the same modality as (e.g. both language), one can condition on by prepending the source to target: .222This method is most successful when hand-selected task-dependent buffer words are inserted between and as well such as ”tl;dr” for summarization. While this does not produce competitive models and is limited in its applicability, it is surprising that it works at all.

Figure 2: Comparison of parameter changes in feed forward layers with different conditioning. Root median squared deviation between feed forward parameters, for the Pseudo-Self and Context-Attn models. The Context-Attn approach requires a larger deviation from the initialization to fit the data. Model PPL Cls Acc Test set - 90.1 GPT-2 41.21 - Simple Fusion 38.31 65.1 Transformer 105.43 92.7       Repr-Trans 39.69 72.7       Context-Attn 40.74 88.8       Pseudo-Self 34.80 92.3
Table 1:

Class-Conditional Generation on IMDb movie reviews. Classification accuracy is measured by a sentiment classifier trained on the IMDb training set. Bold indicates statistically significant best results at


Taking inspiration from this approach, we propose learning this contextualization in an encoder-agnostic way. Our approach, pseudo self attention, simply injects learned encoder conditioning directly into the pretrained self attention of the model. Assume that we have a matrix representing a size encoding of , define pseudo self attention as,

where are new parameters tasked with projecting encoder outputs into decoder self attention space. Because attention is inherently variable length, these additional inputs can be injected without changing the module and only act additively on the attention output. The full model is shown in Figure 0(c).

Compared to Context-Attn, the proposed approach only introduces new parameters in the self attention block, which we expect leads to only minimal interference. To explore this quantitatively, we plot the root median squared deviation of parameters from their original values in the feed-forward layer of our first task (Figure 2). While both start with the same parameters, the Context-Attn parameters change significantly more than Pseudo-Self over training. As the pretrained LM weights encode for generation capability, deviating further from this initialization may lead to worse generation performance.

4 Experiments and Results

Experiments consider four diverse tasks spanning input modalities, training dataset sizes, and information about the target contained in the source. Tasks are chosen to emphasize long-form targets to probe the decoder generation capabilities of the different models in a conditional setting. Perplexity is used to measure overall performance and diversity of output, combined with standard task-specific metrics.

For all tasks, GPT-2 is used as the pretrained language model. GPT-2 is a large autoregressive transformer LM trained on 40 GB of non-Wikipedia text (Radford et al., 2019)

. We use the originally publicly available version of the model (117M parameters); it has 12 layers, 12 heads per layer, and a model dimension of 768 units. The Context-Attn and Pseudo-Self models use the same architecture hyperparameters. For the Repr-Transformer model to avoid overfitting we use 6/8/512 layers/heads/dim for the decoder (in addition to the 12/12/768 that make up GPT-2 for the initial contextual embedding in the decoder). All experiments use the same 50k type BPE GPT-2 vocabulary.

4.1 Preliminary: Class-Conditional Generation

Model R1 / R2 / RL PPL
PointerGenerator+BottomUp 41.22 / 18.68 / 38.34 -
ELMo+SHDEMB 41.56 / 18.94 / 38.47 -
BERT+Two-Stage 41.38 / 19.34 / 38.37 -
UniLM Large+ExtractiveLoss 43.47 / 20.30 / 40.63
Transformer + Copy 39.94 / 17.73 / 37.09 8.21
  Repr-Trans 37.09 / 13.77 / 33.99 13.58
  Context-Attn 40.59 / 18.17 / 37.24 6.68
  Pseudo-Self 40.72 / 18.38 / 37.46 6.43
  Pseudo-Self+BU 41.62 / 18.66 / 38.46 6.43
Table 2: Abstractive summarization on CNN/DM. Literature results above, our models below. † indicates pretraining of the encoder side. PointerGenerator+BottomUp from (Gehrmann et al., 2018), ELMo+SHDEMB from (Edunov et al., 2019), BERT+Two-Stage from (Zhang et al., 2019), UniLM Large+ExtractiveLoss from (Dong et al., 2019). Bold indicates statistically significant best results among general models and encoder-agnostic models at .

We first consider a control experiment with a minimal encoder model. We consider producing class-conditional samples, e.g. and , from the IMDb sentiment classification dataset (Maas et al., ), similar to previous works for sentiment transfer (Shen et al., 2017; Zhao et al., 2018). We set to be a sentiment bit (positive/negative), and the movie review as the target . We maintain the original IMDb 25k/25k train/test split, with 2.5k reviews of the original train split held out for validation, and truncate reviews to 400 BPE tokens during training. Model quality is evaluated by perplexity, and adherence to the source bit is evaluated by the sentiment classification accuracy of an external classifier on generated reviews as in Shen et al. (2017). Reviews are generated via random sampling with a temperature of 0.7. To detect sentiment, we use the fastText external classifier from Joulin et al. (2016) which has an accuracy of 90.1% on the IMDb test set.

Table 1 shows results for all model, as well as unconditional GPT-2 and the results using Simple Fusion (Stahlberg et al., 2018). The GPT-2 model itself already shows a greatly reduced PPL compared to a problem-specific transformer. All pretraining methods further improve perplexity. The pseudo self attention approach significantly outperforms the approaches in terms of class adherence. Despite being initialized as a language model, the approach only sees a decrease of 0.4% classification accuracy compared to the randomly initialized model. In contrast, the Repr-Transformer model sees a decrease in accuracy of 20.0% and the Context-Attn model sees a decrease in accuracy of 3.9%. As a point of comparison, we additionally report the results of Simple Fusion in Table 1. Compared to Pseudo-Self it gives a worse PPL and extremely poor classification accuracy. Given the weak results, we focus on comparisons between the deep models for the rest of the paper.

Model PPL Rank Acc. Transformer 30.58 80.6       Repr-Trans 21.16 76.7       Context-Attn 5000 9.3       Pseudo-Self 21.21 81.8
Table 3: Story generation on the WritingPrompts dataset. Rank acc. refers to the top-1 prompt ranking accuracy metric described in Section 4.3. (Experiments use the GPT2 BPE scheme, so PPL numbers are not directly comparable to those reported in (Fan et al., 2018)). Bold indicates statistically significant best results at .
Model CIDEr B4 Krause et al. (2017) 13.5 8.7 Chatterjee et al. (2018) 20.9 9.4 Melas-Kyriazi et al. (2018) 22.7 8.7 Transformer 19.9 8.0       Repr-Trans 19.3 7.2       Context-Attn 22.6 7.6       Pseudo-Self 24.0 8.3
Table 4: Image paragraph captioning on Visual Genome, as measured by CIDEr and BLEU-4 (B4) scores. Bold indicates statistically significant best results at .

4.2 Document Summarization

Abstractive document summarization requires the model to produce a long-form summary given a full news article. For these experiments we use the non-anonymized CNN-Daily Mail dataset (Hermann et al., 2015). The dataset is comprised of 280k training examples of document-scale source news articles and corresponding 2-4 sentence target summaries. Summarization is a mature testbed with state-of-the-art models that use task-specific architecture modifications, so transfer learning methods need to be able to mesh well with these changes. We use the transformer version of the copy mechanism from (Gehrmann et al., 2018) and employ bottom-up (BU) summarization attention pruning (Gehrmann et al., 2018). Generation is conducted via beam-search with a beam size of 5 with tri-gram blocking, consistent with the literature models (Edunov et al., 2019).

Table 2 shows the performance of the models tested with recent state-of-the-art models for comparison. Compared to the baseline model without pretraining, Pseudo-Self improves ROUGE-1 by 0.78, ROUGE-2 by 0.65, ROUGE-L by 0.37, and reduced PPL by 20%. The Context-Attn approach nearly matches these results for this task, but the Repr-Transformer approach performs more poorly.

We additionally experiment with the simple bottom-up summarization attention pruning approach without pretraining applied at inference time as in (Gehrmann et al., 2018). With this modification Pseudo-Self outperforms all literature models in ROUGE-1 except the text-to-text UniLM+ExtractLoss, which uses joint pretraining of the source and target and is trained with an additional extractive loss. The performance of all of our models can potentially be further improved with the addition of pretraining on the encoder side.

4.3 Conditional Story Generation

Conditional story generation with the WritingPrompts dataset (Fan et al., 2018) requires the model to produce an on-topic story given a short textual prompt. While summarization relies heavily on the encoder, this task gives more flexibility to the decoder. The dataset is well supervised, containing 300k single sentence writing prompts (the source) and stories (the target). Following the preprocessing of Fan et al. (2018), we truncate the stories to 1000 tokens. Due to the story lengths the total number of training tokens is on the order of 100 million, resulting in a large in-domain data setting.

To compare models we compute two metrics: perplexity (PPL) and prompt ranking. Perplexity is used as a proxy for generation quality, whereas prompt ranking is used to measure the relevance of the story to the prompt. To calculate prompt ranking, we use the procedure from Fan et al. (2018): For each story in the test set, the likelihood is evaluated under the model for the “true” corresponding prompt and 9 other randomly selected “fake” prompts from the test set. Then, the rank accuracy is the percentage of stories for which the model gave the highest likelihood to the true prompt.

Table 4 shows the results. Despite the large dataset size, the Repr-Transfomer and Pseudo-Self approaches still substantially reduce the PPL. That the models are able to improve PPL, despite the 100 million+ target tokens, suggests these models are able to effectively make use of the GPT-2 LM. Pseudo-Self sees only a 0.3% decrease in prompt ranking accuracy, while the Repr-Transformer approach sees a larger decrease. The Context-Attn model runs into optimization challenges and fails to learn in this setting.

Model PPL Cls Acc Pseudo-Self 117M 34.80 92.3 Pseudo-Self 345M 30.26 92.4
Figure 3: IMDb conditional movie review generation results, comparing the larger 345M parameter GPT2 model to the 117M parameter GPT model.
Figure 4:

Data efficiency analysis with IMDb. PPL shown in blue (left), classification accuracy shown in orange (right). Error bars show an approximate 95% confidence interval.

4.4 Image Paragraph Captioning

Image paragraph captioning on the Visual Genome dataset from Krause et al. (2017)

, differes from standard image captioning task, where captions are single sentences or sentence fragments, and requires the model to generate an entire paragraph (usually 5-8 sentences) describing a given image. Recent work in the image captioning literature has argued for a greater focus on paragraph captioning because the descriptive capacity of single-sentence image captions is inherently limited. However, due to the difficulty of producing labeled paragraph captions, existing paragraph captioning datasets are quite small; whereas the MSCOCO (single-sentence captioning) dataset contains around 600,000 image-caption pairs, Visual Genome contains fewer than 20,000 image-paragraph pairs. As a result, models trained from scratch on Visual Genome have been observed to have difficulty learning the structure of language, necessitating the use of heuristics.

We use the same convolutional encoder as Krause et al. (2017)

, without the final pooling layer; that is, for each image, the output of the encoder is a tensor of size

extracted from a ResNet. Note that in this experiment, unlike those above, the encoder (CNN) and decoder (finetuned LM) are trained separately rather than end-to-end. Since we are interested in analyzing how to most effectively utilize pretraining for generation, we only compare with approaches using the same loss function (cross-entropy). Recent work shows it is possible to improve paragraph captioning models by incorporating sequence-level

(Melas-Kyriazi et al., 2018) and adversarial (Chatterjee and Schwing, 2018) losses, but these loss function improvements are orthogonal to improvements in the underlying model architecture.

Table 4 shows the results on the captioning task, as measured by the widely-used CIDEr and BLEU-4 metrics. We compare the three transfer learning methods with a non-pretraining baseline and models from the literature. Of the three pretraining approaches Pseudo-Self gives the best performance, and is the only model to improve both CIDEr and BLEU-4 compared to the Transformer baseline. Furthermore, Pseudo-Self outperforms all other models on CIDEr but gives a slightly worse BLEU-4.

5 Analysis and Discussion

5.1 Effect of pretrained LM size

There is a continuing trend to larger pretrained LMs. During the preparation of this manuscript, a larger version of GPT-2 was made available with 345M parameters, increasing the model dimension to 1028, the number of attention heads to 16, and the number of layers to 24. We retrained our model using this larger LM for class-conditional generation, using the same training hyperparameters and re-tuning the generation temperature (Table 4). The larger model improves PPL by 4.5 points while attaining similarly high classification accuracy. This datapoint suggests that transfer learning effectiveness can continue to improve along with the quality of the pretrained model used.

5.2 Low-data supervision

Many of our tasks showed improvements even with medium-to-large training sets. To study the effectiveness of the approach in low data regimes, we create artificial small datasets by subsampling the IMDb dataset to sizes between 200 and 16k datapoints. We retrain our model using the same hyperparameters and use datasize-dependent early stopping to prevent overfitting. To reduce variance and measure uncertainty we repeat the process 8 times for each dataset size, calculating the PPL and classification accuracy. Results are shown in Figure

4. Note that a non-pretrained model has a PPL of over 1000 when trained on 200 examples. The pretrained model starts with reasonable outputs (44.4 PPL after 200 examples) and increases task accuracy steadily with more data. (See Section 5.4 for representative samples.)

Model Grammaticality Non-redundancy Consistency Typicality Combined
Test set 71.3 4.3 87.2 3.2 85.1 3.4 74.4 4.1 3.18 0.10
Transformer 55.4 4.7 60.5 4.6 53.7 4.7 39.7 4.6 2.09 0.13
Repr-Trans 62.1 4.4 71.0 4.1 57.1 4.5 43.7 4.5 2.34 0.12
Pseudo-Self 65.2 4.6 69.3 4.5 61.3 4.7 48.4 4.8 2.44 0.13
Table 5: Human evaluation of story generation quality. Participants were asked specific binary questions concerning the four criteria, the numbers for the four left categories represent percentages of approval. On the right, the methods are rated on a 4-point scale based on the combination of the four criteria. Uncertainties represent a 95% confidence interval, bold indicates statistically significant maxima for each category of the models under consideration.

5.3 Human evaluation

To assess the quality of generations, we conducted a human evaluation based on the story generation task. Generation uses a temperature of 0.9 and a top-k value of 100. We ask participants on Amazon Mechanical Turk a series of four yes/no questions mapped to desirable linguistic properties outlined in Dang (2006): grammaticality, non-redundancy, consistency, and typicality. 125 stories are evaluated for each model, and each story is evaluated by 5 unique workers. Scores are calculated for each property as the total percent of positive responses. A combined score rates the model overall on a scale from 0-4 based on the equally-weighted combination of the four properties.

The results are shown in Table 5. In all four categories the Pseudo-Self and Repr-Transformer models show statistically significant performance gains compared to the baseline Transformer model. The Pseudo-Self model achieves a grammaticality score of only 6.1% less than the test set, indicating strong grammaticality, likely a more localized property, is well learned by the pretrained LM and effectively transferred to the conditional models. In contrast, all models score significantly worse than the test data in terms of consistency and typicality. This suggests that these higher level properties, while best transferred in the Pseudo-Self case, still represent a challenge for neural models.

Sentiment: Positive
No pretraining 1.8k: “Set’s that I liked this movie. I have seen I remember the original movie is one of the music that it is great movie. I’ve seen this film and one of the whole movie is like this movie. It is so bad, I watched the top of this movie. i would see the movie was bad, I have seen it. This movie, it’s a TV main movie is about the plot, relaxing. I liked this movie takes it is a few times, was awesome. I’m a DVD. The critics in the first time I know it were a lovely plot. You could watch I’m seen the acting, and I don’t watch this. I’ve seen what you really i love the film. The film on the DVD.
No pretraining 22k: I was enthralled by the performance of Hartnett at the time of this film. At the time this film was shot, I thought, I was absolutely fantastic in the film, and thought it was a real ride. It is about a family living in a small town San Francisco, where a couple lives in Mexico, where they are working at home with the family. In the end, they are in the family to decide the family will just have their fathers through endless periods, coffee, disillusionment, and finally, can’t escape, but there’s no story. There are some delightful performances (particularly the two leads) and the actors are well known, I really can’t help but feel that the film could have been better as a comedy, with the exception of Ernesto L’Amelie.
Pseudo-Self 1.8k: I fell in love with this film in 1985. It’s a quintessential short film that explores the everyday lives of the human condition. The main character of the movie is a man named Donald (Husband George). He buys a home and captures a great deal of information about the businessmen who live and work in his neighborhood. This film is a complete departure from the norm or what a typical American would expect from a successful businessman with a short fuse. His stereotypical actors are less than likable, and his best friend is Kelly (Preet Bharara). The film is full of humor, and a great deal of heart and soul.
Pseudo-Self 22k: When I saw the preview of this film, I thought it was going to be a horrible movie. I was wrong. The film has some of the funniest and most escapist scenes I’ve seen in a long time. The acting is superb. The story is decent, but the direction and editing may have been a bit harsh at times. The director and cast achieved a great balance of comedy and drama. I’ve seen some bad films, but this one is one of the ones I’ve seen that is really good. I loved the acting and the pace. The two leads were compelling. The only real problem with the film was that I was a bit bored with it. The ending is a bit long, but it’s still a funny, good movie. It’s efficient. I give it a 7/10.
Table 6: Example generations from models trained on the movie review generation task. In all cases the indicated sentiment was positive. The number in the left column is the number of training examples (22k is the full dataset).

5.4 Qualitative examples

Representative samples for the movie review dataset are shown in Table 6. The No-Pretraining model is the transformer from Table 1, and the number in the left column indicates the number of supervised examples in the training dataset. Samples are generated via random sampling with a temperature of 0.75. Without pretraining, the model makes a number of clear coherence mistakes. The Pseudo-Self 22K makes no grammatical mistakes and follows a single train of thought, although it is somewhat generic.

The distinction between the models is further exaggerated when only 1.8k supervised examples are given. The baseline model trained on only 1.8k datapoints leads to an exceptionally poor generation. In contrast, the Pseudo-Attention model shows significantly improved grammar and sentence structure. Despite a handful of mistakes, the review follows a consistent description of a movie over multiple sentences. Given the poor performance of the baseline model, these properties must have been transferred from the original unconditional LM. These samples were selected to be representative of the broader set for the indicated models.

6 Conclusion

We study encoder-agnostic approaches for adapting pretrained language model for general purpose conditional language generation. Across a set of diverse long-form conditional generation tasks we show that pseudo self attention consistently improves performance over strong encoder-agnostic pretraining baselines. From a practical perspective, the approach gives robust, sizable improvements over a non-pretraining baseline while maintaining adherence to the source context. Furthermore, we demonstrate the data efficiency and qualitative properties of the approach.

Beyond empirical results, this study highlights the distinction between improving contextual representations of the source language and improving language generation capability of the target language. While they appear to be similar problems, they exhibit substantially different phenomenology. For example, the representation-based approach which works well for NLU gives poor performance for NLG. Future work can study this distinction further.


  • L. R. Bahl, F. Jelinek, and R. L. Mercer (1983) A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-5 (2), pp. 179–190. External Links: Document, ISSN 01628828 Cited by: §2.
  • M. Chatterjee and A. G. Schwing (2018) Diverse and coherent paragraph generation from images. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 729–744. Cited by: §4.4, Table 4.
  • A. Chronopoulou, C. Baziotis, and A. Potamianos (2019) An embarrassingly simple approach for transfer learning from pretrained language models. CoRR abs/1902.10547. Cited by: §1.
  • H. T. Dang (2006) Overview of DUC 2006. Proceedings of HLT-NAACL. Cited by: §5.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. External Links: Document, 1810.04805, ISBN 0-674-24915-1 0-674-24917-8, ISSN 0140-525X, Link Cited by: §1, §2.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. ArXiv abs/1905.03197. Cited by: §1, §2, Table 2.
  • S. Edunov, A. Baevski, and M. Auli (2019) Pre-trained Language Model Representations for Language Generation. NAACL-HLT. External Links: arXiv:1903.09722v2 Cited by: §1, §2, §3, §4.2, Table 2.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898. External Links: Document, 1805.04833, ISBN 1369-8486, ISSN 1473-3250, Link Cited by: §4.3, §4.3, Table 4.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-Up Abstractive Summarization.

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    External Links: 1808.10792, Link Cited by: §4.2, §4.2, Table 2.
  • S. Golovanov, R. Kurbanov, S. Nikolenko, K. Truskovskyi, A. Tselousov, and T. Wolf (2019) Large-scale transfer learning for natural language generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6053–6058. External Links: Link Cited by: §2.
  • C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y. Bengio (2015)

    On Using Monolingual Corpora in Neural Machine Translation

    arXiv preprint arXiv:1503.03535. External Links: Document, 1503.03535, ISBN 1503.03535v2, ISSN 1560-7917, Link Cited by: §2.
  • K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching Machines to Read and Comprehend. Proceedings of NIPS, pp. 1–14. External Links: 1506.03340, Link Cited by: §4.2.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-efficient transfer learning for nlp. CoRR abs/1902.00751. Cited by: §1.
  • J. Howard and S. Ruder (2018) Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Document, 1801.06146, ISBN 2333-0384, ISSN 23330384, Link Cited by: §2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759. External Links: 1607.01759, Link Cited by: §4.1.
  • P. Koehn, F. J. Och, and D. Marcu (2003) Statistical Phrase-Based Translation. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 (June), pp. 48–54. Cited by: §2.
  • J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei (2017) A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Patterm Recognition (CVPR), Cited by: §4.4, §4.4, Table 4.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1, §2.
  • [19] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts

    Learning Word Vectors for Sentiment Analysis

    Cited by: §4.1.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in Translation: Contextualized Word Vectors. 31st Conference on Neural Information Processing Systems. External Links: 1708.00107, Link Cited by: §1, §2.
  • L. Melas-Kyriazi, A. Rush, and G. Han (2018) Training for diversity in image paragraph captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 757–761. Cited by: §4.4, Table 4.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems. Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. Proceedings of NAACL-HLT. External Links: Document, 1802.05365, Link Cited by: §1, §1, §2, §2.
  • A. Radford and T. Salimans (2018) GPT: Improving Language Understanding by Generative Pre-Training. arXiv, pp. 1–12. External Links: Document, 1802.05365, ISBN 9781945626753, ISSN 1095-8290, Link Cited by: §1, §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language Models are Unsupervised Multitask Learners. Cited by: §1, §1, §1, §2, §3, §4.
  • P. Ramachandran, P. J. Liu, and Q. V. Le (2017) Unsupervised Pretraining for Sequence to Sequence Learning. Proceedings of EMNLP. Cited by: §2.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 6833–6844. External Links: ISBN 978-1-5108-6096-4, Link Cited by: §4.1.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. M. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In ICML, Cited by: §1, §2.
  • A. Sriram, H. Jun, S. Satheesh, and A. Coates (2018) Cold fusion: Training Seq2seq models together with language models. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2018-Septe, pp. 387–391. External Links: Document, 1708.06426, ISSN 19909772 Cited by: §2.
  • F. Stahlberg, J. Cross, and V. Stoyanov (2018) Simple Fusion: Return of the Language Model. Proceedings of the Third Conference on Machine Translation: Research Papers 1, pp. 204–211. External Links: 1809.00125, Link Cited by: §2, §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems, pp. 5998–6008. External Links: 1706.03762 Cited by: §2.
  • C. Wang, M. Li, and A. J. Smola (2019) Language models with transformers. CoRR abs/1904.09408. Cited by: §1.
  • L. Yu, P. Blunsom, C. Dyer, E. Grefenstette, and T. Kocisky (2017) The Neural Noisy Channel. arXiv preprint arXiv:1611.02554, pp. 1–13. Cited by: §2.
  • H. Zhang, Y. Gong, Y. Yan, N. Duan, J. Xu, J. Wang, M. Gong, and M. Zhou (2019) Pretraining-Based Natural Language Generation for Text Summarization. arXiv preprint arXiv:1902.09243. External Links: 1902.09243, Link Cited by: §2, Table 2.
  • J. Zhao, Y. Kim, K. Zhang, A. Rush, and Y. LeCun (2018)

    Adversarially regularized autoencoders


    Proceedings of the 35th International Conference on Machine Learning

    , J. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5902–5911. External Links: Link Cited by: §4.1.