Distilling the Knowledge of BERT for Text Generation

11/10/2019 ∙ by Yen-Chun Chen, et al. ∙ Microsoft Carnegie Mellon University 0

Large-scale pre-trained language model, such as BERT, has recently achieved great success in a wide range of language understanding tasks. However, it remains an open question how to utilize BERT for text generation tasks. In this paper, we present a novel approach to addressing this challenge in a generic sequence-to-sequence (Seq2Seq) setting. We first propose a new task, Conditional Masked Language Modeling (C-MLM), to enable fine-tuning of BERT on target text-generation dataset. The fine-tuned BERT (i.e., teacher) is then exploited as extra supervision to improve conventional Seq2Seq models (i.e., student) for text generation. By leveraging BERT's idiosyncratic bidirectional nature, distilling the knowledge learned from BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong baselines of Transformer on multiple text generation tasks, including machine translation (MT) and text summarization. Our proposed model also achieves new state-of-the-art results on the IWSLT German-English and English-Vietnamese MT datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale pre-trained language model, such as ELMo (Peters et al., 2018), GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), has become the de facto

first encoding step for many natural language processing (NLP) tasks. For example, BERT, pre-trained with deep bidirectional Transformer 

(Vaswani et al., 2017) via masked language modeling and next sentence prediction, has revolutionized the state of the art in many language understanding tasks, such as natural language inference (Bowman et al., 2015) and question answering (Rajpurkar et al., 2016).

However, beyond common practice of fine-tuning BERT for language understanding (Wang et al., 2019), applying BERT to language generation still remains an open question. Text generation aims to generate natural language sentences conditioned on certain input, with applications ranging from machine translation (Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015), text summarization (Nallapati et al., 2016; Gehring et al., 2017; Chen and Bansal, 2018)

), to image captioning 

(Vinyals et al., 2015; Xu et al., 2015; Gan et al., 2017). In this paper, we study how to use BERT for better text generation, which to the best of our knowledge is still a relatively unexplored territory.

Intuitively, as BERT is learned with a generative objective via Masked Language Modeling (MLM) during the pre-training stage, a natural assumption is that this training objective should have learned essential, bidirectional, contextual knowledge that can help enhance text generation. Unfortunately, this MLM objective is not auto-regressive, which encumbers its direct application to auto-regressive text generation in practice.

In this paper, we tackle this challenge by proposing a novel and generalizable approach to distilling knowledge learned in BERT for text generation tasks. We first propose a new Conditional Masked Language Modeling (C-MLM) task, inspired by MLM but requiring additional conditional input, which enables fine-tuning pre-trained BERT on a target dataset. In order to extract knowledge from the fine-tuned BERT and apply it to a text generation model, we leverage the fine-tuned BERT as a teacher model that generates sequences of word probability logits for the training samples, and treat the text generation model as a student network, which can effectively learn from the teacher’s outputs for imitation. The proposed approach improves text generation by providing a good estimation on the word probability distribution for each token in a sentence, consuming both the left and the right context, the exploitation of which encourages conventional text generation models to

plan ahead.

Text generation models are usually trained via Maximum Likelihood Estimation (MLE), or teacher forcing (Bengio et al., 2015): at each time step, it maximizes the likelihood of the next word conditioned on its previous ground-truth words. This corresponds to optimizing one-step-ahead prediction. As there is no explicit signal towards global planning in the training objective, the generation model may incline to focusing on local structure rather than global coherence. With our proposed approach, BERT’s looking into the future ability can act as an effective regularization method, capturing subtle long-term dependencies that ensure global coherence and in consequence boost model performance on text generation.

An alternative way to leverage BERT for text generation is to initialize the parameters of the encoder or decoder of Seq2Seq with pre-trained BERT, and then fine-tuning on the target dataset. However, this approach requires the encoder/decoder to have the same size as BERT, inevitably making the final text generation model too large. Our approach, on the other hand, is modular and compatible to any text-generation model, and has no restriction on the model size (e.g., large or small) or model architecture (e.g., LSTM or Transformer).

The main contributions of this work are three-fold. () We present a novel approach to utilizing BERT for text generation. The proposed method induces sequence-level knowledge into the conventional one-step-ahead and teacher-forcing training paradigm, by introducing an effective regularization term to MLE training loss. () We conduct comprehensive evaluation on multiple text generation tasks, including machine translation, text summarization and image captioning. Experiments show that our proposed approach significantly outperforms strong Transformer baselines and is generalizable to different tasks. () The proposed model achieves new state-of-the-art on both IWSLT14 German-English and IWSLT15 English-Vietnamese datasets.

2 Related Work

Pre-trained Language Models Prior to pre-trained language model, word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Bojanowski et al., 2017) were widely used for NLP tasks. Recently, CoVe (McCann et al., 2017) introduced (conditional) language models pre-trained on paired machine translation corpus. ELMo (Peters et al., 2018) learned a contextual language model on a large corpus with bidirectional RNN. GPT (Radford et al., 2018) used unidirectional Transformer to achieve better contextualized word representation. By fine-tuning pre-trained language models, ULMFit (Howard and Ruder, 2018) also achieved promising results on text classification.

In our study, we focus on BERT due to its superior performance on multiple language understanding tasks. However, different from previous work exploiting BERT for language understanding tasks, here we aim to apply BERT to text generation. To the best of our knowledge, this is still a relatively unexplored space. The proposed approach is also model-agnostic and can be applied to other pre-trained language models as well.

BERT for Text Generation There has been some recent attempt on applying BERT to text generation. Specifically, Lample and Conneau (2019) trained cross-lingual MLM and demonstrated promising results for cross-lingual natural language inference (Conneau et al., 2018)

and unsupervised neural machine translation (NMT) 

(Lample et al., 2018). Wang and Cho (2019) formulated BERT as a Markov Random Field LM and showed preliminary results on unsupervised text generation with improved diversity. Zhang et al. (2019a) utilized an encoder with BERT and a two-stage decoder for text summarization. Song et al. (2019) proposed Masked Seq2Seq (MASS) pre-training, demonstrating promising results on unsupervised NMT, text summarization and conversational response generation. Concurrent with our work, Ghazvininejad et al. (2019) proposed a similar conditional MLM for constant-time translation, and Yang et al. (2019) studied how to fine-tune BERT for NMT.

Our approach is novel in the sense that we do not directly use the parameters of BERT in the Seq2Seq model. Instead, BERT acts as an effective regularization to the MLE training loss, by proactively injecting future information for predicting the present.

Right-to-Left Generation Our work also shares a high-level intuition with those approaches that try to regularize left-to-right generative models with a right-to-left counterpart. Specifically, Liu et al. (2016) trained a separate reverse NMT and performed joint decoding at inference time to enforce agreement between forward and reverse models. Twin Networks (Serdyuk et al., 2018) used a backward RNN jointly trained with a forward RNN decoder by matching their hidden states. Zhang et al. (2019b) further extended the idea to Transformer with joint training, so that the forward and the backward models iteratively improve each other. Our proposed approach stems from a similar intuition. However, we focus on using pre-trained language model such as BERT to regularize an auto-regressive generation model.

Knowledge Distillation Our method shares the same loss formulation as Knowledge Distillation (KD) proposed in Buciluǎ et al. (2006); Hinton et al. (2015); Kim and Rush (2016), where a smaller student model is trained on soft labels provided by a larger teacher model. More recently, Tan et al. (2019) applied KD to multilingual NMT, and Sun et al. (2019) proposed patient KD for BERT model compression. Compared with these previous studies, where both the teacher and the student are trained on the same task, our approach is different in the sense that the BERT teacher is not designed to perform the student’s generation task. We focus on using KD to leverage the learned knowledge of BERT for text generation, while previous work mostly focused on model compression.

Figure 1: Illustration of the proposed approach to distilling knowledge from BERT for text generation. See Section 3.2 and 3.3 for details.

3 Proposed Approach

In this section, we present our proposed approach to distilling the knowledge in BERT for text generation in generic sequence-to-sequence (Seq2Seq) setting. We first review Seq2Seq learning in Section 3.1, and then describe the proposed approach in Section 3.2 and 3.3.

3.1 Sequence-to-Sequence Learning

Seq2Seq learning (Sutskever et al., 2014) aims to generate a sequence of discrete output of length , conditioned on a sequence of discrete input of length . A Seq2Seq model learns parameters to estimate the conditional likelihood , typically trained via Maximum Likelihood Estimation (MLE), or equivalently, minimizing the cross-entropy loss as follows:


where each conditional probability can be calculated via an attention-based recurrent neural network (RNN) 

(Bahdanau et al., 2015; Luong et al., 2015), Transformer (Vaswani et al., 2017), or any other neural sequence-generation models.

3.2 Fine-tuning BERT with Conditional MLM

This generic Seq2Seq learning framework is the state of the art on a wide range of text generation tasks. Using modern deep neural networks, the conditional probabilities can be readily modeled as a sequence of classifications over the word vocabulary. However, during training, in order to generate the -th token , the model only sees a partial sentence from the ground-truth training data. Intuitively, it is reasonable to assume that a bidirectional model can be more informative than a left-to-right generation model, since additional context from the right (or future) is also incorporated to predict the current word. Unfortunately, this additional information is not utilized in a standard Seq2Seq model, since it can only be trained in a left-to-right manner, where the future context is masked out to prevent each word from indirectly “seeing itself”. To compensate this single-directional limitation of Seq2Seq setting, we propose a new conditional language model (C-MLM) to enable the fine-tuning of BERT on target generation task, in hope that the fine-tuned bidirectional BERT can be utilized for better text generation.

BERT (Devlin et al., 2019) is a deep bidirectional Transformer trained via Masked Language Modeling (MLM).111Besides MLM, Devlin et al. (2019) also introduced the next sentence prediction task for training BERT. We omit this task since it is unrelated to our work. In a similar setting, where the input is a sequence pair (),222The two sequences are consecutive paragraphs sampled from a very large corpus such as Wikipedia. of the tokens are randomly masked. Formally, we denote the masked token sets as and , and the disjoint counterpart (i.e., the unmasked tokens) as and , respectively. The trained BERT model aims to estimate the joint probability:


where and denote the number of masked tokens in and , respectively. Each , and each . Eqn. (2) can be trained with the standard word-level cross-entropy loss.

We aim to marry MLM pre-training with Seq2Seq learning, to leverage bidirectional language model for text generation. To this end, we propose a conditional-MLM, a variant of MLM that allows further fine-tuning of pre-trained BERT on target dataset. For example, for machine translation, and represent the source and the target sentence, respectively. We first concatenate them together and randomly mask of the tokens only in , then train the network to model the joint probability


The above C-MLM objective is similar to the conditional language modeling (LM) objective in Eqn. (1), but conditional LM only permits predicting a word based on its left context. C-MLM is also related to Masked Seq2Seq (MASS) pre-training Song et al. (2019). However, in MASS, the encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and the decoder tries to predict this masked fragment, which is different from our model design. The final goal is also different: MASS focuses on Seq2Seq pre-training, while we focus on leveraging BERT for text generation. In our experiments, we observe that the C-MLM task can obtain high accuracy and good generalization on word prediction. However, it is not feasible to generate sequential output directly from C-MLM. Instead, we use knowledge distillation to distill the knowledge learned from the fine-tuned BERT into a Seq2Seq model for direct text generation, which will be explained in the next sub-section.

3.3 Knowledge Distillation for Generation

Our inspiration springs from the observation that the probability distribution of the masked word is estimated using both and from . In other words, the distribution for a given word contains information from both backward and forward contexts, which is a desirable benefit for providing sequence-level global guidance. This probability distribution can be considered as soft targets for a text generation model to mimic from, which potentially contains more useful and fine-grained information than the usual hard-assigned, one-hot label, therefore enhancing conventional left-to-right generation models to look into the future.

In a knowledge distillation setting, the BERT model can be considered as a teacher, while the Seq2Seq model acts as a student. Specifically, the Seq2Seq model can be trained with the following objective function:


where is the soft target estimated by the fine-tuned BERT with learned parameters , and denotes the output vocabulary. Note that is fixed during the distillation process. An illustration of this learning process is provided in Figure 1, which aims to match the word probability distribution provided by the student with provided by the teacher (i.e., distillation).

To further improve the Seq2Seq student model, hard-assigned labels are also utilized. the final model is trained with the following compound objective:


where is a hyper-parameter for tuning the relative importance of the two training targets: soft estimation from fine-tuned BERT, and ground-truth hard label. Note that our proposed approach only has a minimal requirement on the architecture of the incorporated Seq2Seq model. As long as the model is trained to estimate word-level probability as in Eqn. (1), it can be trained jointly with the proposed objective function Eqn. (5).

At a higher level, the additional loss term can be interpreted as a sequence-level objective function. Our auto-regressive (or causal) model tries to predict the probability distribution that matches the estimation the bidirectional teacher model predicts, hence encouraging the planning of future (right context) for generation.

4 Experiments

In this section, we describe our experiments on two well-studied text generation tasks: machine translation, and abstractive text summarization.

4.1 Datasets and Training Details

Machine Translation We consider two relatively small-scale datasets, IWSLT15 English-Vietnamese (En-Vi, 113k training samples) and IWSLT14 German-English (De-En, 160k training samples), and one medium-scale dataset, WMT14 English-German (En-De, 4.5M training samples). For IWSLT15 En-Vi, we use the pre-processed dataset provided by Luong and Manning (2015). We use tst2012 as dev set and test on tst2013. For IWSLT14 De-En, we follow the pre-processing steps and the same train/dev/test split as in Wu et al. (2019). For WMT14 En-De, we follow the pre-processing steps in Vaswani et al. (2017) for fair comparison. We use newstest2013 as the dev set and newstest2014 as the test set. We report BLEU scores (Papineni et al., 2002) for evaluation of MT performance following the Moses script.333For fair comparison to previous work, we report tokenized BLEU scores using https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl, and for WMT14 En-De, we further split the compound words after tokenization.

Abstractive Summarization For summarization, we conduct experiments on the Gigaword summarization dataset (Rush et al., 2015). Note that the original train/valid/test split of Gigaword is 3.8M/190k/2k. In our experiments, we observed severe distribution mismatch between the validation and test data. See Table 55, and Sec. 4.3 for detailed discussion. Therefore, we further sampled 5k/5k dev/test-dev splits from the validation set and tuned hyper-parameters on the dev set only. We report ROUGE scores (Lin, 2004) on test-dev for the evaluation of our proposed approach, and include results on the standard test split for the comparison with prior work.

De-En Models dev test Our Implementations Transformer (base) 35.27 34.09 + BERT teacher 36.93 35.63 Other Reported Results ConvS2S + MRT 33.91 32.85 Transformer (big) - 34.4 Lightweight Conv - 34.8 Dyn. Convolution - 35.2
Table 1: BLEU scores for IWSLT14 German-English translation. () tuned with checkpoint averaging and length penalty. () from Edunov et al. (2018). () from Wu et al. (2019).
En-Vi Models tst2012 tst2013 Our Implementations RNN 23.37 26.80 + BERT teacher 25.14 27.59 Transformer (base) 27.03 30.76 + BERT teacher 27.85 31.51 Other Reported Results RNN - 26.1 ELMo - 29.3 CVT - 29.6
Table 2: BLEU scores for IWSLT15 English-Vietnamese translation. () from  Luong et al. (2017). () from  Clark et al. (2018).

Training and Hyper-parameters

Our implementation is based on the PyTorch 

(Paszke et al., 2017) version of OpenNMT (Klein et al., 2018) seq2seq toolkit. We use the ‘base’ model of 6-layer Transformer with 512-hidden 8-head attention blocks and 2048-hidden feed-forward layer for all experiments, with label smoothing regularization (LSR) (Szegedy et al., 2016) of 0.1. We batch examples with similar sequence length, and count batch size by the number of tokens. For MT we use the pre-trained BERT-base-multilingual-cased model, and for summarization we use BERT-base-uncased as the starting point of BERT fine-tuning.444BERT pre-trained models are available at https://github.com/google-research/bert. Our fine-tuning implementation is modified from code available at https://github.com/huggingface/pytorch-pretrained-BERT. We use the corresponding pre-trained byte-pair-encoding (Sennrich et al., 2016) shipped together with the BERT model for tokenization.

For all training methods of all Transformer models, the learning rate schedule is set to where is the attention representation size (Vaswani et al., 2017). For all BERT fine-tuning, we follow Devlin et al. (2019) and use a triangular learning rate schedule with maximum learning rate . The parameters are updated with the Adam optimizer (Kingma and Ba, 2015). In the distillation stage, we pre-compute BERT’s prediction logits of the training data using top- distillation (Tan et al., 2019) to reduce computation overhead and memory footprint, where is set to 8 across all the experiments. We also tune the temperature for the applied at the teacher’s logits.555Different from the original KD, we do not apply the same temperature on the student. In our preliminary experiment we found high of Seq2Seq results in much worse performance. We hypothesize the low-entropy nature of conditioned text generation is not suitable for temperature scaling.

For the detailed values of the hyper-parameters for each experiment, please refer to the supplementary material. We found it necessary to train longer with , since it is still improving after the step at which the baseline Transformer starts to plateau. At inference time, we use beam search with beam size 4 and length penalty (Wu et al., 2016) of 0.6 across all the models. All the hyper-parameters are tuned on the development set. Note that we tuned our Transformer baseline to achieve higher scores than the reference implementation on each dataset with default hyper-parameters (in most cases comparable to the state-of-the-art).

4.2 Results on Machine Translation

En-De Models NT2013 NT2014
Our Implementations
Transformer (base) 25.95 26.94
+ BERT teacher 26.22 27.53
Other Reported Results
Transformer (base) 25.8 27.3
Transformer (big) 26.5 29.3
Dyn. Convolution 26.90.2 29.7
Table 3: BLEU scores for WMT14 English-German translation. () tuned with checkpoint averaging. () trained on WMT16, a slightly different version of training data. () from Vaswani et al. (2017). () from Ott et al. (2018). () from Wu et al. (2019).

We first validate our proposed text generation approach on machine translation task. Experimental results are summarized in Table 22 and 3, which show that our model significantly improves over the strong Transformer baseline across all three datasets. Note that our baseline is the ‘base’ model of Transformer, which has 44M trainable parameters, and the reference implementation by Wu et al. (2019) is Transformer (big) with 176M trainable parameters.666Parameter counts exclude word embedding and final linear projection, which mostly depends on the vocabulary size. BERT-base has 86M trainable parameters.

For IWSLT German-English translation, our method improves over the Transformer baseline by 1.54 BLEU points, and achieves new state of the art. Our approach outperforms previously-reported results such as ConvS2S+MRT, a convolutional-based model (Gehring et al., 2017) with minimum risk training (Edunov et al., 2018), and Lightweight and Dynamic Convolution (Wu et al., 2019). Note that  Wu et al. (2019) also tuned checkpoint averaging, which creates a soft ensemble effect. And their model has roughly the same amount of parameters as Transformer (big).

For IWSLT English-Vietnamese translation, since most prior work experimented with RNN models, we also report RNN-based results here. This also suggests that our method is model-agnostic. Our best model outperforms Seq2Seq-OT (Chen et al., 2019) that utilizes optimal transport for sequence-level training, as well as the ELMo and CVT results reported in Clark et al. (2018).777The CVT results used a much larger RNN and CNN-based character embedding, as well as a customized structure. Therefore, we did not try to use RNN to match their results. For WMT14 English-German translation, our method still improves over the well-tuned Transformer baseline. We also report the scores of Transformer (big) and state-of-the-art Dynamic Convolution model (Wu et al., 2019) for reference.

GW Models R-1 R-2 R-L Dev Transformer (base) 46.64 24.37 43.17 + BERT teacher 47.35 25.11 44.04 Test-Dev Transformer (base) 46.84 24.80 43.58 + BERT teacher 47.90 25.75 44.53
Table 4: ROUGE F scores for Gigaword abstractive summarization on our internal test-dev split.
GW Models R-1 R-2 R-L Seq2Seq 36.40 17.77 33.71 CGU 36.3 18.0 33.8 FTSum 37.27 17.65 34.24 E2T 37.04 16.66 34.93 ReSum 37.04 19.03 34.46 Trm + BERT teacher 37.57 18.59 34.82
Table 5: ROUGE F scores on the official test set. () from Nallapati et al. (2016). () from Lin et al. (2018). () from Cao et al. (2018b). () from Amplayo et al. (2018). () from Cao et al. (2018a).

4.3 Results on Abstractive Summarization

Table 5 and Table 5 show the results of our approach on abstractive summarization task, where R-1, R-2, and R-L denote scores of ROUGE-1, ROUGE-2, and ROUGE-L, respectively. Our method shows improvement on all the metrics, as shown in Table 5. We observe that the performance on test set is much lower, which suggests that the distribution in the test set is very different from that in the validation set, as mentioned in Section 4.1. When we manually checked the test set data, we found many corrupted examples such as short input articles, meaningless text, and dominating unknown words. Given that the official test split contains only 1,951 noisy examples, we believe that our results on the dev/test-dev sets are more reliable.

On the test split, our best model is comparable to state-of-the-art models that use much more complex architectures specifically designed for summarization. CGU (Lin et al., 2018) augmented convolutional gating units. FTSum (Cao et al., 2018b) leveraged extra information extraction and dependency parsing features. E2T (Amplayo et al., 2018) utilized entities provided by an external entity linking system. ReSum (Cao et al., 2018a) carefully designed a retrieve-and-rerank pipeline with human-written soft templates. Despite that our model has no summarization-specific model design, we still achieve comparable performance to these models on all the metrics.

4.4 Ablation Study

Methods En-Vi De-En
(tst2012) (dev)
Transformer (base) 27.03 35.27
Trm + BERT 26.99 35.20
Trm + BERT 27.68 36.32
Trm + BERT 27.85 36.93
Table 6: Ablation study. (Trm: Transformer)

There are several possible factors that could contribute to the performance gain: additional parameters of BERT, extra data (pretraining corpus) of BERT, and the bidirectional nature. To better understand the key contributions of our method, we conduct an ablation study described in the following. We finetune 2 extra teachers: BERT and BERT. For BERT, we use a smaller BERT (6 layers) for C-MLM finetuning, which has approximately the same number of parameters as Transformer-base888We still use the pretrained weights of BERT, otherwise the C-MLM does not converge very well.. For BERT, we use the full BERT model but finetune it using left-to-right LM as in the conventional Seq2Seq model. Next, we apply the proposed KD method to train the Transformer on En-Vi and De-En MT tasks. Results are shown in Table 6. BERT still works well though the full BERT provides further improvement. On the other hand, BERT slightly hurts the performance. We hypothesize that it generates noisy learning targets for the student, hence the performance drop. Empirically, we show that the bidirectional knowledge could be more important than the extra parameters, while the pre-trained weights remain useful for more stable C-MLM training.

Figure 2: BLEU scores on IWSLT German-English, WMT English-German and IWSLT English-Vietnamese for different output lengths.
Reference my mother says that i started reading at the age of two , although i think four is probably close to the truth .
Transformer my mother says that i started reading redwith two years , but i think that four redof them probably correspond to the truth . (39.6)
Ours my mother says that i started reading blueat the age of two , but i think four blueis more likely to be the truth . (65.2)
Reference we already have the data showing that it reduces the duration of your flu by a few hours .
Transformer we ’ve already got the data showing that it ’s going to redcrash the duration of your flu by a few hours . (56.6)
Ours we already have the data showing that it bluereduces the duration of your flu by a few hours . (100.0)
Reference we now know that at gombe alone , there are nine different ways in which chimpanzees use different objects for different purposes .
Transformer we know today that alone in gombe , there are nine different ways that chimpanzees use different objects redin different ways . (35.8)
Ours we now know that in gombe alone , there are nine different ways that chimpanzees use different objects bluefor different purposes . (71.5)
Table 7: Qualitative examples from IWSLT German-English translation. Numbers inside the parenthesis are sentence-level BLEU scores. redRed word is where the baseline Transformer makes a mistake without considering the possible future phrase and fails to recover. On the other hand, our model makes the right decision at the blueblue word, hence generates more coherent sentence. Please refer to Section 4.6 for detailed explanation.

4.5 Generation for Different Lengths

We next analyze the effect of our proposed approach on different output lengths. We plot the BLEU scores on MT w.r.t. different output generation lengths on the development set.999For Gigaword summarization, almost all summaries are short sentences (less than 0.5% of the summaries contain more than 16 words), so we omit the analysis. Results are provided in Figure 2. For IWSLT German-English dataset (Figure 2: Left), we can see a shared trend that the proposed objective gains higher BLEU points on longer translation pairs. For WMT English-German (Figure 2: Middle), we can see that although the proposed method performs much worse when the output sentences are very short, it achieves relatively consistent improvement on longer cases, hence resulting in overall BLEU improvement. For IWSLT English-Vietnamese (Figure 2: Right), we see a similar trend when the length .

4.6 Qualitative Examples

In Table 7, we show some translation examples on IWSLT German-English dataset. In the first example, the baseline Transformer cannot recover from ‘with’ and ‘of’, which renders the full sentence not making much sense. “I started reading with…” would make sense from the left context; however, if the model also considers the right context “the age of two”, the word ‘with’ would be assigned with lower probability by the soft labels provided by the BERT teacher. Even though at test-time the model cannot ‘look ahead’, the soft-targets at training-time prevents the over-confidence of the model on one-hot label; hence the better generalization at the test-time. Similarly, other examples show that our model can generate text more coherently w.r.t. the context on the right (underlined in Table 7), thus making more accurate and natural translation.

5 Conclusion

In this work, we propose a novel and generic approach to utilizing pre-trained language models to improve text generation

without explicit parameter sharing, feature extraction, or augmenting with auxiliary tasks.

Our proposed Conditional MLM mechanism leverages unsupervised language models pre-trained on large corpus, and then adapts to supervised sequence-to-sequence tasks. Our distillation approach indirectly influences the text generation model by providing soft-label distributions only, hence is model-agnostic. Experiments show that our model improves over strong Transformer baselines on multiple text generation tasks such as machine translation and abstractive summarization, and achieves new state-of-the-art on some of the translation tasks. For future work, we will explore the extension of Conditional MLM to multimodal input such as image captioning.


  • R. K. Amplayo, S. Lim, and S. Hwang (2018) Entity commonsense representation for neural abstractive summarization. In NAACL, Cited by: §4.3, Table 5.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1, §3.1.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, Cited by: §1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017)

    Enriching word vectors with subword information

    TACL. Cited by: §2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In EMNLP, Cited by: §1.
  • C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In KDD, Cited by: §2.
  • Z. Cao, W. Li, S. Li, and F. Wei (2018a) Retrieve, rerank and rewrite: soft template based neural summarization. In ACL, Cited by: §4.3, Table 5.
  • Z. Cao, F. Wei, W. Li, and S. Li (2018b) Faithful to the original: fact aware neural abstractive summarization. In AAAI, Cited by: §4.3, Table 5.
  • L. Chen, Y. Zhang, R. Zhang, C. Tao, Z. Gan, H. Zhang, B. Li, D. Shen, C. Chen, and L. Carin (2019) Improving sequence-to-sequence learning via optimal transport. In ICLR, Cited by: §4.2.
  • Y. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL, Cited by: §1.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §1.
  • K. Clark, M. Luong, C. D. Manning, and Q. Le (2018) Semi-supervised sequence modeling with cross-view training. In EMNLP, Cited by: §4.2, Table 2.
  • A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. In EMNLP, Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §3.2, §4.1, footnote 1.
  • S. Edunov, M. Ott, M. Auli, D. Grangier, and M. Ranzato (2018) Classical structured prediction losses for sequence to sequence learning. In NAACL, Cited by: §4.2, Table 2.
  • Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng (2017) Semantic compositional networks for visual captioning. In CVPR, Cited by: §1.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In ICML, Cited by: §1, §4.2.
  • M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer (2019) Constant-time machine translation with conditional masked language models. arXiv preprint arXiv:1904.09324. Cited by: §2.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In

    NIPS Deep Learning and Representation Learning Workshop

    Cited by: §2.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In ACL, Cited by: §2.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. In EMNLP, Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §4.1.
  • G. Klein, Y. Kim, Y. Deng, V. Nguyen, J. Senellart, and A. Rush (2018) OpenNMT: neural machine translation toolkit. In AMTA, Cited by: §4.1.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2018) Unsupervised machine translation using monolingual corpora only. In ICLR, Cited by: §2.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §2.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In ACL Text Summarization Branches Out Workshop, Cited by: §4.1.
  • J. Lin, X. Sun, S. Ma, and Q. Su (2018) Global encoding for abstractive summarization. In ACL, Cited by: §4.3, Table 5.
  • L. Liu, M. Utiyama, A. Finch, and E. Sumita (2016) Agreement on target-bidirectional neural machine translation. In NAACL, Cited by: §2.
  • M. Luong, E. Brevdo, and R. Zhao (2017) Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt. Cited by: Table 2.
  • M. Luong and C. D. Manning (2015) Stanford neural machine translation systems for spoken language domain. In IWSLT, Cited by: §4.1.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In EMNLP, Cited by: §3.1.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In NIPS, Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §2.
  • R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al. (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL, Cited by: §1, Table 5.
  • M. Ott, S. Edunov, D. Grangier, and M. Auli (2018) Scaling neural machine translation. In WMT, Cited by: Appendix A, Table 3.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §4.1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS Autodiff Workshop, Cited by: §4.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In EMNLP, Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §1, §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1, §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. In EMNLP, Cited by: §1.
  • A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    In EMNLP, Cited by: §4.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In ACL, Cited by: §4.1.
  • D. Serdyuk, N. R. Ke, A. Sordoni, A. Trischler, C. Pal, and Y. Bengio (2018) Twin networks: matching the future for sequence generation. In ICLR, Cited by: §2.
  • K. S. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In ICML, Cited by: §2, §3.2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: Appendix A.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355. Cited by: §2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §1, §3.1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016)

    Rethinking the inception architecture for computer vision

    In CVPR, Cited by: §4.1.
  • X. Tan, Y. Ren, D. He, T. Qin, Z. Zhao, and T. Liu (2019) Multilingual neural machine translation with knowledge distillation. In ICLR, Cited by: §2, §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §1, §3.1, §4.1, §4.1, Table 3.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In CVPR, Cited by: §1.
  • A. Wang and K. Cho (2019) BERT has a mouth, and it must speak: bert as a markov random field language model. arXiv preprint arXiv:1902.04094. Cited by: §2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) Glue: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §1.
  • F. Wu, A. Fan, A. Baevski, Y. Dauphin, and M. Auli (2019) Pay less attention with lightweight and dynamic convolutions. In ICLR, Cited by: §4.1, §4.2, §4.2, §4.2, Table 2, Table 3.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §4.1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §1.
  • J. Yang, M. Wang, H. Zhou, C. Zhao, Y. Yu, W. Zhang, and L. Li (2019) Towards making the most of bert in neural machine translation. arXiv preprint arXiv:1908.05672. Cited by: §2.
  • H. Zhang, J. Xu, and J. Wang (2019a) Pretraining-based natural language generation for text summarization. arXiv preprint arXiv:1902.09243. Cited by: §2.
  • Z. Zhang, S. Wu, S. Liu, M. Li, M. Zhou, and E. Chen (2019b) Regularizing neural machine translation by target-bidirectional agreement. In AAAI, Cited by: §2.

Appendix A Detailed Hyper-parameter Values

We run all experiments on single GPU of NVIDIA Titan RTX or V100 except for WMT En-De we use 4 V100s for training. Note that for large batch sizes that do not fit in GPU memory, we use the gradient accumulation tricks as in Ott et al. (2018). Batch sizes are counted in number of tokens. Note that all the hyper-parameters are tuned on the development set only.


For C-MLM fine-tuning, we train for 100k steps with 5k , , and batch size of 16k tokens. For baseline model, we train for 50k steps with 4k and batch size of 6k tokens. The learning rate is set to 1. For the proposed model, we train for 100k steps with 8k and batch size of 6k tokens. The learning rate is set to 2, , and . Seq2Seq model uses dropout (Srivastava et al., 2014) of 0.3 in both cases.


For C-MLM fine-tuning and baseline Transformer, the hyper-parameters are identical to that of IWSLT De-En. For the proposed model, we train for 100k steps with 8k and batch size of 6k tokens. The learning rate is set to 2, , and . Dropout is still 0.1.


For C-MLM fine-tuning, we train for 100k steps with 5k , , and batch size of 512k tokens. For baseline model, we train for 30k steps with 4k and batch size of 384k tokens. The learning rate is set to 4. Since this is our largest dataset and training is slow, for the proposed model we use the baseline Transformer to initialize the Seq2Seq student. For the proposed model, we continue training for 50k steps with 4k and batch size of 64k tokens. The learning rate is set to 2, , and . Seq2Seq model uses dropout of 0.1 in both cases.


For C-MLM fine-tuning, we train for 100k steps with 5k , , and batch size of 64k tokens. For baseline model, we train for 50k steps with 4k and batch size of 40k tokens. The learning rate is set to 1. For the proposed model, we train for 70k steps with 4k and batch size of 36k tokens. The learning rate is set to 2, , and . Seq2Seq model uses dropout of 0.1 in both cases.

Appendix B Extra Generation Examples

We show Gigaword summarization examples in Table 9 and extra En-DE generation examples in Table 8. Qualitatively, our Transformer + BERT Teacher outperforms baseline Transformer and generate more coherent sentences.

Reference the political climate in the u.s. at the time was tense , and there were debates going on about immigration .
Transformer the political climate in the u.s. was redback then , and there was constant disasters . (29.5)
Ours the political climate in the united states at the time was bluetense , and there were ongoing shifting debates . (57.3)
Reference it would be immoral to leave these young people with a climate system spiraling out of control .
Transformer it would be immoral to redlet these young people leave a climate system that was out of control . (44.6)
Ours it would be immoral to blueleave these young people with a climate system out of control . (84.3)
Reference the tahltan have called for the creation of a tribal heritage reserve which will set aside the largest protected area in british columbia .
Transformer tahltan demands the redinstitution of a tribe in british columbia that should make the largest protection area in british columbia . (19.9)
Ours the tahltan demands to bluebuild a tribe reserve that should be the largest protected area in british columbia . (32.2)
Table 8: Qualitative examples from IWSLT German-English translation. Numbers inside the parenthesis are sentence-level BLEU scores. redRed word is where the baseline Transformer makes a mistake without considering the possible future phrase and fails to recover. On the other hand, our model makes the right decision at the blueblue word, hence generates more coherent sentence. Please refer to Section 4.5 in the main paper for detailed explanation.
Reference china offers tax exemptions for laid-off workers
Transformer china encourages laid-off workers to seek employment
Ours china offers tax exemptions to laid-off workers
Reference swiss police arrest britons who allegedly ran rental car racket
Transformer three britons arrested in swiss luxury hotel
Ours swiss police arrest three britons in rental car racket case
Reference south korea stocks extend declines as kia concerns intensify
Transformer south korean stocks fall for #th time in # days ; kia leads
Ours south korean stocks fall as kia troubles intensify
Table 9: Qualitative examples from the Gigaword summarization dataset. Baseline model suffers from early mistakes. Our model generates more coherent summaries.