Quantifying Exposure Bias for Neural Language Generation

05/25/2019 ∙ by Tianxing He, et al. ∙ MIT 0

The exposure bias problem refers to the training-inference discrepancy caused by teacher forcing in maximum likelihood estimation (MLE) training for recurrent neural network language models (RNNLM). It has been regarded as a central problem for natural language generation (NLG) model training. Although a lot of algorithms have been proposed to avoid teacher forcing and therefore to remove exposure bias, there is little work showing how serious the exposure bias problem is. In this work, starting from the definition of exposure bias, we propose two simple and intuitive approaches to quantify exposure bias for MLE-trained language models. Experiments are conducted on both synthetic and real data-sets. Surprisingly, our results indicate that either exposure bias is trivial (i.e. indistinguishable from the mismatch between model and data distribution), or is not as significant as it is presumed to be (with a measured performance gap of 3 viewpoint that teacher forcing or exposure bias is a major drawback of MLE training.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language model (LM) is a central module for natural language generation (NLG) tasks (Young et al., 2017) such as machine translation (Wu et al., 2017), dialogue response generation (Li et al., 2017), image captioning (Lin et al., 2014)

, etc. For decades, maximum likelihood estimation (MLE) has been the the most widely used objective for LM training. However, there is a popular belief in the natural language processing (NLP) community that standard MLE training will cause “exposure bias” and lead to a performance degradation during the test-time language generation. The exposure bias problem

(Bengio et al., 2015; Ranzato et al., 2016) refers to the following discrepancy between MLE model training and test-time generation for language models. During training, the language model predicts the next word conditioned on words sampled from the ground-truth data distribution. During generation, the model generates words conditioned on sequences generated by the model itself. However, due to the exposure to real data during the training, the language model is biased to only perform well on the ground-truth history distribution. As a result, during generation the errors will accumulate along the generated sequence, and the distribution generated by the model will be distorted. The forced exposure to ground-truth data during training is also referred to as “teacher forcing”. In order to avoid teacher forcing, many training algorithms (Bengio et al., 2015; Ranzato et al., 2016; Yu et al., 2016; Zhu et al., 2018; Lu et al., 2018; Lin et al., 2017; Guo et al., 2017; Rajeswar et al., 2017; Wiseman and Rush, 2016; Nie et al., 2019; Shi et al., 2018) have been proposed as alternatives to MLE training. Most of these works utilize techniques from generative adversarial network (GAN) (Goodfellow et al., 2014)

or reinforcement learning (RL)

(Sutton and Barto, 1998). In this paper, we refer to these algorithms as non-MLE methods or text GANs. Despite the huge research efforts devoted to avoid exposure bias, surprisingly, its existence or significance is much less studied. In particular, to the best of our knowledge, no existing work attempts to quantify exposure bias in an empirical or theoretical way. This work is motivated by the belief that a good solution should be built upon a testable and quantifiable problem definition. Starting from the definition of exposure bias, we propose two intuitive quantification approaches to empirically measure the significance of exposure bias for language modelling. Our experiment results show that exposure bias is either insignificant or is indistinguishable from the mismatch between data and model distribution.

2 Notations

The task of language modelling is to learn the probability distribution of the

word in a sentence conditioned on the word history . Here, we use the uppercase

to denote a discrete random variable distributed across the vocabulary

. The lower-case is used to denote some particular word in the vocabulary V. Given a training data-set consisting of sentences of length , the standard MLE training minimizes the negative log-likelihood below:


Note that in this work we assume all sentences are of length for simplicity. In the rest of the paper, we denote the generation distribution of the trained LM as , and the ground-truth data distribution as . Readers can assume refers to the generation distribution of a LSTM LM (Hochreiter and Schmidhuber, 1997; Sundermeyer et al., 2012) trained with MLE objective, which is the major subject of this study. Our quantification mainly relies on the measurements of the distance from the model’s generation distribution to the data distribution. Hence we define the following notations to simplify expressions. Let denote the set of probability distributions on the vocabulary . Let denote a distance measure between distributions (e.g. total variation distance), .

3 Methodology

Throughout this work we study the definition of exposure bias based on the following principle: A good definition should lead to a quantifiable measurement to demonstrate its significance. Hence, we validate the following claim that immediately follows from the definition of exposure bias: During sampling, if we set the history distribution to be the ground-truth data distribution instead of the model’s own distribution (now that there is no discrepancy between training and testing), then the model’s language generation quality should be much better. To show the necessity of quantification for exposure bias, we start with the following preliminary experiment. We feed a MLE-trained LSTM LM on EMNLP-news data-set (details are given in Section 4.2), with three kinds of prefixes: model’s own samples, data samples, or samples from a uniform random distribution. Then we let the model complete the sentence given these prefixes as history. We list some samples in Table 1 and more in Appendix A. By manual inspection, we do not observe noticeable differences in sample quality by comparing the generated sentences from the three different distributions. In the extreme case where random sequences are fed, the model is still able to generate reasonable sentences. Therefore, in the following sections, we turn to more sophisticated methods to quantify the significance of exposure bias. Note that our quantification approaches will be independent of the training procedure and only require inference from the trained model.

Model Samples as Hisotry Model Samples
it was only a pieces that had gone up to the forest and forces the shoppers about their chronic young
i mean we didn ’ t know what i haven ’ t considered through , " she told bbc radio
if he were the president - elect , he was known that he would run a force in business at
Data Samples as Hisotry Model Samples
what this group does is to take down various different players in the future and we play in paris we
over 1 , 600 a day have reached greece this gone in 2013 and it planned to allow civilians on
" we ’ re working through a legacy period , and i am proud of the experience of the worker
Random Sequences as History Model Samples
…RANDOM… big winter deserve , but they just say it your things goes wrong
…RANDOM… playoff north realise at its lowest level , improving their understanding in danger
…RANDOM… vital childhood registration , not previously planned for <unk> to each and reduced

Table 1: Samples of a MLE-trained model when fed with different kinds of history. To save space, we omitted the first 7 words of the random history.

4 A Quantification Approach using Marginal Distribution

4.1 Method

In this section, we describe a simple and intuitive approach to quantify exposure bias, which is applicable to real-world data-sets. Assuming a given history length , we consider the marginal distribution of from the following three random process:

  • Draw word sequences of length from the data distribution . Denote the marginal distribution of the random variable at position () as , where

  • Draw word sequences of length from the model distribution . Denote the marginal distribution of the random variable at position as , where

  • First draw from , then draw from . Denote the marginal distribution of the random variable at position as , where


From the definition of exposure bias, suffers from the training-testing discrepancy, while should behave better and be closer to the true distribution . To measure this discrepancy, define the marginal generation deviation (MGD) at history length of history distribution with metric as


where denotes the history distribution. MGD measures the deviation of the marginal distribution of from ground-truth data distribution. Finally, we define the rate of exposure bias (EB-M) at history length of model as the ratio (discrepancy) between the MGD measurements when two different history distributions are fed:


For MLE-trained models, EB-M111Note that one can also directly measure , but in that way, we can not tell which distribution is better. is expected to be larger than 1, and larger EB-M indicates a more serious exposure bias problem for the trained model. For the choice of , we experiment with two popular probability metrics: total variation distance (denoted as ), and Jensen-Shannon divergence (denoted as ). The problem left is to estimate the described marginal distributions of . We adopt a simple sample-and-count method: is estimated by the distribution (histogram) of from a number (to be specified in Section 4.2) of sentences sampled from the data distribution. For and , we first draw a number of history samples from the corresponding history model (model distribution and data distribution respectively). We then feed sampled history sequences into the trained model and estimate the marginal distribution of the word by averaging the predicted distribution .

4.2 Experiments

We measure EB-M for MLE-trained LSTM LM on two popular data-sets: EMNLP-news, and wikitext-103222The wikitext-103 data is available at
. For EMNLP-news we set , and only use data samples whose length is longer than . The resulting training/validation/test set has 268k/10k/10k sentences. The vocabulary is of size 5k. We use the 10k samples in the test set for evaluation of EB-M. Note that the EMNLP-news data-set is widely used in text GAN literatures Yu et al. (2016); Lu et al. (2018). We train a one-layer LSTM LM (Sundermeyer et al., 2012) of hidden dimension 512 as the MLE baseline model for EMNLP-news. For wikitext-103, we set , and regard a paragraph in the original data as a long sentence. Further, we use half of the data for LM training, and utilize the other half for EB-M evaluation. The resulting training/validation/test/evaluation set has 300k/1.5k/1.5k/300k sentences. The vocabulary is of size 50k. We train a two-layer LSTM LM of hidden dimension 1024 as the MLE baseline model for wikitext-103. For MLE baseline model training, the Adam optimizer is used with learning rate 0.001, no Dropout (Srivastava et al., 2014)

is applied. The model is trained for 100 epochs. We first measure EB-M on the wikitext-103 data-set, which has large amount of evaluation data. The results are shown in Figure

1(a). We provide EB-M measurements with metric in Appendix C, as they are similar to those using metric . It is shown that the measurements become stable when using 100k data/model samples. EB-M has an average value of 1.10, indicating a significant gap between the model’s MGD when fed with history from or . Further, we observe a steady growth of EB-M along the length of history, which is expected as an outcome of exposure bias.

(a) wikitext-103
(b) EMNLP-news
Figure 1: (a): EB-M measurements (with metric ) using different number of samples on wikitext-103 data. (b): EB-M (with metric ) comparison for MLE and non-MLE training on EMNLP-News data. For each training method, we show corpus-BLEU (Yu et al., 2016) measurement using the test-set as reference set in the legend.

However, does the EB-M measurements really indicate the significance of exposure bias? Not really. The problem is that the distortion of the marginal is not only affected by the presumably existing exposure bias alone, but also by the mismatch between the history distribution from for , which grows with the length of the history. Therefore, even if the measured EB-M is significantly larger than one, we can not conclude that exposure bias is the major reason. We provide an example to illustrate this argument:

Example 1.

Suppose , and . and are crafted as follows: ; And .

In Example 1, and , which gives . However, the only problem has is the mismatch between the history distributions ( and ) for . The next set of experiments also suggest that EB-M may not precisely reflect exposure bias. On the EMNLP-news data-set, we compare EB-M measurements for several non-MLE training methods with the baseline MLE model. We include results for Scheduled Sampling (SS) (Bengio et al., 2015), Cooperative Training (CoT) (Lu et al., 2018), and Adversarial Ranking (RankGAN) (Lin et al., 2017). We provide implementation details for non-MLE methods in Appendix B. Intuitively, these methods will cause the model to be biased to behave well with model samples as history, instead of data samples. Therefore, assuming the significance of exposure bias, we expect EB-M measurement for non-MLE trained models to be smaller than MLE trained models. However, results in Figure 1(b) shows that the measurements for different training frameworks are almost the same. Hence we believe the EB-M measurements are mainly reflecting the mismatch between the history distributions. What if exposure bias exactly refers to this mismatch between model distribution and the data distribution? If that is the case, then this mismatch is inevitable for any imperfect model, and non-MLE training algorithms can not solve it. We believe a better, more precise definition is needed to discriminate exposure bias from this trivial mismatch. Motivated by this view, we propose a second approach in the section below.

5 A Quantification Approach using Conditional Distribution

5.1 Method

Following the discussion in the last section, we wish our measurement to be independent of the quality of the history distribution. In light of that, we design a quantity to measure the model’s conditional generation quality. Let denote the history distribution as in the MGD definition (5). With history length fixed, we define the conditional generation deviation (CGD) with history distribution for using metric as:


where we assume that is computable, and use it to measure the quality of the model’s conditional distribution. For the choice of the distribution distance , in addition to and , we introduce greedy decoding divergence (), which is defined as:


where is the indicator function, and . We design 333 qualifies as a pseudometric in mathematics. to reflect the model’s accuracy during greedy decoding. Similar to MGD, exposure bias should imply a significant gap between and . We again define rate of exposure bias at history length with metric to be:


5.2 Experiments

Since CGD requires inference for ground-truth data distribution , we only consider experiments in a synthetic setting444We will release code to reproduce our results in the published version of this paper.. In text-GAN literature (Yu et al., 2016), a randomly-initialized one-layer LSTM model with hidden dimension of 32 is widely used as in synthetic experiments (we denote this setting as ). However, the model is small-scale and it does not reflect any structure existing in real-world text. In this work, we take the MLE baseline model trained on EMNLP-news data (described in Section 4.2), as in this synthetic setting. We denote the data model () as . We then train two LSTM LM () with different capacities using samples from the data model, with the standard MLE objective. One is a one-layer LSTM with hidden width of 512 (denoted as LSTM-512), the other one is with hidden width of 32 (denoted as LSTM-32). We train for 100 epochs using the Adam optimizer with learning rate 0.001. In each epoch, 250k sentences (same to the size of the original EMNLP-news data) of length are sampled from and used as training data. We do this to avoid over-fitting. We show perplexity (PPL) results of the trained models in Appendix D. Finally, EB-C is calculated using 100k555We show that we can get stable measurements using 100k samples in Appendix C. samples from and .

(a) LSTM-32
(b) LSTM-512
Figure 2: EB-C measurement for the LSTM-32 and LSTM-512 model with different metrics. Also average value of EB-C along all history length is shown in the legend.

In Figure 2, we show EB-C measurements with different metrics , and the two models give similar results. It is shown that EB-C has a steady but slow increasing trend as history length increases. This is expected as a consequence of exposure bias, because deviates farther from as history length increases. Further, the average value of EB-C is less than 1.03 (the largest average value is from for the LSTM-512 experiment), meaning that the gap between and is not large. Also, note that in most NLG applications (such as machine translation or image captioning), the generated sequence typically has short length (less than 20). In that range of history length, the EB-C measurements show minimal influence of exposure bias.

Figure 3: CGD measurement for corrupted (with ) for the LSTM-512 synthetic experiment.

To dive deeper into the cause of the gap in CGD, we experiment with corrupted versions of as history distribution. We first specify a corrupt rate , and randomly substitute words in a history sample from to a “noise” word drawn uniformly from the vocabulary with probability . In this way, larger will cause the history distribution to deviate farther from the ground-truth . In Figure 3, we show CGD measurement with the corrupted as history. Large gaps are observed between and . So, we deduce the reason for the small gap between and is that, the deviation in the history distribution and is not large enough: has learned a “good enough" distribution that is able to keep it in the well-behaving region during sampling. With these observations, we conclude that, in the synthetic setting considered, exposure bias does exist, but is much less serious than it is presumed to be. The key reason is as follows. Although there exists mismatch between the history distribution and , the mismatch is still in the model’s “comfortable zone". In other words, the LSTM LM is more robust than exposure bias claims it to be. To concretize the this argument, we provide an example LM that has a large EB-C measurement to facilitate a better understanding.

Example 2.

Again suppose , and , the ground-truth data distribution is uniform on . is crafted as follows: . Note that the model behaves bad when , which is of high probability during sampling.

In Example 2, and , so . However, this crafted model is unlikely to be an outcome of MLE training. The fact that is better modeled indicates that in the training data more sentences begin with than . So MLE training should assign more probability to , not the other way around666If we change to , then will be 0.2, meaning that the model has better conditional generation performance during sampling.

Figure 4: (a): EB-C measurements (with ) for comparing non-MLE methods in the LSTM-512 synthetic experiment. (b): EB-C measurements for comparing RankGAN and MLE for the synthetic experiment, the metric used is .

Finally, we use EB-C to compare MLE and non-MLE training. We compare MLE against CoT, SS, RankGAN in the synthetic experiments, and results are shown in Figure 4. Note that the RankGAN experiments are conducted in the setting777Note that the MLE model is used as the pre-trained model for the RankGAN generator. The MLE model has an oracle NLL of 8.67, and RankGAN’s oracle NLL is 8.55., as we find it hard to do a fast implementation of RankGAN for the LSTM-512 setting. We find that RankGAN and CoT gives lower EB-C measurements than MLE, which is expected, as these methods avoid teacher forcing. However, SS gives worse EB-C measurements comparing to MLE, which we currently do not have a good explanation. We refer readers to Huszár (2015) for a discussion about the SS objective. Also, note that most non-MLE methods still rely on MLE training in some way (e.g. for pre-training).

6 Discussion

Is MLE training really biased? We believe the answer is not conclusive. Note that the MLE objective (1) can be re-written as:



denotes the Kullback–Leibler divergence, and

denotes the trainable parameters in . Therefore, MLE training is minizing the divergence from , which is exactly the model’s sampling distribution, from . While it’s true that the training is “exposed” to data samples, we can not simply deduce the objective is “biased”. We want to end our discussion with two remarks. Firstly, the proposed quantification approaches should not be used as the only metric for NLG. For example, a position-aware uni-gram LM, which generates words independent of previous context, has no exposure bias problem and can pass our test easily. Further, the intention of this work is not to discourage researchers from exploring non-MLE training algorithms for LM. It is completely possible that an training objective other than , such as , can lead to better generation performance (Lu et al., 2018; Huszár, 2015). However, though non-MLE algorithms avoid teacher forcing, these algorithms (using GAN or RL for example) are usually much more difficult to tune. Given that the quantified measurement of exposure bias is insignificant, we think it should be questioned whether adopting these techniques to avoid exposure bias is a wise trade-off.

7 Related Works

Several recent works attempt to carefully evaluate whether the non-MLE training methods (e.g. adversarial training) can give superior NLG performance than standard MLE training for RNN LM. Caccia et al. (2018) tunes a “temperature” parameter in the softmax output, and evaluate models over the whole quality-diversity spectrum. Semeniuta et al. (2018) proposes to use “Reverse Language Model score” or “Frechet InferSent Distance” to evaluate the model’s generation performance. Tevet et al. (2018) proposes a method for approximating a distribution over tokens from a GAN, and then evaluate the model with standard LM metrics. These works all arrive to a similar conclusion: Text GANs are not convincingly better, or even worse, than standard MLE training. So, to some extent, these works imply that exposure bias may be not a serious problem in MLE training.

8 Conclusion

In this work, we explore two intuitive approaches to quantify the significance of exposure bias for LM training. The first approach, relying on the marginal generation distribution, suggests some ambiguity in the current definition of exposure bias. Hence we argue that we should focus on the model’s generation performance in terms of its conditional distribution and propose a second quantification approach. However, according to our measurements in a synthetic setting, there’s only around performance gap between the training and testing environments. In particular, exposure bias only has minimal effect when the history length is not long enough. These results indicate that the exposure bias problem might be not as serious as it is currently assumed to be.


We thank Hongyin Luo, Yonatan Belinkov, Hao Tang and Jianqiao Yang for useful discussions. We also want to thank authors of Santurkar et al. (2018), which this work takes inspiration from.


Appendix A Manual Sample Inspection

In Table 3, we provide more samples of a MLE-trained model when fed with different kinds of history.

Appendix B Implementation of SS, CoT, and RankGAN

We implement our MLE baseline and scheduled sampling (SS) in PyTorch. For SS, we use a linear decay schedule to move from complete teacher forcing to replace-sample rate of

. We find that larger rate will give worse performance. For CoT, we use a PyTorch implementation in https://github.com/pclucas14/GansFallingShort

. We use a mediator model that has twice the size of the generator. We set M-step to be 4, and G-step to be 1. For RankGAN, we use a TensorFlow implementation in

https://github.com/desire2020/RankGAN. Note that in our non-MLE experiments, the generator model is set to be the same size with the baseline MLE model. We tune the non-MLE methods using the corpus-BLEU metric, which is widely used in text GAN literature.

Appendix C Auxiliary Plots

In Figure 5, we show that we are able to get stable measurements of EB-C with 100k samples for the LSTM-512 synthetic experiment.

Figure 5: EB-C measurements with different number of samples for the LSTM-512 synthetic experiment.

In Figure 6 and Figure 7 we provide EB-M measurements with metric discussed in Section 4.2, the results are similar to those using metric .

Figure 6: EB-M measurements (with metric ) using different number of samples on wiki103 data.
Figure 7: EB-M (with metric ) comparison for MLE and non-MLE training on EMNLP-News data. For each training method, we show corpus-BLEU measurement using the test-set as reference set in the legend.

Appendix D Perplexity of the Trained Models

We show PPL results for model trained on EMNLP-news data-set in Table 2. The MLE model for wiki-103 data-set discussed in Section 4.2 has PPL 84.58. Note that due to our special setting888We only keep sentences of length longer than , and for wiki-103, only half of training data is used., our PPL result is not directly comparable to state-of-art LM results on these data-sets.

Model PPL
MLE Baseline 55.85
LSTM-512 (MLE, synthetic) 115.3
LSTM-32 (MLE, synthetic) 156.3
CoT-512 (synthetic) 115.6
SS-512 (synthetic) 113.7
CoT 56.83
RankGAN 53.43
SS 56.43
Table 2: PPL results for model trained on EMNLP-news data-set.

Model Samples as Hisotry Model Samples
it was only a pieces that had gone up to the forest and forces the shoppers about their chronic young
i mean we didn ’ t know what i haven ’ t considered through , " she told bbc radio
if he were the president - elect , he was known that he would run a force in business at
but these are not as tired of " the same message that the harry actor does have been hours in
first opinion the agent have taken four seconds , or if they don ’ t only know anything , were
" the economy of the uk is low enough of people of defending where americans think that " brexit ,
the economy grew on 1 . 6 % since the us voted , and when it turned around 200 streets
i was able to produce on my own , which is good ; now that the theatre i ’ ve
" i ’ ve not buying boys i addressed many nervous times before , as a teenager made me is
we think about one - third of the struggles we actually want to see those very well that even more
the story of a album - which made public - was still fantastic , and for the second time in
" the test comes up before tuesday and when we ’ re feeling ahead again soon , " she posted
a year on when he was last seen in his home and he did not see him , his suffering
brady has forced the 9 - known targets to get all - of - 12 gun migration and performing communication
i asked if he himself did , i managed to show all my charges at all , it used to
Data Samples as Hisotry Model Samples
what this group does is to take down various different players in the future and we play in paris we
over 1 , 600 a day have reached greece this gone in 2013 and it planned to allow civilians on
" we ’ re working through a legacy period , and i am proud of the experience of the worker
’ the first time anyone says you need help , you don ’ t have put accurate press into the
out of those who came last year , 69 per cent of women can really take the drive to avoid
he has not played for tottenham ’ s first team this season then and sits down 15 - 0 with
so you have this man who seems to represent this bad story , which he plays minutes – because he
cnn : you made that promise , but it wasn ’ t necessarily at all the features he had in
this is a part of the population that is unk lucky to have no fault today , and it would
they picked him off three times and kept him out of the game and was in the field , the
the treatment was going to cost $ 12 , 000 as a result of the request of anyone who was
but if black political power is so important , why doesn ’ t we becomes the case that either stands
local media reported the group were not looking to hurt the animals , but would never be seen to say
Random Sequences as History Model Samples
…RANDOM… big winter deserve , but they just say it your things goes wrong
…RANDOM… playoff north realise at its lowest level , improving their understanding in danger
…RANDOM… vital childhood registration , not previously planned for <unk> to each and reduced
…RANDOM… treated ship find one as an actual three points contained at a time
…RANDOM… faith five crazy schools and could give them a " sleep " necessary
…RANDOM… domestic jason follows a 12 - year cruise line over the christmas track
…RANDOM… ownership generous tourist accounts for more than 1 per cent every month -
…RANDOM… spending raped since the file returns in january , joining groups of foreign
…RANDOM… netflix worker four centre - and said facebook text <unk> to see how
…RANDOM… race labor witnessed is great , with more to an active the <unk>
…RANDOM… treatments airlines hidden real - time out to sell on benefits to our
…RANDOM… intention short reflects showing the nature of flying in his space rather than
…RANDOM… conversation pace motion them further , but as late as they ’ ve
…RANDOM… export feb president obama agreements with president obama and her being on trump
…RANDOM… entering pocket hill and made it later in the united states and make
Table 3: More samples of a MLE-trained model when fed with different kinds of history. To save space, we omitted the first 7 words of the random history.