1 Introduction
Recently, pretrained language models (PLM), e.g. GPT2 (Radford et al., 2019), have shown great promise in many applications of natural language generation, such as stylized text generation (Syed et al., 2019) and dialog system (Wolf et al., 2019). PLM is obtained by first pretraining on largescaled raw sentences (always general domain corpus), and then used in downstream tasks by finetuning on taskspecific datasets (always from some specific domains). Specifically, given a pretrained GPT2 model, to generate sentences of email domain, we always need to finetune the GPT2 on a small set of email domain corpus.
However, we argue that to get desired sentence outputs, finetuning PLM on a specific domain dataset is not necessarily the best, especially when the finetuning dataset is of a small size. Typically, finetuning is conducted through Maximum Likelihood Estimation (MLE), with which the resulting model distribution will be asymptotically consistent with true distribution when the finetuning dataset has infinite data samples. But it is not the case of finetuning on small datasets, which always leads to the mismatch problem of the real and model distributions.
Specifically, MLE minimizes the Kullback–Leibler (KL) divergence between model and true distributions. Theis et al. (2016) point out that minimizing KL avoids assigning an extremely small probability to any data point but assigns a lot of probability mass to nondata regions, which leads to a gap between and . Additionally, simple data patterns in the finetuning dataset could be easily memorized and overestimated. Meanwhile, the complex ones may be underestimated. The above problem is not severe with adequate data samples, but nontrivial when the size of the finetuning dataset is not large enough. (see Figure 1).
To address the over and underestimated problem, in this paper, we propose MCTailor, which can tailor the resulting density of model distribution by cutting the probability mass of overestimated zones to underestimated zones, leading to more realistic model distribution after finetuning. Concretely, MCTailor consists of two components: a ratio estimator to distinguish over and underestimated regions of model distribution; and an early rejection sampling (ERS) component to tailor (reassign) probability mass and efficiently obtain sampled sentences from the model distribution. Note that the proposed ERS is inspired by Sequential Monte Carlo (SMC, Doucet et al. (2000)), but can avoid the degeneration from SMC, as it directly kills samples rather than performs resampling.
We conduct experiments on various data sets to verify the effectiveness of the proposed MCTailor. Empirical results show that MCTailor can generate significantly better samples than finetuning, and the resulting model distributions of our model are closer to real data distributions.
2 PreTrained Language Model
Language models generally estimate the density of sentences in real context within an autoregressive style:
(1) 
where is a sentence with length . Recently, with an extremely large number of parameters, pretrained language models like GPT2 (Radford et al., 2019) and TransformerXL (Dai et al., 2019) have shown great promise in text generation. PLMs are first trained on a huge general domain data set and then finetuned on specific domain datasets of different downstream tasks.
Specifically, given a pretrained GPT2 model, to generate sentences of email domain, we always need to finetune the GPT2 on a small set of email domain corpus. Additionally, PLMs have some other important applications. Miao et al. (2019) use finetuned language models for constrained text generation. Wolf et al. (2019) finetune GPT2 on a dialog data set to boost the performance of dialog system.
However, as stated in the Introduction, directly finetuning the PLM on a small dataset may lead to the mismatch problem, namely the over and underestimated problem between the true distribution and the model distribution. In the next section, we propose a new method to alleviate this problem.
3 Proposed MCTailor
To mitigate the above shortcomings of finetuning, we propose MCTailor, which generates samples from a modified sample distribution. MCTailor is composed of a ratio estimator, which detects over and underestimate regions of model distributions, and the Early Rejection Sampling algorithm (ERS), which accelerates sampling while ensuring sample quality.
3.1 Ratio Estimator
Ratio estimator is a common technique to measure the gap between two related distributions (Yuxuan et al., 2020). In this work, We apply ratio estimator to estimating , the probability ratio of sentence in finetuned model distribution and true distribution . To tailor the probability from a finetuned PLM, we cut the probabilities of overfitting samples. Specifically, when , i.e., the model overestimates the probability of sample , we remove with a probability of to approximate . After normalization, probabilities of underestimated areas will increase correspondingly. The resulting new distribution is . In this work, we try several different structures of ratio estimators.
Convolutional Ratio Estimator.
Since ratio estimation shares similar properties with classification problems and convolutional neural networks (CNN) are powerful classifiers, our first thought is to build a CNNbased ratio estimator. To be concrete, we use a twolayer CNN to predict whether
is from true or learned distribution. By training with crossentropy loss,(2) 
Naturally, we define
(3) 
Dual Ratio Estimator. Though the basic convolutional ratio estimator is easy to apply, it makes sampling inefficient. For most sentence , we can roughly predict whether it is in a specific domain or suffering from overestimation by the first a few words. However, can only be obtained after a full sentence is generated, so massive computing resources are wasted on generating unpromising samples.
To determine whether a prefix is promising, we can estimate
(4) 
where is the minimum ratio of all sentences with prefix . If is greater than a predefined threshold, all sentences with prefix should be rejected. As a result, we do not need to waste time to continue sampling.
But if we directly train to distinguish from , we will end up getting the average value of for all sentences with prefix , rather than the minimum value. If so, some sentences with low will be erroneously rejected. Luckily, the properties of minmax dual sheds some light on this problem. We first define as the dual form of . Under some weak conditions, we can prove that if approximates , then approximates for with prefix . Similar to training , we train by distinguishing from . Since is a function of , we can get a set of proper parameters for .
Hierarchical Ratio Estimator. Since a single ratio estimator may not be powerful enough to accurately estimate , we break down the workload to several in the spirit of boosting. We first train to estimate , and get . And then we use to estimate the gap between and … With the collaboration of , we can get a more accurate . Using hierarchical ratio estimators also avoids using a single but complicated ratio estimator, which is prone to overfitting. Similarly, we can add hierarchy to the dual ratio estimator to make a hierarchical dual ratio estimator.
3.2 Efficient Sampling
In this part, we introduce our specially designed Early Rejection Sampling (ERS) algorithm for MCTailor. Improved from Sequential Monte Carlo, ERS can efficiently generate samples with high diversity.
Rejection Sampling By applying RS, we first generate a batch of samples from , and then rejecting some samples by rejection ratio . However, RS is very inefficient in actual use since it rejects samples at the end of sampling. As shown in Figure 1(a), lots of computation resources are wasted on ultimately rejected samples.
Sequntial Monte Carlo Instead of rejecting samples at the end of sampling, SMC performs resampling at each step. The unnormalized resampling weight at step is provided by
, leading to an asymptotically unbiased estimator. However, SMC suffers from serious degeneracy problem. In other words, samples from SMC tend to share a very small number of the ancestors because most of the ancestors are killed during resampling. As a result, sample diversity of SMC is critically low.
Early Rejection Sampling To overcome the degeneracy problem of SMC and increase sample diversity. We propose Early Rejection Sampling (ERS) algorithm. ERS first uniformly samples a real number in . After step , if , this particle is killed immediately and computation resources are released to parallel threads. The main difference between ERS and RS is that ERS kills unpromising particles before they are fully generated. But unlike SMC, there is no correlation between ERS samples, resulting in higher sample diversity.
4 Experiments
In this section, We empirically compare the sample quality of our model and baseline models. We first set up experiments and show results in Section 4.2.
Datasets  #Train  Style  Finetune  

Ontonotes  
bn  12k  Broadcast news  124  117  111  
bc  12k  Broadcast dialog  268  144  153  
mz  7k  Magazine  126  112  110  
nw  35k  Newswire  111  110  100  
tc  13k  Telephone dialog  140  136  134  
wb  17k  Web  166  138  136  
Switchboard  203k  Formal dialog  198  165  169  
DailyDialog  76k  Daily dialog  120  117  113  
IWSLT16  133k  Conference speech  240  217  213 
4.1 Experimental Setup
We conduct experiments on 9 data sets with different styles and sizes. And we use five different metrics, including human evaluation, to measure the generation performance of each method.
Datasets. We use the following data sets for experiments.

Ontonotes (Pradhan et al., 2013) is a multigenre data set for sequence annotation. We use sentences from six genres (bn, bc, mz, nw, tc, wb) for the experiment.

IWSLT16 (Cettolo et al., 2016) is a data set of paired conference speeches for machine translation. We use English sentences from DeEn pairs to test model performance on the special conference speech domain.
Refs  Sentences  NLL (Finetune)  NLL () 

a  Thank you everyone for watching .  18.03  18.65 
b  Yes .  4.01  4.77 
c  What does that mean in the context of your book ?  26.56  26.44 
d  And it did n’t hurt too badly .  23.24  22.97 
Methods  Finetune  

Samples  Right .  She should be charged with rape . 
In the case if you think of this   And do you still feel that way every day ?  
Oh well .  But it would be tough .  
I ’ve been there n’t said anything wrong .  He knew about the attack at the Paris offices . 
Evaluation Metrics. To evaluate the generation quality and diversity, we use the following metrics.

PPL reflects the average density of samples from test set in a generative model. Models with lower PPLs have more similar model distributions with real contexts. Unlike baseline models, MCTailor only has an unnormalized logprobability. We estimate the normalization constant of MCTailor by importance sampling and calculate PPLs directly from the normalized logprobability.

RevPPL is a good indicator for both sample quality and diversity, which is derived by first training a language model with generated samples and calculating the PPL of test set in the language model.

EMDl is the earth mover distance between sentence lengths of real and generated data.

EMDf is the earth mover distance between word frequencies of real and generated data.

Human Evaluation Score is added to reflect the comprehensive sample quality. We ask 4 volunteers to select a score from {0, 0.5, 1} for each sample according to their fluency and coherence with the target style. In 85% cases, at least three volunteers give the same score, showing the reliability of the human evaluation.
Model Details. In all the experiments, we use the released GPT2 with 117M parameters as the pretrained language model. We first finetune GPT2 on each dataset and then build our tailor on it. Earlystop is applied to avoid overfitting. For ratio estimators, we use simple CNNs with two convolution layers where (filter_number, kernel_size) is set to (10,5) and (5,5), respectively.
4.2 Experimental Results
RevPPLs of different models are shown in Table 1. We find that MCTailor significantly reduces RevPPLs than finetuning baseline in data sets of different sizes, from Ontonotesmz with only 7k training samples to relatively large Switchboard data set with more than 200k samples. We also notice that multilayer performs better than singlelayer , which confirms the point in Section 3.2 that the gap between and is too complex for a singlelayer ratio estimator to estimate. Sample NLLs of each method (Table 2) further confirms that MCTailor succeeds in decreasing the probabilities of overestimated simple patterns and reallocating them to underestimated samples.
We further compare MCTailor with the baseline model under other metrics. From table 4, we find MCTailor greatly reduce PPL, which means increased probabilities of generating samples similar to test samples. And we can draw the conclusion that sample distributions of MCTailor are closer to real sample distributions, with lower EMDl and EMDf. What’s more, human evaluation scores of MCTailor are about 10% higher than finetuning, which indicates better sample quality to human eyes. Cases shown in Table 3 further demonstrate the advantage of MCTailor in fluency and informativeness. SeqGAN is also compared in our experiment. However, revppls of GANs are even higher than directly finetuning GPT2, and they are especially difficult to train. So we remove SeqGAN from baseline models.
The acceleration effect of ERS is also verified in the experiment. For MCTailor with 1, 2, and 3 layers of ratio estimator, ERS reduces 30%, 79%, and 90% of computation wasted on unpromising samples, achieving 1.5x, 2.8x, 5x accelerations, respectively.
Data  MCT  PPL  EMDl  EMDf  Human 

Ontobn  ✗  34.1  4.31  0.57  0.60 
✓  30.1  1.90  0.53  0.81  
Ontobc  ✗  30.9  6.74  0.67  0.40 
✓  23.1  1.62  0.55  0.67  
Ontomz  ✗  43.4  5.60  0.69  0.71 
✓  39.7  3.33  0.64  0.76  
Ontonw  ✗  37.0  4.94  0.61  0.65 
✓  36.1  3.66  0.54  0.70  
Ontotc  ✗  24.8  4.19  0.64  0.54 
✓  23.8  2.46  0.64  0.54  
Ontowb  ✗  60.9  3.31  0.61  0.46 
✓  52.8  2.40  0.51  0.60  
SB  ✗  19.7  8.75  0.60  0.48 
✓  18.9  5.21  0.51  0.54  
DD  ✗  30.3  5.25  0.47  0.60 
✓  29.1  3.32  0.45  0.62  
IWSLT  ✗  23.3  5.21  0.61  0.32 
✓  20.9  2.99  0.55  0.40 
with 3 layers and finetuning. MCT means whether to use our proposed MCTailor or to direct finetune. SB and DD represent the Switchboard and DailyDialog data sets, respectively. By onetail ttests, we find that improvements in human evaluation scores are significant, with pvalues smaller than 0.05.
5 Conclusion
In this paper, we propose MCTailor to alleviate the over and underestimation problem between true and model distributions. MCTailor is composed of a ratio estimator, which adjusts the probabilities of MLE finetuned PLMs to approximate true distributions, and the ERS to accelerate sampling while ensuring sample quality. Experiments on various datasets show the effectiveness and efficiency of MCTailor.
References
 The iwslt 2016 evaluation campaign. In IWSLT, Cited by: 3rd item.
 Transformerxl: attentive language models beyond a fixedlength context. In Proceedings of ACL, Cited by: §2.
 On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing 10 (3), pp. 197–208. Cited by: §1.
 Switchboard SWBDDAMSL shallowdiscoursefunction annotation coders manual, draft 13. Technical report Cited by: 2nd item.
 DailyDialog: a manually labelled multiturn dialogue dataset. In Proceedings of IJCNLP, Cited by: 2nd item.
 CGMH: constrained sentence generation by metropolishastings sampling. In Proceedings of AAAI, Cited by: §2.
 Towards robust linguistic analysis using ontonotes. In Proceedings of CoNLL, Cited by: 1st item.
 Language models are unsupervised multitask learners. Cited by: §1, §2.
 Adapting language models for nonparallel authorstylized rewriting. In arXiv:1909.09962. Cited by: §1.
 A note on the evaluation of generative models. In Preceedings of ICLR, Cited by: §1.

TransferTransfo: A transfer learning approach for neural network based conversational agents
. In arXiv:1901.08149. Cited by: §1, §2.  Improving maximum likelihood training for text generation with density ratio estimation. In Proceedings of AISTATS, Cited by: §3.1.
Comments
There are no comments yet.