Natural language generation is a key component of dialogue, translation, summarization, captioning, etc. Text generation has been considered a daunting task for decades because of the symbolic and combinatorial nature of languages, and the long range dependencies characterizing natural text. Recent advances (Radford et al., 2019) have however shown that large models trained on large datasets can produce remarkably coherent and fluent text.
The main question we investigate in this work is whether we can automatically discriminate between machine generated and human generated text. This is important for two reasons. First, it has an obvious practical application in the detection of spam and, more generally, machine generated stories. Second, it is a stepping stone towards building better text generation models, as these discriminators could be used to improve the original models used to generate text.
, which factorize the joint distribution of words into the product of conditional distributions using the left to right order. Despite words being generated sequentially one token (word or sub-word) at the time from left to right, sequences are then scored holistically using heuristics like beam search, which alleviate the greediness of the generation procedure. Importantly, sequence scoring is performed using thesame model used to generate text in the first place.
In this work, we first investigate whether decoupling generation and scoring yields better accuracy for the task of predicting whether a sequence is machine or human generated. Our experiments clearly indicate that learning a separate predictor is a better choice.
Second, we perform an extensive empirical study to assess how well such discriminator generalizes to generators not used at training time. On the bright side, our experiments show that the discriminator is indeed able to generalize to similar generators and even weaker generators trained on similar datasets. However, generalization deteriorates once the generators used at test time are stronger, or when the generators used at test time are trained on out-of-domain datasets.
2 Related Work
, the dominant approach to text generation is still based on auto-regressive models which generate one word at the time from left to right. The basic component of these models is a predictor of the next word in the sequence conditioned on the previously generated words. Such conditional distribution can be represented by a count-based ngram model or by parametric functions such as a recurrent neural network(Graves, 2013; Bahdanau et al., 2014), a convolutioinal neural network (Gehring et al., 2017), and more recently, attention-based models like the transformer (Vaswani et al., 2017). Recently, Radford et al. (2019) demonstrated remarkable generations in terms of fluency and coherency by using a large transformer language model trained on a large dataset.
With few exceptions (Zhang et al., 2017), generation using auto-regressive models employs beam search, a heuristic to approximately select the most likely sequence according to the joint distribution defined by the generator. Importantly, such sequence level scoring aggregates intermediate log-likelihood scores produced by the conditional prerdictor. In other words, the same scoring model is used for generating candidates and for selecting the best candidate in the set. In this work instead, we decouple the two scoring functions, and train a dedicated scoring function for the latter task.
Learning scoring functions has a long history and is generally referred to as energy-based modeling(LeCun et al., 2006). In energy-based models (EBMs), the goal of learning is to shape an energy (or scoring) function such that training data points have lower energy than spurious (e.g., noisy) data points. Inference (e.g., a completion task like predicting the next word in the sequence) reduces to finding mimina in the energy landscape. EBMs have been mostly applied to vision applications (Teh et al., 2003; Du and Mordatch, 2019) where search is easier thanks to the continuous nature of the input space. At training time, the energy is decreased at training data points (positive examples) and increased at spurious points (negative examples) which can be found via MCMC sampling (Hinton, 2002), gradient descent (LeCun et al., 2006), etc. Our work can be intepreted as a particular instance of EBM where negatives are produced by a pre-trained language model as opposed to the energy function itself. Learning a generator and a discriminator relates also to Generative Adversarial Networks (Goodfellow et al., 2014), except that in our case the generator is trained beforehand.
Using a separately trained scoring function to evaluate candidates in the beam has been recently proposed in the context of dialogue modeling by Kulikov et al. (2018). There the negatives are randomly chosen next utterances from the training dataset. Here we use samples from a language model to build the set of negatives.
GLTR (Strobelt et al., 2019) demonstrates an interactive tool to show how a large language model scores an input text, which in particular can be used to check if text was generated by a language model generator. While we share the same motivation, this work extends that study by showing that a dedicated discriminator can work better than original generator’s log-likelihood score. Moreover, we provide an empirical assessment of the generalization ability of such discriminator when varying the type of generator and the datasets used for training it, as often times training of the discriminator is performed without access to the actual generator used at test time.
Finally and concurrent with this work, there has been a release of a training dataset of the GPT-2 language model generations 111https://github.com/openai/gpt-2-output-dataset, for the purpose of training discriminators capable of detecting machine generated text. While we share the same motivation, our work is a much broader investigation on the topic. We assess generalization of several discriminator architectures to not just one but several kinds of generators and datasets used for training.
In this section we first describe the architectures of the discriminator we considered, we then explain how we generate negatives and introduce the loss function used to train the discriminator. We conclude with a description of the testing protocol and metrics.
3.1 Learning to Score
Our goal is to build a function that scores the joint compatibility of an input sequence of tokens . The goal of training is to score golden sequences higher than other sequences. We parameterize the scoring function as a neural network, using the architectures described in §4.4.
At training time, the scoring function is trained using a variant of the ranking loss (Collobert et al., 2011). Let be a positive sample consisting of a sequence of tokens taken from the training set, be a set of “negatives” each derived from the same positive sequence but containing at least some machine generated tokens (see more details below), the standard ranking loss is:
In our work, we slightly modify the above loss by a) replacing the full sum with the term yielding the largest loss (corresponding to the most offending negative (LeCun et al., 2006)), and b) splitting the loss in three parts. The first loss is the same as above, dubbed . The second loss is the same as above but specialized in suffixes, as it is used only when the generated text appears at the end of the sequence; this is dubbed . The third loss is the same as above but specializes in prefixes, as it is used only when the generated text appears at the beginning of the sequence; this is dubbed . We therefore have three scoring functions, , one for each such loss, which are computed using three separate top linear layers. Empirically, we found that using these three ranking losses worked better than a single one (see ablation study in §4.8).
3.1.1 Generating Negatives
The most critical component of training an energy based model is the method used to generate negatives, i.e. inputs where the energy should score high or the score should be low (unlikely inputs). In settings with continuous variables, researchers have suggested MCMC (Teh et al., 2003) or Langevin dynamics (Du and Mordatch, 2019). In this work instead, we use the fact that modern auto-regressive models for text are already quite good, and we use them for negative sampling.
We train two auto-regressive language models, a left-to-right one which will be used to produce suffixes, and a right-to-left one which will be used to generate prefixes. The negatives are generated by top-k sampling (Fan et al., 2018) setting
equal to 10. Given a trained language model (for instance, a left-to-right autoregressive model) and given a positive example
a negative can be written as:
where for with are ground truth words belonging to the common context and for are words generated by the language model. In the same way, we can sample a negative with a right-to-left model yielding:
In this section we first describe the datasets and preprocessing used, and then provide the architecture details for both generators and scoring functions. We then present the main results of this work and extensively investigate the generalization ability of the scoring functions we have considered.
4.1 Scoring Model Evaluation
We evaluate the success of a scoring model in three settings: in-domain, cross-architecture, and cross-corpus. These settings are determined by the corpora used to train the training generator(s) and the corpora used to train the testing generators; and the train and test generator architectures. In each of these settings, given a testing corpus and a testing generative model , we start by taking a number of examples from as in (2) and for each, generate negatives with as in (3) or (4). In all the experiments we use .
The three settings are distinguished by the choices of and architecture of compared to and used when training the scorer. Notice that is used for training first the training generators, and then the scoring function. In the in-domain setting, is (except the prefixes at test are from the test-set of the corpus), and has the same architecture as (but is trained from a different random seed). In the cross-architecture setting, again is , but the architecture of is different from the architecture of . In the cross-corpus setting, has the same architecture as but is different than , and is trained on the training split of , while trained on the train split of .
In each case, the scoring function is applied to the positive and the negative examples, and we measure performance in terms of precision at 1 (P@1), which is the ratio between the number of times the ground truth sequence scores the highest over the number of sequences in the test set.
We train models on three corpora whose statistics are reported in Appendix Tab. 7:
Books: The Toronto books corpus described in Zhu et al. (2015); Kiros et al. (2015), which consists of fiction books in 16 different genres, totaling about half a billion words.
CCNews: We collect a de-duplicated subset of the English portion of the CommonCrawl news dataset (Nagel, 2016), which totals around 16 Billion words.
Wikitext: The wikitext103 dataset from Merity et al. (2016), which consists of 103 million words from English Wikipedia articles.
While Wikitext and CCNews are factual, Books is fiction and comprises a wide variety of writing styles. The CCNews corpus has the narrowest domain and it is two orders of magnitude larger than Wikipedia. Overall, these datasets are interesting because they enable us to assess the ability of the scoring function to fit and generalize across various axes, from the amount of data available at training time to the richness of style and relatedness among the different data sources.
On Wikitetxt and Books, we extract positive sequences from windows of text that are 160 tokens long with a stride of 40. On the larger CCNews we do the same except that we stride by 160 tokens. This protocol to mine positives is used both at training and test time, although at test time we limit the evaluation to 60,000 randomly chosen positive samples.
4.3 Generator Architectures
We use two different architectures for generating negatives: a fully convolutional network (Conv) (Dauphin et al., 2017) and a transformer based network (Transf) (Vaswani et al., 2017). We have a medium and a large transformer model, yielding three language models (for each direction) in total: Conv, Transf, TransfBig. The convolutional model uses the "GCNN-14" architecture from Dauphin et al. (2017) with 12 convolutional layers, as implemented in “fconv_lm_dauphin_wikitext103” by Ott et al. (2019). The transformer models are based on the architecture used in Baevski and Auli (2019). The medium sized models uses 6 blocks each containing a multi-head attention module with 8 heads. The large models use 12 blocks each containing a multi-head attention module with 16 heads. The transformer models are also implemented in Ott et al. (2019) as "transformer_lm" and "transformer_lm_big". We also train a “huge” model on CCNews with about 10 times the parameters of the "transformer_lm_big". See Table 8 in Appendix for more details.
As described in Sec. 3.1.1
, we use these language models to generate either a prefix or a suffix. Unless otherwise specified, positive sequences are 160 tokens long. With equal probability, we condition on either 120 or 140 tokens taken from either the beginning or the end of the original sequence when generating negatives.
4.4 Scoring Function Architectures
We consider three architectures for the scoring function:
Linear which computes a score via a bag of tokens: , where is a learnt scalar parameter corresponding to the -th token in the vocabulary.
BiLSTM (Schuster and Kuldip, 1997; Graves and Schmidhuber, 2005) which computes a score through bidirectional layers using LSTM recurrent units (Hochreiter and Schmidhuber, 1997), as in , where is the hidden state at position and layer
which is the concatenation of the forward and backward hidden states, AvgPool averages hidden states over positions and Linear is a vector of parameters projecting the hidden state down to a scalar value. We consider two versions, referred to as BiLSTM and BiLSTM Big. Both have 4 layers, but BiLSTM has 512 units in both the embedding layer and the hidden layers, while BiLSTM Big has 758 units in the embedding layer and 2014 units in the hidden states.
Transformer (Vaswani et al., 2017; Devlin et al., 2018) which computes a score similarly to the BiLSTM’s scorer, except that each bi-LSTM layer is replaced by a either a bidirectional Transformer layer (BiTransf), or a Transformer with causal self-attention (UniTransf). For unidirectional models we use the same averaging technique as with BiLSTM models. For bidirectional models the score is computed via: , where is the top layer hidden state at the first position (as common practice also in prior work (Devlin et al., 2018)). BiTransf uses the BERT-Base (Devlin et al., 2018) configuration: 12 bidirectional transformer layers with 768 units and 12-head attention. It is initialized from a publicly available pretrained bert-base-cased model222https://github.com/huggingface/pytorch-pretrained-BERT. UniTransf has instead 12 layers with 1024 units and 16 attention heads per layer. Following Radford et al. (2019) we used the language modeling task for pretraining.
For all models, we use Adam (Kingma and Ba, 2014) optimizer with warmup. Training is stopped after processing 2.5M samples without any improvement on the validation set. We use data-parallel synchronous multi-GPU training with up to 8 nodes, each with 8 Nvidia V100 GPUs. To improve training speed, we use mixed precision training333https://github.com/NVIDIA/apex. Following common practice we clip the norm of the gradient vector (Pascanu et al., 2013). More details about hyper-parameter setting can be found in Appendix Tab. 10, while Tab. 9 in Appendix reports the number of parameters of each scoring function.
|TransfBig (language model log-likelihood)||1.1||2.2||3.2|
|TransfBig (language model negative log-likelihood)||68.2||62.2||53.3|
4.5 In-domain generalization
In Table 1 we report the results of the in-domain generalization experiment using our large language model, TransfBig. We observe that when the scoring models have similar representational power compared with the generator (UniTransf, see Table 9), they are able to distinguish real from fake completions fairly accurately, reaching a p@1 with 10 negatives of more than 90% on the Books dataset (which is easier since it exhibits the larger variety of style and topics), and attaining above 80% on the more challenging CCNews dataset (for which generation is easier and hence discrimination harder). Weaker scoring models are able to do comparably or better at discriminating real from fake than the training generator used as a scorer by taking the negative log probability of the sequence as a score. In conclusion, learning a dedicated discriminator is much more effective than using the original language model generator as a scorer, and discriminators whose complexities match the generators work better.
4.6 Cross-architecture generalization
In Table 2, we assess how well the best scoring function we have, UniTransf, generalizes to different generator architectures at test time, using Wikitext as dataset. As a reference, the test perplexity of Conv, Transf and TransfBig is 35.4, 33.5 and 24.5, respectively. Therefore, TransfBig is expected to produce higher quality negatives.
Unsurprisingly, each scoring model does best at discriminating the generator architecture it was trained on (see results along the diagonal). Perhaps surprisingly, there is not a clear relationship between train-generative-model perplexity and generalization. On one hand, we see that the mean result of training with generations from the big transformer (averaging values in last row) is slightly worse than the mean score of training on the medium transformer (averaging values in the row before the last). Thus training with only strong negatives can make it easier to be exploited by weaker negatives. On the other hand, scoring functions do exhibit good generalization to negatives produced by weaker generators at test time. For instance, if negatives are generated by the strongest generator, TransfBig, at training time but then the scoring function is tested using negatives from Transf, P@1 is more than 10% higher (last row of the table). On the contrary, harder negatives at test time are not discriminated well (see the decrease of P@1 along the first row).
This is further confirmed by the experiments in Tab. 4, where we compare on CCNews testing with negatives produced by TransfBig (which matches the training architecture) versus TransfHuge which is an almost ten times bigger generator. In this case, P@1 decreases by 35%.
Overall, the scoring function exhibits good generalization as long as negatives are produced by weaker generators. Otherwise, P@1 drops significantly although not catastrophically.
4.7 Cross-Corpus generalization
In table 3 we show the results of generalizing across corpora using UniTransf as scoring function and TransfBig as generator both at training and test time. We can see that the models generalize less well across corpora than they did across architectures; for instance, when testing on Wikitext a scoring function trained with either Books or CCNews, P@1 does not even reach 40%. However, training on the union of two of the corpora gives a large benefit over training on just one or the other when testing on the third. Finally, training on the union of all the three corpora (last row) yields a scoring function that is fairly robust to the testing conditions, with just a slight decrease of precision compared to the ideal case of exact match betwteen corpus used at training and test time.
We also used the publically available GPT2 medium (Radford et al., 2019) and the TransfHuge model we trained on CCNews to generate test negatives. In Table 4, the first row is “in-domain” (and so matches the corresponding row in Table 1). The next row is cross-architecture in the middle cell; and cross-corpus and cross architecture in the other cells since TransfHuge is trained on CCNews. The final row is cross-corpus and cross architecture since GPT2 was trained another dataset which we do not have access to, and it also has a different architecture.
|Books + Wiki||93.0||60.0||69.7|
|CCNews + Wiki||63.5||86.7||74.0|
|Books + CCNews||90.9||86.3||42.6|
|GPT2 Radford et al. (2019)||79.9||44.2||53.4|
4.8 Ablation Study
First, we investigate the robustness of the best scoring function, UniTransf, with respect to generation length and location of the generated text. Results in Table 5 show that the scoring function works better on easier tasks despite the fact that the distribution of negatives is different from the one seen at training time. For instance, if negatives are generated using a left-to-right language model producing the last 20 tokens at training time but at test time negatives have the last 40 tokens generated (a considerably easier discrimination task), P@1 increases from 53.8% to 55.5%. On the contrary, P@1 drops by 20% if at training time negatives have 40 generated tokens but only 20 at test time. Similarly, the scoring function does not generalize well to filling gaps in other positions than those used for training (see poor performance on cells on the anti-diagonal).
Finally, we study the impact of the variants of ranking loss in equation (1). Table 6 shows the baseline performance when using the holistic term as opposed to also the losses specific to detecting machine generated text in the prefix and suffix (first two rows). The composite loss yields almost a 1% improvement in precision. While computing the gradient using more negatives only marginally helps, selecting the most offending negative in that set is the strategy that offers the biggest gain, more than 2% when averaging over 3 negatives as opposed to taking the one that increases the loss the most. In our work, we use this latter setting.
|left to right||right to left|
|40 tokens||20 tokens||40 tokens||20 tokens|
|left to right||40 tokens||69.7||49.2||20.5||17.6|
|right to left||40 tokens||19.2||16.8||83.1||61.9|
|++, 3 random negatives||68.0|
|++, worst negative out of 3||70.3|
|++, worst negative out of 6||70.6|
4.9 Stability to Other Negative Distributions
In the previous sections we have seen that the scoring functions are not robust to negatives generated from a model trained on a different corpus. However, even in that case, a negative is still a sample from an auto-regressive neural network. In Section E in the Appendix, we show examples where changing a few entities can cause large jumps in score (from negative to positive), and so fool the scorer. More generally, we see that the scoring function is not robust to truly out-of-domain samples. For example, the scorer will score blocks of randomly generated text higher than real text.
These behaviors are evidence that the scoring functions have learned the regularities of generated text, as opposed to learning the regularities of real text. We surmise that it does so because modeling the latter would be much more difficult than the former. By modeling generated text, the scoring function assigns high score to anything that is not generated by its training generator. While not surprising, this might be considered a liability of such scoring functions. However, as a model of text, the scoring functions should be considered as working on the residuals of the language models used to generate negatives. For the examples in Appendix Section E, the language model records a large decrease in likelihood after the change in entity; and the language models of course give much lower likelihood to random text than gold or generated text. Therefore, the scoring function needs not to be accurate on examples that are already very unlikely according to these language models.
In Figure 1 we show the average effects of applying various perturbations to sequences from Wikitext103 on an in-domain scorer and a language model to each location (from 1 to 160) in the sequence. We see that for all perturbations, the scorer increases its score, but the language model decreases its score (the likelihood). We also see that the scorer is more sensitive to the ends of the text, which is where the negatives were different from real text at training time.
5 Final Remarks
Through an extensive empirical analysis, we have investigated how accurately machine generated text can be detected automatically. Our experiments show that training a dedicated scoring function for this task works significantly better than using the (negative) log-likelihood scores of the language model generating the data under consideration. We then assessed generalization performance of such scoring function, and found that it generalizes well as long as language model generators used at test time are trained on similar datasets and use architectures that are similar or weaker than those used to train the scoring function.
While the lack of broader generalization across corpora limits the applicabilty to downstream tasks where there is no control on the data and models used for training the generators, we believe that results could further improve by scaling up the size of the scoring function and by adding a larger variety of negatives, including sequences generated adversarially by the same scoring function.
- Baevski and Auli  Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ.
- Bahdanau et al.  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Collobert et al. 
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa.
Natural language processing (almost) from scratch.
Journal of machine learning research, 12(Aug):2493–2537, 2011.
- Dauphin et al.  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 933–941. JMLR, 2017.
- Devlin et al.  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- Du and Mordatch  Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. CoRR, abs/1903.08689, 2019. URL http://arxiv.org/abs/1903.08689.
- Elman  Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
- Fan et al.  Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Association for Computational Linguistics, 2018.
- Freund and Schapire  Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
- Gehring et al.  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional Sequence to Sequence Learning. In Proc. of ICML, 2017.
- Ghazvininejad et al.  Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Constant-time machine translation with conditional masked language models. CoRR, abs/1904.09324, 2019. URL http://arxiv.org/abs/1904.09324.
- Goodfellow et al.  Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
- Graves and Schmidhuber  A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5–6):602––610, 2005.
- Graves  Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
- Gu et al.  Jiatao Gu, Qi Liu, and Kyunghyun Cho. Insertion-based decoding with automatically inferred generation order. CoRR, abs/1902.01370, 2019. URL http://arxiv.org/abs/1902.01370.
Geoffrey E. Hinton.
Training products of experts by minimizing contrastive divergence.Neural Computation, 14:1771–1800, 2002.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735––1780, 1997.
- Kaiser et al.  Lukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv:1803.03382v6, 2018.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kiros et al.  Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015.
- Kulikov et al.  Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, and Jason Weston. Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907, 2018.
- LeCun et al.  Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu-Jie Huang. A tutorial on energy-based learning. Predicting Structured Outputs, 2006. MIT Press.
- Loshchilov and Hutter  Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Merity et al.  Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Nagel  Sebastian Nagel. Cc-news. http://web.archive.org/save/http://commoncrawl.org/2016/10/news-dataset-available/, 2016.
- Ott et al.  Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- Pascanu et al.  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013.
Paszke et al. 
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.2017.
- Peters et al.  Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018.
- Radford et al.  A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. https://openai.com/blog/better-language-models, 2019.
Radford et al. 
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
Improving language understanding with unsupervised learning.https://openai.com/blog/language-unsupervised/, 2018.
- Schuster and Kuldip  Mike Schuster and K. Paliwal Kuldip. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681, 1997.
- Sennrich et al.  Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Stern et al.  Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. In NeurIPS, 2018.
- Stern et al.  Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. Insertion transformer: Flexible sequence generation via insertion operations. CoRR, abs/1902.03249, 2019. URL http://arxiv.org/abs/1902.03249.
- Strobelt et al.  Hendrik Strobelt, Sebastian Gehrmann, and Alexander Rush. Glitr. http://web.archive.org/web/20190507175455/http://gltr.io/dist/, 2019.
- Sutskever et al.  I Sutskever, O Vinyals, and QV Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 2014.
- Teh et al.  Y. W. Teh, M. Welling, S. Osindero, and Hinton G. E. Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4:1235–1260, 2003.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- Viola and Jones  P. Viola and M. Jones. Robust real-time object detection. IJCV, 2001.
- Zhang et al.  Jingyi Zhang, Masao Utiyama, Eiichiro Sumita, Graham Neubig, and Satoshi Nakamura. Improving neural machine translation through phrase-based forced decoding. CoRR, abs/1711.00309, 2017. URL http://arxiv.org/abs/1711.00309.
Zhu et al. 
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler.
Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books.
The IEEE International Conference on Computer Vision (ICCV), December 2015.
Appendix A Corpora sizes
Appendix B Model Sizes
Appendix C Hyper-parameter Setting
To train our biggest model (UniTransf) on our biggest dataset (CC-NEWS) we used 8 machines each with 8 GPUs in synchronous mode using data parallelism. The resulting large batch size speeds up training when combined with float16 reduced precision and cosine scheduling of the learning rate without any restarts [Loshchilov and Hutter, 2016], i.e. we decay the learning rate to zero over the course of “max steps” updates and then stop training. Using these methods, we reduced training time by five times compared to a single node training. For all other configurations we used a single node with up to 8 GPUs and inverse square root decay.
|max lr||bsz (per GPU)||GPUs||fp16||warmup steps||max steps|
Appendix D Score Distributions
Appendix E Perturbing the Energy Function
In this section we show that we can change a few words to make a negative example become a “positive” one as judged by the scoring function, and vice versa, by using gradient information.
Below here, we show an example of ground truth sentence from the Wikitext dataset.
Here the block has 160 BPE tokens, where the first 120 tokens (black font) are used as context and the remaining 40 are the ground truth completion. Next, we use a language model to generate 10 negatives:
<EOS> =Robert Boulter= <EOS> <EOS> Robert Boulter is an English film, television and theatre actor. He had a guest-starring role on the television series The Bill in 2000. This was followed by a starring role in the play Herons written by Simon Stephens, which was performed in 2001 at the Royal Court Theatre. He had a guest role in the television series Judge John Deed in 2002. In 2004 Boulter landed a role as "Craig" in the episode "Teddy’s Story" of the television series The Long Firm; he starred alongside actors Mark Strong and[ Chris Elliott in 2006 as the character. Boulter has appeared in various television specials dealing with the series since its inception. <EOS> After graduating with a degree in drama, Boulter worked as a]
<EOS> =Robert Boulter= <EOS> <EOS> Robert Boulter is an English film, television and theatre actor. He had a guest-starring role on the television series The Bill in 2000. This was followed by a starring role in the play Herons written by Simon Stephens, which was performed in 2001 at the Royal Court Theatre. He had a guest role in the television series Judge John Deed in 2002. In 2004 Boulter landed a role as "Craig" in the episode "Teddy’s Story" of the television series The Long Firm; he starred alongside actors Mark Strong and[ Stephen Fry in the episode "You’re All Alone" and in the episode "The Longest Day". <EOS> He auditioned for the role in the series in 2003 but was not cast. In 2005]
<EOS> =Robert Boulter= <EOS> <EOS> Robert Boulter is an English film, television and theatre actor. He had a guest-starring role on the television series The Bill in 2000. This was followed by a starring role in the play Herons written by Simon Stephens, which was performed in 2001 at the Royal Court Theatre. He had a guest role in the television series Judge John Deed in 2002. In 2004 Boulter landed a role as "Craig" in the episode "Teddy’s Story" of the television series The Long Firm; he starred alongside actors Mark Strong and[ Ian Somerhalder on the BBC series Top Gear; this was followed up in 2007 by a role in the BBC science-fiction series Doctor Who. In 2008 Boulter appeared in the BBC]
On this example, using the big transformer model, UniTransf, as the scoring function, we are able to separate real from fake examples as shown (Figure 3). We want to perturb these negatives to violate the margin. To do so, we make use of the gradient information from the scoring function and use a first order Taylor expansion to approximate the effect of a token replacement (we abuse our notations and use to denote embeddings in this analysis). Given the original sample , we change one word to to arrive at . The score of is approximately:
Using this approximation, we can search for those token replacements that increase/decrease the score the most. We can easily change a negative sample to a positive one by replacing the 5 words highlighted below. In paratheses, we report both score and language model perplexity.
Original negative (score -0.77, PPL 20.77)
<EOS> =Robert Boulter= <EOS> <EOS> Robert Boulter is an English film, television and theatre actor. He had a guest-starring role on the television series The Bill in 2000. This was followed by a starring role in the play Herons written by Simon Stephens, which was performed in 2001 at the Royal Court Theatre. He had a guest role in the television series Judge John Deed in 2002. In 2004 Boulter landed a role as "Craig" in the episode "Teddy’s Story" of the television series The Long Firm; he starred alongside actors Mark Strong and[ Chris][ Elliott] in 2006 as the character. Boulter has appeared in various television specials[ dealing] with the series since its inception. <EOS> After graduating with a degree in[ drama], Boulter worked as a
Perturbed negative (score 0.00, PPL 117.30)
<EOS> =Robert Boulter= <EOS> <EOS> Robert Boulter is an English film, television and theatre actor. He had a guest-starring role on the television series The Bill in 2000. This was followed by a starring role in the play Herons written by Simon Stephens, which was performed in 2001 at the Royal Court Theatre. He had a guest role in the television series Judge John Deed in 2002. In 2004 Boulter landed a role as "Craig" in the episode "Teddy’s Story" of the television series The Long Firm; he starred alongside actors Mark Strong and[ Gor](-0.0.64, 28.97)[ Trem](-0.56, 38.86) in 2006 as the character. Boulter has appeared in various television specials[ relates](-0.77, 24.60) with the series since its inception. <EOS> After[Health](-0.35, 39.52) with a degree in[edited](-0.49, 27.45), Boulter worked as a
In the above example, we also show the (score, PPL) for replacing a single token in the subscripts. Similarly, we can replace a few words and make a positive sample become negative.
Original positive (score -0.25, PPL 77.68)
<EOS> =Robert Boulter= <EOS> <EOS> Robert Boulter is an English film, television and theatre actor. He had a guest-starring role on the television series The Bill in 2000. This was followed by a starring role in the play Herons written by Simon Stephens, which was performed in 2001 at the Royal Court Theatre. He had a guest role in the television series Judge John Deed in 2002. In 2004 Boulter landed a role as "Craig" in the episode "Teddy’s Story" of the television series The Long Firm; he starred alongside actors Mark Strong and[ Derek] Jacobi. He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur, which was performed at the[ Drum] Theatre in[ Plymouth] and the[ Men]ier[ Chocolate] Factory in London. He was
Perturbed positive (score -0.78, PPL 142.85)
<EOS> =Robert Boulter= <EOS> <EOS> Robert Boulter is an English film, television and theatre actor. He had a guest-starring role on the television series The Bill in 2000. This was followed by a starring role in the play Herons written by Simon Stephens, which was performed in 2001 at the Royal Court Theatre. He had a guest role in the television series Judge John Deed in 2002. In 2004 Boulter landed a role as "Craig" in the episode "Teddy’s Story" of the television series The Long Firm; he starred alongside actors Mark Strong and[connected](-0.30, 118.30) Jacobi. He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur, which was performed at the[ C](-0.28, 75.36) Theatre in[ London](-0.47, 62.29) and the[ Vaughan](-0.40, 93.77)ier[cerning](-0.32, 100.71) Factory in London. He was
As shown in Figure 4, we can easily “fool” the discriminator by editing a few words. However, these edited sentences have a very low probability (high PPL) under the generator we used. This explains why the discriminator gets fooled, because it has never seen such negatives during training.