1 Introduction
The dominant approach to parametric text generation is based on large neural autoregressive models (Radford et al., 2019). These models can be trained efficiently via maximum likelihood and they can efficiently generate samples of remarkable quality. Key to their success is local normalization, i.e. they are defined in terms of a product of conditional distributions, one for each token in the sequence. Such distributions are relatively cheap to compute with modern hardware given the limited vocabulary size of common subword units like BPE (Sennrich et al., 2015).
Unfortunately, local normalization also brings some drawbacks. First, the designer of the model needs to specify the order in which tokens are generated. Second, at training time the model is conditioned on ground truth context while at test time it is conditioned on its own generations, a discrepancy referred to as exposure bias (Ranzato et al., 2016)
. Finally, while heuristics like beam search somewhat help rescore at the sequence level, generation generally lacks longrange coherency because it is produced by the greedy selection of one token at a time without lookahead.
Energybased models (EBMs) (Hinton, 2002; LeCun et al., 2006; Ranzato et al., 2007) are a more general framework which potentially address all these issues, as they do not require any local normalization. They only require the definition of an energy function defined over the whole input sequence. Training aims at shaping the energy function such that regions of high density of training data points have lower energy than elsewhere. In principle, EBMs are ideal for modeling text as they can score the whole input at once, they are not prone to label bias (Bottou, 1991) and they may enable generation of large chunks of text, which should help improve coherency.
However, so far EBMs had limited application in text generation, because sampling from the model is intractable, and so is maximum likelihood training. The problem is that shaping the energy function is accomplished by updating the model parameters such that the energy is decreased at the training data points (a.k.a. positive examples) and increased at other data points (a.k.a. negative examples). In maximum likelihood training negatives are generated from the model, but in text application we cannot use gradientbased MCMC methods (Teh et al., 2003; Du and Mordatch, 2019) and Gibbs sampling (Welling et al., 2005) is too slow to be practical. Generating negatives by local perturbations of the ground truth would be efficient but hardly useful for generation purposes, when at test time the model needs to generate from scratch.
Recently, Bakhtin et al. (2019) carefully studied the problem of training a discriminator to distinguish human written text from language model generations. They experimented with different language model and discriminator architectures, training/test time corpora and concluded that the discriminator can generalize rather well to weaker language models when the training/test corpora match. Bakhtin et al. (2019) found that the learned discriminator is not robust to random perturbations, and argued that the discriminator operates in the “residual” space of the language model.
Concurrently, Grover et al. (2019) proposed a general approach to “debias” a generator, by simply training a discriminator and using its output for importance sampling.
In this work, we build upon these two works. First, we formalize the residual interpretation by Bakhtin et al. (2019) and use a generative model of the form:
(1) 
where is a locally normalized language model which is fixed during training, and is the energy function parameterized by . The resulting model is globally normalized due to the energy term. Note that the same residual formulation was also used in Rosenfeld et al. (2001); Wang and Ou (2018b); Parshakova et al. (2019).
This formulation has multifold benefits. First, by incorporating a locally normalized language model, we can leverage recent advancements in locally normalized language modeling. Second, the language model provides a natural proposal distribution for training (Bakhtin et al., 2019), and training can be made efficient by using the conditional noise contrastive estimation objective (Gutmann and Hyvärinen, 2010) as we shall see in §3. Lastly, this formulation enables efficient evaluation and generation via importance sampling (Horvitz and Thompson, 1952; Grover et al., 2019).
In some sense, this last point is perhaps the central contribution of the paper, as it allows estimating perplexity of the residual EBM, and thus allows these EBMs to be compared in a standard way to other models. Indeed, in §4 we show that our joint model decreases perplexity on two large datasets, when compared to various autoregressive language model baselines. Finally, the EBM generations are significantly preferred by humans according to our qualitative evaluation. To the best of our knowledge, this is the first time that an EBM has demonstrated improved generation ability against very strong autoregressive baselines, both in terms of estimated perplexity and through human evaluation.
2 Related Work
Energybased models have a long history in machine learning
(Hopfield, 1982; Hinton, 2002; LeCun et al., 2006; Ranzato et al., 2007). The key challenge of training is mining for good negatives. This can be accomplished explicitly by fantasizing inputs where the energy should be increased or implicitly via global constraints such as sparsity (Ranzato et al., 2007). Methods attempting at maximizing the likelihood of the data require to sample from the distribution induced by the model. Unfortunately, gradientbased MCMC approaches like Hybrid Monte Carlo (Teh et al., 2003) and Langevyn dynamics (Ranzato et al., 2007; Du and Mordatch, 2019; Xie et al., 2016, 2017, 2019, 2018; Gao et al., 2018; Nijkamp et al., 2019) are not applicable when the input is discrete like in text applications. Other approaches like Gibbs sampling (Hinton, 2002) were applied to binary inputs but do not scale well to large dictionaries once the energy function is a large bidirectional transformer model like the one used in this work. Several variants of autoencoders have also been investigated for representing and generating text (Bowman et al., 2016; Zhao et al., 2018), but they have not shown significant improvements in terms of perplexity and they have so far been applied to relatively small datasets only.Our approach appears similar to discriminative reranking approaches used in the parsing and machine translation community (Shen et al., 2004). However, our approach provides a generative model, and parameters/hyperparameters are directly tuned to close the gap between the model distribution and the data distribution, rather than relying on surrogate ranking losses. This approach is also related to other sequence level training objectives (Edunov et al., 2018), with the major difference that in those works training aims at improving the baseline model, but generation at test time is still greedy.
Energy Networks have been used for sequence modeling (Rosenfeld et al., 2001; Wang et al., 2015, 2017; Wang and Ou, 2017, 2018a; Parshakova et al., 2019). In particular, our residual modeling form and the training algorithm is the same as in Wang and Ou (2018b)
, where they used an LSTM as the generator and a CNNLSTM as the energy function, and showed significant gains compared to LSTM baselines in speech recognition. Our work builds on these prior works and develops new lower and upper bounds for the logprobability under the joint model, which makes it possible to show that the residual EBM approach gets better perplexity. We also develop an importance weighting sampling scheme used at generation time, which is focused on conditional generation as opposed to rescoring in speech recognition
(Wang and Ou, 2018b). The residual EBM formalism makes it very natural to use BERT for language modeling, and we show that empirically this type of approach can outperform modern stateoftheart language modeling baselines, both in terms of perplexity, and through human evaluation.Generative Adversarial Networks (Goodfellow et al., 2014) also relate to EBMs, except that in EBMs the generator is implicit and negatives samples are produced by the discriminator itself. In our work, the pretrained locally normalized language model can be seen as a fixed generator, like in Bakhtin et al. (2019). Azadi et al. (2018) also share our same goal but their generator is not locally normalized and they propose to improve the sampling from the generator by using the discriminator for rejection sampling. Similar to our work, Grover et al. (2019) propose to use the discriminator to debias the pretrained generator using importance sampling. We adapt this work to the application of text generation. In particular, we adopt the conditional noise contrastive estimation (NCE) objective (Ma and Collins, 2018; Gutmann and Hyvärinen, 2010) to our residual model energy function and then sample from the joint model using importance sampling. We want to note that the same formulation has been proposed in (Wang and Ou, 2018b; Parshakova et al., 2019). While Ma and Collins (2018) used conditional NCE to predict the next word in a sequence, we apply it to produce a whole sequence at once with the pretrained autoregressive language model as the noise distribution.
3 Residual EnergyBased Models
We study the problem of conditional generation of discrete sequences. Given a prefix with where is the vocabulary, we want to model the probabilities of generating a sequence of total length ^{1}^{1}1We assume a fixed for simplicity of analysis and implementation, but our method generalizes to varying length generation with an endofsequence symbol.. The generative model is:
(2) 
where is a normalizing factor known as partition function. Computing the partition function is intractable in our case since it involves a sum over terms which grow exponentially with the sequence length: in our experiments the size of the vocabulary is 50,096 and the length of the generation is 40 tokens. We call the joint model, and the residual energy function since is fixed throughout training. The goal of training is to learn the parameters of the energy function such that the joint model distribution gets close to the data distribution. For the sake of reducing clutter in the notation, we will drop the conditioning variables in the following discussion.
3.1 Training
When the partition function is intractable, Maximum Likelihood Estimation (MLE) requires samples from the model distribution, which is usually approximated with Monte Carlo sampling or mean field inference (Hinton, 2012; LeCun et al., 2006) for globally normalized models. Unfortunately, both approaches are too computationally expensive for text applications when using large bidirectional transformer models. For instance, if we were to employ Gibbs sampling exactly, we would need to perform at every position as many forward passes as words in the dictionary to compute each conditional distribution. On large datasets where training locally normalized models on multiple machines already takes days, having such additional overhead means that the model would learn from much less data for the same amount of time, and this is seldom a beneficial strategy for learning models that generalize well. Therefore, we do not use either MCMC nor mean field methods, as the latter would introduce additional variational parameters or an inference network which anyway yields an approximation to MLE learning.
Instead, we train our residual energy function using Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010), and more specifically its conditional version (Ma and Collins, 2018). NCE requires two distributions: The model distribution and a noise distribution. In our case, the model distribution is the joint model of Eq. 2, , while the noise distribution is the pretrained language model,
. NCE then trains a binary classifier on the difference of logprobability scores of these two models. Since our joint model is the product of the energy function (whose parameters we want to learn) with
, the difference reduces to: . Therefore, under these modeling assumptions of residual learning and noise model, the objective function becomes:(3) 
where is a positive sequence taken from the human generated training set, and is a negative sequence drawn from
(for a given ground truth prefix). In other words, training the energy function reduces to training a binary classifier to discriminate between real text and text generated by an autoregressive language model. The aim of training is to assign as negative energy as possible to real data, and as positive energy as possible to machine generated data. Interestingly, the role of positive and negative samples is totally symmetric in this loss function, §
5 will discuss the consequences of this.With the theoretical guarantee of NCE, we can show that the optimum of the above objective is reached at data distribution with infinite amount of data and model with enough capacity, which is also proved in Ma and Collins (2018)^{2}^{2}2From Ma and Collins (2018) Assumption 2, for conditional NCE the model needs to be flexible enough such that the selfnormalizing property can be satisfied conditioned on any prefix..
Theorem 1.
If has the same support as , then the objective function in Eq. 3 reaches its maximum at , if there exists such .
Proof.
This theorem directly follows from the proof in Gutmann and Hyvärinen (2010). Note that at optimum, is selfnormalizing: instead of , we have . However, we still need to estimate the partition function throughout the rest of this paper, since we cannot guarantee that this optimum can be reached. ∎
3.2 Evaluation
A commonly used protocol for evaluating generative sequence models, especially language models, is perplexity (PPL), which is equal to . PPL can be interpreted as the average number of tokens the model is uncertain of at every time step. Since the loglikelihood required by PPL relies on estimating the partition function , we derive two estimators for the logpartition function based on the work of Nowozin (2018).
Theorem 2.
Denote as the empirical estimate of with samples : , then , such that we have
(4) 
The proof is given in Appendix A.2.
We can use the above two estimators to estimate the lower and upper bounds of the partition function, but we want to emphasize that they are true only asymptotically (when
is sufficiently large). We also want to note that to get lower variance estimates we use leaveoneout strategy to estimate
. See Nowozin (2018) for implementation details and methods to improve numeric stability.Similarly to locally normalized models, we can also factorize the probabilities of an entire sequence step by step, as , and evaluate the PPL for each generation step. By marginalizing over the future, we can derive the following per step probabilities:
(5) 
The stepwise probabilities in Eq. 5 are an instance of importance sampling (Horvitz and Thompson, 1952). The basic distribution is adjusted by the probability assigned to token by the energy function (numerator is clamped at while denominator sums over all the possible values of the token at position t), with the additional marginalization over all subsequent tokens up to the horizon . Since the summation involves exponentially many terms, unless , this is approximated by samples drawn by . Since both the numerator and the denominator take the same form as the partition function, we also use Eq. 4 to estimate the upper and lower bounds. E.g., the lower bound of can be obtained by using the lower bound of the numerator and the upper bound of the denominator.
For , we can calculate the log probability by exhaustive enumeration. This gives us an idea of the true performance of our model at the last step, and it also provides a sanitycheck of the tightness of our estimators.
3.3 Generation
Generating from the joint model is a nontrivial task. A naive way is to generate from the joint model autoregressively, by marginalizing the future as in Eq. 5, which we term Topk autoregressive sampling. However, doing so is computationally expensive and impractical, and we only use this method for a qualitative analysis of the joint model in Appendix A.1.
In order to generate efficiently, we use selfnormalizing importance sampling (Owen, 2013; Grover et al., 2019). Under the assumptions that the model from which we wish to draw samples is the joint model, which is the product of the autoregressive model and the energy function, and that the proposal distribution is the autoregressive model itself, sampling proceeds simply by: a) sampling from the autoregressive language model, followed by b) resampling according to the energy function. The algorithm is shown in Algorithm 1, where we introduce an optional topk constraint on the pretrained language model to improve the quality of samples in the set^{3}^{3}3Adapting to other types of local constraints such as nucleus sampling (Holtzman et al., 2019) is straightforward.. Without the topk constraint, as the number of samples goes to infinity, we would recover exact samples from the joint model distribution.
4 Experiments
In this section, we describe the experimental set up and the results we obtained by using the residual EBM for text generation, both in terms of perplexity and generation quality.
4.1 Experimental Setup
Datasets
We consider two datasets: the Toronto Book Corpus (Zhu et al., 2015; Kiros et al., 2015) and CCNews (Bakhtin et al., 2019). The former dataset consists of fiction books in 16 different genres, totaling about half a billion words. The latter is a deduplicated subset of the English portion of the CommonCrawl news dataset (Nagel, 2016), which totals around 16 Billion words. The book corpus is more challenging because the range of style and topics is more diverse than CCNews. Also, the book corpus is 30 times smaller than CCNews and may pose generalization challenges because of its smaller size.
In all our experiments we use a prefix of size 120 tokens and we generate the following 40 tokens; with the notation of Eq. 2, and . For training the joint models, for efficiency we generated 16/128 samples per prefix for CCNews/Book Corpus offline, and sample uniformly from those samples at training time.
Baselines
We consider as base language model (Base LM) used to generate negatives for the residual EBM, a transformer language model with 12 layers, , , (we refer to Vaswani et al. (2017) for notations). This is also our first baseline model.
The joint model has as many parameters as the sum of the number of parameters in the base LM and the number of parameters in the energy network. To make a fair comparison, we consider two additional baselines that have the same number of parameters as our joint model.
The first baseline is a Residual Autoregressive Language Model baseline (RALM):
(6) 
where takes the form of another autoregressive language model. The parameters of are trained by exact maximum likelihood training of .
The second baseline is an autoregressive language model of the same size of our joint model (sum of the base LM and energy function parameters), we dub this model Big Autoregressive Language Model (BALM). BALM has 12 layers, , , , and is trained by standard token level crossentropy loss.
Residual EBM Architecture
We consider two architectures for our residual EBM, both of them are based on transformers (Vaswani et al., 2017; Devlin et al., 2018). The first version uses causal selfattention and is derived from the base LM, a unidirectional transformer (UniT). It is of the same architecture as Base LM, except that in the final layer we project the meanpooled hidden states to a scalar energy value. We initialize its parameters with a language model trained on the same dataset.
The second version is instead bidirectional (BiT), and the energy function is computed by projecting the meanpooled top hidden states down to a single scalar value. We consider three variants, a BiTBase following the architecture of RoBERTaBase, and a following RoBERTaLarge (Liu et al., 2019), and a BiTMed with the same number of parameters as UniT (such that Joint BiTMed has roughly the same number of parameters as BALM)^{4}^{4}4We use models from the HuggingFace repository at https://github.com/huggingface/transformers. We initialize the parameters with a trained BERT, and we use to mark usage of external data, otherwise it means that BERT was trained on our training set. Notice how our model can be interpreted as a natural way to finetune large bidirectional pretrained models for the text generation task.
While we expect BiT to yield better results because it can fully leverage context also for intermediate tokens, we also consider UniT to compare to the RALM baseline, which uses the same architecture and only differs in the way parameters are trained and in the presence of local normalization.
We train our models on 8 DGX nodes, each with 8 Nvidia V100s. To improve training speed, we use mixed precision training^{5}^{5}5https://github.com/NVIDIA/apex. We use the Adam optimizer, with cosine learning rate decay and learning rate warmup. To stabilize training we used gradient norm clipping (Pascanu et al., 2013). Detailed hyperparameter settings can be found in Appendix A.3.
For generation, we use topk sampling with for all human evaluations. We take 10,000 samples from Base LM for our joint sampling.
4.2 Results
Automatic Evaluation
Model (#parameters)  CCNews  Toronto Book Corpus  

Val  Test  Val  Test  
base LM (203M)  18.41  17.57  16.16  18.29 
RALM (LM+203M)  17.01  16.17  15.71  17.85 
BALM (408M)  
joint UniT (LM+203M)  16.4216.44  15.5715.58  15.1215.13  16.9817.00 
joint BiTBase (LM+125M)         
joint BiTBase* (LM+125M)         
joint BiTLarge* (LM+355M)         
Base LM24L (203M)  
RALM24L (LM24L+203M)  
BALM24L (408M)  
joint UniT (LM24L+203M)        
joint BiTBase (LM24L+125M)         
joint BiTBase* (LM24L+125M)         
joint BiTMed (LM24L+203M)         
joint BiTLarge* (LM24L+355M)         
Model1 (baseline)  Model2 (compared model)  Rate  pvalue  

base LM  joint uniT  52.85%  0.16  
base LM  joint BiTBase  56.25%  0.015  
base LM  joint BiTLarge*  58.93%  0.00084  
base LM  BALM  46.77%  0.88  
BALM  joint UniT  50.00%  0.52  
BALM  joint BiTBase  57.89%  0.0027  
BALM  joint BiTLarge*  59.89%  0.00020  
BALM24L  joint BiTMed (24L)  56.23%  0.015  
joint BiTLarge* (24L)  Human  55.21%  0.036  
base LM  BALM  54.85%  0.050 
Our main result is reported in Table 1 where we compare models in terms of their perplexity. We can see that on both datasets, residual EBMs with causal attention joint UniT outperforms the baseline RALM with approximately the same number of parameters. The nonresidual baseline BALM performs similarly to joint UniT, which might be due to the limitation that is not trained jointly with the residual model in both joint UniT and RALM. However, by using our EBM approach, we can remove the causal attention mask and use bidirectional models, which achieves better performance than baselines and joint UniT: without external data, joint BiTBase reaches a higher performance than joint UniT with fewer parameters. By initializing from the stateoftheart pretrained bidirectional transformers RoBERTaBase and RoBERTaLarge, joint BiTBase* and Joint BiTLarge* reach even better performance than joint BiTBase.
In the lower part of the table, we show that if we make the big language model baseline BALM deeper (BALM24L) (24 layers instead of 12, for the same number of parameters) we attain lower perplexity. However, training the joint model Joint BiTBase on the residual of a deeper language model BASE LM24L yields even lower perplexity, despite having fewer parameters. By using the same number of parameters as BALM24L, Joint BitMed further decreases perplexity. Finally, by initializing from RoBERTaLarge, joint BiTBase* obtains the best results.
One caveat of our evaluation protocol is that the perplexity bounds are only estimates, which might not reflect the true value, particularly since the number of possible sequences grows exponentially with the number of words that are generated. We therefore break down perplexity per position in the generated sequences as in Eq. 5, and compare the estimated PPLs to the true enumerated PPLs at the last position, as shown in Figure 1. We find that at the final generation step, the estimated bounds agree remarkably well with the exact values, proving that our method at least gets a reasonable PPL estimate at the last generation step, and that Joint BiTMed outperforms baselines at the last generation step for sure.
Human Evaluation
Better perplexity results do not necessarily imply better generations. Besides, since generation from the residual EBM requires approximations as in Algorithm 1
, the limited sample size might induce approximation errors compared to truly sampling from the joint distribution. Therefore, we conducted human evaluations to compare generations from the residual EBM model to generations from the baseline language models.
For each prefix, we present one completion from each model, and ask humans to select the one that is a better continuation. More details about human evaluation can be found in the Appendix A.4. The preference rates reported in Table 2 confirm that indeed the generation quality of Joint BitBase and is better than both language model baselines. Depending on the model variant, our joint model (with bidirectional EBM) is preferred between 56% and almost 60% of the times; interestingly, the preference rate does not change much as we compare against base LM as opposed to BALM. In fact, humans do not seem to have a strong preference for BALM over base LM, despite the former scores two perplexity points lower. Similarly, Joint Unit is not strongly preferred over Base LM despite its lower perplexity score. We surmise that unidirectional scoring functions and autoregressive models exhibit generation artifacts which are easily detected by humans, and these may overshadow the improvements brought by perplexity gains.
4.3 Analyses
on CCNews validation set as we vary the number of samples. Right: Percentage of Unique ngrams found in real data, samples from the joint model
BiTBase and samples from the base language model. The joint sampling is done with 10,000 samples.In this section, we analyze some of the results we obtained. First, we check whether we used a sufficient number of samples in our perplexity estimates. Second, we assess whether the joint model produces fewer repetitions compared to the base language model, and finally we check how well some statistics of the model and data distributions match.
Number of samples.
In Figure 2, we vary the number of samples we take in order to estimate PPL upper and lower bounds. Beyond 20,000 samples the upper estimate becomes very stable, although we have to emphasize that these estimates might be biased even though the gap between lower and upper bound closes as we take more samples.
Repetitions.
A typical artifact of autoregressive language models is their tendency to repeat phrases. It is then interesting to check whether the joint model is able to alleviate this artifact. Fig. 2 shows that indeed the joint model has a slightly higher percentage of unique ngrams compared to the baseline language model with , although still not as high as the original human generated text.
A necessary condition for the model to match the data distribution.
If the joint model matches the data distribution , then statistics computed on a large population of samples from the two distributions should also match. In particular, Fig. 3 show the density plots of loglikelihood scores of the baseline language model (left) and joint model (right) when fed with their own samples versus samples from the test set. We observe that the histogram of samples from the joint model matches the real data distribution more closely: The difference of means in the LM Base case is 21.64 whereas the difference is 6.20 in the joint approach.
5 Limitations
In the previous sections we highlighted the strengths of residual EBMs, namely their simplicity, efficiency both at training and test time, and their improved perplexity scores against strong autoregressive language model baselines. In this section, we comment on their limitations to caution the reader about when these methods are more likely to succeed and to inform other researchers about what future avenues of research may naturally derive from this work.
In order to make training efficient and side step costly negative mining using the energy function itself, the current approach uses negatives generated from a pretrained autoregressive language model. Therefore, our model works as long as the base language model from which we draw samples is strong enough, and as long as the ground truth and other plausible sequences are reachable by the baseline language model.
If the base language model has poor quality, then generation from our joint model is going to be poor as well, as the joint model merely resamples generations from the original language model. Moreover, training is going to be trivial if the base language model is poor, because the residual energy function merely needs to detect trivial generation artifacts from the base language model. In fact, observe that the role of positive and negative samples is symmetric in the loss of Eq. 3. This means that the energy function can choose to minimize the loss by either modeling the true data or the negative samples; since the latter have much simpler structure, it is going to model the negative samples. Therefore, importance sampling amounts to mostly downweighing the worst samples from the base language model. The consequence of this is that search with a poor base language model is going to be catastrophically inefficient, as we would need to sample an impractically large number of negatives in order to find samples that are reasonably close to the true data manifold.
To summarize, this work makes a rather strong implicit assumption on the quality of the base language model, and it is expected to work well only when this is rather strong. In our application, this assumption is met quite well in practice as large autoregressive language models trained on large datasets have improved significantly in recent years (Radford et al., 2019). In general however, residual learning always carries liability to its base model.
6 Conclusions and Future work
We investigated an EBM trained on the residual of a pretrained autoregressive language model (Wang and Ou, 2018b; Parshakova et al., 2019). The resulting joint model scores sequences holistically, thanks to the energy function. Training is very efficient and consists of a binary classification task between positives from the training set and pregenerated negatives from the fixed language model. Generation is also very efficient as it amounts to resampling from the large set of negatives produced by the base language model. Our estimates show that the resulting model has lower perplexity than the base language model. Finally, this approach may be interpreted as a natural way to finetune a large bidirectional transformer like BERT for text generation applications.
In the future, we plan to investigate other ways to generate negatives that may strike a better tradeoff between the amount of compute each negative requires and their closeness to the joint model distribution. It would also be interesting to explore other loss functions and the generation of longer pieces of text by using this model autoregressively at the chunk level, as opposed to the token level.
References
 Discriminator rejection sampling. arXiv preprint arXiv:1810.06758. Cited by: §2.
 Real or fake? learning to discriminate machine from human generated text. arXiv preprint arXiv:1906.03351. Cited by: §1, §1, §1, §2, §4.1.
 Une approche théorique de l’apprentissage connexionniste: applications à la reconnaissance de la parole. Note: Ph.D. thesis, Université de Paris XI, Orsay, France Cited by: §1.
 Generating sentences from a continuous space. In SIGNLL Conference on Computational Natural Language Learning, Cited by: §2.
 BERT: pretraining of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §4.1.
 Implicit generation and generalization in energybased models. CoRR abs/1903.08689. External Links: Link, 1903.08689 Cited by: §1, §2.
 Classical structured prediction losses for sequence to sequence learning. In North American Chapter of the Association for Computational Linguistics, Cited by: §2.

Learning generative convnets via multigrid modeling and sampling.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9155–9164. Cited by: §2.  Generative adversarial nets. In NIPS, Cited by: §2.
 Bias correction of learned generative models using likelihoodfree importance weighting. arXiv preprint arXiv:1906.09531. Cited by: §1, §1, §2, §3.3.

Noisecontrastive estimation: a new estimation principle for unnormalized statistical models.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pp. 297–304. Cited by: §1, §2, §3.1, §3.1. 
Training products of experts by minimizing contrastive divergence
. Neural Computation 14, pp. 1771–1800. Cited by: §1, §2. 
A practical guide to training restricted boltzmann machines
. In Neural networks: Tricks of the trade, pp. 599–619. Cited by: §3.1.  The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: footnote 3.
 Neural networks and physical systems with emergent collective computational abilities. In National Academy of Sciences of the USA, Vol. 79, pp. 2554––2558. Cited by: §2.
 A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. Cited by: §1, §3.2.

Skipthought vectors
. arXiv preprint arXiv:1506.06726. Cited by: §4.1.  A tutorial on energybased learning. Predicting Structured Outputs. Note: MIT Press Cited by: §1, §2, §3.1.
 Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.1.

Noise contrastive estimation and negative sampling for conditional models: consistency and statistical efficiency.
In
Empirical Methods for Natural Language Processing
, Cited by: §2, §3.1, §3.1, footnote 2.  CCnews. Note: http://web.archive.org/save/http://commoncrawl.org/2016/10/newsdatasetavailable/ Cited by: §4.1.
 Learning nonconvergent nonpersistent shortrun mcmc toward energybased model. In Advances in Neural Information Processing Systems, pp. 5233–5243. Cited by: §2.

Debiasing evidence approximations: on importanceweighted autoencoders and jackknife variational inference
. In International Conference on Learning Representations, Cited by: §A.2, §3.2, §3.2.  Monte carlo theory, methods and examples. Note: chapter 9 External Links: Link Cited by: §3.3.

Global autoregressive models for dataefficient sequence learning
. In Conference on Computational Natural Language Learning, Cited by: §1, §2, §2, §6. 
On the difficulty of training recurrent neural networks
. In International conference on machine learning, pp. 1310–1318. Cited by: §4.1.  Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1, §5.

A unified energybased framework for unsupervised learning
. In 11th International Workshop on Artificial Intelligence and Statistics (AISTATS), Cited by: §1, §2.  Sequence level training with recurrent neural networks. In International Conference on Learning Representation, Cited by: §1.
 Wholesentence exponential language models: a vehicle for linguisticstatistical integration. Computer Speech & Language 15 (1), pp. 55–73. Cited by: §1, §2.
 Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §1.
 Discriminative reranking for machine translation. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLTNAACL 2004, pp. 177–184. Cited by: §2.
 Energybased models for sparse overcomplete representations. Journal of Machine Learning Research 4, pp. 1235–1260. Cited by: §1, §2.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §4.1, §4.1.
 Transdimensional random fields for language modeling. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 785–794. Cited by: §2.
 Learning transdimensional random fields with applications to language modeling. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 876–890. Cited by: §2.

Language modeling with neural transdimensional random fields.
In
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
, pp. 294–300. Cited by: §2.  Improved training of neural transdimensional random field language models with dynamic noisecontrastive estimation. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 70–76. Cited by: §2.
 Learning neural transdimensional random field language models with noisecontrastive estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6134–6138. Cited by: §1, §2, §2, §6.
 Exponential family harmoniums with an application to information retrieval. In Neural Information Processing Systems, Cited by: §1.
 A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644. Cited by: §2.
 Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8629–8638. Cited by: §2.
 Synthesizing dynamic patterns by spatialtemporal generative convnet. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 7093–7101. Cited by: §2.
 Learning energybased spatialtemporal generative convnets for dynamic patterns. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
 Adversarially regularized autoencoders. In International Conference in Machine Learning, Cited by: §2.
 Aligning books and movies: towards storylike visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §4.1.
Appendix A Appendix
a.1 Topk autoregressive sampling
In this subsection, we factorize the joint model BiTBase autoregressively, and compare its differences with Base LM. Since even estimating the per step probabilities according to Eq. 5 is too computationally expensive, we further approximate it by only considering the top 128 words predicted by Base LM, where we sample 10,000 completions for each of them to estimate . Then we take the top 10 entries and renormalize, and compare it to the top 10 probabilities of Base LM.
Our initial explorations suggested that the joint model tends to generate fewer repetitions. Therefore we picked a few LM samples where there are repetitions at , and use the same context to estimate for the joint model. Some examples of of Base LM and BiTBase are presented in Table 3. Indeed BiTBase usually assigns lower probabilities to repetitions even though the top k words remain the same, which is not surprising given that the existence of repetition is a strong indicator of coming from the LM, which would lead to a higher energy value hence lower joint probability.
Context  Model  Rank  

^{6}^{6}6Excerpt from https://www.swissinfo.ch/eng/multinationalprinciples_swissgovernmentgivesgreenlightforunmigrationaccord/44464186.… is aimed at setting common benchmarks for orderly migration practices, thereby reducing irregular flows. The Global Compact contains ten guiding principles, including that migrants cannot be settled by countries with better integration policies and a fair and sustainable development. ”For the first time in our history, a redlegally binding and  Base LM  0  redbinding  red0.39 
1  redlegally  red0.33  
2  internationally  0.06  
3  comprehensive  0.05  
4  transparent  0.04  
BiTBase  0  redbinding  red0.18  
1  redlegally  red0.17  
2  internationally  0.12  
3  comprehensive  0.09  
4  transparent  0.08  
^{7}^{7}7Excerpt from https://www.forbes.com/sites/markmurphy/2018/05/11/thisistheonepieceofdatathat85ofrecruitersaremissing/#25917c765dad. … companies that land their firstchoice candidates 90100% of the time, 24% of them have ”thoroughly defined” their high performer attitudes. By contrast, only 1% of companies that struggle to land their firstchoice candidates ”thoroughly defined” their high performer attitudes. So it seems pretty clear that companies that land their topchoice candidates are not always as redwilling and  Base LM  0  able  0.66 
1  redwilling  red0.09  
2  eager  0.07  
3  ready  0.05  
4  well  0.04  
BiTBase  0  able  0.75  
1  redwilling  red0.05  
2  eager  0.05  
3  ready  0.04  
4  well  0.03  
^{8}^{8}8Excerpt from https://www.reuters.com/article/ususafedpowell/fednomineepowelloncehawkishnowchampionsyellensfocusonjobsidUSKBN1DS0FG… it reveals a key skill needed to lead the Fed. ”You need to know what you don’t know. And you need to be willing to listen when you don’t know something,” said Karen Dynan, who as an assistant Treasury Secretary in Barack Obama’s second administration would regularly meet Fed governors. ¡EOS¿ New Delhi Dec 5 The following are mergers under review by India’s redfinancial services and  Base LM  0  banking  0.64 
1  redfinancial  red0.10  
2  insurance  0.09  
3  technology  0.05  
4  IT  0.04  
BiTBase  0  banking  0.92  
1  redfinancial  red0.06  
2  insurance  0.01  
3  technology  0.00  
4  IT  0.00 
a.2 Proof of Theorem 2
Theorem 2.
Denote as the empirical estimate of with samples , and let , then , such that we have
(7) 
Proof.
From Nowozin (2018) Eq. 35, we can write as
(8) 
Where , . Equivalently,
(9) 
Therefore, . So , such that when , . On the other hand, , so such that when we have . Up to this point, we have proved that .
For the other half part of the proof, using Eq. 8 we have
(10) 
where is a constant. Therefore, . Therefore , hence , such that . Furthermore, , so such that when we have .
Putting the above together, , let , then ,
∎
a.3 Optimization Settings
Model  fp16  batch size  warmup steps  max steps  max lr  max grad norm 

base LM    32  2,000  180,000  0.0001  10 
RALM    64  2,000  180,000  0.0001  10 
BALM    32  2,000  180,000  0.0001  10 
joint UniT  +  64  2,000  180,000  0.0003  10 
joint BiTBase    60  2,000  90,000  0.00005  0.25 
joint BiTBase*    60  2,000  90,000  0.00005  0.25 
joint BiTLarge*  +  64  2,000  90,000  0.0003  10 
base LM24L    50  2,000  90,000  0.0003  0.25 
RALM24L    28  1,000  90,000  0.00015  0.25 
BALM24L    28  2,000  90,000  0.0003  0.25 
joint UniT (LM24L)  +  64  2,000  180,000  0.0003  10 
joint BiTBase (LM24L)    60  2,000  90,000  0.00005  0.25 
joint BiTBase* (LM24L)    60  2,000  90,000  0.00005  0.25 
joint BiTMed (LM24L)    32  2,000  90,000  0.00005  0.25 
joint BiTLarge* (LM24L)    20  2,000  90,000  0.00005  0.25 
The optimization settings are presented in Table 4.
a.4 Human Evaluation
A screenshot of the human evaluation experiments can be found in Fig 4. Every page asks for 4 comparisons, one of which we know what the ground truth answer is. We subsampled 333 sentences from the test set of CCNews, and asked 3 Amazon Mechanical turkers to vote. We consider one continuation better if it gets more votes. To check the quality of the received ratings, we performed a qualification task beforehand, where one of the continuations is real text, and we kept the top half performing turkers for further evaluation (corresponding to higher than 66.67% accuracy for discriminating real from LM samples – for a total of 26 qualified turkers). Then in the actual experiment, we use one out of every four comparisons as an attention check and drop responses if the turker did not pass the check.
a.5 Qualitative analysis
We present generation examples when our approach Base LM outperforms baseline BALM in Table 5, and when our approach underperforms in Table 6. Here the judgment is based on human evaluation when all three turkers unanimously voted in favor of one model over the other.
Prefix  Model  Generation 

^{9}^{9}9Excerpt from https://chicago.cbslocal.com/2017/11/20/chancetherapperjewelosco/the timing of the donation couldn’t be better, and even went on to say that no one is more relevant than Chance right now. Thank you to Doug and @jewelosco for donating $1,000,000 to The New Chance Arts & Literature Fund, which will ensure more students have access to arts enrichment education! #supportcps pic.twitter.com/MXZtpscU5b — SocialWorks (@SocialWorks_Chi) November 20, 2017 “In the past, CPS hasn’t really gotten a lot of money from nonprofits  Base LM  . And the fact that the money is coming from the government, it makes a big difference,” he said. “We’re not the biggest donor of arts education, so to 
BALM  , and it doesn’t even have the kind of resources or funding for the arts or literary programs that we need to do the things that people want to support,” said Chance.  
BiTBaseWorst  ,” said Cpl. Chris Bissonnette of the Stony Brook Police Department. “So the reason is, as I think about the impact of these grants and these grants to schools  
BiTBase  to fund programming. So, I think that it’s a great time for us to do something to support that,” said Chance. “We want to make sure that our programs  
Gold  and charitable arms and organizations because there’s been a past of corrupt activity in the city of Chicago,” Chance said. “And so, when it comes down to it, it  
^{10}^{10}10Excerpt from https://chiefswire.usatoday.com/2017/12/10/halftimeanalysisforchiefsvsraidersweek14/quarter. The penalties are still somewhat of a problem but tight ends Travis Kelce and Demetrius Harris made some impressive grownman football plays. It was nice to see running back Kareem Hunt get in the end zone for the first time since Week 3. He must feel good to end the drought. Kelce was visibly frustrated on the sidelines and rightfully so. The officials seem to be leaning toward Oakland with calls today. Still, Kelce should’ve caught that easy pass that he dropped. Quarterback Alex Smith has delivered a couple of nice deep balls to  Base LM  get his hands on the ball this week. He threw two touchdown passes on Thursday. He should get another touchdown as the season goes on. He’s got a good chance to be one of 
BALM  the Chiefs and the Chiefs defense has made them pay for it. The offense has done a nice job with the ball, and they’s been a factor. It’s nice to have  
BiTBaseWorst  tight end Martavis Bryant to get the ball back, but the throw from Smith to Davis Webb is incomplete, which could lead to an incompletion. The Chiefs are now 52 after winning  
BiTBase  help the Chiefs win this game. His completion of a deep ball to wide receiver Sammy Watkins is a nice sign. Jalen Ramsey should be back in the lineup today. The Jaguars are going  
Gold  receivers Albert Wilson and Tyreek Hill. The footwork by Wilson on the deep ball was amazing and Hill’s balltracking skills were on point as usual. Good things happen when Smith trusts  
^{11}^{11}11Excerpt from https://www.glamour.com/story/kourtneykardashianintensedetoxhas an amazing ability to detoxify itself namely by the liver, kidneys, and skin,” she says. Detox programs and diets do sometimes make people feel better, but that happens more often if you’re eating and/or drinking things that make feel like crap (something it doesn’t sound like Kardashian was doing in the first place). And if, somehow, you discover that you have elevated levels of mercury or lead in your body like Kardashian said she had, it’s important to figure out why, Dr. Wider says. “Finding the source is extremely important for any heavy metal, so you  Base LM  can’t say ’no or I’ve never seen something of that value, so I’ll try and find it again and see what happens.” So don’t be afraid to find a way to detox or 
BALM  want to get the full picture and not just have to look at the whole picture,” she says. “It can be a great idea to talk to your doctor to determine what’s causing your symptoms.”  
BiTBaseWorst  can get rid of toxins that are found in foods like whole wheat bread, nuts like walnuts, walnuts and walnuts,” she says. ”If a source of the metals has elevated levels of  
BiTBase  can’t say, ’Oh my God, it’s a lead, but I’m fine.’ But it’s definitely a problem in our bodies,” she says. “And if it’s a problem in  
Gold  can lower your exposure over time,” she says. Meaning, a detox diet probably isn’t going to help you as much as, say, reducing your intake of foods that are high in mercury (like 
Prefix  Model  Generation 

^{12}^{12}12Excerpt from https://www.libraryjournal.com/?detailStory=lakewoodohmomsueslibraryoverteensroughtreatmentbut as the audio only came on halfway through the incident, it did not record his initial words to the girl or her response. It was also dropped in the hallway during their interaction, so it did not record any of what happened once the two left the library. Jones brought her outside to the library’s front steps, where he waited for responding officers to arrive. They did not call for an ambulance, according to the lawsuit. Rather, when the officers arrived, they told her to spit blood on the grass instead of the concrete, and to move from the steps to a bench. Some 20 minutes  Base LM  later, she was dead, according to the lawsuit. “It’s not uncommon for an officer to spit blood on an unarmed, vulnerable adult. The use of spit is illegal in a public place,” 
BALM  later, Jones was arrested for disorderly conduct and released after posting bond, the suit said. She was arrested again on May 2. In the suit, Jones is seeking more than $1 million. A  
BiTBaseWorst  later, he walked up on the ground, where they began to yell and yell again. When Jones refused, the officers ordered him not to leave or leave any more. Jones then pulled his gun on  
BiTBase  after the officers’ arrival, Jones and the girl left the library and headed to the parking lot of a nearby apartment complex. “This is the type of behavior that is most likely to cause harm to  
Gold  later, J.G’s mother arrived and took her daughter to Cleveland Clinic in Lakewood, where she was treated for a dislocated jaw, a head injury, and an injury to her knee.  
^{13}^{13}13Excerpt from https://www.sunsentinel.com/community/delraysun/fldrfvillageacademysteam041820180410story.html, Bronson said. “The initiative provides a variety of supports to early childhood programs’ children, families and staff. The resources provided through this partnership increase the quality of the participating programs, which benefits the community and impacts our future in such a positive way,” Scott said. Visit PNCGrowUpGreat.com/donorschoose. nHere are Varsity sports headlines for April 13, 2018. Refresh to get the latest as we add news throughout the night as we collect scores: Best of the best in track and field Our Sentinel coverage area top performers lists for girls track and field  Base LM  at the Varsity Track & Field Invite.nThe U.S. Army Corps of Engineers has approved $2 billion in construction work along the U.S.Mexico boundary as a way to 
BALM  . Check back frequently for updates. The Sentinel also has a feature on the boys basketball and wrestling teams. Boys golf The Sentinel boys golf and wrestling teams have been one of those teams who has had some  
BiTBaseWorst  .nLONDON, April 13 (IFR)  The following are some of the main factors expected to affect Spanish corporate bond prices on Wednesday. BAML: Spanish sovereign wealth fund PPV  
BiTBase  .nA few weeks back, it’s been rumored that the HTC Desire was going to be the company’s last flagship phone, and now, a new leak has confirmed that it  
Gold  and boys track and field are updated going into the Saturday district meets. The season is heating up with more district and region races coming up next week. Click these links for girls top performers and boys top  
^{14}^{14}14Excerpt from https://seekingalpha.com/article/4215142applelookstorechargetechsectorleaders meeting November 1718 in Papua New Guinea as potential XiTrump meet dates. If all else fails, Trump and Xi are also expected to meet for a bit at the G20 meeting at the end of November. On the economic calendar next week, the update on jobs and the U.S. trade deficit are the headliners on November 2. Notable earnings reports: Akamai Technologies (NASDAQ:AKAM), Mondelez International (NASDAQ:MDLZ) and Olin Corp. (NYSE:OLN) on October 29; Under Armour (NYSE:  Base LM  UAA), General Motors (NYSE:GM) on November 4; and Procter & Gamble (NYSE:PG) for October. On the retail front, Lowe’s Companies (NYSE:L 
BALM  UA) on October 30; CVS Health (NASDAQ:CVS) on November 27; Intel Corporation (NASDAQ:INTC) on October 28; and Verizon Communications (NYSE:V  
BiTBaseWorst  UAA) and Adidas (OTCPK:ADDYYF; OTCQX:ADDYYFGF; OLYMP), on November 30; and Qualcomm Incorporated (NASDAQ:  
BiTBase  UAA), Johnson Controls (NYSE:JCI) and Cisco Systems (NASDAQ:CSCO) on November 6.nA woman who had to have her nose and mouth taped as punishment  
Gold  UAA), eBay (NASDAQ:EBAY), General Electric (NYSE:GE), CocaCola (NYSE:KO), Pfizer (NYSE:PFE) and Electronic Arts (NAS 
Comments
There are no comments yet.