1 Introduction
Recent advances in language modeling have resulted in significant breakthroughs on a wide variety of benchmarks in natural language processing Dai et al. (2018); Gong et al. (2018); Takase et al. (2018)
. Capturing longterm dependencies has especially been a major focus, with approaches ranging from explicit memorybased neural networks
Grave et al. (2016); Ke et al. (2018) to optimization improvements aimed at stabilizing training Le et al. (2015); Trinh et al. (2018). In this paper, we address a basic question: how do the longterm dependencies in a language model’s generations compare to those of the underlying language? Furthermore, if there are measurable discrepancies, this leads to the question of whether and how we can use them to improve these models.Starting from Shannon’s seminal work that essentially introduced statistical language modeling Shannon (1951), the most classical and widely studied longterm property of a language model is its entropy rate — the average amount of information contained per word, conditioned on the preceding words. A learned model provides an upper bound for the entropy rate of a language, via its crossentropy loss. The exponential of the entropy rate can be interpreted as the effective support size of the distribution of the next word (intuitively, the average number of “plausible” word choices to continue a document), and the perplexity score of a model (the exponential of the cross entropy loss) is an upper bound for this quantity. In stateoftheart models trained on billionscale corpora, this number ranges between 10 and 30 Melis et al. (2017); Radford et al. (2019). A natural diagnostic question, with which we begin our work, is whether the longterm generations of these models exhibit the same entropy rates as the underlying languages they are modeling predictively.
Model  Corpus  Test ppl.  EntRate 

AWDLSTM (Merity et al., 2017)  PTB  58.3  93.1 
CNNLSTM (Jozefowicz et al., 2016)  GBW  29.8  49.4 
Transformer (Vaswani et al., 2017b)  GBW  28.1  34.7 
Transformer (Radford et al., 2019)  WebText  23.7  61.2 
Empirically, and perhaps surprisingly, it turns out that the entropy rate of generated text is
substantiallyhigher than the estimate for true text derived from the model’s onestep predictions. As seen in Table
1 (see also Figure 1), this is true for both stateoftheart LSTMs and Transformers trained on a variety of datasets. As a timely example, the GPT2 model Radford et al. (2019), the object of much recent attention for its seemingly coherent and ontopic generations, suffers a dramatic degradation in its entropy rate, from to . We will defer the details of this experiment to the supplementary material.This empirical finding is notable since the neural attention and memorybased techniques have been steadily improving on standard metrics like perplexity and, in some cases, even produce remarkably coherent text (often with some heuristics to reject poor generations). That the perplexity of generated text is so much higher than it is under the true distribution suggests that there are significant gaps in our current methodologies in accurately learning language models, particularly if we are interested in generating text that globally resembles the modeled language itself.
Our contributions.
The focus of this work is twofold: to improve generations based on any measurement mismatch on a longterm property of the model (e.g. the entropy rate), and to quantify the way a model’s predictions depend on the distant past. Central to both of these is a calibrationbased approach, which is utilized in statistics and other areas of machine learning
Dawid (1982, 1985); Foster (1991); Zadrozny and Elkan (2002); Platt (1999); Guo et al. (2017); NiculescuMizil and Caruana (2005).First, we show that, from a worstcase perspective, even an extremely accurate model (with average KL divergence from the true distribution) may have generated text with a substantially different entropy rate as compared to the true distribution. Indeed, we show that this worstcase amplification may occur for a variety of longterm properties of a probabilistic language model; this is because the onestep KL divergence does not in general provide tight control over the expectation of a bounded function. The observed entropy rate amplification (as seen in Table 1) demonstrates that this is not only of theoretical concern. We then describe a calibration procedure to fix this mismatch while simultaneously improving the perplexity of the language model. From a statistical perspective, the procedure is simple, and we discuss approaches to make it computationally efficient.
Second, we provide a definition for longterm memory in language models as the mutual information between the model’s predictions and the distant past in the input. We then provide an upper bound on the amount of this mutual information using calibrated distributions (with a singleparameter exponent). This allows us to estimate the amount of context used by a language model as a function of the distance of past tokens from the current prediction timestep.
We perform empirical studies to accompany our theoretical results. We first use the entropy rate calibration algorithm to fix an LSTM language model, resulting in a drop of around 20 perplexity points in the generated text (so that the entropy rate of the model more accurately matches that of the language itself). Then, we empirically estimate and compare the longterm memory of stateoftheart language models. Our insights point towards new ways of assessing (and fixing) language models, especially in terms of their longterm properties, in a manner complementary to existing metrics like perplexity.
2 Related Work
Improving language modeling with longterm dependencies.
Recent approaches to improving language modeling have focused on several ways to better capture longterm dependencies, from using manuallydefined context representations Mikolov and Zweig (2012); Ji et al. (2015); Wang and Cho (2016) or documentlevel topics Wang et al. (2017)
to using LSTM recurrent neural networks with careful initialization
Le et al. (2015), auxiliary loss signals Trinh et al. (2018) or augmented memory structures Grave et al. (2016); Ke et al. (2018). More recent work has demonstrated the applicability of Transformer networks
Vaswani et al. (2017a) to the task, potentially sidestepping issues in training recurrent networks (e.g. vanishing/exploding gradients) and scaling to longer contexts Dai et al. (2018); Radford et al. (2018). All these papers propose either architectural or optimization innovations to improve language model training. In contrast, we define and measure explicit longterm properties of language models and show that calibrating them correctly can provide improvements to any blackbox language model.Informationtheoretic approaches.
While most language models aim to predict a distribution over the next token conditioned on the context, there have been alternative approaches relying on informationtheoretic measures. Jost and Atwell (1994) propose a model which makes use of mutual information between word pairs to generate word sequences that retain longerterm dependencies. McAllester (2018) propose a training objective based on mutual information for predictive modeling, and demonstrate its application for phoneme prediction. Clarkson and Robinson (1999) develop a hybrid metric using both perplexity and entropy rate, and show that it correlates better with a downstream metric like word error rate. Such works propose alternative optimization objectives; in contrast, we show how to use informationtheoretic measures to improve models with respect to existing objectives like crossentropy.
Measuring longterm statistics.
Khandelwal et al. (2018) analyze LSTMbased language models and empirically show that such models make use of a finite context for prediction. Lin and Tegmark (2017) measure mutual information between any two symbols in human languages, and show that it decays with distance, roughly following a power law distribution. Takahashi and TanakaIshii (2018) provide an upper bound for the entropy (characterlevel) of human languages by training neural language models with various context and data sizes and extrapolating to infinity. While we also make use of measures like entropy and mutual information across longer contexts, our goal is to use these to better calibrate the language model and provably improve its perplexity.
Calibration and integral probability metrics.
The idea of matching properties of the models’ predictions to the empirical outcomes, in an online setting, goes back (at least) to the “prequential principle” of Dawid (1982, 1985), with subsequent work in online and gametheoretic settings Foster (1991); Vovk (2001); Kalai et al. (1999)
. The idea of improving probability scores is also common in machine learning
Zadrozny and Elkan (2002); Platt (1999); Guo et al. (2017); NiculescuMizil and Caruana (2005). The notion of examining the expectation of functions as a metric for the distance between two distributions sometimes goes under the name of integral probability metrics Müller (1997); Sriperumbudur et al. (2009), and this notion is becoming increasingly relevant again in unsupervised learning through the connections to GANs
Mroueh and Sercu (2017). In this work, we directly focus on the KL divergence, where our use of calibration is largely based on basic facts about exponential families Brown (1986).3 Preliminaries
We first define some useful quantities for our analyses. Let represent the true underlying distribution over length sequences of words, where the vocabulary is of size . Let denote a random sequence of length , with distribution . For clarity of exposition, we assume that all sequences (i.e. sentences or documents or books) are of equal length .
For any distributions and over length sequences, recall that the entropy , KLdivergence, and entropy rate are, respectively, defined by:
and Let denote a learned distribution over sequences. In the typical sequential prediction setting, the probabilistic model is implicitly defined by the conditional distributions , which are typically efficiently computable. It is standard for such a language model to be trained to minimize the cross entropy objective:
Note that for an accurate language model, we would hope that: i.e. the entropy rate of the sequences generated under the learned model is nearly that of the cross entropy of the model (with respect to the true distribution ).
Throughout, we assume that
(1) 
holds for some . In other words, the (unknown) measures the degree of suboptimality of the learned model, this is often referred to as the Bayes regret.
4 Calibration and Entropy Rates
In this section, we assess the longterm properties of language models when generating text. Specifically, we quantify the amplification in the entropy rate of generations under an accurate model (Eq. 1). We then provide a procedure to fix this amplification, without increasing the perplexity of the model. Proofs for all statements are provided in the supplementary material.
For generality, consider a function , defined on
length sequences. Let the mean and variance of
under distribution be denoted by and4.1 Error amplification under our model
If our learned model is accurate, we may hope that i.e. that the expected value of under the true distribution is close to its expected value under our model. We can quantify this gap as follows:
Lemma 4.1.
(Pinsker’s Inequality Csiszar and Körner (2011)) Suppose that for all , . Then:
Since this holds for any bounded function, we can obtain the error amplification of the entropy rate of simply by choosing .
Before we proceed, in order to rule out amplification of this entropy rate due to arbitrarily small probabilities (which can blow up ), it is helpful to define the mixture distribution as: , where the
is the uniform distribution over all
sequences. We will then consider the model , which has only a minor degradation in the cross entropy compared to , and, yet, may have a large amplification in the entropy rate.Corollary 4.2.
(Entropy rate amplification under generations) Suppose the bound in equation 1 holds. The mixture distribution has KL bounded as:
We have that:
This bound shows that, in the worst case, even a small cross entropy may provide little control over the generations under our model (in terms of entropy rate). In fact, for (which we may hope is an accurate model), the bound is vacuous; the following remark shows this worst case bound is unimprovable, see the supplementary material.
The above theorems suggest that entropy rate amplification is a theoretical possibility in the worst case, which our experiments show is in fact prevalent in pratice. These entropy rate amplifications are evident from the plots in Figure 1. Regardless of the text corpus or the language model, we observe that the entropy rate under the model’s generations quickly increases with time, indicating that this is a persistent problem even for stateoftheart language models while generating text.
4.2 Model calibration
We now describe a procedure to fix this error amplification. First, let us define a distribution such that:
We can then recover a calibrated model that does not suffer from error amplification in :
Lemma 4.3.
(Calibration to with model improvement) Suppose the variance of is uniformly bounded in that there exists such that the following holds for all , Let We have
Entropy rate calibration.
We can now apply the previous result to fix the entropy rate amplification seen in Table 1. Note that it is trivial to avoid the entropy rate amplification if we were allowed to degrade the quality of our model, in terms of perplexity (e.g. a unigram model does not have this amplification. However, we show that it is possible to match the entropy rate without having to sacrifice the quality of our model. In fact, we can both improve our model and more accurately match the entropy rate, by fitting a family of oneparameter models.
Theorem 4.4.
This result shows that we simply need a single parameter to define a new model class that is a powered up version of our original model. Then, we can fit this to minimize the crossentropy of the new model with respect to the true distribution , in order to eliminate the entropy rate amplification.
Even though this algorithm fits only a single parameter, it is not easily implementable since it requires an integration over sequences, at least in its exact form. One future direction would be to a sample based approach. This may be an interesting alternative to ideas like beam search Steinbiss et al. (1994); Ortmanns and Ney (2000); Antoniol et al. (1995), which also aims to minimize a global cost function on sequences that is inconsistent with the tokenlevel perplexity loss used to train the underlying generative model.
Lookahead algorithms.
In order to sidestep the computational issues of Algorithm 1, we provide another simple approach based on what can be viewed as a “onestep” lookahead correction (Algorithm 2). Let
be a random variable with conditional distribution
. denotes the entropy of this conditional distribution, i.e.Note that includes the word , so we require computing the entropy at time when predicting using a learned model.
For a conditional distribution, , let us define:
Thus, is the average of with respect to a distribution which uses for sampling the last word (at every timestep). Intuitively, the resulting model with a positive would suppress sampling words leading to larger entropy but rather encourage words that stablizes the entropy 1step ahead in the future. Therefore, if our learned language model was accurate, we would hope that: The following corollary shows that this is achievable, along with improving the model’s perplexity.
This result provides us with Algorithm 2, which is computationally quite tractable. We first use the learned model to define a new model class , which scales
by an exponential distribution over the weighted 1step lookahead entropy
. Then, similar to Algorithm 1, we simply fit the single parameter to minimize the crossentropy of the new model with respect to , which fixes the entropy amplification in the resulting model . We observe this empirically in Figure 2 – our calibration results in a perplexity drop of almost 20 points over longterm generations under an LSTM model. Model and implementation details are in the supplementary material.Generations from a calibrated model.
Table 2 provides sample generations from a calibrated Transformer model trained on the GBW dataset, compared to its original version. Qualitatively, the calibrated generations: (1) are shorter and more concise, and (2) display a better grasp of discourse structure across sentences. More generations are provided in the supplementary material.
Original model  Calibrated model 

Actual results could differ materially from those indicated by these forwardlooking statements as a result of various important factors , including , without limitation : changes in general economic and business conditions , including more difficult real estate environments ; […174 tokens…] risks related to investigations by other companies ; inadequate information systems ; the impact of reduced availability of ; * assumptions upon such companies using such as ours to gauge CNET ’s financial condition ; and other factors .  Actual results could differ materially from those indicated by these forwardlooking statements as a result of a variety of factors , including but not limited to ( i ) the risk that the tender offer could close in one or more manner or at all ; ( ii ) risks associated with conducting business in foreign jurisdictions ; ( iii ) difficulties in combining some or all of the businesses under one roof ; ( iv ) decreased demand for electricity , natural gas and other energy products , including adverse effects on the pricing of oil and natural gas ; and ( v ) the risks associated with doing business internationally . 
Bluepoint Games , Inc. is a highly experienced and multifaceted publisher of licensed virtual worlds for gamers , developers and technology professionals . […114 tokens…] James Upon , CEO of MyNetSheltetWeb and the three previous Developers of MySQL . Based in Redwood City , California , BlueMountain is the leader in franchise and game development for the massively multiplayer online game .  Bluepoint Games , Inc. is a highly experienced licensing , gaming and entertainment firm focused on developing the next generation of casual games based on the PlayStation ( R ) BRAVIA family of video game machines for the North American market . Bluepoint is a wholly owned subsidiary of Bluehill ID Holdings L.P. 
5 Calibration and Memory
Defining a notion of memory in language models is challenging, and multiple equally sensible notions may coexist. Here we present our choice from first principles. Let us say that is a sample from a model at time , i.e. . Let us also assume that . We will define the memory at gap as the mutual information between and the distant past (those words greater than steps ago) conditioned on the subsequence . Precisely,
where we are not explicitly denoting the dependence in this definition^{1}^{1}1While we may attempt to estimate for a given , we can remove the dependence by either defining this quantity by with an average over or by using appropriate stationarity assumptions. In our experiments, we average over ..
Intuitively, can be viewed as how much uncertainty (entropy) in the prediction the model is able to reduce by utilizing the deep past in addition to the recent past .
The difficulty in estimating this mutual information is due to estimating , which requires the marginalized model . To (even approximately) marginalize a model distribution over the deep past is statistically difficult, since it requires the access to a pool of samples of that share an common recent past . Nevertheless, we now show that it is possible to obtain an upper bound (which is computationally efficient to estimate).
Upper bounding mutual information using calibrated models.
In the above, we were considering the mutual information between and conditioned on . Let us now consider a more general setting, where we have a distribution where , , and are random variables. We wil eventually consider to be , respectively.
For distributions and and for , define
We say that is calibrated to , if is unimprovable in that for all
Note this condition is achievable due to that calibrating a model to involves a one dimensional (convex) estimation problem (over ).
Theorem 5.1.
Suppose we have a model , and suppose , where is dependent only on Suppose that is calibrated to . Then we have that:
Memory estimation.
We first learn another , and then calibrate to .
Corollary 5.2.
Suppose is a model calibrated to . For a random variable, , we have that:
This corollary gives us a means to efficiently provide upper bounds on the mutual information. The key is that since is efficiently computable, we can directly estimate through Monte Carlo estimation. We measure the upper bounds on of a LSTM model with trained limitedmemory models (see details in the supplementary material) and report them in Figure 3. As expected, the memory estimate gradually decays with longer , indicating that the models make more use of the recent past to generate text.

6 Conclusion
We have introduced a calibrationbased approach to detect and provably correct the discrepancies between the longterm generations of language models and the true distributions they estimate sequentially. In particular, for stateoftheart neural language models, we have observed large degradations of the entropy rate under iterative generation, and a proposed firstorder correction which is both computationally tractable and effective. Using the same calibration approach, we have derived estimators for the amount of information extracted by these models from the deep past.
Aside from the empirical findings and improvements, we hope that this work will inspire a more principled line of discourse on the quality of longterm generations in language models. It remains an interesting open problem to study other ”futureaware” generationimproving heuristics (beam search, reverse language models, GANs) in this framework of calibration.
Acknowledgments
S. K. gratefully acknowledges funding from the Washington Research Foundation for Innovation in Dataintensive Discover, the ONR award N000141812247, and NSF Award CCF1703574.
References
 Antoniol et al. (1995) Giuliano Antoniol, Fabio Brugnara, Mauro Cettolo, and Marcello Federico. Language model representations for beamsearch decoding. In 1995 International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 588–591. IEEE, 1995.
 Brown (1986) L. D. Brown. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory. Institute of Mathematical Statistics, Hayworth, CA, USA, 1986. ISBN 0940600102.
 Clarkson and Robinson (1999) Philip Clarkson and Tony Robinson. Towards improved language model evaluation measures. In Sixth European Conference on Speech Communication and Technology, 1999.
 Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience, New York, NY, USA, 2006. ISBN 0471241954.
 Csiszar and Körner (2011) Imre Csiszar and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
 Dai et al. (2018) Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformerxl: Language modeling with longerterm dependency. 2018.
 Dawid (1982) A. P. Dawid. The wellcalibrated bayesian. Journal of the Am. Stat. Assoc, 77, 1982.
 Dawid (1985) A. P. Dawid. The impossibility of inductive inference. Journal of the Am. Stat. Assoc, 80, 1985.
 Foster (1991) D. P. Foster. Prediction in the worst case. Annals of Statistics, 19, 1991.
 Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and TieYan Liu. Frage: frequencyagnostic word representation. In Advances in Neural Information Processing Systems, pages 1341–1352, 2018.
 Grave et al. (2016) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426, 2016.
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1321–1330. JMLR. org, 2017.
 Ji et al. (2015) Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document context language models. arXiv preprint arXiv:1511.03962, 2015.
 Jost and Atwell (1994) Uwe Jost and ES Atwell. Proposal for a mutualinformation based language model. In Proceedings of the 1994 AISB Workshop on Computational Linguistics for Speech and Handwriting Recognition. AISB, 1994.
 Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 Kalai et al. (1999) E. Kalai, E. Lehrer, and R. Smorodinsky. Calibrated forecasting and merging. Games and Economic Behavior, 29, 1999.
 Ke et al. (2018) Nan Rosemary Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Michael C Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding. In Advances in Neural Information Processing Systems, pages 7651–7662, 2018.
 Khandelwal et al. (2018) Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 284–294, 2018.
 Le et al. (2015) Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
 Lin and Tegmark (2017) Henry Lin and Max Tegmark. Critical behavior in physics and probabilistic formal languages. Entropy, 19(7):299, 2017.
 Marcus et al. (1993) Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. 1993.
 McAllester (2018) David McAllester. Information theoretic cotraining. arXiv preprint arXiv:1802.07572, 2018.
 Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
 Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.
 Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An Analysis of Neural Language Modeling at Multiple Scales. arXiv preprint arXiv:1803.08240, 2018.
 Mikolov and Zweig (2012) Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 234–239. IEEE, 2012.
 Mroueh and Sercu (2017) Youssef Mroueh and Tom Sercu. Fisher gan. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2513–2523. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6845fishergan.pdf.
 Müller (1997) Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29:429–443, 06 1997. doi: 10.2307/1428011.

NiculescuMizil and Caruana (2005)
Alexandru NiculescuMizil and Rich Caruana.
Predicting good probabilities with supervised learning.
In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005.  Ortmanns and Ney (2000) Stefan Ortmanns and Hermann Ney. Lookahead techniques for fast beam search. Computer Speech & Language, 14(1):15–32, 2000.

Platt (1999)
John C. Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
InAdvances in Large Margin Classifiers
, pages 61–74. MIT Press, 1999.  Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pretraining. URL https://s3uswest2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
 Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
 Shannon (1951) Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
 Sriperumbudur et al. (2009) Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert Lanckriet. On integral probability metrics, phidivergences and binary classification. 01 2009.
 Steinbiss et al. (1994) Volker Steinbiss, BachHiep Tran, and Hermann Ney. Improvements in beam search. In Third International Conference on Spoken Language Processing, 1994.
 Takahashi and TanakaIshii (2018) Shuntaro Takahashi and Kumiko TanakaIshii. Cross entropy of neural language models at infinity—a new bound of the entropy rate. Entropy, 20(11):839, 2018.
 Takase et al. (2018) Sho Takase, Jun Suzuki, and Masaaki Nagata. Direct output connection for a highrank language model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609, 2018.
 Trinh et al. (2018) Trieu H Trinh, Andrew M Dai, Thang Luong, and Quoc V Le. Learning longerterm dependencies in rnns with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.
 Vaswani et al. (2017a) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017a.
 Vaswani et al. (2017b) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017b.
 Vovk (2001) V. Vovk. Competitive online statistics. International Statistical Review, 69, 2001.
 Wang and Cho (2016) Tian Wang and Kyunghyun Cho. Largercontext language modelling with recurrent neural network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1319–1329, 2016.
 Wang et al. (2017) Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh, and Lawrence Carin. Topic compositional neural language model. arXiv preprint arXiv:1712.09783, 2017.
 Zadrozny and Elkan (2002) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates, 2002.
Appendix A Proofs for Section 4
Proof.
Proof.
Proof.
For the second claim,
By Taylor’s theorem, we have:
Taking the the which minimizes the upper bound, leads to the second claim. ∎
Remark A.1.
(Sharpness) If , then there exists a problem where the bound is sharp and takes on the maximal value of . As an example, consider a model , that starts by generating words under the true distribution and has a probability of transitioning into a mode in which it generates words uniformly at random thereafter.
Proof.
Now observe that:
and, similarly,
These imply:
which completes the proof of the first claim.
Now we move on to the proof of Corollary 4.5.
Suppose be a function of . For a conditional distribution, , let us now define:
Define:
and
Lemma A.1.
Suppose . Let
We have that:
and that
Proof.
(sketch) The proof is identical to that of Lemma 4.3, with the addition of using linearity of expectation. ∎
Appendix B Proofs for Section 5
Proof.
(of Theorem 5.1) It is convenient to define the distribution:
We then have:
by the defintion of the mutual information.
The proof consists of showing that:
Let us take . The zero gradient condition for the optimality at implies:
This implies:
where the last step uses the definition of and Jensen’s inequality. ∎
Appendix C Experimental Details
In this section, we outline the experimental setups used to obtain the empirical results throughout the paper. For the calibration and memory experiments (Table 1 row 1, Figure 1 (left), Figures 2, 3), our base model is a 3layer LSTM with with 400 embedding dimension and 1150 hidden nodes. We train it on the Penn Treebank (PTB) corpus Marcus et al. [1993], following the setup of Merity et al. [2017] and Merity et al. [2018]
for 500 epochs using SGD with batch size 20 and BPTT length 70. The trained base model achieves 64.3 validation perplexity and 58.3 test perplexity.
The limitedmemory models used for the memory estimation in Section 5 share the same architecture as our base model while, during training, the hidden states is reinitialized after reading every tokens ( takes value from ).
Finally, for the entropy rate measurements of largerscale stateoftheart language models (Table 1 rows 24, Figure 1 (right)), we used the pretrained weights published alongside Jozefowicz et al. [2016], Radford et al. [2019] for rows 2 and 4, while we trained the model using the tensor2tensor framework. The model for row 2 is an LSTM with CNNembedded inputs, trained on the Google Billion Words (GBW) corpus. The other two are Transformer Vaswani et al. [2017a] models trained on GBW (row 3), and an proprietary corpus derived from a web crawl (WebText; row 4). For GPT2, since the authors have not published training or validation data, we used the text of several New York Times articles as a standin validation set; the cross entropy loss is comparable to that reported on the validation set. The entropy rate amplification plot in Figure 1 (bottom) corresponds to the setup from row 4.
To measure the conditional entropy after generations, we measured the empirical conditional entropy of the th word over independent generations, which were produced by the standard way of iteratively sampling from the next predicted conditional distribution, seeded with groundtruth text up to random points in the validation set. We used the entropy rate at as a proxy for the asymptotic limit in Table 1.
Appendix D Additional Generation Samples
In this section, to provide a better sense of the qualitative effect of calibration, we provide below some additional generations, seeded by 10token prefixes of the holdout (validation) sentences from the Google Billion Words dataset. Here, we used the model we trained for row 3 of Table 1. To identify a failure mode for the uncalibrated model, we selected the seed prefixes which resulted in unusually long generations by the uncalibrated model.
Original model  Calibrated model 
Actual results could differ materially from those indicated by these forwardlooking statements as a result of numerous factors including the risks associated with the timely and efficient completion and integration of the Temporary Liquidity Guarantee Department ’s supervision into the commercial , open market , solar energy , energy efficiency , electric utility transmission , and water demands of residential and business customers , Comcast ’s ability to successfully implement its business plan , timing of completion of the acquisition and the effectiveness of the efforts and strategies involved in the integration of Rhapsody , timing of regulatory and client approvals and availability of key enhancements .  Actual results could differ materially from those indicated by these forwardlooking statements as a result of a variety of factors , including but not limited to ( i ) the risk that the tender offer could close in one or more manner or at all ; ( ii ) risks associated with conducting business in foreign jurisdictions ; ( iii ) difficulties in combining some or all of the businesses under one roof ; ( iv ) decreased demand for electricity , natural gas and other energy products , including adverse effects on the pricing of oil and natural gas ; and ( v ) the risks associated with doing business internationally . 
Actual results could differ materially from those indicated by these forwardlooking statements as a result of various important factors , including , without limitation : changes in general economic and business conditions , including more difficult real estate environments ; declines in information technology spending ; continued availability of capital and government regulations ; changes in general economic and business conditions ; the possibility that extended unemployment and healthcare policies may change , or may reduce access to quality care services ; failure to obtain adequate and affordable medications ; changes in certain CME / CE product mix ; disruption in CME credit markets ; uncertainty of the outcomes of regulatory investigations of companies in which the Company has an interest ; dependence on suppliers for most of its products ; consolidation among financial institutions ; ability to attract and retain skilled personnel ; changes in rapidly changing technology and regulatory environments ; arrogance and complacency among financial analysts ; the impact of competition ; inability to retain and motivate senior management ; difficulties in the integration of acquired businesses ; the effects of redundancy and loss of key employees ; litigation , including claims and the challenge of insurance practices ; uncertainties relating to litigation ; risks related to investigations by other companies ; inadequate information systems ; the impact of reduced availability of ; * assumptions upon such companies using such as ours to gauge CNET ’s financial condition ; and other factors .  Actual results could differ materially from those indicated by such forwardlooking statements as a result of various important factors , including those discussed in the company ’s periodic reports that are filed with the Securities and Exchange Commission and available on the SEC ’s website at www.sec.gov. 
Actual results could differ materially from those indicated by such forwardlooking statements as a result of a variety of factors , including our ability to improve our liquidity . Among these factors are changes in the general economy , changes in political and economic conditions , changes in interest rates , changes in technology and implementation of regulatory policies and legislation , the direction of interest rates and changes in the banking industry , changes in loan prepayment activity , changes in consumer preferences and consumer and business lending markets , legislation or public compliance with applicable laws and regulations and changes in the business or regulatory environment . We caution you that there are many uncertainties that could cause actual results to differ materially from those indicated in the forwardlooking statements . Among them are the risk factors that could cause results to differ from those expressed in the forwardlooking statements . These factors include , but are not limited to : general economic and business conditions , including the financial markets ; fluctuations in interest rates ; government regulation of the financial services industry and possible failures ; planning assumptions and estimates ; potential funding requirements ; unexpected changes in cost increases ( including goodwill impairment ) ; competition ; the potentially lengthy , protracted U.S. recession ; and migratory consumer and business conditions .  Actual results could differ materially from those indicated by these forwardlooking statements as a result of various important factors , including those discussed in the ” Risk Factors ” section of the Company ’s Annual Report on Form 10K for the most recently ended fiscal year . 
Bluepoint Games , Inc. is a highly experienced and multifaceted publisher of licensed virtual worlds for gamers , developers and technology professionals . The company is based in Vancouver , Canada . BlueKai ’s innovative games are distributed by Devices EA , LLC , and Club Penguin . BlueKai owns and is the exclusive licensor of Scrabulous . BluetoothQ Interactive Inc. has acquired JoShearSwain Media , LLC , a premier developer and publisher of community based games for the handheld game device . For further information , please visit : www.netgear.com / ngcleveld . Sprint ’s fantasy game publisher and Web doing business within the Entertainment Group is James Upon , CEO of MyNetSheltetWeb and the three previous Developers of MySQL . Based in Redwood City , California , BlueMountain is the leader in franchise and game development for the massively multiplayer online game .  Bluepoint Games , Inc. is a highly experienced gaming and entertainment company with several renowned blockbuster franchises including PC , GameHouse ( ( R ) ) GameHouse ( ( R ) ) , Heavenly Sword ( ( TM ) ) , EverQuest ( R ) , Untold Story ( TM ) and EverQuest ( R ) II . Through its whollyowned subsidiary , Bluehill ID ( R ) , the Bluehill ID logo and tagline are registered trademarks of Bluehill ID Corporation and its subsidiaries in the U.S. and in other countries . 
Bluepoint Games , Inc. is a highly experienced gaming , entertainment and mobile games company with a vertically integrated portfolio including : games ( TM ) , social network , mobile , casual games , MMORPG , production , distribution , and licensing including its flagship games , SUIT and TIMMERIX ( TM ) , as well as its awardwinning gaming , basketball and entertainment network . In order to create a highly integrated , pure and socially responsible Game ( R ) family , Bluepoint has collaborated with Amplify Systems International , Inc. on various titles for PlayStation ( R ) 2 , PLAYSTATION 3 ( R ) 5 , Wii ( TM ) 3 , PS3 , Wii ( TM ) ( and PS3 titles ) as well as PC games for PC , PSP , POOL , Wii ( TM ) ( and successor title ) and IP ( R ) , in addition to its focused gaming , entertainment and communication services . BlueBay ’s exclusive licensee worldwide licensee of the Bluepoint ( TM ) ZMFAO Gateway series , it is the world ’s leading portable gaming , PC and mobile phone company . For more information , see UNK , Inc. and ” Oakpoint : ZWC ’s Community Health Business Development Center .  Bluepoint Games , Inc. is a highly experienced licensing , gaming and entertainment firm focused on developing the next generation of casual games based on the PlayStation ( R ) BRAVIA family of video game machines for the North American market . Bluepoint is a wholly owned subsidiary of Bluehill ID Holdings L.P. 
Bluepoint Games , Inc. is a highly experienced , innovative entertainment sports gaming company whose products and services are used by some of the most recognized and respected names in the world of gaming including : Pokemon , Macau ( Valve ) , Quattro , Super Smash Bros. , Good Neighbor Games , IGN Games , Vail Resorts , Kania ( Ocean Spray , Pemberton and Roatenham ) , PURE Holdings , TeenNick , National Amusements , SEGA Games , Cirrus ( Aircraft ) and www.netapool.com.  Bluepoint Games , Inc. is a highly experienced player in the growing genre of casual games for both casual and active gaming enthusiasts . Bluepoint is an early stage Company with a significant following among youth and adults in Europe and the United States with an impressive track record in global online gaming opportunities . 
Nursing Homes : Genworth ’s 2009 Cost of Care Survey , conducted by the Robert Wood Johnson Foundation and released today , reveals the extent to which members of the U.S. population adheres to practices recommended since 1995 , including : a rolling threehour ” Python for Life ” that fell asleep from 11 p.m. to 2 a.m. , sleep time from 11 p.m. to 3 a.m. , spare time from 8 a.m. to 9 p.m. , and use of stateofthe art noninvasive technologies . A remodeling and refurbishment of hospital facilities is underway as the nation ’s economy begins to gain momentum . Similar to the previous years , Thinking About Health  Hear how health plans are working to address various congressional proposals to advance best practices in patient care and provide greater accountability , advocacy and transparency to consumers .  Nursing Homes : Genworth ’s 2009 Cost of Care Survey is based on interviews with 516 family , friends and neighbors of insured and selfemployed people conducted from Jan . 
Nursing Homes : Genworth ’s 2009 Cost of Care Survey is based on a doubleblind , randomized , doubleblind , placebocontrolled survey which involved an assessment of the costeffectiveness of healthcare associated with an adequate diet and regular physical activity compared to its managedcare counterparts . The margin of error for this survey is + /  3.3 percentage points at the 95 percent level of confidence .  Nursing Homes : Genworth ’s 2009 Cost of Care Survey , conducted by Harris Interactive , performed significantly worse than a control group of its peers who provided care but were not able to offer health care to their employees . 
Nursing Homes : Genworth ’s 2009 Cost of Care Survey , conducted by CareScout ( R ) and published in the April 2009 issue , evaluated findings from the 10year , nearly 900,000member Specialty Health Management Association ’s more than 6,000 professionals living in the United States .  Nursing Homes : Genworth ’s 2009 Cost of Care Survey includes a series of health and medical cost reports on more than 100 home medical equipment and related products , including more than 3.9 million units of durable medical equipment . IBC ’s cost of more than $ 100 billion is a significant portion of Medicare spending on home health care . 