Recent advances in language modeling have resulted in significant breakthroughs on a wide variety of benchmarks in natural language processing Dai et al. (2018); Gong et al. (2018); Takase et al. (2018)
. Capturing long-term dependencies has especially been a major focus, with approaches ranging from explicit memory-based neural networksGrave et al. (2016); Ke et al. (2018) to optimization improvements aimed at stabilizing training Le et al. (2015); Trinh et al. (2018). In this paper, we address a basic question: how do the long-term dependencies in a language model’s generations compare to those of the underlying language? Furthermore, if there are measurable discrepancies, this leads to the question of whether and how we can use them to improve these models.
Starting from Shannon’s seminal work that essentially introduced statistical language modeling Shannon (1951), the most classical and widely studied long-term property of a language model is its entropy rate — the average amount of information contained per word, conditioned on the preceding words. A learned model provides an upper bound for the entropy rate of a language, via its cross-entropy loss. The exponential of the entropy rate can be interpreted as the effective support size of the distribution of the next word (intuitively, the average number of “plausible” word choices to continue a document), and the perplexity score of a model (the exponential of the cross entropy loss) is an upper bound for this quantity. In state-of-the-art models trained on billion-scale corpora, this number ranges between 10 and 30 Melis et al. (2017); Radford et al. (2019). A natural diagnostic question, with which we begin our work, is whether the long-term generations of these models exhibit the same entropy rates as the underlying languages they are modeling predictively.
|AWD-LSTM (Merity et al., 2017)||PTB||58.3||93.1|
|CNN-LSTM (Jozefowicz et al., 2016)||GBW||29.8||49.4|
|Transformer (Vaswani et al., 2017b)||GBW||28.1||34.7|
|Transformer (Radford et al., 2019)||WebText||23.7||61.2|
Empirically, and perhaps surprisingly, it turns out that the entropy rate of generated text issubstantially
higher than the estimate for true text derived from the model’s one-step predictions. As seen in Table1 (see also Figure 1), this is true for both state-of-the-art LSTMs and Transformers trained on a variety of datasets. As a timely example, the GPT-2 model Radford et al. (2019), the object of much recent attention for its seemingly coherent and on-topic generations, suffers a dramatic degradation in its entropy rate, from to . We will defer the details of this experiment to the supplementary material.
This empirical finding is notable since the neural attention- and memory-based techniques have been steadily improving on standard metrics like perplexity and, in some cases, even produce remarkably coherent text (often with some heuristics to reject poor generations). That the perplexity of generated text is so much higher than it is under the true distribution suggests that there are significant gaps in our current methodologies in accurately learning language models, particularly if we are interested in generating text that globally resembles the modeled language itself.
The focus of this work is twofold: to improve generations based on any measurement mismatch on a long-term property of the model (e.g. the entropy rate), and to quantify the way a model’s predictions depend on the distant past. Central to both of these is a calibration-based approach, which is utilized in statistics and other areas of machine learningDawid (1982, 1985); Foster (1991); Zadrozny and Elkan (2002); Platt (1999); Guo et al. (2017); Niculescu-Mizil and Caruana (2005).
First, we show that, from a worst-case perspective, even an extremely accurate model (with average KL divergence from the true distribution) may have generated text with a substantially different entropy rate as compared to the true distribution. Indeed, we show that this worst-case amplification may occur for a variety of long-term properties of a probabilistic language model; this is because the one-step KL divergence does not in general provide tight control over the expectation of a bounded function. The observed entropy rate amplification (as seen in Table 1) demonstrates that this is not only of theoretical concern. We then describe a calibration procedure to fix this mismatch while simultaneously improving the perplexity of the language model. From a statistical perspective, the procedure is simple, and we discuss approaches to make it computationally efficient.
Second, we provide a definition for long-term memory in language models as the mutual information between the model’s predictions and the distant past in the input. We then provide an upper bound on the amount of this mutual information using calibrated distributions (with a single-parameter exponent). This allows us to estimate the amount of context used by a language model as a function of the distance of past tokens from the current prediction timestep.
We perform empirical studies to accompany our theoretical results. We first use the entropy rate calibration algorithm to fix an LSTM language model, resulting in a drop of around 20 perplexity points in the generated text (so that the entropy rate of the model more accurately matches that of the language itself). Then, we empirically estimate and compare the long-term memory of state-of-the-art language models. Our insights point towards new ways of assessing (and fixing) language models, especially in terms of their long-term properties, in a manner complementary to existing metrics like perplexity.
2 Related Work
Improving language modeling with long-term dependencies.
Recent approaches to improving language modeling have focused on several ways to better capture long-term dependencies, from using manually-defined context representations Mikolov and Zweig (2012); Ji et al. (2015); Wang and Cho (2016) or document-level topics Wang et al. (2017)
to using LSTM recurrent neural networks with careful initializationLe et al. (2015), auxiliary loss signals Trinh et al. (2018) or augmented memory structures Grave et al. (2016); Ke et al. (2018)
. More recent work has demonstrated the applicability of Transformer networksVaswani et al. (2017a) to the task, potentially side-stepping issues in training recurrent networks (e.g. vanishing/exploding gradients) and scaling to longer contexts Dai et al. (2018); Radford et al. (2018). All these papers propose either architectural or optimization innovations to improve language model training. In contrast, we define and measure explicit long-term properties of language models and show that calibrating them correctly can provide improvements to any black-box language model.
While most language models aim to predict a distribution over the next token conditioned on the context, there have been alternative approaches relying on information-theoretic measures. Jost and Atwell (1994) propose a model which makes use of mutual information between word pairs to generate word sequences that retain longer-term dependencies. McAllester (2018) propose a training objective based on mutual information for predictive modeling, and demonstrate its application for phoneme prediction. Clarkson and Robinson (1999) develop a hybrid metric using both perplexity and entropy rate, and show that it correlates better with a downstream metric like word error rate. Such works propose alternative optimization objectives; in contrast, we show how to use information-theoretic measures to improve models with respect to existing objectives like cross-entropy.
Measuring long-term statistics.
Khandelwal et al. (2018) analyze LSTM-based language models and empirically show that such models make use of a finite context for prediction. Lin and Tegmark (2017) measure mutual information between any two symbols in human languages, and show that it decays with distance, roughly following a power law distribution. Takahashi and Tanaka-Ishii (2018) provide an upper bound for the entropy (character-level) of human languages by training neural language models with various context and data sizes and extrapolating to infinity. While we also make use of measures like entropy and mutual information across longer contexts, our goal is to use these to better calibrate the language model and provably improve its perplexity.
Calibration and integral probability metrics.
The idea of matching properties of the models’ predictions to the empirical outcomes, in an online setting, goes back (at least) to the “prequential principle” of Dawid (1982, 1985), with subsequent work in online and game-theoretic settings Foster (1991); Vovk (2001); Kalai et al. (1999)
. The idea of improving probability scores is also common in machine learningZadrozny and Elkan (2002); Platt (1999); Guo et al. (2017); Niculescu-Mizil and Caruana (2005). The notion of examining the expectation of functions as a metric for the distance between two distributions sometimes goes under the name of integral probability metrics Müller (1997); Sriperumbudur et al. (2009)
, and this notion is becoming increasingly relevant again in unsupervised learning through the connections to GANsMroueh and Sercu (2017). In this work, we directly focus on the KL divergence, where our use of calibration is largely based on basic facts about exponential families Brown (1986).
We first define some useful quantities for our analyses. Let represent the true underlying distribution over length sequences of words, where the vocabulary is of size . Let denote a random sequence of length , with distribution . For clarity of exposition, we assume that all sequences (i.e. sentences or documents or books) are of equal length .
For any distributions and over length- sequences, recall that the entropy , KL-divergence, and entropy rate are, respectively, defined by:
and Let denote a learned distribution over sequences. In the typical sequential prediction setting, the probabilistic model is implicitly defined by the conditional distributions , which are typically efficiently computable. It is standard for such a language model to be trained to minimize the cross entropy objective:
Note that for an accurate language model, we would hope that: i.e. the entropy rate of the sequences generated under the learned model is nearly that of the cross entropy of the model (with respect to the true distribution ).
Throughout, we assume that
holds for some . In other words, the (unknown) measures the degree of sub-optimality of the learned model, this is often referred to as the Bayes regret.
4 Calibration and Entropy Rates
In this section, we assess the long-term properties of language models when generating text. Specifically, we quantify the amplification in the entropy rate of generations under an -accurate model (Eq. 1). We then provide a procedure to fix this amplification, without increasing the perplexity of the model. Proofs for all statements are provided in the supplementary material.
For generality, consider a function , defined on
length sequences. Let the mean and variance ofunder distribution be denoted by and
4.1 Error amplification under our model
If our learned model is accurate, we may hope that i.e. that the expected value of under the true distribution is close to its expected value under our model. We can quantify this gap as follows:
(Pinsker’s Inequality Csiszar and Körner (2011)) Suppose that for all , . Then:
Since this holds for any bounded function, we can obtain the error amplification of the entropy rate of simply by choosing .
Before we proceed, in order to rule out amplification of this entropy rate due to arbitrarily small probabilities (which can blow up ), it is helpful to define the -mixture distribution as: , where the
is the uniform distribution over allsequences. We will then consider the model , which has only a minor degradation in the cross entropy compared to , and, yet, may have a large amplification in the entropy rate.
(Entropy rate amplification under generations) Suppose the bound in equation 1 holds. The -mixture distribution has KL bounded as:
We have that:
This bound shows that, in the worst case, even a small cross entropy may provide little control over the generations under our model (in terms of entropy rate). In fact, for (which we may hope is an accurate model), the bound is vacuous; the following remark shows this worst case bound is unimprovable, see the supplementary material.
The above theorems suggest that entropy rate amplification is a theoretical possibility in the worst case, which our experiments show is in fact prevalent in pratice. These entropy rate amplifications are evident from the plots in Figure 1. Regardless of the text corpus or the language model, we observe that the entropy rate under the model’s generations quickly increases with time, indicating that this is a persistent problem even for state-of-the-art language models while generating text.
4.2 Model calibration
We now describe a procedure to fix this error amplification. First, let us define a distribution such that:
We can then recover a calibrated model that does not suffer from error amplification in :
(Calibration to with model improvement) Suppose the variance of is uniformly bounded in that there exists such that the following holds for all , Let We have
Entropy rate calibration.
We can now apply the previous result to fix the entropy rate amplification seen in Table 1. Note that it is trivial to avoid the entropy rate amplification if we were allowed to degrade the quality of our model, in terms of perplexity (e.g. a unigram model does not have this amplification. However, we show that it is possible to match the entropy rate without having to sacrifice the quality of our model. In fact, we can both improve our model and more accurately match the entropy rate, by fitting a family of one-parameter models.
This result shows that we simply need a single parameter to define a new model class that is a powered up version of our original model. Then, we can fit this to minimize the cross-entropy of the new model with respect to the true distribution , in order to eliminate the entropy rate amplification.
Even though this algorithm fits only a single parameter, it is not easily implementable since it requires an integration over sequences, at least in its exact form. One future direction would be to a sample based approach. This may be an interesting alternative to ideas like beam search Steinbiss et al. (1994); Ortmanns and Ney (2000); Antoniol et al. (1995), which also aims to minimize a global cost function on sequences that is inconsistent with the token-level perplexity loss used to train the underlying generative model.
be a random variable with conditional distribution. denotes the entropy of this conditional distribution, i.e.
Note that includes the word , so we require computing the entropy at time when predicting using a learned model.
For a conditional distribution, , let us define:
Thus, is the average of with respect to a distribution which uses for sampling the last word (at every timestep). Intuitively, the resulting model with a positive would suppress sampling words leading to larger entropy but rather encourage words that stablizes the entropy 1-step ahead in the future. Therefore, if our learned language model was accurate, we would hope that: The following corollary shows that this is achievable, along with improving the model’s perplexity.
This result provides us with Algorithm 2, which is computationally quite tractable. We first use the learned model to define a new model class , which scales
by an exponential distribution over the weighted 1-step lookahead entropy. Then, similar to Algorithm 1, we simply fit the single parameter to minimize the cross-entropy of the new model with respect to , which fixes the entropy amplification in the resulting model . We observe this empirically in Figure 2 – our calibration results in a perplexity drop of almost 20 points over long-term generations under an LSTM model. Model and implementation details are in the supplementary material.
Generations from a calibrated model.
Table 2 provides sample generations from a calibrated Transformer model trained on the GBW dataset, compared to its original version. Qualitatively, the calibrated generations: (1) are shorter and more concise, and (2) display a better grasp of discourse structure across sentences. More generations are provided in the supplementary material.
|Original model||Calibrated model|
|Actual results could differ materially from those indicated by these forward-looking statements as a result of various important factors , including , without limitation : changes in general economic and business conditions , including more difficult real estate environments ; […174 tokens…] risks related to investigations by other companies ; inadequate information systems ; the impact of reduced availability of ; * assumptions upon such companies using such as ours to gauge CNET ’s financial condition ; and other factors .||Actual results could differ materially from those indicated by these forward-looking statements as a result of a variety of factors , including but not limited to ( i ) the risk that the tender offer could close in one or more manner or at all ; ( ii ) risks associated with conducting business in foreign jurisdictions ; ( iii ) difficulties in combining some or all of the businesses under one roof ; ( iv ) decreased demand for electricity , natural gas and other energy products , including adverse effects on the pricing of oil and natural gas ; and ( v ) the risks associated with doing business internationally .|
|Bluepoint Games , Inc. is a highly experienced and multi-faceted publisher of licensed virtual worlds for gamers , developers and technology professionals . […114 tokens…] James Upon , CEO of MyNetSheltetWeb and the three previous Developers of MySQL . Based in Redwood City , California , BlueMountain is the leader in franchise and game development for the massively multiplayer online game .||Bluepoint Games , Inc. is a highly experienced licensing , gaming and entertainment firm focused on developing the next generation of casual games based on the PlayStation ( R ) BRAVIA family of video game machines for the North American market . Bluepoint is a wholly owned subsidiary of Bluehill ID Holdings L.P.|
5 Calibration and Memory
Defining a notion of memory in language models is challenging, and multiple equally sensible notions may co-exist. Here we present our choice from first principles. Let us say that is a sample from a model at time , i.e. . Let us also assume that . We will define the memory at gap as the mutual information between and the distant past (those words greater than steps ago) conditioned on the subsequence . Precisely,
where we are not explicitly denoting the dependence in this definition111While we may attempt to estimate for a given , we can remove the dependence by either defining this quantity by with an average over or by using appropriate stationarity assumptions. In our experiments, we average over ..
Intuitively, can be viewed as how much uncertainty (entropy) in the prediction the model is able to reduce by utilizing the deep past in addition to the recent past .
The difficulty in estimating this mutual information is due to estimating , which requires the marginalized model . To (even approximately) marginalize a model distribution over the deep past is statistically difficult, since it requires the access to a pool of samples of that share an common recent past . Nevertheless, we now show that it is possible to obtain an upper bound (which is computationally efficient to estimate).
Upper bounding mutual information using calibrated models.
In the above, we were considering the mutual information between and conditioned on . Let us now consider a more general setting, where we have a distribution where , , and are random variables. We wil eventually consider to be , respectively.
For distributions and and for , define
We say that is calibrated to , if is unimprovable in that for all
Note this condition is achievable due to that calibrating a model to involves a one dimensional (convex) estimation problem (over ).
Suppose we have a model , and suppose , where is dependent only on Suppose that is calibrated to . Then we have that:
We first learn another , and then calibrate to .
Suppose is a model calibrated to . For a random variable, , we have that:
This corollary gives us a means to efficiently provide upper bounds on the mutual information. The key is that since is efficiently computable, we can directly estimate through Monte Carlo estimation. We measure the upper bounds on of a LSTM model with trained limited-memory models (see details in the supplementary material) and report them in Figure 3. As expected, the memory estimate gradually decays with longer , indicating that the models make more use of the recent past to generate text.
We have introduced a calibration-based approach to detect and provably correct the discrepancies between the long-term generations of language models and the true distributions they estimate sequentially. In particular, for state-of-the-art neural language models, we have observed large degradations of the entropy rate under iterative generation, and a proposed first-order correction which is both computationally tractable and effective. Using the same calibration approach, we have derived estimators for the amount of information extracted by these models from the deep past.
Aside from the empirical findings and improvements, we hope that this work will inspire a more principled line of discourse on the quality of long-term generations in language models. It remains an interesting open problem to study other ”future-aware” generation-improving heuristics (beam search, reverse language models, GANs) in this framework of calibration.
S. K. gratefully acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discover, the ONR award N00014-18-1-2247, and NSF Award CCF-1703574.
- Antoniol et al. (1995) Giuliano Antoniol, Fabio Brugnara, Mauro Cettolo, and Marcello Federico. Language model representations for beam-search decoding. In 1995 International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 588–591. IEEE, 1995.
- Brown (1986) L. D. Brown. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory. Institute of Mathematical Statistics, Hayworth, CA, USA, 1986. ISBN 0-940-60010-2.
- Clarkson and Robinson (1999) Philip Clarkson and Tony Robinson. Towards improved language model evaluation measures. In Sixth European Conference on Speech Communication and Technology, 1999.
- Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006. ISBN 0471241954.
- Csiszar and Körner (2011) Imre Csiszar and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
- Dai et al. (2018) Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Language modeling with longer-term dependency. 2018.
- Dawid (1982) A. P. Dawid. The well-calibrated bayesian. Journal of the Am. Stat. Assoc, 77, 1982.
- Dawid (1985) A. P. Dawid. The impossibility of inductive inference. Journal of the Am. Stat. Assoc, 80, 1985.
- Foster (1991) D. P. Foster. Prediction in the worst case. Annals of Statistics, 19, 1991.
- Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Frage: frequency-agnostic word representation. In Advances in Neural Information Processing Systems, pages 1341–1352, 2018.
- Grave et al. (2016) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426, 2016.
- Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.
- Ji et al. (2015) Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document context language models. arXiv preprint arXiv:1511.03962, 2015.
- Jost and Atwell (1994) Uwe Jost and ES Atwell. Proposal for a mutual-information based language model. In Proceedings of the 1994 AISB Workshop on Computational Linguistics for Speech and Handwriting Recognition. AISB, 1994.
- Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- Kalai et al. (1999) E. Kalai, E. Lehrer, and R. Smorodinsky. Calibrated forecasting and merging. Games and Economic Behavior, 29, 1999.
- Ke et al. (2018) Nan Rosemary Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Michael C Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding. In Advances in Neural Information Processing Systems, pages 7651–7662, 2018.
- Khandelwal et al. (2018) Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 284–294, 2018.
- Le et al. (2015) Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
- Lin and Tegmark (2017) Henry Lin and Max Tegmark. Critical behavior in physics and probabilistic formal languages. Entropy, 19(7):299, 2017.
- Marcus et al. (1993) Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. 1993.
- McAllester (2018) David McAllester. Information theoretic co-training. arXiv preprint arXiv:1802.07572, 2018.
- Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
- Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.
- Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An Analysis of Neural Language Modeling at Multiple Scales. arXiv preprint arXiv:1803.08240, 2018.
- Mikolov and Zweig (2012) Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 234–239. IEEE, 2012.
- Mroueh and Sercu (2017) Youssef Mroueh and Tom Sercu. Fisher gan. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2513–2523. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6845-fisher-gan.pdf.
- Müller (1997) Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29:429–443, 06 1997. doi: 10.2307/1428011.
Niculescu-Mizil and Caruana (2005)
Alexandru Niculescu-Mizil and Rich Caruana.
Predicting good probabilities with supervised learning.In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005.
- Ortmanns and Ney (2000) Stefan Ortmanns and Hermann Ney. Look-ahead techniques for fast beam search. Computer Speech & Language, 14(1):15–32, 2000.
John C. Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In
Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018.
- Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
- Shannon (1951) Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
- Sriperumbudur et al. (2009) Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert Lanckriet. On integral probability metrics, phi-divergences and binary classification. 01 2009.
- Steinbiss et al. (1994) Volker Steinbiss, Bach-Hiep Tran, and Hermann Ney. Improvements in beam search. In Third International Conference on Spoken Language Processing, 1994.
- Takahashi and Tanaka-Ishii (2018) Shuntaro Takahashi and Kumiko Tanaka-Ishii. Cross entropy of neural language models at infinity—a new bound of the entropy rate. Entropy, 20(11):839, 2018.
- Takase et al. (2018) Sho Takase, Jun Suzuki, and Masaaki Nagata. Direct output connection for a high-rank language model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609, 2018.
- Trinh et al. (2018) Trieu H Trinh, Andrew M Dai, Thang Luong, and Quoc V Le. Learning longer-term dependencies in rnns with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.
- Vaswani et al. (2017a) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017a.
- Vaswani et al. (2017b) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017b.
- Vovk (2001) V. Vovk. Competitive on-line statistics. International Statistical Review, 69, 2001.
- Wang and Cho (2016) Tian Wang and Kyunghyun Cho. Larger-context language modelling with recurrent neural network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1319–1329, 2016.
- Wang et al. (2017) Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh, and Lawrence Carin. Topic compositional neural language model. arXiv preprint arXiv:1712.09783, 2017.
- Zadrozny and Elkan (2002) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates, 2002.
Appendix A Proofs for Section 4
(of Corollary 4.2) First observe:
For the first claim, we have
using our assumption in Equation 1.
For the second claim, taking with Lemma 4.1, we have:
which completes the proof. ∎
(of Lemma 4.3) By definition,
The first claim now follows from optimality of .
For the second claim,
By Taylor’s theorem, we have:
Taking the the which minimizes the upper bound, leads to the second claim. ∎
(Sharpness) If , then there exists a problem where the bound is sharp and takes on the maximal value of . As an example, consider a model , that starts by generating words under the true distribution and has a probability of transitioning into a mode in which it generates words uniformly at random thereafter.
(of Theorem 4.4) We can apply the previous lemma using
and so our calibration condition implies:
Now observe that:
which completes the proof of the first claim.
The proof of the second claim uses
and, by Equation 2,
which completes the proof. ∎
Now we move on to the proof of Corollary 4.5.
Suppose be a function of . For a conditional distribution, , let us now define:
Suppose . Let
We have that:
(sketch) The proof is identical to that of Lemma 4.3, with the addition of using linearity of expectation. ∎
Appendix B Proofs for Section 5
(of Theorem 5.1) It is convenient to define the distribution:
We then have:
by the defintion of the mutual information.
The proof consists of showing that:
Let us take . The zero gradient condition for the optimality at implies:
where the last step uses the definition of and Jensen’s inequality. ∎
Appendix C Experimental Details
In this section, we outline the experimental setups used to obtain the empirical results throughout the paper. For the calibration and memory experiments (Table 1 row 1, Figure 1 (left), Figures 2, 3), our base model is a 3-layer LSTM with with 400 embedding dimension and 1150 hidden nodes. We train it on the Penn Treebank (PTB) corpus Marcus et al. , following the setup of Merity et al.  and Merity et al. 
for 500 epochs using SGD with batch size 20 and BPTT length 70. The trained base model achieves 64.3 validation perplexity and 58.3 test perplexity.
The limited-memory models used for the memory estimation in Section 5 share the same architecture as our base model while, during training, the hidden states is re-initialized after reading every tokens ( takes value from ).
Finally, for the entropy rate measurements of larger-scale state-of-the-art language models (Table 1 rows 2-4, Figure 1 (right)), we used the pretrained weights published alongside Jozefowicz et al. , Radford et al.  for rows 2 and 4, while we trained the model using the tensor2tensor framework. The model for row 2 is an LSTM with CNN-embedded inputs, trained on the Google Billion Words (GBW) corpus. The other two are Transformer Vaswani et al. [2017a] models trained on GBW (row 3), and an proprietary corpus derived from a web crawl (WebText; row 4). For GPT-2, since the authors have not published training or validation data, we used the text of several New York Times articles as a stand-in validation set; the cross entropy loss is comparable to that reported on the validation set. The entropy rate amplification plot in Figure 1 (bottom) corresponds to the setup from row 4.
To measure the conditional entropy after generations, we measured the empirical conditional entropy of the -th word over independent generations, which were produced by the standard way of iteratively sampling from the next predicted conditional distribution, seeded with ground-truth text up to random points in the validation set. We used the entropy rate at as a proxy for the asymptotic limit in Table 1.
Appendix D Additional Generation Samples
In this section, to provide a better sense of the qualitative effect of calibration, we provide below some additional generations, seeded by 10-token prefixes of the holdout (validation) sentences from the Google Billion Words dataset. Here, we used the model we trained for row 3 of Table 1. To identify a failure mode for the uncalibrated model, we selected the seed prefixes which resulted in unusually long generations by the uncalibrated model.
|Original model||Calibrated model|
|Actual results could differ materially from those indicated by these forward-looking statements as a result of numerous factors including the risks associated with the timely and efficient completion and integration of the Temporary Liquidity Guarantee Department ’s supervision into the commercial , open market , solar energy , energy efficiency , electric utility transmission , and water demands of residential and business customers , Comcast ’s ability to successfully implement its business plan , timing of completion of the acquisition and the effectiveness of the efforts and strategies involved in the integration of Rhapsody , timing of regulatory and client approvals and availability of key enhancements .||Actual results could differ materially from those indicated by these forward-looking statements as a result of a variety of factors , including but not limited to ( i ) the risk that the tender offer could close in one or more manner or at all ; ( ii ) risks associated with conducting business in foreign jurisdictions ; ( iii ) difficulties in combining some or all of the businesses under one roof ; ( iv ) decreased demand for electricity , natural gas and other energy products , including adverse effects on the pricing of oil and natural gas ; and ( v ) the risks associated with doing business internationally .|
|Actual results could differ materially from those indicated by these forward-looking statements as a result of various important factors , including , without limitation : changes in general economic and business conditions , including more difficult real estate environments ; declines in information technology spending ; continued availability of capital and government regulations ; changes in general economic and business conditions ; the possibility that extended unemployment and healthcare policies may change , or may reduce access to quality care services ; failure to obtain adequate and affordable medications ; changes in certain CME / CE product mix ; disruption in CME credit markets ; uncertainty of the outcomes of regulatory investigations of companies in which the Company has an interest ; dependence on suppliers for most of its products ; consolidation among financial institutions ; ability to attract and retain skilled personnel ; changes in rapidly changing technology and regulatory environments ; arrogance and complacency among financial analysts ; the impact of competition ; inability to retain and motivate senior management ; difficulties in the integration of acquired businesses ; the effects of redundancy and loss of key employees ; litigation , including claims and the challenge of insurance practices ; uncertainties relating to litigation ; risks related to investigations by other companies ; inadequate information systems ; the impact of reduced availability of ; * assumptions upon such companies using such as ours to gauge CNET ’s financial condition ; and other factors .||Actual results could differ materially from those indicated by such forward-looking statements as a result of various important factors , including those discussed in the company ’s periodic reports that are filed with the Securities and Exchange Commission and available on the SEC ’s website at www.sec.gov.|
|Actual results could differ materially from those indicated by such forward-looking statements as a result of a variety of factors , including our ability to improve our liquidity . Among these factors are changes in the general economy , changes in political and economic conditions , changes in interest rates , changes in technology and implementation of regulatory policies and legislation , the direction of interest rates and changes in the banking industry , changes in loan prepayment activity , changes in consumer preferences and consumer and business lending markets , legislation or public compliance with applicable laws and regulations and changes in the business or regulatory environment . We caution you that there are many uncertainties that could cause actual results to differ materially from those indicated in the forward-looking statements . Among them are the risk factors that could cause results to differ from those expressed in the forward-looking statements . These factors include , but are not limited to : general economic and business conditions , including the financial markets ; fluctuations in interest rates ; government regulation of the financial services industry and possible failures ; planning assumptions and estimates ; potential funding requirements ; unexpected changes in cost increases ( including goodwill impairment ) ; competition ; the potentially lengthy , protracted U.S. recession ; and migratory consumer and business conditions .||Actual results could differ materially from those indicated by these forward-looking statements as a result of various important factors , including those discussed in the ” Risk Factors ” section of the Company ’s Annual Report on Form 10-K for the most recently ended fiscal year .|
|Bluepoint Games , Inc. is a highly experienced and multi-faceted publisher of licensed virtual worlds for gamers , developers and technology professionals . The company is based in Vancouver , Canada . BlueKai ’s innovative games are distributed by Devices EA , LLC , and Club Penguin . BlueKai owns and is the exclusive licensor of Scrabulous . BluetoothQ Interactive Inc. has acquired JoShear-Swain Media , LLC , a premier developer and publisher of community based games for the handheld game device . For further information , please visit : www.netgear.com / ngcleveld . Sprint ’s fantasy game publisher and Web doing business within the Entertainment Group is James Upon , CEO of MyNetSheltetWeb and the three previous Developers of MySQL . Based in Redwood City , California , BlueMountain is the leader in franchise and game development for the massively multiplayer online game .||Bluepoint Games , Inc. is a highly experienced gaming and entertainment company with several renowned blockbuster franchises including PC , GameHouse ( ( R ) ) GameHouse ( ( R ) ) , Heavenly Sword ( ( TM ) ) , EverQuest ( R ) , Untold Story ( TM ) and EverQuest ( R ) II . Through its wholly-owned subsidiary , Bluehill ID ( R ) , the Bluehill ID logo and tagline are registered trademarks of Bluehill ID Corporation and its subsidiaries in the U.S. and in other countries .|
|Bluepoint Games , Inc. is a highly experienced gaming , entertainment and mobile games company with a vertically integrated portfolio including : games ( TM ) , social network , mobile , casual games , MMORPG , production , distribution , and licensing including its flagship games , SUIT and TIMMERIX ( TM ) , as well as its award-winning gaming , basketball and entertainment network . In order to create a highly integrated , pure and socially responsible Game ( R ) family , Bluepoint has collaborated with Amplify Systems International , Inc. on various titles for PlayStation ( R ) 2 , PLAYSTATION 3 ( R ) 5 , Wii ( TM ) 3 , PS3 , Wii ( TM ) ( and PS3 titles ) as well as PC games for PC , PSP , POOL , Wii ( TM ) ( and successor title ) and IP ( R ) , in addition to its focused gaming , entertainment and communication services . BlueBay ’s exclusive licensee worldwide licensee of the Bluepoint ( TM ) ZMFAO Gateway series , it is the world ’s leading portable gaming , PC and mobile phone company . For more information , see UNK , Inc. and ” Oakpoint : ZWC ’s Community Health Business Development Center .||Bluepoint Games , Inc. is a highly experienced licensing , gaming and entertainment firm focused on developing the next generation of casual games based on the PlayStation ( R ) BRAVIA family of video game machines for the North American market . Bluepoint is a wholly owned subsidiary of Bluehill ID Holdings L.P.|
|Bluepoint Games , Inc. is a highly experienced , innovative entertainment sports gaming company whose products and services are used by some of the most recognized and respected names in the world of gaming including : Pokemon , Macau ( Valve ) , Quattro , Super Smash Bros. , Good Neighbor Games , IGN Games , Vail Resorts , Kania ( Ocean Spray , Pemberton and Roatenham ) , PURE Holdings , TeenNick , National Amusements , SEGA Games , Cirrus ( Aircraft ) and www.netapool.com.||Bluepoint Games , Inc. is a highly experienced player in the growing genre of casual games for both casual and active gaming enthusiasts . Bluepoint is an early stage Company with a significant following among youth and adults in Europe and the United States with an impressive track record in global on-line gaming opportunities .|
|Nursing Homes : Genworth ’s 2009 Cost of Care Survey , conducted by the Robert Wood Johnson Foundation and released today , reveals the extent to which members of the U.S. population adheres to practices recommended since 1995 , including : a rolling three-hour ” Python for Life ” that fell asleep from 11 p.m. to 2 a.m. , sleep time from 11 p.m. to 3 a.m. , spare time from 8 a.m. to 9 p.m. , and use of state-of-the art non-invasive technologies . A remodeling and refurbishment of hospital facilities is underway as the nation ’s economy begins to gain momentum . Similar to the previous years , Thinking About Health - Hear how health plans are working to address various congressional proposals to advance best practices in patient care and provide greater accountability , advocacy and transparency to consumers .||Nursing Homes : Genworth ’s 2009 Cost of Care Survey is based on interviews with 516 family , friends and neighbors of insured and self-employed people conducted from Jan .|
|Nursing Homes : Genworth ’s 2009 Cost of Care Survey is based on a double-blind , randomized , double-blind , placebo-controlled survey which involved an assessment of the cost-effectiveness of healthcare associated with an adequate diet and regular physical activity compared to its managed-care counterparts . The margin of error for this survey is + / - 3.3 percentage points at the 95 percent level of confidence .||Nursing Homes : Genworth ’s 2009 Cost of Care Survey , conducted by Harris Interactive , performed significantly worse than a control group of its peers who provided care but were not able to offer health care to their employees .|
|Nursing Homes : Genworth ’s 2009 Cost of Care Survey , conducted by CareScout ( R ) and published in the April 2009 issue , evaluated findings from the 10-year , nearly 900,000-member Specialty Health Management Association ’s more than 6,000 professionals living in the United States .||Nursing Homes : Genworth ’s 2009 Cost of Care Survey includes a series of health and medical cost reports on more than 100 home medical equipment and related products , including more than 3.9 million units of durable medical equipment . IBC ’s cost of more than $ 100 billion is a significant portion of Medicare spending on home health care .|