1 Introduction
Natural language is fundamentally creative—speakers and listeners frequently produce and comprehend sentences which have never been produced before (fodor1975language; fodor1988connectionism; chomsky1975logical). As a sideeffect of this property, distributions in natural language are characterized by a heavytail of individually improbable events which collectively account for a significant amount of the total probability mass of the distribution (Khmaladze1988TheSA; Baayen2001WordFD). Precisely approximating this large number of rare events is one of the foundational challenges for models of natural language (goodturing; Jelinek1980InterpolatedEO; katz; kneserney; wood2011sequence; goldwater11a). Autoregressive neural language models (bengio2003neural; mikolov2013linguistic; Radford2019LanguageMA)
attempt to do so by decomposing the probability of an event (a sequence) into a series of conditional distributions, each parameterized by a shared neural network.
Recently, a growing body work has sought to understand how these language models (LM) fit the distribution of a language beyond standard measures such as perplexity. meister2021language, for example, investigated the statistical tendencies of the distribution defined by neural LMs, whereas moderecovery explored whether they adequately capture the modes of the distribution they attempt to model. At the same time, increased focus has been given to performance on rare or novel events in the data distribution, both for models of natural language (mccoy2021much; lent2021language; dudy2020words; orenetal2019distributionally) and neural models more generally (see, for example sagawa2020investigation; d2021tale; chen2021non; BlevinsZettlemoyer2020; czarnowskaetal2019dont; devilintails; Ouyang_2016_CVPR; bengio2015battle; Zhu_2014_CVPR). Neither of these branches of work, however, has explored instancelevel LM performance on rare sequences in the distribution. As a result, we have relatively little understanding of how neural LMs approximate sequences in the heavytail characteristic of natural language.
In this work, we introduce a controlled methodology to explore how LMs estimate the probability of sequences in the heavytail of the distribution. Our instancelevel evaluation scheme explicitly compares the target probability distribution of the language to the distribution defined by the LM. Since the true distribution of any natural language is in practice unknown, we use a Transformer LM trained on natural data as a generative model to define target
artificial languages for which we can exactly compute sequence probabilities. Training LSTM and Transformer LMs on sequences sampled from these target artificial languages, we compare the sequencelevel probability estimates given by neural LMs to the target probabilities in the language. By controlling the entropy of the generative model’s conditional distributions, we create a set of artificial languages with varying distributional properties, and analyze how LM estimation behaviour is modulated by the properties of the target distribution.Our experiments uncover the extent to which neural LMs provide a distorted fit of the language they are trained to model. We find that LSTM and Transformer LMs (i) systematically underestimate the probability of sequences drawn from the target language and (ii) do so more when such sequences are rare. Where did this underestimated probability mass go? We do not find that the underestimation is accompanied by overestimation in the head of distribution. Rather, we find that LMs tend to (iii) overestimate the probability of rare perturbed (illformed) sequences. Interpreted together, these findings indicate that on the one hand, neural LMs underrepresent wellformed sequences in the tail of the language they attempt to model, and on the other hand, overrepresent illformed sequences far away from high probability zones in sequencespace. In addition, we find that (iv) greater amounts of training data lessen underestimation but do not eliminate it and that (v) underestimation is exacerbated for target distributions with lower entropy.
2 Background
We begin by briefly characterizing why distributions with a large number of rare events (LNRE) emerge in natural language, and why these events pose challenges for LMs. Furthermore, we motivate the need for instancelevel evaluation when dealing with a large number of rare events.
Productivity
In the context of language production, a language user has the ability to produce, at any given point in their linguistic lifespan, an utterance which they have never produced before. This creativity is the result of the generative property of productivity, which states that on the basis of finite linguistic experience, a language user can produce and comprehend an unbounded number of grammatically acceptable utterances (chomsky1975logical). Productive processes induce a distribution which places nonzero probability on unseen events at all practical sample sizes. Because of this property, many of the distributions in natural language—particularly the distribution over the sequences of a language—are characterized by a heavytail of rare events.
LNRE Zone
To make explicit the connection between productivity and a heavytail of rare events, let denote the probability of sampling a novel (previously unseen) event from some distribution after having sampled events. Then productivity as described above states that for all sample sizes that occur in practice. The range of sample sizes for which it is the case that is known as the LNRE zone (Khmaladze1988TheSA; Baayen2001WordFD). The LNRE zone for natural language appears to be very large, and it seems likely that will remain greater than for samples of natural language many orders of magnitude larger than all the data currently available for training LMs.^{1}^{1}1For an empirical validation of this claim on a sample of practical size from the OpenWebText corpus, see the Appendix. In the LNRE zone, it is difficult to obtain accurate estimates of the probability of events using straightforward maximum likelihood estimation (MLE). Accounting for this enormous amount of novelty is thus a central challenge in language modeling.
Language modeling
A model of the language attempts to define a distribution which closely resembles the true distribution of the language
. In a locallynormalized autoregressive model, this distribution is defined by assigning probabilities to variable length sequences
via a chainrule decomposition:
(1) 
where is the vocabulary, and is the nonnegative score of token given the sequence prefix , which is computed by a neural network with parameters .
For to perfectly approximate , we expect , where is the set of all strings of finite length (the Kleene closure of ). In the LNRE zone, is defined over a support containing a very large set of sequences which have never occurred in a training corpus (or equivalently, have all occurred with frequency ), and which take on a very wide array of differing probabilities. For example, while the sequences have likely never occurred in any sample of English, most would agree that is far more probable than :

The East pond in Parc Lafontaine was filled to the brim with Diet Coke.

Certain nak indicate liberationing among theorter codity voters vandalized.
LM Evaluation
For a perfect LM of English, we would expect the estimated probabilities of the sequences and to match their probabilities under the true distribution . However, since and it’s underlying generative process are unknown, it is not possible to explicitly evaluate how closely instancelevel probability estimates align. As a proxy, the mean perplexity of the model on a holdout set of sequences is typically used, which measures whether the model, on average, assigns high likelihood to unseen instances. This measure does not, however, tell us whether instancelevel estimates align with their true counterparts, nor is it necessarily indicative of performance on rare, idiosyncratic events in . In this way, the lack of access to the groundtruth distribution severely complicates LM evaluation on the heavytail of rare sequences in language. The following section introduces a methodology to overcome these limitations.
3 Language Model Evaluation in the LNRE Zone
Component  Notation  Description 

Generative model  A LM trained on natural instancelevel data.  
Artificial language  The distribution over sequences induced by a sampling scheme from .  
Language model  The distribution of a LM trained on sequences sampled from .  
Target probabilities  The probability assigned by to the sequence .  
Model probabilities  The probability assigned by to the sequence . 
We propose evaluating language model performance on the heavytail of rare events via a known probability distribution over sequences. Specifically, we train a Transformer LM on sequences sampled from a corpus of natural language to define a generative model . The distribution over sequences induced by a sampling scheme from , denoted , is then our artificial language. We expect a model of this artificial language to assign probabilities to sequences which match the target probabilities of under . To characterize neural LM behaviour on rare events, we train Transformer and LSTM LMs on data sampled from , and compare the instancelevel probability estimates given by to target probabilities under . We summarize the components of this methodology in Table 1, and overview it in greater detail in the following section.
3.1 Artifical Languages as Target Distributions
“we’re very excited to have the opportunity to help them,” he says.  
so what’s going to happen?  
to me, the fisheries are in the midst of a global financial crisis. 
To define a generative model , we train a randomlyinitialized GPT2medium on 1.5M sentences sampled from the OpenWebText corpus (Gokaslan2019OpenWeb).^{2}^{2}2All Transformer implementations were obtained from Huggingface, and training was done on two or four RTX8000 GPUs (depending on model size) with mixed floating point precision. We set the maximum sequence length to be 128 tokens. We additionally train a bytepairencoding (BPE) tokenizer on this data set with a standard GPT2 vocabulary size of 57,256 tokens. For simplicity, this tokenizer is used for all models.
Using this generative model , we define the target distribution over sequences—the artificial language—as the distribution induced by an ancestral sampling scheme from . Thus, we draw instances from our language by recursively sampling from the conditional distribution over tokens at each time step: where and . All experiments up until Section 4.5 are conducted on the distribution induced by ancestrally sampling from with softmax temperature .^{3}^{3}3We temper in an effort to define a groundtruth distribution whose entropy more closely resembles that of a natural language. Note that our findings hold for untempered () groundtruth distributions as well (see A.1.1 and A.1.2). In Section 4.5, we explore the effects of different values of when sampling from . Table 2 shows three sequences sampled from this distribution.
3.2 Sequence Probability Estimation in the LNRE Zone
Given an artificial language , the task of —the language model—is to define a distribution whose sequencelevel probability estimates closely align with the sequencelevel probabilities given by . We refer to any deviation from this desiderata as model estimation error. To quantify the model estimation error for a sequence , we take the difference between the sequence’s log probability under and its true log probability under :
(2) 
This quantity is the log probability ratio, which measures, in logspace, the number of times more or less likely the sequence is under the language model . Note that indicates that underestimates the probability of , whereas indicates that overestimates the probability of . In practice, we train on a set of sequences sampled from , and compute model estimation error on a separate set of sequences sampled from . In all cases, we compute the probability of a sequence as its chain rule decomposition: where and . When computing the groundtruth sequence probabilities for , we take into account any softmax tempering.
3.3 Neural Language Models
We study the estimation performance of two neural LM architectures: the Transformer (vaswani2017attention) and the LSTM (Melis2020Mogrifier)
. When training either architecture, we halve the learning rate if validation loss increases at the end of an epoch. For all model sizes, we use a batch size of 128 sequences. Models with the lowest crossentropy loss on a withheld validation set are used in experiments unless otherwise mentioned.
We use the Huggingface (wolfetal2020transformers) implementations of GPT2small, GPT2medium and GPT2large (Radford2019LanguageMA) as representative Transformer LMs. We use Adam Optimization with and learning rates , and for GPT2small, medium and large respectively. Since these models are in the same model class as our artificial target language , we expect the task of recovering the groundtruth distribution to be relatively easy compared to the true problem faced in modeling natural language, where both the distribution and the underlying generative process are unknown. For the LSTM (hochreiter1997long), we follow the implementation of the baseline LM described in (Melis2020Mogrifier). We use 2 layers and adjust the hidden state and embedding dimension (2048 and 1024, respectively) to be such that the total number of parameters is approximately equal to GPT2small (110M).
4 Results
4.1 Estimation Error with Fixed Data
We begin by exploring model estimation error on a fixed training set of 1M sequences sampled from . We first train LSTM and GPT models on , earlystopping as described above. Following training, we sample a test set of 500,000 sequences from , and score each sequence under both the model distribution and the true language distribution . From this, we obtain a set of probability estimates: . If the model perfectly models the language , then for each we would expect . Figures 2(A) and 2(B) visualize this relationship with the  and axes denoting the true and model estimated sequence probabilities respectively, and a dashed line representing equality. To compare probability estimates, we represent the set in the form of a joint histogram over this coordinate space. Histogram bins are shaded based on the number of tuples which lie in the coordinate range they define. Importantly, any deviation of this histogram from the identity line indicates that the model distorts the shape of the distribution of the language.
Figures 2(A) and 2(B) provide evidence for distributional distortion in the form of underestimation. The majority of probability tuples in lie to the right of the identity line, indicating that LSTM and GPT2 models consistently underestimate the probability of sequences sampled from . Furthermore, the distance between the identity line and the probability tuples grows nonlinearly as function of the true sequence probability, indicating that underestimation error is more severe for rarer sequences in the language. We validate these observations in the rightmost plot of Figure 2, which shows mean estimation error decreasing nonlinearly as a function of the target sequences probability.^{4}^{4}4To compute this curve, we split the target sequence probability range into equally sized bins (by probability range). We report the mean estimation error for each bin with sequences. We additionally compute 95% confidence intervals with bootstraps for each mean, resampling equal to the number of sequences in the given bin.
In addition, comparing underestimation behaviour across model size, we find that while GPT2medium performs slightly better than GPT2small, these improvements are typically within the range defined by the bootstrapped 95% confidence intervals. See
A.1.3 for evidence indicating that this underestimation behaviour also occurs in pretrained models finetuned on .4.2 Estimation Error across Training Time
To understand the training dynamics underlying the previously reported underestimation, we compute model probability estimates on subsets of and at the end of each training iteration . Once computed, we sort each set of probability tuples and by their target sequence probabilities , and split the probability tuples into 50 equallysized bins. We plot estimation curves in Figure 3: Each curve represents a 50th of the sequences, with darker curves denoting estimation error for sequences with lower target probabilities (rarer sequences). At any given point, then, the distance between estimation curves represents the degree to which estimation error is dependent on the target probability of the sequence.
Figure 3 left visualizes underestimation error for sequences seen in training. Around the fifth epoch, estimation error for train sentences converges to zero, that is , indicating that GPT2medium is able to almost perfectly recover the target probabilities of training sequences no matter their target probability. At the same time, this convergence happens almost simultaneously for all sequences, indicating that a complete reduction in error during training occurs throughout the entire range of target sequence probabilities.
Figure 3 right visualizes GPT2medium model’s performance on a separate set of test sequences. First, unlike for , estimation error for does not converge to zero, meaning that even when the model has perfectly recovered the target probability of train sequences, the target probabilities for test sequences remain underestimated. Second, in the case of , the difference between estimation curves of different shades converges to zero, indicating that estimation performance becomes uniform across the distribution of train sequences. We do not see such behaviour in . Instead, the error curves remain at a relatively consistent distance from one another, indicating that the discrepancy in estimation error at different parts of the distribution is unchanging for sequences not seen during training.
4.3 Estimation Error by Amount of Training Data
Our previous experiment trained languages models on a set of 1M sequences. A plausible explanation for the model’s underestimation behaviour on unseen test sequences is therefore that the language model has not seen enough samples from the target distribution. Here we explore how estimation error varies as a function of the amount of training data. We train GPT2medium, GPT2large and an LSTM model in the online “Ideal World” setting (Nakkiran2020online) by sampling, at the beginning of each training iteration, a fresh set of 500,000 sequences from , and training on this sample. Doing so for 60 iterations, we obtain LMs which have been trained on 30 million sequences. We compute model estimation error on at the end of each iteration . Figure 4 visualizes underestimation error throughout training for these LMs. We again split test sequences by their true probability, with darker lines denoting estimation trends for less probable target sequences.
The estimation curves in Figure 4 suggest that while increasing the amount of data in training initially leads to lower estimation error, this reduction eventually asymptotes. In the insets of Figure 4, we visualize the relative change in mean estimation between epochs and . Relative change in estimation error eventually fluctuates around 0 (minimal change) for the majority of the distribution. Comparing architectures, we find that the Transformer is significantly more efficient at reducing mean estimation error throughout the distribution.
4.4 Where did the probability mass go?
In the previous section, we saw that even when increasing the amount of training data, consistently underestimates the probabilities of sequences sampled from the tail of . At the same time, we did not find that overestimated sequences in the head of . Under the assumption that is a proper probability distribution,^{5}^{5}5See welleck2020consistency for discussion on consistency in the context of neural language model decoding. that is, , these findings suggest that there exists sequences in whose probability is overestimated by the model. In this section, we investigate where this probability mass went.
Swap two tokens in . 
Delete a token from . 
Insert a token from 
at a position in . 
Substitute a token in 
with a token from . 
To do so, we compute model estimation error on perturbed sequences from —sequences in which are increasingly far away from the highprobability zones in . We build a corpus of perturbed sequences by recursively applying 30 random perturbations to each sequence . Formally, the set of sequences at perturbation step can be expressed as: where is a function which returns a novel perturbed version of , and . Sequence perturbation operations are shown in Table 3. While it is possible that these operations produce other wellformed strings, we expect this to be a relatively rare outcome. We score each of these 15M sequences under both the target generative model and the LM . Note that we use as LM the GPT2medium model from the previous section (trained on 30M sequences).
Figure 5 visualizes GPT2medium’s mean estimation error for these 15M sequences across two dimensions. On the xaxis, we plot the target probability of the sequence under , and on the yaxis, the number of perturbations performed on the sequence. For example, the bottomleft corner (1) of Figure 5 visualizes estimation behaviour for rare sequences sampled directly from , whereas the topleft corner (2) visualizes estimation error sequences which are equally rare, but which have been perturbed up to 30 times.
Figure 5 offers a nuanced characterization of underestimation behaviour. The brown area on the bottom of the figure restates the underestimation findings of the previous section. When increasing the number of perturbations performed, however, we begin entering into a space of sequences which are at first wellestimated by (the white areas) but then are quickly overestimated by (the dark blue areas), confirming that there are indeed sequences in which are overestimated by the language model. Furthermore, these findings suggest that the tail of rare events defined by the language model does not match the tail of the artificial language—the rare events typical in are underrepresented in in favour of other sequences in . See Section A.1.4 for experiments finding that random sequences from are also overestimated by .
4.5 Modulating the Shape of the Target Distribution
Up to this point, the target artificial language was given as the distribution induced by an ancestral sampling scheme with softmax from the generative model . In the previous section, we saw that placed excess probability mass on areas in with lowprobability under . Here we modulate the shape of the sequence space defined by to investigate how estimation error varies with respect to systematic interventions in the target distribution. To adjust the way that allocates probability mass over , we control the entropy of the conditional distributions at each generation step
by dividing the presoftmax logits by a temperature value
.^{6}^{6}6Formally, for the th component of the length presoftmax logits , this operation is given as:We visualize the effects of on the shape of the distribution in the left of Figure 6. By increasing the value of , we increase the entropy of the distributions over next tokens, which in turn, spreads probability mass across a larger number of sequences in . We define four artificial languages with varying and train GPT2medium on an ancestral sample of 1M sequences from each of these artificial languages. Figure 6 visualizes model estimation error by true sequence probability for each model. We find that models trained on languages with increased entropy perform comparatively better than models trained on low entropy languages. Estimation error for models trained on languages with greater is less severe, and this holds throughout nearly all target sequence probabilities. These results indicate that neural LMs provide a more accurate approximation of target distributions which spread mass more uniformly across .
5 Related Work
This paper contributes to recent work investigating the properties of the distributions defined by LMs. Prior studies have focused on exploring (takahashi2019; takahashi2017neural) and developing frameworks (meister2021language) to better understand whether the largescale statistical tendencies of natural language, such as Zipf’s law (zipf1949human), are captured by LMs. We take a more finegrained approach, proposing a methodology which draws off of instancelevel evaluation schemes (zhong2021larger) and the experimental control afforded by artificial corpora (whitecotterellartificial; zipfjurafskistructure). Indeed, closely related to our work is moderecovery’s, in which artificial corpora produced by generative models were used to explore mode recovery in neural language modeling. Our analysis exploring the overestimation of illformed sequences extends previous findings on locally normalized conditional models assigning arbitrary probability mass to unlikely sequences (andor2016globally; goyal2019empirical; labelbiaslafferty), neural LMs assigning high likelihood to sequences with repetitions (Welleck2020Neural), the consistency of decoding algorithms (welleck2020consistency), and on machine translation models placing significant probability mass on the empty sequence (stahlbergbyrne2019nmt).
We additionally contribute to the body work seeking to characterize and adapt neural model performance on rare or novel examples and classes (devilintails; bengio2015battle). In the context of language modeling, lent2021language explored performance on underresourced languages, whereas orenetal2019distributionally did so on underrepresented domains in training corpora. mccoy2021much introduced analyses to assess sequential and syntactic novelty in LMs. Focusing on the word frequency distribution, dudy2020words found that LMs underperform when less frequent examples are encountered at test time. In the classification setting, various approaches have been proposed to help alleviate class imbalance in the data distribution, such as data augmentation (sagawa2020investigation) or the transfer of knowledge from highfrequency classes to infrequent ones (Ouyang_2016_CVPR; Zhu_2014_CVPR; chen2021non). Prior to the current neural paradigm (bengio2003neural), multiple approaches have been proposed to deal with the heavytail, such as smoothing and backoff approaches in statistical grams (chen1999empirical) and twostage Bayesian approaches (johnsonPL2006).
6 Conclusion
Emerging as a result of a language user’s ability to produce and comprehend novel expressions, the heavytail of rare events is one of the fundamental features of distributions in natural language. In this work, we introduce a controlled methodology to evaluate instancelevel LM performance on this set of individually rare but collectively frequent events. We use generative models trained on natural language corpora to define a set of artificial languages for which we can exactly compute the probability of sequences. Training LSTM and Transformer LMs on sequences sampled from these artificial languages, our analysis compares the probability estimates given to sequences by the LMs to the target probabilities of sequences under the artificial language.
Our results indicate that neural LMs systematically underrepresent sequences in the tail of the target distribution, even when increasing the amount of the training data. Investigating where this probability mass went, our perturbation experiments reveal that neural LMs do not tend to overestimate the head of the distribution, but rather overestimate the probability of sequences outside those typical in the target distribution. Comparing model performance on target distributions with varying properties, we find that neural LMs tend to provide more accurate approximations of distributions with greater entropy. Interpreted together, these results indicate that autoregressive neural language models have a tendency to spread probability mass too uniformly across the space of possible sequences.
Finally, we would like to acknowledge that we do not know the degree of structural difference between our Transformergenerated groundtruth distributions and the distributions of actual natural languages. It is likely that the distribution defined by our ground truth models is less structured than the distribution of a natural language. Therefore, it is possible that some systematic difference between natural language distributions and our groundtruth distributions may affect our results to a certain degree. That being said, our experiments in Section 4.5 suggest that it may actually be easier for neural LMs to learn less structured distributions, and we expect the task of recovering a groundtruth distribution to be made easier when the target distribution and LM are in the same model class. Nevertheless, future work should seek to conduct similar experiments using groundtruth distributions with more explicit structure.
References
Appendix A Appendix
a.1 Additional Experiments
a.1.1 PreTrained GroundTruth Model
In our previous experiments, our groundtruth model was given as a Transformer language model trainined on 1.5M sequences from the OpenWebText corpus. Here we explore underestimation behaviour when the groundtruth distribution is given by a pretrained GPT2medium model finetuned on 1.5M sequences from the OpenWebText corpus. We sample from with softmax .
Analogously to the experiment in Section 4.1, we train a randomlyinitialized GPT2medium model on 1M sequences sampled from the finetuned model. In the center of Figure 7, we visualize mean model estimation error for test sequences as a function of true sequence probability. Similarly to our experiments in Section 4.4, we find that the LM underestimates the probability of the majority of sequences, and does so more severely for less probable sequences. Note that this LM obtains a test perplexity of .
Finally, to ensure that our findings regarding the overestimation of illformed sequences hold, we compute model estimation error on random sequences sampled from (see A.1.4 for details on how these sequences are constructed). Figure 7 center visualizes mean model estimation error as a function of target sequence probability. We find that overestimates the majority of illformed sequences, indicating that these findings hold when the groundtruth distribution is defined using a pretrained model.
a.1.2 Untempered (T = 1.00) GroundTruth Distribution
In Sections 4.1 to 4.4, we conducted all experiments on an artificial language defined by ancestral sampling scheme with . In Section 4.5, we saw that the underestimation findings held regardless of . To provide further evidence that these results hold for other values of , we conduct similar experiments as in Section 4.3 with an artificial language defined by an ancestral sampling scheme with . Specifically, we train GPT2medium and an GPT2large on a total of 30M sequences sampled from , and we compute model estimation error on a set of withheld test sequences at each training iteration.
a.1.3 PreTrained Model Estimation Error
In Section 4.3, we explored how estimation error varies as a function of the amount of training data, finding that while increased data weakens estimation error, the underestimation behaviour persists. As an alternative way to explore how underestimation varies with increasing data, we finetune Huggingface’s pretrained GPT2small, medium and large models on the set of 1M sequences used in Section 4.1. Computing estimation error on a set of unseen test sequences, we find, analogously to our experiments on models trained from scratch, that pretrained models underestimate the probability of the majority of sequences sampled from the target distribution, and do so more severely for rarer sequences (we visualize this in Figure 9). Furthermore, in Figure 10, we increase the amount of finetuning data substantially, and we plot test sequence estimation behaviour at the end of each epoch for a pretrained GPT2medium model. Once again, increased training data lessens but does eliminate the underestimation behaviour.
a.1.4 Alternative Perturbations
In Section 4.4, we study where the language model’s underestimated probability mass went by computing model estimation error on perturbed sequences. We obtain a set of perturbed sequences by (i) sampling a sequence from and then (ii) recursively perturbing this sequence according to the perturbations provided in Table 3.
This method provides us with sequences which are increasingly far away from highprobability zones under . However, it does so with initial sequences sampled directly from , and as a result, produces strings which are editadjacent to the highprobability zones (under ) in .
To ensure that our results hold in other lowprobability subspaces, we conduct an analogous experiment on sequences randomly sampled from . We sample a sequence from by first sampling a sequence length
from a Poisson distribution with
. Given this length, we sample tokens from and concatenate all tokens to form a sequence. We then score the sequence under both and , and compute model estimation error. We use as artificial language GPT2medium with and we use as language model a GPT2medium model trained on 30M sequences ancestrally sampled from .Figure 11 visualizes mean estimation error as a function of the groundtruth probability of the sequence. Similarly to all other perturbation experiments, we do indeed find that overestimates these sequences, regardless of their true sequence probability.
a.1.5 Estimation Error by Sequence Length
Autoregressive neural language models decompose the joint distribution
over sequences into a series of conditional distributions . Generating a sequence of length , then, involves estimating conditional distributions. Since the probability of a sequence is inversely correlated with its length, our findings that estimation error is worse for rarer sentences may be explained by compounding errors.To test this claim, we ask whether the observed estimation error is worse than would be expected if it was due to error compounding. Specifically, in Figure 12, we plot the expected model estimation (black) and the mean observed error (blue) by sequence length. Expected estimation error for sequence length is computed by multiplying the average token level estimation error by the sequence length, i.e., , where
Observed estimation error is computed as the mean estimation error for test sentences of length . Note that the shaded areas around this curve denote the 95% bootstrapped confidence intervals for this mean. As shown in Figure 12, observed estimation error for both GPT2small and GPT2medium is more severe than expected estimation error as we increase sentence length. This suggests that estimation error for longer (and typically rarer) sequences is not solely due to error compounding.
a.2 Empirically Measuring the LNRE Zone
In section 2, we formally defined the LNRE zone as the range of values of for which there is nonzero probability of sampling a novel event at the th draw. Here we introduce a frequentist estimator for this probability. This in turn allows us to empirically verify if a sample exists in the LNRE zone.
a.2.1 Estimating the Potential Productivity
Suppose we have a set of events drawn from some generative process . Given , we aim to obtain an empirical estimate for the potential productivity : the amount of probability allocated to unseen events as a function of . We can do so using the GoodTuring estimate for the probability of an event given its frequency (goodturing).
Specifically, let be a function which returns the frequency of the event in . Let denote the number of types (unique events) in for which . GoodTuring says that for large , the probability of the event given that it has occurred with frequency in our sample of size is equal to
(3) 
To obtain an estimate for , we set :
(4) 
which states that the total amount of probability mass allocated to unseen events is equal to the proportion of events which occurred only once (hapax legomena) in . This quantity is known as the potential productivity of a linguistic process (baayen200941; Baayen2001WordFD; baayen1994productivity).
a.2.2 The LNRE Zone in OpenWebText
We apply this method to a subset of OpenWebText, a popular language modeling corpus. In Figure 13, we plot the the empirical estimate of as function of the sample size for grams sampled from a subset of OpenWebText. Particularly for grams with , we find that there is significant probability of sampling a previously unseen event, even for .
a.3 Model Perplexity Values
In this section we report relevant perplexity values for all models used. For each model, we report mean perplexity across sentences drawn from (i) the validation set generated by the artificial language they attempt to model and (ii) the OpenWebText corpus. While we include the perplexity values for our models on sentences of English, this is not to claim that our groundtruth models are meant to define a distribution which closely resembles the distribution of English.
Model  mean (val)  mean (eng) 

LSTM  36.53  158.31 
GPT2small  26.66  129.84 
GPT2medium  25.42  132.56 
GPT2small (pretrained)  21.02  53.79 
GPT2medium (pretrained)  20.87  48.42 
GPT2large (pretrained)  20.90  47.30 
LSTM (increased data)  35.01  149.02 
GPT2medium (increased data)  21.60  94.03 
GPT2large (increased data)  21.32  91.28 
GPT2medium (groundtruth model)    73.79 
Softmax  mean (val)  mean (eng) 

0.70  11.61  223.60 
0.85  25.42  132.56 
1.00  75.30  106.21 
1.15  290.69  106.29 
a.4 Language Samples
he was a complete east end player .  
a former harvard university graduate  
told cnn that in recent weeks , u.s. intelligence officials have  
begun to gather evidence that trump ’s campaign  
colluded with russia to influence the election .  
in the current study , we examined whether  
participants in the study performed more or less  
“ active ” in weight loss lrb p = 0.05 rrb .  
you , the one that is the republican candidate ,  
who is taking over the senate and government  
as a democrat and who is a bipartisan democrat ,  
and you ’ve got to be able to get that done .  
since the 1970s , the city has been in the midst of a landmark urban pride . 
” i have no doubt that ’s a big part of any kind of campaign machine , ” he told al jazeera .  
and it ’s not a very good idea to work with claims in comics and film novels .  
according to the new york times , susan macmahon , 54 ,  
and her husband , elizabeth bailey , were in the vehicle with their son ,  
aged between 12 and 19 .  
anticipating experience  
this is the reason i ’d like to take advantage of these ideas  
in an attempt to transform my life . 
beautiful bacon cookies  
” this accident , despite most of its life ,  
is dependent on one of the most precious bodies on this planet ,  
which is comparable to that on mostrughenai mountains .  
it ’s not fair to say that we ’re not in a position to permit same   
sex marriages , and all the same issues that are presented on regular basis . ”  
from video lrb without editing rrb you can see original images produced at  
the official website , competitors , and event prices .  
consumer electronics have become a perfect fit for bm ’s ultra  low  cost turn . 
zy brace gym fingerprint – bom jerozo dell ’  
shortly thereafter , will be a buoyant as the dynamo bust series starts one .  
disjitor doom efeco became a gothicrevolution manager ofby  
city library for marvel studios .  
earlier this week chicago ’s writer andrew o’t text ’  
american catholics to interested readers the holiday kingsburyobee lrb expiration rrb  
at red for now atmotionweism race and bachelor whitney university ,  
register as 1852 jim banner of airbus beer  
and of course smiled at us in formula 1 where he tells friends by name , ” finnish catholics :  
the encryptings on robot  type mothers .  
middleware in germany 