DeepAI
Log In Sign Up

Evaluating Distributional Distortion in Neural Language Modeling

A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for less-probable sequences. Investigating where this probability mass went, (iii) we find that LMs tend to overestimate the probability of ill formed (perturbed) sequences. In addition, we find that this underestimation behaviour (iv) is weakened, but not eliminated by greater amounts of training data, and (v) is exacerbated for target distributions with lower entropy.

READ FULL TEXT VIEW PDF
02/26/2021

Learning Chess Blindfolded: Evaluating Language Models on State Tracking

Transformer language models have made tremendous strides in natural lang...
08/24/2017

A Study on Neural Network Language Modeling

An exhaustive study on neural network language modeling (NNLM) is perfor...
05/15/2021

A Cognitive Regularizer for Language Modeling

The uniform information density (UID) hypothesis, which posits that spea...
10/22/2020

Autoregressive Modeling is Misspecified for Some Sequence Distributions

Should sequences be modeled autoregressively—one symbol at a time? How m...
12/20/2022

A Measure-Theoretic Characterization of Tight Language Models

Language modeling, a central task in natural language processing, involv...
06/01/2016

Generalizing and Hybridizing Count-based and Neural Language Models

Language models (LMs) are statistical models that calculate probabilitie...
06/18/2019

Improper vs finitely additive distributions as limits of countably additive probabilities

In Bayesian statistics, improper distributions and finitely additive pro...

1 Introduction

Figure 1: GPT2 sequence probability estimates plotted against the true sequence probabilities. Neural LMs underestimate the probability of sequences drawn from the language they are trained to model.

Natural language is fundamentally creative—speakers and listeners frequently produce and comprehend sentences which have never been produced before (fodor1975language; fodor1988connectionism; chomsky1975logical). As a side-effect of this property, distributions in natural language are characterized by a heavy-tail of individually improbable events which collectively account for a significant amount of the total probability mass of the distribution (Khmaladze1988TheSA; Baayen2001WordFD). Precisely approximating this large number of rare events is one of the foundational challenges for models of natural language (good-turing; Jelinek1980InterpolatedEO; katz; kneser-ney; wood2011sequence; goldwater11a). Autoregressive neural language models (bengio2003neural; mikolov2013linguistic; Radford2019LanguageMA)

attempt to do so by decomposing the probability of an event (a sequence) into a series of conditional distributions, each parameterized by a shared neural network.

Recently, a growing body work has sought to understand how these language models (LM) fit the distribution of a language beyond standard measures such as perplexitymeister2021language, for example, investigated the statistical tendencies of the distribution defined by neural LMs, whereas mode-recovery explored whether they adequately capture the modes of the distribution they attempt to model. At the same time, increased focus has been given to performance on rare or novel events in the data distribution, both for models of natural language (mccoy2021much; lent2021language; dudy2020words; oren-etal-2019-distributionally) and neural models more generally (see, for example sagawa2020investigation; d2021tale; chen2021non; BlevinsZettlemoyer2020; czarnowska-etal-2019-dont; devil-in-tails; Ouyang_2016_CVPR; bengio2015battle; Zhu_2014_CVPR). Neither of these branches of work, however, has explored instance-level LM performance on rare sequences in the distribution. As a result, we have relatively little understanding of how neural LMs approximate sequences in the heavy-tail characteristic of natural language.

In this work, we introduce a controlled methodology to explore how LMs estimate the probability of sequences in the heavy-tail of the distribution. Our instance-level evaluation scheme explicitly compares the target probability distribution of the language to the distribution defined by the LM. Since the true distribution of any natural language is in practice unknown, we use a Transformer LM trained on natural data as a generative model to define target

artificial languages for which we can exactly compute sequence probabilities. Training LSTM and Transformer LMs on sequences sampled from these target artificial languages, we compare the sequence-level probability estimates given by neural LMs to the target probabilities in the language. By controlling the entropy of the generative model’s conditional distributions, we create a set of artificial languages with varying distributional properties, and analyze how LM estimation behaviour is modulated by the properties of the target distribution.

Our experiments uncover the extent to which neural LMs provide a distorted fit of the language they are trained to model. We find that LSTM and Transformer LMs (i) systematically underestimate the probability of sequences drawn from the target language and (ii) do so more when such sequences are rare. Where did this underestimated probability mass go? We do not find that the underestimation is accompanied by overestimation in the head of distribution. Rather, we find that LMs tend to (iii) overestimate the probability of rare perturbed (ill-formed) sequences. Interpreted together, these findings indicate that on the one hand, neural LMs under-represent well-formed sequences in the tail of the language they attempt to model, and on the other hand, over-represent ill-formed sequences far away from high probability zones in sequence-space. In addition, we find that (iv) greater amounts of training data lessen underestimation but do not eliminate it and that (v) underestimation is exacerbated for target distributions with lower entropy.

2 Background

We begin by briefly characterizing why distributions with a large number of rare events (LNRE) emerge in natural language, and why these events pose challenges for LMs. Furthermore, we motivate the need for instance-level evaluation when dealing with a large number of rare events.

Productivity

In the context of language production, a language user has the ability to produce, at any given point in their linguistic lifespan, an utterance which they have never produced before. This creativity is the result of the generative property of productivity, which states that on the basis of finite linguistic experience, a language user can produce and comprehend an unbounded number of grammatically acceptable utterances (chomsky1975logical). Productive processes induce a distribution which places non-zero probability on unseen events at all practical sample sizes. Because of this property, many of the distributions in natural language—particularly the distribution over the sequences of a language—are characterized by a heavy-tail of rare events.

LNRE Zone

To make explicit the connection between productivity and a heavy-tail of rare events, let denote the probability of sampling a novel (previously unseen) event from some distribution after having sampled events. Then productivity as described above states that for all sample sizes that occur in practice. The range of sample sizes for which it is the case that is known as the LNRE zone (Khmaladze1988TheSA; Baayen2001WordFD). The LNRE zone for natural language appears to be very large, and it seems likely that will remain greater than for samples of natural language many orders of magnitude larger than all the data currently available for training LMs.111For an empirical validation of this claim on a sample of practical size from the OpenWebText corpus, see the Appendix. In the LNRE zone, it is difficult to obtain accurate estimates of the probability of events using straightforward maximum likelihood estimation (MLE). Accounting for this enormous amount of novelty is thus a central challenge in language modeling.

Language modeling

A model of the language attempts to define a distribution which closely resembles the true distribution of the language

. In a locally-normalized autoregressive model, this distribution is defined by assigning probabilities to variable length sequences

via a chain-rule decomposition:

(1)

where is the vocabulary, and is the non-negative score of token given the sequence prefix , which is computed by a neural network with parameters .

For to perfectly approximate , we expect , where is the set of all strings of finite length (the Kleene closure of ). In the LNRE zone, is defined over a support containing a very large set of sequences which have never occurred in a training corpus (or equivalently, have all occurred with frequency ), and which take on a very wide array of differing probabilities. For example, while the sequences have likely never occurred in any sample of English, most would agree that is far more probable than :

  • The East pond in Parc Lafontaine was filled to the brim with Diet Coke.

  • Certain nak indicate liberationing among theorter codity voters vandalized.

LM Evaluation

For a perfect LM of English, we would expect the estimated probabilities of the sequences and to match their probabilities under the true distribution . However, since and it’s underlying generative process are unknown, it is not possible to explicitly evaluate how closely instance-level probability estimates align. As a proxy, the mean perplexity of the model on a holdout set of sequences is typically used, which measures whether the model, on average, assigns high likelihood to unseen instances. This measure does not, however, tell us whether instance-level estimates align with their true counterparts, nor is it necessarily indicative of performance on rare, idiosyncratic events in . In this way, the lack of access to the ground-truth distribution severely complicates LM evaluation on the heavy-tail of rare sequences in language. The following section introduces a methodology to overcome these limitations.

3 Language Model Evaluation in the LNRE Zone

Component Notation Description
Generative model A LM trained on natural instance-level data.
Artificial language The distribution over sequences induced by a sampling scheme from .
Language model The distribution of a LM trained on sequences sampled from .
Target probabilities The probability assigned by to the sequence .
Model probabilities The probability assigned by to the sequence .
Table 1: Components of our instance-level evaluation scheme. Training on samples from , we compare to for .

We propose evaluating language model performance on the heavy-tail of rare events via a known probability distribution over sequences. Specifically, we train a Transformer LM on sequences sampled from a corpus of natural language to define a generative model . The distribution over sequences induced by a sampling scheme from , denoted , is then our artificial language. We expect a model of this artificial language to assign probabilities to sequences which match the target probabilities of under . To characterize neural LM behaviour on rare events, we train Transformer and LSTM LMs on data sampled from , and compare the instance-level probability estimates given by to target probabilities under . We summarize the components of this methodology in Table 1, and overview it in greater detail in the following section.

3.1 Artifical Languages as Target Distributions

“we’re very excited to have the opportunity to help them,” he says.
so what’s going to happen?
to me, the fisheries are in the midst of a global financial crisis.
Table 2: Sample of sequences drawn from our artificial language.

To define a generative model , we train a randomly-initialized GPT2-medium on 1.5M sentences sampled from the OpenWebText corpus (Gokaslan2019OpenWeb).222All Transformer implementations were obtained from Huggingface, and training was done on two or four RTX-8000 GPUs (depending on model size) with mixed floating point precision. We set the maximum sequence length to be 128 tokens. We additionally train a byte-pair-encoding (BPE) tokenizer on this data set with a standard GPT2 vocabulary size of 57,256 tokens. For simplicity, this tokenizer is used for all models.

Using this generative model , we define the target distribution over sequences—the artificial language—as the distribution induced by an ancestral sampling scheme from . Thus, we draw instances from our language by recursively sampling from the conditional distribution over tokens at each time step: where and . All experiments up until Section 4.5 are conducted on the distribution induced by ancestrally sampling from with softmax temperature .333We temper in an effort to define a ground-truth distribution whose entropy more closely resembles that of a natural language. Note that our findings hold for untempered () ground-truth distributions as well (see A.1.1 and A.1.2). In Section 4.5, we explore the effects of different values of when sampling from . Table 2 shows three sequences sampled from this distribution.

3.2 Sequence Probability Estimation in the LNRE Zone

Given an artificial language , the task of —the language model—is to define a distribution whose sequence-level probability estimates closely align with the sequence-level probabilities given by . We refer to any deviation from this desiderata as model estimation error. To quantify the model estimation error for a sequence , we take the difference between the sequence’s log probability under and its true log probability under :

(2)

This quantity is the log probability ratio, which measures, in log-space, the number of times more or less likely the sequence is under the language model . Note that indicates that underestimates the probability of , whereas indicates that overestimates the probability of . In practice, we train on a set of sequences sampled from , and compute model estimation error on a separate set of sequences sampled from . In all cases, we compute the probability of a sequence as its chain rule decomposition: where and . When computing the ground-truth sequence probabilities for , we take into account any softmax tempering.

3.3 Neural Language Models

We study the estimation performance of two neural LM architectures: the Transformer (vaswani2017attention) and the LSTM (Melis2020Mogrifier)

. When training either architecture, we halve the learning rate if validation loss increases at the end of an epoch. For all model sizes, we use a batch size of 128 sequences. Models with the lowest cross-entropy loss on a withheld validation set are used in experiments unless otherwise mentioned.

We use the Huggingface (wolf-etal-2020-transformers) implementations of GPT2-small, GPT2-medium and GPT2-large (Radford2019LanguageMA) as representative Transformer LMs. We use Adam Optimization with and learning rates , and for GPT2-small, -medium and -large respectively. Since these models are in the same model class as our artificial target language , we expect the task of recovering the ground-truth distribution to be relatively easy compared to the true problem faced in modeling natural language, where both the distribution and the underlying generative process are unknown. For the LSTM (hochreiter1997long), we follow the implementation of the baseline LM described in (Melis2020Mogrifier). We use 2 layers and adjust the hidden state and embedding dimension (2048 and 1024, respectively) to be such that the total number of parameters is approximately equal to GPT2-small (110M).

4 Results

4.1 Estimation Error with Fixed Data

Figure 2: Test sequence probability estimates given by neural LMs. Three left-most figures: The joint histograms of sequence probability estimates. The dotted line denotes the cases in which the model’s estimates perfectly align with the target probability; shading to the right of this line denotes underestimation. Right-most figure: Mean sequence estimation error by target sequence probability.

We begin by exploring model estimation error on a fixed training set of 1M sequences sampled from . We first train LSTM and GPT models on , early-stopping as described above. Following training, we sample a test set of 500,000 sequences from , and score each sequence under both the model distribution and the true language distribution . From this, we obtain a set of probability estimates: . If the model perfectly models the language , then for each we would expect . Figures 2(A) and 2(B) visualize this relationship with the - and -axes denoting the true and model estimated sequence probabilities respectively, and a dashed line representing equality. To compare probability estimates, we represent the set in the form of a joint histogram over this coordinate space. Histogram bins are shaded based on the number of tuples which lie in the coordinate range they define. Importantly, any deviation of this histogram from the identity line indicates that the model distorts the shape of the distribution of the language.

Figures 2(A) and 2(B) provide evidence for distributional distortion in the form of underestimation. The majority of probability tuples in lie to the right of the identity line, indicating that LSTM and GPT2 models consistently underestimate the probability of sequences sampled from . Furthermore, the distance between the identity line and the probability tuples grows non-linearly as function of the true sequence probability, indicating that underestimation error is more severe for rarer sequences in the language. We validate these observations in the right-most plot of Figure 2, which shows mean estimation error decreasing non-linearly as a function of the target sequences probability.444To compute this curve, we split the target sequence probability range into equally sized bins (by probability range). We report the mean estimation error for each bin with sequences. We additionally compute 95% confidence intervals with bootstraps for each mean, resampling equal to the number of sequences in the given bin.

In addition, comparing underestimation behaviour across model size, we find that while GPT2-medium performs slightly better than GPT2-small, these improvements are typically within the range defined by the bootstrapped 95% confidence intervals. See

A.1.3 for evidence indicating that this underestimation behaviour also occurs in pre-trained models fine-tuned on .

4.2 Estimation Error across Training Time

To understand the training dynamics underlying the previously reported underestimation, we compute model probability estimates on subsets of and at the end of each training iteration . Once computed, we sort each set of probability tuples and by their target sequence probabilities , and split the probability tuples into 50 equally-sized bins. We plot estimation curves in Figure 3: Each curve represents a 50th of the sequences, with darker curves denoting estimation error for sequences with lower target probabilities (rarer sequences). At any given point, then, the distance between estimation curves represents the degree to which estimation error is dependent on the target probability of the sequence.

Figure 3 left visualizes underestimation error for sequences seen in training. Around the fifth epoch, estimation error for train sentences converges to zero, that is , indicating that GPT2-medium is able to almost perfectly recover the target probabilities of training sequences no matter their target probability. At the same time, this convergence happens almost simultaneously for all sequences, indicating that a complete reduction in error during training occurs throughout the entire range of target sequence probabilities.

Figure 3: Mean model estimation error by training epoch (GPT2-medium). Each line denotes the mean estimation error for a 50th of the sequences; darker lines represent less probable sequences. The shaded area denotes the area in which validation cross-entropy reaches a minimum.

Figure 3 right visualizes GPT2-medium model’s performance on a separate set of test sequences. First, unlike for , estimation error for does not converge to zero, meaning that even when the model has perfectly recovered the target probability of train sequences, the target probabilities for test sequences remain underestimated. Second, in the case of , the difference between estimation curves of different shades converges to zero, indicating that estimation performance becomes uniform across the distribution of train sequences. We do not see such behaviour in . Instead, the error curves remain at a relatively consistent distance from one another, indicating that the discrepancy in estimation error at different parts of the distribution is unchanging for sequences not seen during training.

4.3 Estimation Error by Amount of Training Data

Figure 4: GPT2-medium, GPT2-large and LSTM trained on 30M sequences sampled from . Main plots: Mean estimation error on test sequences as a function of the number of sequences seen in training. Inset plots: Relative change in mean estimation error of test sequences as a function of the number of sequences seen in training. In both cases, each line denotes estimation behaviour for a 50th of the test sequences; darker lines represent less probable sequences.

Our previous experiment trained languages models on a set of 1M sequences. A plausible explanation for the model’s underestimation behaviour on unseen test sequences is therefore that the language model has not seen enough samples from the target distribution. Here we explore how estimation error varies as a function of the amount of training data. We train GPT2-medium, GPT2-large and an LSTM model in the online “Ideal World” setting (Nakkiran2020online) by sampling, at the beginning of each training iteration, a fresh set of 500,000 sequences from , and training on this sample. Doing so for 60 iterations, we obtain LMs which have been trained on 30 million sequences. We compute model estimation error on at the end of each iteration . Figure 4 visualizes underestimation error throughout training for these LMs. We again split test sequences by their true probability, with darker lines denoting estimation trends for less probable target sequences.

The estimation curves in Figure 4 suggest that while increasing the amount of data in training initially leads to lower estimation error, this reduction eventually asymptotes. In the insets of Figure 4, we visualize the relative change in mean estimation between epochs and . Relative change in estimation error eventually fluctuates around 0 (minimal change) for the majority of the distribution. Comparing architectures, we find that the Transformer is significantly more efficient at reducing mean estimation error throughout the distribution.

4.4 Where did the probability mass go?

Figure 5: GPT2-medium estimation behaviour for 15M sequences in across two dimensions: sequence rarity (x-axis) and degree of perturbation (y-axis). The heat map is shaded based on estimation error severity; blue areas indicate overestimation, whereas brown areas indicate underestimation. We also include example sequences from two zones in this sequence space.

In the previous section, we saw that even when increasing the amount of training data, consistently underestimates the probabilities of sequences sampled from the tail of . At the same time, we did not find that overestimated sequences in the head of . Under the assumption that is a proper probability distribution,555See welleck2020consistency for discussion on consistency in the context of neural language model decoding. that is, , these findings suggest that there exists sequences in whose probability is overestimated by the model. In this section, we investigate where this probability mass went.

Swap two tokens in .
Delete a token from .
Insert a token from
at a position in .
Substitute a token in
with a token from .
Table 3: randomly applies one of these perturbations to .

To do so, we compute model estimation error on perturbed sequences from —sequences in which are increasingly far away from the high-probability zones in . We build a corpus of perturbed sequences by recursively applying 30 random perturbations to each sequence . Formally, the set of sequences at perturbation step can be expressed as: where is a function which returns a novel perturbed version of , and . Sequence perturbation operations are shown in Table 3. While it is possible that these operations produce other well-formed strings, we expect this to be a relatively rare outcome. We score each of these 15M sequences under both the target generative model and the LM . Note that we use as LM the GPT2-medium model from the previous section (trained on 30M sequences).

Figure 5 visualizes GPT2-medium’s mean estimation error for these 15M sequences across two dimensions. On the x-axis, we plot the target probability of the sequence under , and on the y-axis, the number of perturbations performed on the sequence. For example, the bottom-left corner (1) of Figure 5 visualizes estimation behaviour for rare sequences sampled directly from , whereas the top-left corner (2) visualizes estimation error sequences which are equally rare, but which have been perturbed up to 30 times.

Figure 5 offers a nuanced characterization of underestimation behaviour. The brown area on the bottom of the figure re-states the underestimation findings of the previous section. When increasing the number of perturbations performed, however, we begin entering into a space of sequences which are at first well-estimated by (the white areas) but then are quickly overestimated by (the dark blue areas), confirming that there are indeed sequences in which are overestimated by the language model. Furthermore, these findings suggest that the tail of rare events defined by the language model does not match the tail of the artificial language—the rare events typical in are under-represented in in favour of other sequences in . See Section A.1.4 for experiments finding that random sequences from are also overestimated by .

4.5 Modulating the Shape of the Target Distribution

Figure 6: Model estimation error on test sequences as a function of target sequence probability for three different artificial languages. Each line visualizes estimation error for a GPT2-medium model trained to model a language with a specific softmax temperature parameter .

Up to this point, the target artificial language was given as the distribution induced by an ancestral sampling scheme with softmax from the generative model . In the previous section, we saw that placed excess probability mass on areas in with low-probability under . Here we modulate the shape of the sequence space defined by to investigate how estimation error varies with respect to systematic interventions in the target distribution. To adjust the way that allocates probability mass over , we control the entropy of the conditional distributions at each generation step

by dividing the pre-softmax logits by a temperature value

.666Formally, for the th component of the length pre-softmax logits , this operation is given as:

We visualize the effects of on the shape of the distribution in the left of Figure 6. By increasing the value of , we increase the entropy of the distributions over next tokens, which in turn, spreads probability mass across a larger number of sequences in . We define four artificial languages with varying and train GPT2-medium on an ancestral sample of 1M sequences from each of these artificial languages. Figure 6 visualizes model estimation error by true sequence probability for each model. We find that models trained on languages with increased entropy perform comparatively better than models trained on low entropy languages. Estimation error for models trained on languages with greater is less severe, and this holds throughout nearly all target sequence probabilities. These results indicate that neural LMs provide a more accurate approximation of target distributions which spread mass more uniformly across .

5 Related Work

This paper contributes to recent work investigating the properties of the distributions defined by LMs. Prior studies have focused on exploring (takahashi2019; takahashi2017neural) and developing frameworks (meister2021language) to better understand whether the large-scale statistical tendencies of natural language, such as Zipf’s law (zipf1949human), are captured by LMs. We take a more fine-grained approach, proposing a methodology which draws off of instance-level evaluation schemes (zhong2021larger) and the experimental control afforded by artificial corpora (white-cotterell-artificial; zipf-jurafski-structure). Indeed, closely related to our work is mode-recovery’s, in which artificial corpora produced by generative models were used to explore mode recovery in neural language modeling. Our analysis exploring the overestimation of ill-formed sequences extends previous findings on locally normalized conditional models assigning arbitrary probability mass to unlikely sequences (andor2016globally; goyal2019empirical; label-bias-lafferty), neural LMs assigning high likelihood to sequences with repetitions (Welleck2020Neural), the consistency of decoding algorithms (welleck2020consistency), and on machine translation models placing significant probability mass on the empty sequence (stahlberg-byrne-2019-nmt).

We additionally contribute to the body work seeking to characterize and adapt neural model performance on rare or novel examples and classes (devil-in-tails; bengio2015battle). In the context of language modeling, lent2021language explored performance on under-resourced languages, whereas oren-etal-2019-distributionally did so on under-represented domains in training corpora. mccoy2021much introduced analyses to assess sequential and syntactic novelty in LMs. Focusing on the word frequency distribution, dudy2020words found that LMs under-perform when less frequent examples are encountered at test time. In the classification setting, various approaches have been proposed to help alleviate class imbalance in the data distribution, such as data augmentation (sagawa2020investigation) or the transfer of knowledge from high-frequency classes to infrequent ones (Ouyang_2016_CVPR; Zhu_2014_CVPR; chen2021non). Prior to the current neural paradigm (bengio2003neural), multiple approaches have been proposed to deal with the heavy-tail, such as smoothing and back-off approaches in statistical -grams (chen1999empirical) and two-stage Bayesian approaches (johnsonPL2006).

6 Conclusion

Emerging as a result of a language user’s ability to produce and comprehend novel expressions, the heavy-tail of rare events is one of the fundamental features of distributions in natural language. In this work, we introduce a controlled methodology to evaluate instance-level LM performance on this set of individually rare but collectively frequent events. We use generative models trained on natural language corpora to define a set of artificial languages for which we can exactly compute the probability of sequences. Training LSTM and Transformer LMs on sequences sampled from these artificial languages, our analysis compares the probability estimates given to sequences by the LMs to the target probabilities of sequences under the artificial language.

Our results indicate that neural LMs systematically under-represent sequences in the tail of the target distribution, even when increasing the amount of the training data. Investigating where this probability mass went, our perturbation experiments reveal that neural LMs do not tend to overestimate the head of the distribution, but rather overestimate the probability of sequences outside those typical in the target distribution. Comparing model performance on target distributions with varying properties, we find that neural LMs tend to provide more accurate approximations of distributions with greater entropy. Interpreted together, these results indicate that autoregressive neural language models have a tendency to spread probability mass too uniformly across the space of possible sequences.

Finally, we would like to acknowledge that we do not know the degree of structural difference between our Transformer-generated ground-truth distributions and the distributions of actual natural languages. It is likely that the distribution defined by our ground truth models is less structured than the distribution of a natural language. Therefore, it is possible that some systematic difference between natural language distributions and our ground-truth distributions may affect our results to a certain degree. That being said, our experiments in Section 4.5 suggest that it may actually be easier for neural LMs to learn less structured distributions, and we expect the task of recovering a ground-truth distribution to be made easier when the target distribution and LM are in the same model class. Nevertheless, future work should seek to conduct similar experiments using ground-truth distributions with more explicit structure.

References

Appendix A Appendix

a.1 Additional Experiments

a.1.1 Pre-Trained Ground-Truth Model

In our previous experiments, our ground-truth model was given as a Transformer language model trainined on 1.5M sequences from the OpenWebText corpus. Here we explore underestimation behaviour when the ground-truth distribution is given by a pretrained GPT2-medium model fine-tuned on 1.5M sequences from the OpenWebText corpus. We sample from with softmax .

Analogously to the experiment in Section 4.1, we train a randomly-initialized GPT2-medium model on 1M sequences sampled from the fine-tuned model. In the center of Figure 7, we visualize mean model estimation error for test sequences as a function of true sequence probability. Similarly to our experiments in Section 4.4, we find that the LM underestimates the probability of the majority of sequences, and does so more severely for less probable sequences. Note that this LM obtains a test perplexity of .

Figure 7: Left: Joint histogram of sequence probability estimates for test sequences. Center: Mean model estimation error by true sequence probability for test sequences. Right: Mean model estimation error by true sequence probability for sequences randomly sampled from .

Finally, to ensure that our findings regarding the overestimation of ill-formed sequences hold, we compute model estimation error on random sequences sampled from (see A.1.4 for details on how these sequences are constructed). Figure 7 center visualizes mean model estimation error as a function of target sequence probability. We find that overestimates the majority of ill-formed sequences, indicating that these findings hold when the ground-truth distribution is defined using a pretrained model.

a.1.2 Untempered (T = 1.00) Ground-Truth Distribution

In Sections 4.1 to 4.4, we conducted all experiments on an artificial language defined by ancestral sampling scheme with . In Section 4.5, we saw that the underestimation findings held regardless of . To provide further evidence that these results hold for other values of , we conduct similar experiments as in Section 4.3 with an artificial language defined by an ancestral sampling scheme with . Specifically, we train GPT2-medium and an GPT2-large on a total of 30M sequences sampled from , and we compute model estimation error on a set of withheld test sequences at each training iteration.

Figure 8: GPT2-medium and GPT2-large trained on 30M sequences sampled from with . Main plots: Mean estimation error on test sequences as a function of the number of sequences seen in training. Inset plots: Relative change in mean estimation error of test sequences as a function of the number of sequences seen in training. In both cases, each line denotes estimation behaviour for a 50th of the test sequences; darker lines represent less probable sequences.

a.1.3 Pre-Trained Model Estimation Error

In Section 4.3, we explored how estimation error varies as a function of the amount of training data, finding that while increased data weakens estimation error, the underestimation behaviour persists. As an alternative way to explore how underestimation varies with increasing data, we fine-tune Huggingface’s pre-trained GPT2-small, -medium and -large models on the set of 1M sequences used in Section 4.1. Computing estimation error on a set of unseen test sequences, we find, analogously to our experiments on models trained from scratch, that pre-trained models underestimate the probability of the majority of sequences sampled from the target distribution, and do so more severely for rarer sequences (we visualize this in Figure 9). Furthermore, in Figure 10, we increase the amount of fine-tuning data substantially, and we plot test sequence estimation behaviour at the end of each epoch for a pre-trained GPT2-medium model. Once again, increased training data lessens but does eliminate the underestimation behaviour.

Figure 9: Test sequence probability estimates given by pretrained neural LMs fine-tuned on 1M sequences sampled from . Three left-most figures: The joint histograms of sequence probability estimates. The dotted line denotes the cases in which the model’s estimates perfectly align with the target probability; shading to the right of this line denotes underestimation. Right-most figure: Mean sequence estimation error by target sequence probability.
Figure 10: Pre-trained GPT2-medium fine-tuned on 30M sequences sampled from . Main plots: Mean estimation error on test sequences as a function of the number of sequences seen in fine-tuning. Inset plots: Relative change in mean estimation error of test sequences as a function of the number of sequences seen in fine-tuning. In both cases, each line denotes estimation behaviour for a 50th of the test sequences; darker lines represent less probable sequences.

a.1.4 Alternative Perturbations

In Section 4.4, we study where the language model’s underestimated probability mass went by computing model estimation error on perturbed sequences. We obtain a set of perturbed sequences by (i) sampling a sequence from and then (ii) recursively perturbing this sequence according to the perturbations provided in Table 3.

This method provides us with sequences which are increasingly far away from high-probability zones under . However, it does so with initial sequences sampled directly from , and as a result, produces strings which are edit-adjacent to the high-probability zones (under ) in .

Figure 11: Mean model estimation error by true sequence probability for sequences randomly sampled from .

To ensure that our results hold in other low-probability subspaces, we conduct an analogous experiment on sequences randomly sampled from . We sample a sequence from by first sampling a sequence length

from a Poisson distribution with

. Given this length, we sample tokens from and concatenate all tokens to form a sequence. We then score the sequence under both and , and compute model estimation error. We use as artificial language GPT2-medium with and we use as language model a GPT2-medium model trained on 30M sequences ancestrally sampled from .

Figure 11 visualizes mean estimation error as a function of the ground-truth probability of the sequence. Similarly to all other perturbation experiments, we do indeed find that overestimates these sequences, regardless of their true sequence probability.

a.1.5 Estimation Error by Sequence Length

Autoregressive neural language models decompose the joint distribution

over sequences into a series of conditional distributions . Generating a sequence of length , then, involves estimating conditional distributions. Since the probability of a sequence is inversely correlated with its length, our findings that estimation error is worse for rarer sentences may be explained by compounding errors.

Figure 12: Expected and observed estimation error by sentence length for GPT2-small and GPT2-medium.

To test this claim, we ask whether the observed estimation error is worse than would be expected if it was due to error compounding. Specifically, in Figure 12, we plot the expected model estimation (black) and the mean observed error (blue) by sequence length. Expected estimation error for sequence length is computed by multiplying the average token level estimation error by the sequence length, i.e., , where

Observed estimation error is computed as the mean estimation error for test sentences of length . Note that the shaded areas around this curve denote the 95% bootstrapped confidence intervals for this mean. As shown in Figure 12, observed estimation error for both GPT2-small and GPT2-medium is more severe than expected estimation error as we increase sentence length. This suggests that estimation error for longer (and typically rarer) sequences is not solely due to error compounding.

a.2 Empirically Measuring the LNRE Zone

In section 2, we formally defined the LNRE zone as the range of values of for which there is non-zero probability of sampling a novel event at the th draw. Here we introduce a frequentist estimator for this probability. This in turn allows us to empirically verify if a sample exists in the LNRE zone.

a.2.1 Estimating the Potential Productivity

Suppose we have a set of events drawn from some generative process . Given , we aim to obtain an empirical estimate for the potential productivity : the amount of probability allocated to unseen events as a function of . We can do so using the Good-Turing estimate for the probability of an event given its frequency (good-turing).

Specifically, let be a function which returns the frequency of the event in . Let denote the number of types (unique events) in for which . Good-Turing says that for large , the probability of the event given that it has occurred with frequency in our sample of size is equal to

(3)

To obtain an estimate for , we set :

(4)

which states that the total amount of probability mass allocated to unseen events is equal to the proportion of events which occurred only once (hapax legomena) in . This quantity is known as the potential productivity of a linguistic process (baayen200941; Baayen2001WordFD; baayen-1994-productivity).

a.2.2 The LNRE Zone in OpenWebText

We apply this method to a subset of OpenWebText, a popular language modeling corpus. In Figure 13, we plot the the empirical estimate of as function of the sample size for -grams sampled from a subset of OpenWebText. Particularly for -grams with , we find that there is significant probability of sampling a previously unseen event, even for .

Figure 13: The probability of sampling a novel item (potential productivity ) as a function of sample size . Many distributions in natural language are characterized by a potentially unbounded amount of novelty.

a.3 Model Perplexity Values

In this section we report relevant perplexity values for all models used. For each model, we report mean perplexity across sentences drawn from (i) the validation set generated by the artificial language they attempt to model and (ii) the OpenWebText corpus. While we include the perplexity values for our models on sentences of English, this is not to claim that our ground-truth models are meant to define a distribution which closely resembles the distribution of English.

Model mean (val) mean (eng)
LSTM 36.53 158.31
GPT2-small 26.66 129.84
GPT2-medium 25.42 132.56
GPT2-small (pretrained) 21.02 53.79
GPT2-medium (pretrained) 20.87 48.42
GPT2-large (pretrained) 20.90 47.30
LSTM (increased data) 35.01 149.02
GPT2-medium (increased data) 21.60 94.03
GPT2-large (increased data) 21.32 91.28
GPT2-medium (ground-truth model) - 73.79
Table 4: on validation set generated by the artificial language the LM attempts to model (val), and on real sentences sampled from OpenWebText (eng).
Softmax mean (val) mean (eng)
0.70 11.61 223.60
0.85 25.42 132.56
1.00 75.30 106.21
1.15 290.69 106.29
Table 5: GPT2-medium on validation set generated by the artificial language the LM attempts to model (val), and on real sentences sampled from OpenWebText (eng) for models used in Section 4.5

a.4 Language Samples

he was a complete east end player .
a former harvard university graduate
told cnn that in recent weeks , u.s. intelligence officials have
begun to gather evidence that trump ’s campaign
colluded with russia to influence the election .
in the current study , we examined whether
participants in the study performed more or less
“ active ” in weight loss -lrb- p = 0.05 -rrb- .
you , the one that is the republican candidate ,
who is taking over the senate and government
as a democrat and who is a bipartisan democrat ,
and you ’ve got to be able to get that done .
since the 1970s , the city has been in the midst of a landmark urban pride .
Table 6: Sequences ancestrally sampled from the artificial language generated by GPT2-medium with softmax .
” i have no doubt that ’s a big part of any kind of campaign machine , ” he told al jazeera .
and it ’s not a very good idea to work with claims in comics and film novels .
according to the new york times , susan macmahon , 54 ,
and her husband , elizabeth bailey , were in the vehicle with their son ,
aged between 12 and 19 .
anticipating experience
this is the reason i ’d like to take advantage of these ideas
in an attempt to transform my life .
Table 7: Sequences ancestrally sampled from the artificial language generated by GPT2-medium with softmax .
beautiful bacon cookies
” this accident , despite most of its life ,
is dependent on one of the most precious bodies on this planet ,
which is comparable to that on mostrughenai mountains .
it ’s not fair to say that we ’re not in a position to permit same -
sex marriages , and all the same issues that are presented on regular basis . ”
from video -lrb- without editing -rrb- you can see original images produced at
the official website , competitors , and event prices .
consumer electronics have become a perfect fit for bm ’s ultra - low - cost turn .
Table 8: Sequences ancestrally sampled from the artificial language generated by GPT2-medium with softmax .
zy brace gym fingerprint – bom jerozo dell ’
shortly thereafter , will be a buoyant as the dynamo bust series starts one .
disjitor doom efeco became a gothicrevolution manager ofby
city library for marvel studios .
earlier this week chicago ’s writer andrew o’t text ’
american catholics to interested readers the holiday kingsburyobee -lrb- expiration -rrb-
at red for now atmotionweism race and bachelor whitney university ,
register as 1852 jim banner of airbus beer
and of course smiled at us in formula 1 where he tells friends by name , ” finnish catholics :
the encryptings on robot - type mothers .
middleware in germany
Table 9: Sequences ancestrally sampled from the artificial language generated by GPT2-medium with softmax .