Language modeling is the task of generating language from context. This is traditionally framed as an autoregressive task, where a model predicts the conditional probability of the next word based on the sequence of previously observed or generated tokens. Language modeling has seen a recent surge in relevance thanks to the success of language modeling as a pretraining objective for self-supervised representation learning. The most-prominent language models today are Transformer-based models(Vaswani et al., 2017) such as BERT (Devlin et al., 2018)
and GPT-2(Radford et al., 2019)
. Language models are most commonly trained with backpropagation using traditional NLP loss metrics such as cross entropy. These loss metrics are designed so that the models are rewarded for assigning high probability to text that appears commonly in the gold training corpus. The energy and computational resources used to train a state-of-the-art language model from scratch are very high, to the point of impracticality for most researchers. One recent estimate suggests that training a single model with an architecture search takes more energy to train than five cars will use in their entire lifetimes(Strubell et al., 2019). This high cost of training from scratch is sidestepped by pretraining, where a generic language model is trained by those with sufficient resources on a general dataset and released for use by other researchers. Once pretrained, the parameters of a language model can be updated for use in related downstream tasks through finetuning. A sufficiently general language model can be finetuned on a subdomain in order to generate text that matches the style and syntax of that specific domain Howard and Ruder (2018). While using pretrained models avoids having to create a new model for each task, the cost of finetuning such large networks is still relatively high. Finetuning to completion for a single task can easily take in excess of a day on multiple energy-intensive GPUs Strubell et al. (2019).
Recent work analyzing the finetuning process has shown that it has high variability between runs and is particularly sensitive to seemingly arbitrary factors such as data ordering Dodge et al. (2020). Those authors propose to overcome this variability by training models using many random seeds and then only keeping the best, effectively trading computational efficiency for model performance. While this technique improves the final model, the reasons for the high variability between random seeds have yet to be explored. We hypothesize that much of this variability can be explained by the random selection of highly ”informative” training examples, which most effectively capture low-level distributional statistics of a given corpus such as token unigram and bigram frequencies. If this is the case, then it should be possible to quickly screen for these informative training examples using a simple model.
We suggest an alternative prospective approach to improving the robustness of the training procedure, rather than the retrospective approach of testing many random seeds that they suggest. In this paper, we show that most of the benefit of the finetuning process comes from learning low-level frequency-based distributional statistics of the training corpus. Based on this observation, we present a new technique for more efficient finetuning of language models that employs a secondary learner to estimate the usefulness of finetuning on each given training example. Our method is valuable both as a novel learning result and potentially as a means to mitigate the energy impact of deep language modeling, as our approach requires significantly fewer backpropagation steps than other techniques that trade computational power for performance to achieve finetuned models of equal quality.
2 Finetuning is Simplistic
We hypothesized that it is possible to determine whether a given example is worth learning from by only examining the low-level features of that context such as unigram frequency. To test this hypothesis, we performed an experiment in which we finetuned a language model on either (1) real example sequences from a corpus, (2) artificial sequences that were constructed by independently sampling each token from the frequency distribution of the corpus, and (3) sequences constructed by uniformly sampling tokens. We then measured the change in loss on a separate portion of the corpus. Figure 1 shows the results of this experiment. The average reduction in loss for examples constructed using the unigram frequency distribution is almost as high as for real examples. Thus, a significant amount of the benefit of training on contexts can be estimated by merely knowing the unigram frequency distribution from which those contexts were derived, which is easily estimable without knowing the particular parameterization of the language model itself. This suggests that we can inexpensively estimate whether a given context generalizes well to the target corpus, and preferentially choose to train on those contexts over others.
Dodge et al. (2020) observed that the quality of a finetuning run could usually be established by looking at the trajectory of the loss curve very early during training. We therefore attempted to determine whether training on good contexts early is an important element of the variability in data order between finetuning runs. Figure 2 compares test perplexity after training from a randomly sampled first batch against the test perplexity after many randomly sampled batches. Good early batches improve the probability of converging to an ideal final value. The correlation between the test perplexity after a single batch and the test perplexity after 50 batches, which is near convergence for most runs, is moderately high ().
We use this pair of observations, that (1) early data order is important, and (2) that it may be possible to ensure its quality, to motivate our approach to selectively modifying data order during finetuning to improve overall model performance. Specifically, if we carefully ensure that early batches are good, then we will likely end up with a superior model after convergence.
3 Information Gain Filtration
3.1 Overview of Method
Our method generates a secondary learner that attempts to predict the “informativeness” of each example sampled from a training corpus to finetuning the language model to our chosen target corpus. We then set a threshold on this informativeness to determine whether to backpropagate or skip that example during finetuning. By filtering examples in this way, we aim to reduce the effect of variability in data order observed in previous work Dodge et al. (2020) and improve the performance of our language model. Due to its intuitive similarity with notions in deep Q-learning (Mnih et al., 2013) of using a network to approximate the expected value of a given action, we abbreviate this normalized informativeness metric as a “-value”.
Formally, we define our language model
as a conditional probability distribution induced by a set of parameters, which, when conditioned on an ordered sequence of tokens , outputs a probability distribution over the next token, :
We define our loss functionas the perplexity of a given set of (sequence, next token) pairs that we denote the test set, , under a given parameterization () of our language model:
where denotes the one-hot probability distribution that assigns all of its probability mass to the token .
When encountering a new training example , the model has a choice between two actions:
Backprop, which updates the language model parameters by backpropagating the loss and taking a single step of gradient descent, updating parameters to , and
Skip, which leaves the language model parameters unchanged.
We intend to utilize a held-out subset of training data we call the objective set, in order to inform our decision about which contexts are informative. We will compute the difference in perplexity measured against this objective set before and after training on a given example in order to quantify the informativeness of that example. We denote this difference as the information gain (IG):
where is the initial parameterization of the language model and is the parameterization of the new language model after backpropogating on the loss associated with the training example . For notational brevity, we denote as simply since there exists an implicit direct bijection between all ’s and ’s.
Given an example, regardless of the current language model parameters, we intend to estimate whether it is worthwhile to backpropogate over that example. Given our previous observation that early batches are especially important for finetuning, we expect that even though is only a measurement of the single-step change in perplexity during training, it will be a sufficient estimator of long-term data quality. To each of our two actions, we assign a value:
where is a free “threshold” parameter for deciding which values are sufficiently high to warrant backpropagation. We call this technique information gain filtration or simply IGF.
We construct a training dataset for our secondary learner by measuring for a randomly selected set of training examples from our training set. Each entry in this dataset consists of a pair of the input text and its associated value, . Using this constructed dataset, we generate a secondary learner to approximate given .
During language model finetuning we apply our newly-trained secondary learner with a greedy policy:
In practice, our secondary learner, , represents the input text
by embedding it with the 768-dimensional GPT-2 byte-pair embeddings. We then pass the input representations through a convolution with kernel width 3, followed by max-pooling operation over the time axis and a 2-layer feedforward network. This architecture was refined through coordinate descent, and evaluated on a separate held-out set of measuredvalues. The choice of architecture does not strongly affect method performance (see Appendix A, Figure 8
). Additionally, a neural network is not necessary for the learner, as simpler learning methods are sufficient (see Figure5).
3.2 Scheduled Thresholding
Our method approximates using only the difference in loss between the initial pretrained model parameters and the model parameters after one backpropagation step . This means that the effectiveness of the learner at distinguishing “high quality” examples from “low quality” examples decays over time during finetuning, as the parameters of the pretrained model diverge from their initial values. Examples that are useful to train on at the beginning of finetuning are not necessarily useful to train on later. To ameliorate this problem, IGF can be modified by changing over the finetuning process. Since is most accurate at the first step, we have found that scheduling to alternate between highly selective (a high value) to highly permissive (a low value) is an effective strategy. This allows the model to take advantage of the accurate predictions for early in the finetuning process without overfitting once those predictions become less accurate later on during finetuning. We find alternating to be superior to setting the permissiveness of to decay permanently because it enables the language model to take advantage of as many or as few high-quality examples as necessary without overfitting them.
3.3 Iterated Information Gain Filtration
Instead of scheduling the selectivity of the secondary learner to taper off as the finetuning process continues, we might instead replace the learner periodically with a new learner trained on a new dataset of pairs generated using the current parameterization of the language model. This process, which we call iterated infomation gain filtration (IIGF), allows us to replace the obsolete learner that was trained to predict for early examples with a learner that is more relevant later in finetuning. IIGF has the added advantage of allowing us to keep high throughout finetuning, as secondary learner irrelevance is no longer a concern. This procedure is very computationally expensive, as the overhead in generating the new dataset and learner far exceeds the computational cost of finetuning. Nonetheless, this enables finer control of data order throughout the finetuning process and further improvements in final perplexity over IGF with scheduled thresholding.
We focused on analyzing the effectiveness of our method from two perspectives, final model performance and ability of the method to efficiently trade a computational budget for improved performance. We compare IGF directly to standard finetuning, which we define as basic batched stochastic gradient descent with AdamKingma and Ba (2014) using random samples from the target corpus. For our tests, we used the pretrained GPT-2 Small Transformer model, a commonly used unidirectional language model. We use the publicly available GPT-2 Small implementation of the transformers package Wolf et al. (2019). We test our approach in two settings, a standard Books dataset Zhu et al. (2015) and “mixed” dataset which is composed of training examples from two corpora (the Books corpus and a corpus of scraped Reddit comments Huth et al. (2016)111Intel authors did not use or process any data. Intel does not control or audit third-party data.) but whose test set only comes from one corpus (Books). The Books corpus allows us to fairly compare standard finetuning against IGF, whereas the Mixed corpus allows us to analyze the effectiveness of the method at separating informative contexts from uninformative ones. For both methods, batches of size 16 were used to train the language model with a learning rate of and . The convolutional network that we used for our secondary learner was similarly trained using SGD with Adam with a learning rate of and .
4.1 Corpus Separation
We created a dataset of 10,000 pairs using an objective set of 160 examples drawn solely from the Books corpus. We used this dataset to train an example secondary learner. Next, this secondary learner was fed randomly sampled contexts from both the Books and Reddit corpora. Because the objective set only contains examples from one corpus, we expect that the secondary learner should assign higher values to other examples from the same corpus. Figure 3 shows that there is indeed a significant difference in the distributions of
values between these two corpora. Further, this shows that the corpora can be effectively separated by this simple secondary learner. Almost all examples from the Reddit corpus are expected by the IGF secondary learner to produce a reduction in perplexity that is at least one standard deviation below the mean. This indicates that the secondary learner can identify with strong confidence that Books corpus examples as more informative for finetuning towards the Books objective than Reddit corpus examples. It is worthwhile to note that the secondary learner achieves dataset separation despite having access to just 160 labelled examples of 32 tokens in our objective set, a total of just 5120 tokens from the Books corpus, and no examples from the Reddit corpus.
4.2 Effectiveness of the Secondary Learner
The fully-trained convolutional secondary learner achieves a MSE from the true metric of just 0.21 standard deviations. Considering that the secondary learner only has access to low-level statistics of the contexts, this supports the view that the question of whether a not an example is useful for language model finetuning is one that is almost entirely answerable by simple statistics of the example itself, and not something that requires a learner that is as complex as the language model. That is, simple statistical regularities such as unigram, bigram, and trigram frequencies that can be captured by the stride-3 convolutional architecture that the secondary learner uses are sufficient to adequately estimate the usefulness of an example during training. Since the secondary learner can be used to fully filter out the examples in the Mixed dataset that originated from the Reddit corpus, it performs equally as well on the Mixed and Books corpus. This shows that finetuning with IGF is resilient to imperfections in the training set, such as incorrectly labelled data.
4.3 Language Model Finetuning
We next used the secondary learner to finetune our language model towards the target Books corpus. We generated 50 runs each of standard finetuning on training examples sampled from the Mixed corpus and separately from the easier Books corpus. We then generated 50 runs of IGF using two thresholding schedules, one with a fixed and one with an alternating . Both types of IGF runs were performed on the more challenging Mixed corpus only. Figure 4 plots the averaged finetuning curves of these 4 different categories over 60 batches. We can see that IGF (green and red) significantly improves final test perplexity when compared to standard finetuning on both the Mixed corpus (blue) and the Books (orange) corpus. Standard finetuning on Books achieves a median perplexity of 57.3, compared to 56.9 for IGF with a constant threshold (green) and 54.0 for IGF with the alternating threshold schedule (red). All 50 runs of IGF with an alternating schedule outperformed all 50 standard finetuning runs. This means that the overall improvements to data order that IGF achieves through selective sampling of informative contexts are far in excess of what might be reasonably achieved through random sampling of contexts during finetuning. Due to its computational expense, we also ran a small set of 5 tests of iterated information gain filtration by training a secondary learner using an dataset built from example pairs derived from a language model that had already been fully finetuned to the Books corpus. IIGF was able to improve these already-converged models by an average of 0.29 additional perplexity points after reconverging, with a standard deviation of 0.11 points.
4.4 Choice of Secondary Learner
For other results presented here, we used the simple convolutional neural network described in Section 3 as our secondary learner. However, it is generally not necessary to choose an end-to-end neural network as the secondary learner for
. Indeed, much simpler machine learning methods suffice for almost equal performance. Figure5 shows predicted vs. actual normalized values for several learning methods. While the convolutional neural network is most effective at approximating , other learners perform almost well. We encoded the contexts both by using the standard GPT-2 Small word embedding and with a simple one-hot encoding of the token identities within each context. Standard linear regression performed on both encoding types perform nearly as well at approximating with a convolutional model. We also tested an even simpler learner that assigned to each token an associated value that was computed by averaging the values for each context that contained that token in a held-out training set. The values of new contexts in our test set were then computed as the average of token values contained in that context. Even this extremely simple model is a reasonably predictive approximator of . This underscores that while is an extremely complex function to compute exactly, since it is dependent on the precise parameterization of the language model, it can nevertheless be effectively approximated through simple unigram information.
4.5 Efficiency of IGF
When compared to standard finetuning, IGF takes fewer batches to train to convergence. This reduces the number of backpropogation steps necessary during training and improves runtime and energy usage during finetuning (See Appendix A, Figure 9
). However, it is important to note that there is a significant constant-time overhead for generating the dataset on which to train the secondary learner, so IGF is beneficial for this purpose only if the model will be trained many times, such as during a hyperparameter or architecture search.Dodge et al. (2020) showed that model performance can be improved by rerunning the finetuning process many times with different random seeds to determine a good data order, and then choosing the best resulting model by testing against a validation set. Since IGF aims to replaces this random search for a good seed with a principled, directed search, we would expect IGF to be significantly more effective than random seed testing in terms of the number of runs necessary to achieve meaningfully improved performance. Figure 6 compares the computational efficiency of IGF against random seed testing. We can see that using IGF to improve data order significantly outperforms random seed testing.
5 Conclusion and Future Work
We have shown that a substantial portion of the finetuning process is composed of merely changing the model’s estimation of low-level distributional statistics, such as unigram and bigram frequency. We then presented information gain filtration, a method for improving the data and energy efficiency of the finetuning process that uses this observation to efficiently estimate the usefuleness of each example encountered during training. This data filtration technique yields significant improvements in final model performance and also converges with roughly 40% fewer batches than standard finetuning.
There are some significant questions that our research has left for future study. Since the focus of our filtering method was on a lightweight technique that did not require a significant overhead, our secondary learner was a simple convolutional network that treated each example sequence as a bag of trigrams. Data efficiency during training could potentially be further improved if one was willing to sacrifice on model complexity and energy efficiency by using a more complex model. The question of how far one could logically take a function approximator network for estimating information gain remains unexplored.
Finally, we have left several questions of a more theoretical nature unanswered in our analysis here. Specifically, we lack an understanding of why improving performance on early batches results in a long-term improvement in the model performance at convergence. Is this a property of exclusively language models, or are there other types of networks and learning tasks that exhibit this phenomenon? What are the factors that allow IGF to generalize in some cases and not in others, and can a more general method for filtering useful examples be developed that is language-model invariant? We believe these are exciting questions that drive near the heart of a better understanding of the underlying processes of LM finetuning.
We would like to thank all of the people whose contributions, opinions and suggestions helped with the development of this project, especially Nicole Beckage, Kaj Bostrom, Greg Durrett, Shailee Jain, and Vy Vo. This project was funded by a generous grant from Intel.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
- Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. External Links: Cited by: §1, §2, §3.1, Figure 6, §4.5.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §1.
- Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532 (7600), pp. 453–458. Cited by: §4.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §3.1.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1.
Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books. External Links: Cited by: §4.