Selecting Informative Contexts Improves Language Model Finetuning

We present a general finetuning meta-method that we call information gain filtration for improving the overall training efficiency and final performance of language model finetuning. This method uses a secondary learner which attempts to quantify the benefit of finetuning the language model on each given example. During the finetuning process, we use this learner to decide whether or not each given example should be trained on or skipped. We show that it suffices for this learner to be simple and that the finetuning process itself is dominated by the relatively trivial relearning of a new unigram frequency distribution over the modelled language domain, a process which the learner aids. Our method trains to convergence using 40 finetuning, and achieves a median perplexity of 54.0 on a books dataset compared to a median perplexity of 57.3 for standard finetuning using the same neural architecture.



There are no comments yet.


page 7


Meta-Learning a Dynamical Language Model

We consider the task of word-level language modeling and study the possi...

3D Meta-Segmentation Neural Network

Though deep learning methods have shown great success in 3D point cloud ...

Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Universal Language Model for Fine-tuning [arXiv:1801.06146] (ULMFiT) is ...

Language Model Adaptation for Language and Dialect Identification of Text

This article describes an unsupervised language model adaptation approac...

CBAG: Conditional Biomedical Abstract Generation

Biomedical research papers use significantly different language and jarg...

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Recent results in end-to-end ASR have demonstrated the efficacy of simpl...

Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation

False triggers in voice assistants are unintended invocations of the ass...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language modeling is the task of generating language from context. This is traditionally framed as an autoregressive task, where a model predicts the conditional probability of the next word based on the sequence of previously observed or generated tokens. Language modeling has seen a recent surge in relevance thanks to the success of language modeling as a pretraining objective for self-supervised representation learning. The most-prominent language models today are Transformer-based models

(Vaswani et al., 2017) such as BERT (Devlin et al., 2018)

and GPT-2

(Radford et al., 2019)

. Language models are most commonly trained with backpropagation using traditional NLP loss metrics such as cross entropy. These loss metrics are designed so that the models are rewarded for assigning high probability to text that appears commonly in the gold training corpus. The energy and computational resources used to train a state-of-the-art language model from scratch are very high, to the point of impracticality for most researchers. One recent estimate suggests that training a single model with an architecture search takes more energy to train than five cars will use in their entire lifetimes

(Strubell et al., 2019). This high cost of training from scratch is sidestepped by pretraining, where a generic language model is trained by those with sufficient resources on a general dataset and released for use by other researchers. Once pretrained, the parameters of a language model can be updated for use in related downstream tasks through finetuning. A sufficiently general language model can be finetuned on a subdomain in order to generate text that matches the style and syntax of that specific domain Howard and Ruder (2018). While using pretrained models avoids having to create a new model for each task, the cost of finetuning such large networks is still relatively high. Finetuning to completion for a single task can easily take in excess of a day on multiple energy-intensive GPUs Strubell et al. (2019).

Recent work analyzing the finetuning process has shown that it has high variability between runs and is particularly sensitive to seemingly arbitrary factors such as data ordering Dodge et al. (2020). Those authors propose to overcome this variability by training models using many random seeds and then only keeping the best, effectively trading computational efficiency for model performance. While this technique improves the final model, the reasons for the high variability between random seeds have yet to be explored. We hypothesize that much of this variability can be explained by the random selection of highly ”informative” training examples, which most effectively capture low-level distributional statistics of a given corpus such as token unigram and bigram frequencies. If this is the case, then it should be possible to quickly screen for these informative training examples using a simple model.

We suggest an alternative prospective approach to improving the robustness of the training procedure, rather than the retrospective approach of testing many random seeds that they suggest. In this paper, we show that most of the benefit of the finetuning process comes from learning low-level frequency-based distributional statistics of the training corpus. Based on this observation, we present a new technique for more efficient finetuning of language models that employs a secondary learner to estimate the usefulness of finetuning on each given training example. Our method is valuable both as a novel learning result and potentially as a means to mitigate the energy impact of deep language modeling, as our approach requires significantly fewer backpropagation steps than other techniques that trade computational power for performance to achieve finetuned models of equal quality.

2 Finetuning is Simplistic

Figure 1: Learning the New Unigram Frequency Distribtion Constitutes Most of the Benefit of Finetuning: These plots show the reduction in cross-entropy of a GPT-2 language model, tested on a Reddit corpus after training on each 32 token contexts sampled from different distributions. Each example consisted of a word along with the preceding 32 words of context. Positive values indicate that learning from that example resulted in reduced loss on the test dataset; negative values indicate increased loss. (left) Actual sequence from corpus. The language model learns something useful from every example when finetuned on text from the corpus. (middle) Random sequence with preserved word probabilities. For this sequence, 32 tokens are sampled to generate a context using the unigram probabilities for the Reddit corpus. Here the model also learns something useful from every example, despite being finetuned on scrambled text. (right) Random sequence with uniform word probabilities.

When the unigram probability distribution is replaced with a uniform probability distribution, the model no longer consistently learns. All pairs of distributions are different with


We hypothesized that it is possible to determine whether a given example is worth learning from by only examining the low-level features of that context such as unigram frequency. To test this hypothesis, we performed an experiment in which we finetuned a language model on either (1) real example sequences from a corpus, (2) artificial sequences that were constructed by independently sampling each token from the frequency distribution of the corpus, and (3) sequences constructed by uniformly sampling tokens. We then measured the change in loss on a separate portion of the corpus. Figure 1 shows the results of this experiment. The average reduction in loss for examples constructed using the unigram frequency distribution is almost as high as for real examples. Thus, a significant amount of the benefit of training on contexts can be estimated by merely knowing the unigram frequency distribution from which those contexts were derived, which is easily estimable without knowing the particular parameterization of the language model itself. This suggests that we can inexpensively estimate whether a given context generalizes well to the target corpus, and preferentially choose to train on those contexts over others.

Dodge et al. (2020) observed that the quality of a finetuning run could usually be established by looking at the trajectory of the loss curve very early during training. We therefore attempted to determine whether training on good contexts early is an important element of the variability in data order between finetuning runs. Figure 2 compares test perplexity after training from a randomly sampled first batch against the test perplexity after many randomly sampled batches. Good early batches improve the probability of converging to an ideal final value. The correlation between the test perplexity after a single batch and the test perplexity after 50 batches, which is near convergence for most runs, is moderately high ().

We use this pair of observations, that (1) early data order is important, and (2) that it may be possible to ensure its quality, to motivate our approach to selectively modifying data order during finetuning to improve overall model performance. Specifically, if we carefully ensure that early batches are good, then we will likely end up with a superior model after convergence.

Figure 2: Reduction in Perplexity in Early Steps is Predictive of Total Reduction: If the first batch in a finetuning run leads to a large reduction in perplexity, the finetuning run as a whole will tend to converge to a lower value (). This is significant to .

3 Information Gain Filtration

3.1 Overview of Method

Our method generates a secondary learner that attempts to predict the “informativeness” of each example sampled from a training corpus to finetuning the language model to our chosen target corpus. We then set a threshold on this informativeness to determine whether to backpropagate or skip that example during finetuning. By filtering examples in this way, we aim to reduce the effect of variability in data order observed in previous work Dodge et al. (2020) and improve the performance of our language model. Due to its intuitive similarity with notions in deep Q-learning (Mnih et al., 2013) of using a network to approximate the expected value of a given action, we abbreviate this normalized informativeness metric as a “-value”.

Formally, we define our language model

as a conditional probability distribution induced by a set of parameters

, which, when conditioned on an ordered sequence of tokens , outputs a probability distribution over the next token, :

We define our loss function

as the perplexity of a given set of (sequence, next token) pairs that we denote the test set, , under a given parameterization () of our language model:

where denotes the one-hot probability distribution that assigns all of its probability mass to the token .

When encountering a new training example , the model has a choice between two actions:

Backprop, which updates the language model parameters by backpropagating the loss and taking a single step of gradient descent, updating parameters to , and

Skip, which leaves the language model parameters unchanged.

We intend to utilize a held-out subset of training data we call the objective set, in order to inform our decision about which contexts are informative. We will compute the difference in perplexity measured against this objective set before and after training on a given example in order to quantify the informativeness of that example. We denote this difference as the information gain (IG):

where is the initial parameterization of the language model and is the parameterization of the new language model after backpropogating on the loss associated with the training example . For notational brevity, we denote as simply since there exists an implicit direct bijection between all ’s and ’s.

Given an example, regardless of the current language model parameters, we intend to estimate whether it is worthwhile to backpropogate over that example. Given our previous observation that early batches are especially important for finetuning, we expect that even though is only a measurement of the single-step change in perplexity during training, it will be a sufficient estimator of long-term data quality. To each of our two actions, we assign a value:

where is a free “threshold” parameter for deciding which values are sufficiently high to warrant backpropagation. We call this technique information gain filtration or simply IGF.

We construct a training dataset for our secondary learner by measuring for a randomly selected set of training examples from our training set. Each entry in this dataset consists of a pair of the input text and its associated value, . Using this constructed dataset, we generate a secondary learner to approximate given .

During language model finetuning we apply our newly-trained secondary learner with a greedy policy:

In practice, our secondary learner, , represents the input text

by embedding it with the 768-dimensional GPT-2 byte-pair embeddings. We then pass the input representations through a convolution with kernel width 3, followed by max-pooling operation over the time axis and a 2-layer feedforward network. This architecture was refined through coordinate descent, and evaluated on a separate held-out set of measured

values. The choice of architecture does not strongly affect method performance (see Appendix A, Figure 8

). Additionally, a neural network is not necessary for the learner, as simpler learning methods are sufficient (see Figure


3.2 Scheduled Thresholding

Our method approximates using only the difference in loss between the initial pretrained model parameters and the model parameters after one backpropagation step . This means that the effectiveness of the learner at distinguishing “high quality” examples from “low quality” examples decays over time during finetuning, as the parameters of the pretrained model diverge from their initial values. Examples that are useful to train on at the beginning of finetuning are not necessarily useful to train on later. To ameliorate this problem, IGF can be modified by changing over the finetuning process. Since is most accurate at the first step, we have found that scheduling to alternate between highly selective (a high value) to highly permissive (a low value) is an effective strategy. This allows the model to take advantage of the accurate predictions for early in the finetuning process without overfitting once those predictions become less accurate later on during finetuning. We find alternating to be superior to setting the permissiveness of to decay permanently because it enables the language model to take advantage of as many or as few high-quality examples as necessary without overfitting them.

3.3 Iterated Information Gain Filtration

Instead of scheduling the selectivity of the secondary learner to taper off as the finetuning process continues, we might instead replace the learner periodically with a new learner trained on a new dataset of pairs generated using the current parameterization of the language model. This process, which we call iterated infomation gain filtration (IIGF), allows us to replace the obsolete learner that was trained to predict for early examples with a learner that is more relevant later in finetuning. IIGF has the added advantage of allowing us to keep high throughout finetuning, as secondary learner irrelevance is no longer a concern. This procedure is very computationally expensive, as the overhead in generating the new dataset and learner far exceeds the computational cost of finetuning. Nonetheless, this enables finer control of data order throughout the finetuning process and further improvements in final perplexity over IGF with scheduled thresholding.

4 Results

We focused on analyzing the effectiveness of our method from two perspectives, final model performance and ability of the method to efficiently trade a computational budget for improved performance. We compare IGF directly to standard finetuning, which we define as basic batched stochastic gradient descent with Adam

Kingma and Ba (2014) using random samples from the target corpus. For our tests, we used the pretrained GPT-2 Small Transformer model, a commonly used unidirectional language model. We use the publicly available GPT-2 Small implementation of the transformers package Wolf et al. (2019). We test our approach in two settings, a standard Books dataset Zhu et al. (2015) and “mixed” dataset which is composed of training examples from two corpora (the Books corpus and a corpus of scraped Reddit comments Huth et al. (2016)111Intel authors did not use or process any data. Intel does not control or audit third-party data.) but whose test set only comes from one corpus (Books). The Books corpus allows us to fairly compare standard finetuning against IGF, whereas the Mixed corpus allows us to analyze the effectiveness of the method at separating informative contexts from uninformative ones. For both methods, batches of size 16 were used to train the language model with a learning rate of and . The convolutional network that we used for our secondary learner was similarly trained using SGD with Adam with a learning rate of and .

Figure 3: Normalized Predicted Q’s by Training Corpus: In the mixed setting, a corpus composed of Reddit comments (25% of total contexts) and a corpus of books (75% of total contexts) were mixed into a single training dataset. Using the predicted generated from our DQN, we can achieve reasonable separation of the corpora using the information gain metric despite computing the true q-value using a small objective set. The percentage of examples from the Books corpus that are higher than several frequently referenced values are given for our dataset. Figure 7 in Appendix A gives the CDF of the values of contexts over different thresholds. This can be used to compute the selectivity of a chosen threshold.

4.1 Corpus Separation

We created a dataset of 10,000 pairs using an objective set of 160 examples drawn solely from the Books corpus. We used this dataset to train an example secondary learner. Next, this secondary learner was fed randomly sampled contexts from both the Books and Reddit corpora. Because the objective set only contains examples from one corpus, we expect that the secondary learner should assign higher values to other examples from the same corpus. Figure 3 shows that there is indeed a significant difference in the distributions of

values between these two corpora. Further, this shows that the corpora can be effectively separated by this simple secondary learner. Almost all examples from the Reddit corpus are expected by the IGF secondary learner to produce a reduction in perplexity that is at least one standard deviation below the mean. This indicates that the secondary learner can identify with strong confidence that Books corpus examples as more informative for finetuning towards the Books objective than Reddit corpus examples. It is worthwhile to note that the secondary learner achieves dataset separation despite having access to just 160 labelled examples of 32 tokens in our objective set, a total of just 5120 tokens from the Books corpus, and no examples from the Reddit corpus.

4.2 Effectiveness of the Secondary Learner

The fully-trained convolutional secondary learner achieves a MSE from the true metric of just 0.21 standard deviations. Considering that the secondary learner only has access to low-level statistics of the contexts, this supports the view that the question of whether a not an example is useful for language model finetuning is one that is almost entirely answerable by simple statistics of the example itself, and not something that requires a learner that is as complex as the language model. That is, simple statistical regularities such as unigram, bigram, and trigram frequencies that can be captured by the stride-3 convolutional architecture that the secondary learner uses are sufficient to adequately estimate the usefulness of an example during training. Since the secondary learner can be used to fully filter out the examples in the Mixed dataset that originated from the Reddit corpus, it performs equally as well on the Mixed and Books corpus. This shows that finetuning with IGF is resilient to imperfections in the training set, such as incorrectly labelled data.

Figure 4: Comparing IGF to Standard Finetuning: IGF with constant (, -test) and alternating (, -test) thresholding significantly outperforms SGD with Adam using no data order control in both corpora. The left-hand figure shows the averaged progression of the training curve for each method. The right-hand figure gives the variability in the different methods, and shows a clear improvement in the overall performance of IGF over standard batched finetuning with Adam. The median Standard Books run reached a perplexity of 57.3 against a median perplexity of 54.0 for IGF with an alternating threshold. For the constant threshold, was set to 0.75. For the alternating threshold,

was set to alternate between 1 and -1 every 10 batches. In both cases where our method was employed, the Mixed corpus was used and a set of 160 example contexts of 32 tokens each from the Books corpus was used as the objective set. Error bars in the left-hand figure show the standard error. All methods averaged over 50 runs.

4.3 Language Model Finetuning

We next used the secondary learner to finetune our language model towards the target Books corpus. We generated 50 runs each of standard finetuning on training examples sampled from the Mixed corpus and separately from the easier Books corpus. We then generated 50 runs of IGF using two thresholding schedules, one with a fixed and one with an alternating . Both types of IGF runs were performed on the more challenging Mixed corpus only. Figure 4 plots the averaged finetuning curves of these 4 different categories over 60 batches. We can see that IGF (green and red) significantly improves final test perplexity when compared to standard finetuning on both the Mixed corpus (blue) and the Books (orange) corpus. Standard finetuning on Books achieves a median perplexity of 57.3, compared to 56.9 for IGF with a constant threshold (green) and 54.0 for IGF with the alternating threshold schedule (red). All 50 runs of IGF with an alternating schedule outperformed all 50 standard finetuning runs. This means that the overall improvements to data order that IGF achieves through selective sampling of informative contexts are far in excess of what might be reasonably achieved through random sampling of contexts during finetuning. Due to its computational expense, we also ran a small set of 5 tests of iterated information gain filtration by training a secondary learner using an dataset built from example pairs derived from a language model that had already been fully finetuned to the Books corpus. IIGF was able to improve these already-converged models by an average of 0.29 additional perplexity points after reconverging, with a standard deviation of 0.11 points.

Figure 5: Comparing The Ability of Simple Learners To Estimate Information Gain: The above plots show the prediction accuracy for a variety of secondary learners. Each performs well in estimating when trained on a dataset of pairs. The convolutional network (far left

) which we chose as our secondary learner moderately outperforms the other simple learners. As alternative learners, we also tested linear regression where

is represented as a one-hot encoding over the token values (

center left), linear regression where is represented as its average embedded representation in the GPT-2 byte-pair embedding space (center right), and a trivial learner which estimated the value of a context as average of the values of the tokens that compose it, whose values are in turn computed as the average value of the training contexts they occur in (far right). For the one-hot and average token value learners, contexts with tokens appearing in the training set and not in the test set were excluded. All learners were trained on a dataset of 50,000 training examples.
Figure 6: Prospective IGF Is More Efficient Than Retrospective Random Seed Search: Dodge et al. (2020) showed that computation can be traded for finetuning performance though the testing of many random seeds that govern data order and weight initialization. We show boxplots of the best run from differently-sized sets of runs to visualize the expected benefit of using random seed testing and compare it to the benefit of using IGF. Even one IGF run is significantly more effective than 50 random seed tests using standard finetuning. We further observe that the improvements to data order that come from IGF are somewhat disjoint from the improvements to data order than come with random seed testing, so both approaches can be applied simultaneously for further perplexity reduction. Sets of runs of each size were generated by sampling without replacement from a pool of independent 50 runs for each method. For the 50 run case, the minimum over the entire pool of runs for each method is plotted instead.

4.4 Choice of Secondary Learner

For other results presented here, we used the simple convolutional neural network described in Section 3 as our secondary learner. However, it is generally not necessary to choose an end-to-end neural network as the secondary learner for

. Indeed, much simpler machine learning methods suffice for almost equal performance. Figure

5 shows predicted vs. actual normalized values for several learning methods. While the convolutional neural network is most effective at approximating , other learners perform almost well. We encoded the contexts both by using the standard GPT-2 Small word embedding and with a simple one-hot encoding of the token identities within each context. Standard linear regression performed on both encoding types perform nearly as well at approximating with a convolutional model. We also tested an even simpler learner that assigned to each token an associated value that was computed by averaging the values for each context that contained that token in a held-out training set. The values of new contexts in our test set were then computed as the average of token values contained in that context. Even this extremely simple model is a reasonably predictive approximator of . This underscores that while is an extremely complex function to compute exactly, since it is dependent on the precise parameterization of the language model, it can nevertheless be effectively approximated through simple unigram information.

4.5 Efficiency of IGF

When compared to standard finetuning, IGF takes fewer batches to train to convergence. This reduces the number of backpropogation steps necessary during training and improves runtime and energy usage during finetuning (See Appendix A, Figure 9

). However, it is important to note that there is a significant constant-time overhead for generating the dataset on which to train the secondary learner, so IGF is beneficial for this purpose only if the model will be trained many times, such as during a hyperparameter or architecture search.

Dodge et al. (2020) showed that model performance can be improved by rerunning the finetuning process many times with different random seeds to determine a good data order, and then choosing the best resulting model by testing against a validation set. Since IGF aims to replaces this random search for a good seed with a principled, directed search, we would expect IGF to be significantly more effective than random seed testing in terms of the number of runs necessary to achieve meaningfully improved performance. Figure 6 compares the computational efficiency of IGF against random seed testing. We can see that using IGF to improve data order significantly outperforms random seed testing.

5 Conclusion and Future Work

We have shown that a substantial portion of the finetuning process is composed of merely changing the model’s estimation of low-level distributional statistics, such as unigram and bigram frequency. We then presented information gain filtration, a method for improving the data and energy efficiency of the finetuning process that uses this observation to efficiently estimate the usefuleness of each example encountered during training. This data filtration technique yields significant improvements in final model performance and also converges with roughly 40% fewer batches than standard finetuning.

There are some significant questions that our research has left for future study. Since the focus of our filtering method was on a lightweight technique that did not require a significant overhead, our secondary learner was a simple convolutional network that treated each example sequence as a bag of trigrams. Data efficiency during training could potentially be further improved if one was willing to sacrifice on model complexity and energy efficiency by using a more complex model. The question of how far one could logically take a function approximator network for estimating information gain remains unexplored.

Finally, we have left several questions of a more theoretical nature unanswered in our analysis here. Specifically, we lack an understanding of why improving performance on early batches results in a long-term improvement in the model performance at convergence. Is this a property of exclusively language models, or are there other types of networks and learning tasks that exhibit this phenomenon? What are the factors that allow IGF to generalize in some cases and not in others, and can a more general method for filtering useful examples be developed that is language-model invariant? We believe these are exciting questions that drive near the heart of a better understanding of the underlying processes of LM finetuning.


We would like to thank all of the people whose contributions, opinions and suggestions helped with the development of this project, especially Nicole Beckage, Kaj Bostrom, Greg Durrett, Shailee Jain, and Vy Vo. This project was funded by a generous grant from Intel.


  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith (2020) Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. External Links: 2002.06305 Cited by: §1, §2, §3.1, Figure 6, §4.5.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §1.
  • A. G. Huth, W. A. De Heer, T. L. Griffiths, F. E. Theunissen, and J. L. Gallant (2016) Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532 (7600), pp. 453–458. Cited by: §4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013)

    Playing atari with deep reinforcement learning

    arXiv preprint arXiv:1312.5602. Cited by: §3.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1.
  • E. Strubell, A. Ganesh, and A. McCallum (2019)

    Energy and policy considerations for deep learning in nlp

    arXiv preprint arXiv:1906.02243. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019)

    HuggingFace’s transformers: state-of-the-art natural language processing

    ArXiv abs/1910.03771. Cited by: §4.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. External Links: 1506.06724 Cited by: §4.

Appendix A Supplementary Material

Figure 7: CDF of Predicted Q’s: CDFs of the datasets against the Books objective set. Note that a threshold of almost entirely excludes contexts in the Mixed corpus that originated from the Reddit corpus. This allows IGF with a constant threshold of -1 on the Mixed dataset to perform almost identically to standard finetuning on just the Books corpus.
Figure 8: Architecture Invariance: The method performs similarly regardless of the convolutional setup of the model. Allowing the DQN to be informed by higher-order frequencies such as trigram and 10-gram do not significantly affect performance.
Figure 9: Improved Finetuning Efficiency Over Standard Finetuning: We plot the number of batches it takes for each threshold schedule to exceed the perplexity of standard at each step. This serves as a barometer for comparing the relative efficiency of finetuning. In the early stages of finetuning, we can see that IGF requires 30%-40% fewer backpropogation steps over standard finetuning. This suggests that IGF could be used as a more energy efficient alternative to standard language model finetuning. Note that since IGF converges to a lower final value than standard finetuning, these values asymptote to a fixed value.