Pre-trained language models (Radford et al., 2019; Devlin et al., 2018; Liu et al., 2019; Lewis et al., 2019, 2020) provide the defacto initialization for modeling most existing NLP tasks. However, the process of fine-tuning them on often very small target task datasets remains somewhat mysterious. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?
We propose intrinsic dimensionality as a new lens through which fine-tuning can be analyzed (Li et al., 2018). An objective function’s intrinsic dimensionality describes the minimum dimension needed to solve the optimization problem it defines to some precision level. In the context of pre-trained language models, measuring intrinsic dimensional will tell us how many free parameters are required to closely approximate the optimization problem that is solved while fine-tuning for each end task. For example, we will show that 200 parameters (randomly projected back into the full parameter space) are enough to represent the problem of tuning a RoBERTa model to within 90% of the performance of the full model. More generally, we also describe a set of strong empirical and theoretical connections between intrinsic dimensionality, number of parameters, pre-training, and generalization.
We first empirically show that standard pre-trained models can learn a large set of NLP tasks with very few parameters and that the process of pre-training itself implicitly minimizes the intrinsic dimension of later tuning for different NLP tasks. We continue by conducting a study across over a dozen various pre-trained models to show that number of parameters strongly inversely correlates with intrinsic dimensionality, at least in part to justify the extreme effectiveness of such models. We interpret pre-training as providing a framework that learns how to compress the average NLP task. Finally, we connect intrinsic dimensional with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count, further justifying why these methods generalize so well in practice across tasks.
The contributions of our paper are the following:
We empirically show that common NLP tasks within the context of pre-trained representations have an intrinsic dimension several orders of magnitudes less than the full parameterization.
We propose a new interpretation of intrinsic dimension as the downstream fine-tuning task’s minimal description length within the framework of the pre-trained model. Within this interpretation, we empirically show that the process of pre-training implicitly optimizes the description length over the average of NLP tasks, without having direct access to those same tasks.
We measure the intrinsic dimension of a large set of recently developed pre-training methods. We discover that there exists a fortuitous trend where larger models tend to have a smaller intrinsic dimension.
Lastly, we show that compression based generalization bounds can be applied to our intrinsic dimension framework to provide generalization bounds for large pre-trained models independent of the pre-trained model parameter count.
2 Related Work
Calculating the intrinsic dimension of an objective function was proposed Li et al. (2018). In their paper, they analyzed the impact of various architectures on the intrinsic dimensionality of their objective. Our work is a direct extension of this paper, focusing on analyzing pre-trained representations instead.
There is a large collection of literature analyzing pre-trained models from the perspective of capacity. For example, a recent line of work has shown that pre-trained models such as BERT are redundant in their capacity, allowing for significant sparsification without much degradation in end metrics (Chen et al., 2020; Prasanna et al., 2020; Desai et al., 2019). Houlsby et al. (2019) showed that fine-tuning top layers of pre-trained models is not effective and that alternate methods allow fine-tuning effectively with a couple of percent of the parameters. Furthermore, we can view computing the intrinsic dimensionality as a continuous relaxation of the sparsification problem.
Moreover, standard approaches towards fine-tuning seem to have non-trivial effects on the generalization of pre-trained representations (Aghajanyan et al., 2020). A holistic explanatory picture of the successes of fine-tuning has not yet been painted. A clear understanding of the underlying mechanisms which lead to the incredible generalization of fine-tuned pre-trained representations is currently missing. Moreover, we still do not understand why various pre-training methodology manifests in universally useful representations.
3 Intrinsic Dimensionality of Finetuning
An objective function’s intrinsic dimension measures the minimum number of parameters needed to reach satisfactory solutions to the respective objective (Li et al., 2018)
. Alternatively, the intrinsic dimension represents the lowest dimensional subspace in which one can optimize the original objective function to within a certain level of approximation error. Computing the exact intrinsic dimensional of the objective function is computation intractable; therefore, we resort to heuristic methods to calculate an upper bound. Letbe a set of parameters that parameterize some model . Instead of optimizing the empirical loss in the original parameterization (), the subspace method fine-tunes the model via the following re-parametrization in the lower-dimensionsal -dimensions:
where projects from a parameter from a lower dimensional to the higher dimensional . Intuitively, we do an arbitrary random projection onto a much smaller space; usually, a linear projection, we then solve the optimization problem in that smaller subspace. If we reach a satisfactory solution, we say the dimensionality of that subspace is the intrinsic dimension. This methodology was proposed in the seminal paper by Li et al. (2018). Concretely Li et al. (2018) proposed 3 various actualizations of ; a random linear dense projection (), random linear sparse projection() and random linear projection via the Fastfood transform (Le et al., 2013).
We will primarily use the Fastfood transform, defined as:
The factorization of consists of , a Hadamard matrix, , a random diagonal matrix with independent standard normal entries,
a random diagonal matrix with equal probabilityentries, and a random permutation matrix. Furthermore, the matrix multiplication with a Hadamard matrix can be computed in via the Fast Walsh-Hadamard Transform. Note that everything but is fixed; therefore, the optimization problem lies only in -dimensions. Note that if we place a constraint of being a binary matrix, we recover the sparsification problem; therefore, we can view finding intrinsic dimensionality as a continuous relaxation of the sparsification problem.
The standard method of measuring the intrinsic dimensionality of an objective as proposed by Li et al. (2018) requires searching over various , training using standard SGD over the subspace reparameterization and selecting the smallest which provides us with a satisfactory solution (). Li et al. (2018) defined the satisfactory solution as being 90% of the full training metric. For example, if we reach 85% accuracy training a model with all of its parameters, the goal is to find the smallest , which would reach accuracy; we call this dimension . Let us also note that by merely initializing we recover the original parameterization which in the context of fine-tuning represents the original weights of the pre-trained model.
The way Li et al. (2018) define a satisfactory solution reduces the dependence of the dataset’s size on the calculation of intrinsic dimension. For a small dataset, we will generally have worse end metrics; therefore, we have a lower cut-off; inversely, a larger dataset will require a more non-trivial cut-off.
Structure Aware Intrinsic Dimension
Due to the large size of pre-trained language models (generally in the hundreds of millions of parameters), the only computationally reasonable subspace optimization method is one that utilizes the Fastfood transform. For example, if we are interested in subspace training with for the RoBERTa-Large model using a dense matrix, we would require 1.42 terabytes of memory to store just the projection matrix.
Unfortunately, the method of finding the intrinsic dimension proposed by Li et al. (2018) is unaware of the layer-wise structure of the function parameterized by . Existing literature argues that in attention-based pre-trained models, individual layers specialize separately (Clark et al., 2019); therefore, it is useful to incorporate a notion of structure when computing . We define Structure-Aware Intrinsic Dimension (SAID) as the following
For layers, we trade parameters from our subspace parameter to allow for layer-wise scaling through jointly learned , thus becomes . This allows the SAID method to focus a larger capacity of towards specific layers what might carry more relevant information for the task at hand. Conversely, we will refer to the layer unaware method (Equation 2) as the Direct Intrinsic Dimension (DID) method.
4 Intrinsic Dimensionality of Common NLP Tasks
4.1 Sentence Prediction
We first empirically calculate the intrinsic dimension of various pre-trained models on a set of sentence prediction tasks from the GLUE Benchmark (Wang et al., 2018). We focus on analyzing BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) at both the base and large model sizes.
We chose to experiment with MRPC (Dolan and Brockett, 2005) and QQP (Iyer et al., 2017) as reference examples of small and large tuning datasets. MRPC is a binary classification task for predicting semantic equivalency for two paraphrases with roughly 3700 training samples, while QQP is a binary classification task for predicting semantic equality of two questions, with roughly 363k samples. For every dataset and every model, we run 100 subspace trainings with
ranging from 10 to 10000 on a log scale. For every training run, we do a small hyperparameter search across four learning rates. We initialize every
to the zero vector to allow for our starting point to be the original pre-trained model. Our subspace optimization method also operates over the randomly initialized sentence classification head to ensure we have exactlyparameters to optimize.
The first takeaway is the incredible low dimensionality of viable solutions. With RoBERTa-Large, we can reach 90% of the full fine-tuning solution of MRPC using roughly 200 parameters and 800 parameters for QQP (Table 1). Recall that our approximation of intrinsic dimension is necessarily crude by using random projections and restricting them to the use of Fastfood transform; therefore, it is likely that the true intrinsic dimension is much lower.
Furthermore, RoBERTa consistently outperforms BERT across various subspace dimensions while having more parameters. We leave a more in-depth analysis of model parameter size on intrinsic dimensionality to a later section (§5.2).
Lastly we see that adding a notion of structure in the computation of intrinsic dimension is beneficial with the SAID method consistently improving over the structure unaware DID method.
5 Intrinsic Dimension, Pre-Training, and Generalization Gap
One interpretation of the intrinsic parameter vector is that it encodes the task at hand with respect to the original pre-trained representations. Therefore, we can interpret as the minimal description length of the task within the framework dictated by the pre-trained representations (Hinton and Zemel, 1993). Under this interpretation of intrinsic dimensionality, we hypothesize that pre-training is implicitly lowering the intrinsic dimensionality of the average NLP task, and therefore compress the minimal description length of those same tasks.
What do we more precisely mean by intrinsic parameter encoding a task within the framework provided by the pre-trained representations? Traditionally, a finetuned model (e.g. for a classification tasks) simply consists of a classification head , parameterized by applied to fine-tuned representations , parameterized by per sample . Therefore, to fully describe a task, we need to pack together parameterizations and weights . This model description is completely decoupled from the original weights of the pre-trained representation , therefore to represent classification tasks, we need to maintain ; additionally, the task representation is incredibly high dimensional. Conversely, fine-tuning utilizing SAID in -dimensions requires storing only per task, a single random seed used to generate and the original pre-trained weights . Therefore, we can represent arbitrary NLP tasks within a single pre-trained model framework with parameters.
For example, in the last section, we represented MRPC with roughly 200 parameters, which translates to needing less than a kilobyte of data to encode a complex natural language task within the framework provided by RoBERTa.
We hypothesize that the better the pre-trained models are, the fewer bits (description length) are needed to represent the average NLP task, as we will demonstrate empirically in the next section.
5.1 Pre-Training Intrinsic Dimension Trajectory
To verify our hypothesis of pre-training optimizing intrinsic dimension, we retrain a RoBERTa-Base from scratch and measure various NLP tasks’ intrinsic dimensions using the SAID method across various checkpoints. We completely replicate the setting as described by (Liu et al., 2019) apart from only training for a total of 200k steps (instead of 500k) with half the batch size (1k). To calculate the intrinsic dimension more efficiently, we reuse the best learning rates discovered in Section 4 for and use a fixed learning rate for anything else. To find we do a binary search across per each checkpoint, with a minimum of 100 and a maximum of 4 million. The “full solution” that we use when deciding cut-off is computed by fine-tuning the checkpointed model in the standard way. We compute SAID on six datasets; MRPC, QQP, Yelp Polarity (Zhang et al., 2015), SST-2 (Socher et al., 2013), MNLI (Williams et al., 2018) and ANLI using all rounds of data (Nie et al., 2019).
We present our results in Figure 2. We see that the intrinsic dimensionality of RoBERTa-Base monotonically decreases as we continue pre-training. We do not explicitly optimize for intrinsic dimensionality, specifically during pre-training (the language model does not have access to downstream datasets!), but none-the-less the intrinsic dimension of these downstream tasks continues to decrease.
More so, tasks that are easier to solve consistently show lower intrinsic dimensionality across all checkpoints, for example, Yelp Polarity vs. the notoriously tough ANLI dataset. The correlation between tasks traditionally hard for RoBERTa and their large intrinsic dimension hints at a connection between generalization and intrinsic dimension. We will discuss generalization further in Section §5.3.
Given our task representation interpretation of intrinsic dimensionality, we argue that the large scale training of Masked Language Models (MLM) learns generic and distributed enough representations of language to facilitate downstream learning of highly compressed task representations. Furthermore, we argue for another perspective of pre-training learning representations that form a compression framework with respect to various NLP tasks.
5.2 Parameter Count and Intrinsic Dimension
We would also like to measure the relationships between the parameter count of arbitrary pre-trained models and the intrinsic dimension of downstream NLP tasks. The optimal experiment to run would be to fix the pre-training method, e.g., MLM RoBERTa style, vary the architecture size from small to very big, and compute the intrinsic dimension of a group of tasks at every size of the model. Unfortunately, such an experiment is computationally infeasible due to the need to train many RoBERTa models.
Due to these constraints, we opt to do an empirical study over existing pre-trained models, regardless of the pre-training method. We show that the trend is strong enough to overcome differences in training methodology. We select the following pre-trained models in our study: BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), BART (Lewis et al., 2019), Electra (Clark et al., 2020), Albert (Lan et al., 2019), XLNet (Yang et al., 2019), T5 (Raffel et al., 2019), and XLM-R (Conneau et al., 2019). Furthermore, we selected various sizes of these models, as available publicly within the HuggingFace Transformers library (Wolf et al., 2019).
We used the MRPC dataset and computed intrinsic dimension for every pre-trained model utilizing the same binary search methodology mentioned in the previous section with additional small hyper-parameter searches across learning rate (due to the wide range of learning rates needed by various models).
We present our results in Figure 3. We see a strong general trend that as the number of parameters increases, the intrinsic dimension of fine-tuning on MRPC decreases. We ran this experiment on other datasets to ensure that this is not an artifact of the dataset. Our experiments showed the same trend; we refer to the Appendix for all trends per dataset.
Within the same window of number of parameters, pre-training methodology becomes essential. For example, in the regime of parameters, the RoBERTa method of pre-training dominates similar sized pre-training methods. However, there does not seem to be a method that can overcome the limitations induced by the number of parameters. Interpreting these results through the lens of learning a compression framework for NLP tasks is straightforward; the more parameters we have in the model, the less we need to represent a task.
5.3 Generalization Bounds through Intrinsic Dimension
We have shown strong empirical evidence connecting pre-training, fine-tuning, and intrinsic dimensionality. However, we have yet to argue the connection between intrinsic dimensionality and generalization. Given that we have seen pre-training minimize intrinsic dimension, we hypothesize that generalization improves as the intrinsic dimension decreases.
To do so, we will empirically experiment with the connections between and evaluation set performance by looking at various checkpoints from our RoBERTa experiments in Section §5.1. We also plot the relative generalization gap (delta between train time performance and test time performance).
In Figure 4 we plot the evaluation accuracy’s achieved by our pre-training experiment in Section §5.1. A lower intrinsic dimension is strongly correlated with better evaluation performance. Additionally we are interested in measuring relative generalization gap (
) across intrinsic dimension. We select the training accuracy that provides us with the best evaluation metrics when computing this figure.
We present our results in Figure 5. Lower intrinsic dimension once again correlates strongly with a smaller relative generalization gap. If we interpret the intrinsic dimension as a measure of complexity, we expect the generalization gap to decrease with intrinsic dimension.
5.3.1 Generalization Bounds
By applying standard compression based generalization bounds, we can provide theoretical backing to the empirical connection between intrinsic dimension and generalization (Arora et al., 2018).
Consider the following definition of multi-class classification loss with an optional margin over our supervised dataset .
When , recovers the standard classification loss. Furthermore, Let be an unbiased empirical estimate of the margin loss.
Let be a function which is parameterized by as described in Equation 1 with a total of trainable intrinsic parameters on a dataset with samples. Then with a high probability, we can state the following asymptotic generalization bound
This generalization bound is independent of the underlying parameter count () of the pre-trained model but depends on the ability to compress the downstream task (). Moreover, given that our previous section shows larger models compress better, our bounds are aligned with general intuition and recent empirical evidence that larger pre-trained models generalize better. Explicitly, these bounds only apply to pre-trained methods trained with the intrinsic dimension subspace method; research has yet to show that standard SGD optimizes in this low dimensional space (although experimentally, this seems to be confirmed). We leave the theoretical contribution of showing SGD optimizes in this space, resembling something such as intrinsic subspace, for future work.
We want to highlight that generalization is not necessarily measured by the pre-trained model’s parameter count or measure of complexity, but the pre-trained model’s ability to facilitate the compression of downstream tasks. In some sense, if we want to compress downstream tasks better, we must expect pre-trained representations to have a considerable measure of complexity.
In conclusion, we proposed viewing the various phenomena surrounding fine-tuning and pre-training through the lens of intrinsic dimensionality. We empirically showed that common natural language tasks could be learned with very few parameters, sometimes in the order of hundreds, when utilizing pre-trained representations. We provided an interpretation of pre-training as providing a compression framework for minimizing the average description length of natural language tasks and showed that pre-training implicitly minimizes this average description length.
We continued by doing an empirical study of existing pre-training methods and their respective intrinsic dimension, uncovering the phenomena that intrinsic dimensionality decreases as we increase the number of pre-trained representation parameters. This phenomenon provides some intuitions to the trend of growing pre-trained representations. We connected intrinsic dimensionality with generalization by first showing that pre-trained models with lower intrinsic dimensions across various tasks achieve higher evaluation accuracies and lower relative generalization gaps. Furthermore, we explain these empirical results by applying well-known generalization bounds to the intrinsic dimension to get generalization bounds that grow on the order of the intrinsic dimension, not on the pre-trained model’s parameter count.
Intrinsic dimensionality is a useful tool for understanding the complex behavior of large models. We hope that future work will make explicit theoretical connections between SGD and optimizing the intrinsic dimension as well as explain exactly why pre-training methods optimize the intrinsic dimensionailty of tasks before not seen.
- Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156. Cited by: §2.
- Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296. Cited by: §A.1, §A.1, §A.1, §5.3.1, §5.3.1.
- The lottery ticket hypothesis for pre-trained bert networks. Advances in Neural Information Processing Systems 33. Cited by: §2.
- What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341. Cited by: §3.
- Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §5.2.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §5.2.
- Evaluating lottery tickets under distributional shifts. arXiv preprint arXiv:1910.12708. Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §4.1, §5.2.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Cited by: §4.1.
- Autoencoders, minimum description length and helmholtz free energy. Advances in neural information processing systems 6, pp. 3–10. Cited by: §5.
Parameter-efficient transfer learning for nlp. arXiv preprint arXiv:1902.00751. Cited by: §2.
- First quora dataset release: question pairs. External Links: Cited by: §4.1.
Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §5.2.
Fastfood-approximating kernel expansions in loglinear time.
Proceedings of the international conference on machine learning, Vol. 85. Cited by: §3.
- Pre-training via paraphrasing. External Links: Cited by: §1.
Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §1, §5.2.
- Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838. Cited by: §1, §2, §3, §3, §3, §3.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §4.1, §5.1, §5.2.
- Adversarial nli: a new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599. Cited by: §5.1.
- When bert plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561. Cited by: §2.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §5.2.
Recursive deep models for semantic compositionality over a sentiment treebank.
Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §5.1.
GLUE: a multi-task benchmark and analysis platform for natural language understanding.
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Cited by: §4.1.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. External Links: Cited by: §5.1.
- HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv, pp. arXiv–1910. Cited by: §4.1, §5.2.
- Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §5.2.
- Character-level Convolutional Networks for Text Classification. arXiv:1509.01626 [cs]. External Links: Cited by: §5.1.
Appendix A Appendix
Arora et al. (2018) define compressible using helper string as the following.
compressible using helper string
Suppose is a class of
classifiers indexed by trainable parameters A and fixed strings s. A classifier
is a class of classifiers indexed by trainable parameters A and fixed strings s. A classifieris -compressible with respect to using helper string s if there exists such that for any , we have for all y
If we parameterize via the intrinsic dimension approach as defined in Equation 1, then is compressible losslessly using a helper string consisting of the random seed used to generate the static random projection weights and the initial pre-trained representation . Therefore we say parameterized by either DID or SAID is compressible.
Theorem in Arora et al. (2018) states given a compression consisting of discrete states we achieve the following generalization bound.
We can trivially represent our parameters in a discrete fashion through discretization (as was done in Arora et al. (2018)), and the number of states is dependent on the level of quantization but is static once chosen (FP32 vs. FP16).
We then connect the fact that models trained in low dimensional subspace using SAID/DID methods are (0, S)-compressible to derive the final asymptotic bound.