1 Introduction
Pretrained language models (Radford et al., 2019; Devlin et al., 2018; Liu et al., 2019; Lewis et al., 2019, 2020) provide the defacto initialization for modeling most existing NLP tasks. However, the process of finetuning them on often very small target task datasets remains somewhat mysterious. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?
We propose intrinsic dimensionality as a new lens through which finetuning can be analyzed (Li et al., 2018). An objective function’s intrinsic dimensionality describes the minimum dimension needed to solve the optimization problem it defines to some precision level. In the context of pretrained language models, measuring intrinsic dimensional will tell us how many free parameters are required to closely approximate the optimization problem that is solved while finetuning for each end task. For example, we will show that 200 parameters (randomly projected back into the full parameter space) are enough to represent the problem of tuning a RoBERTa model to within 90% of the performance of the full model. More generally, we also describe a set of strong empirical and theoretical connections between intrinsic dimensionality, number of parameters, pretraining, and generalization.
We first empirically show that standard pretrained models can learn a large set of NLP tasks with very few parameters and that the process of pretraining itself implicitly minimizes the intrinsic dimension of later tuning for different NLP tasks. We continue by conducting a study across over a dozen various pretrained models to show that number of parameters strongly inversely correlates with intrinsic dimensionality, at least in part to justify the extreme effectiveness of such models. We interpret pretraining as providing a framework that learns how to compress the average NLP task. Finally, we connect intrinsic dimensional with low dimensional task representations and compression based generalization bounds to provide intrinsicdimensionbased generalization bounds that are independent of the full parameter count, further justifying why these methods generalize so well in practice across tasks.
The contributions of our paper are the following:

We empirically show that common NLP tasks within the context of pretrained representations have an intrinsic dimension several orders of magnitudes less than the full parameterization.

We propose a new interpretation of intrinsic dimension as the downstream finetuning task’s minimal description length within the framework of the pretrained model. Within this interpretation, we empirically show that the process of pretraining implicitly optimizes the description length over the average of NLP tasks, without having direct access to those same tasks.

We measure the intrinsic dimension of a large set of recently developed pretraining methods. We discover that there exists a fortuitous trend where larger models tend to have a smaller intrinsic dimension.

Lastly, we show that compression based generalization bounds can be applied to our intrinsic dimension framework to provide generalization bounds for large pretrained models independent of the pretrained model parameter count.
2 Related Work
Calculating the intrinsic dimension of an objective function was proposed Li et al. (2018). In their paper, they analyzed the impact of various architectures on the intrinsic dimensionality of their objective. Our work is a direct extension of this paper, focusing on analyzing pretrained representations instead.
There is a large collection of literature analyzing pretrained models from the perspective of capacity. For example, a recent line of work has shown that pretrained models such as BERT are redundant in their capacity, allowing for significant sparsification without much degradation in end metrics (Chen et al., 2020; Prasanna et al., 2020; Desai et al., 2019). Houlsby et al. (2019) showed that finetuning top layers of pretrained models is not effective and that alternate methods allow finetuning effectively with a couple of percent of the parameters. Furthermore, we can view computing the intrinsic dimensionality as a continuous relaxation of the sparsification problem.
Moreover, standard approaches towards finetuning seem to have nontrivial effects on the generalization of pretrained representations (Aghajanyan et al., 2020). A holistic explanatory picture of the successes of finetuning has not yet been painted. A clear understanding of the underlying mechanisms which lead to the incredible generalization of finetuned pretrained representations is currently missing. Moreover, we still do not understand why various pretraining methodology manifests in universally useful representations.
3 Intrinsic Dimensionality of Finetuning
Background
An objective function’s intrinsic dimension measures the minimum number of parameters needed to reach satisfactory solutions to the respective objective (Li et al., 2018)
. Alternatively, the intrinsic dimension represents the lowest dimensional subspace in which one can optimize the original objective function to within a certain level of approximation error. Computing the exact intrinsic dimensional of the objective function is computation intractable; therefore, we resort to heuristic methods to calculate an upper bound. Let
be a set of parameters that parameterize some model . Instead of optimizing the empirical loss in the original parameterization (), the subspace method finetunes the model via the following reparametrization in the lowerdimensionsal dimensions:(1) 
where projects from a parameter from a lower dimensional to the higher dimensional . Intuitively, we do an arbitrary random projection onto a much smaller space; usually, a linear projection, we then solve the optimization problem in that smaller subspace. If we reach a satisfactory solution, we say the dimensionality of that subspace is the intrinsic dimension. This methodology was proposed in the seminal paper by Li et al. (2018). Concretely Li et al. (2018) proposed 3 various actualizations of ; a random linear dense projection (), random linear sparse projection() and random linear projection via the Fastfood transform (Le et al., 2013).
We will primarily use the Fastfood transform, defined as:
(2) 
The factorization of consists of , a Hadamard matrix, , a random diagonal matrix with independent standard normal entries,
a random diagonal matrix with equal probability
entries, and a random permutation matrix. Furthermore, the matrix multiplication with a Hadamard matrix can be computed in via the Fast WalshHadamard Transform. Note that everything but is fixed; therefore, the optimization problem lies only in dimensions. Note that if we place a constraint of being a binary matrix, we recover the sparsification problem; therefore, we can view finding intrinsic dimensionality as a continuous relaxation of the sparsification problem.The standard method of measuring the intrinsic dimensionality of an objective as proposed by Li et al. (2018) requires searching over various , training using standard SGD over the subspace reparameterization and selecting the smallest which provides us with a satisfactory solution (). Li et al. (2018) defined the satisfactory solution as being 90% of the full training metric. For example, if we reach 85% accuracy training a model with all of its parameters, the goal is to find the smallest , which would reach accuracy; we call this dimension . Let us also note that by merely initializing we recover the original parameterization which in the context of finetuning represents the original weights of the pretrained model.
The way Li et al. (2018) define a satisfactory solution reduces the dependence of the dataset’s size on the calculation of intrinsic dimension. For a small dataset, we will generally have worse end metrics; therefore, we have a lower cutoff; inversely, a larger dataset will require a more nontrivial cutoff.
Structure Aware Intrinsic Dimension
Due to the large size of pretrained language models (generally in the hundreds of millions of parameters), the only computationally reasonable subspace optimization method is one that utilizes the Fastfood transform. For example, if we are interested in subspace training with for the RoBERTaLarge model using a dense matrix, we would require 1.42 terabytes of memory to store just the projection matrix.
Unfortunately, the method of finding the intrinsic dimension proposed by Li et al. (2018) is unaware of the layerwise structure of the function parameterized by . Existing literature argues that in attentionbased pretrained models, individual layers specialize separately (Clark et al., 2019); therefore, it is useful to incorporate a notion of structure when computing . We define StructureAware Intrinsic Dimension (SAID) as the following
(3) 
For layers, we trade parameters from our subspace parameter to allow for layerwise scaling through jointly learned , thus becomes . This allows the SAID method to focus a larger capacity of towards specific layers what might carry more relevant information for the task at hand. Conversely, we will refer to the layer unaware method (Equation 2) as the Direct Intrinsic Dimension (DID) method.
4 Intrinsic Dimensionality of Common NLP Tasks
4.1 Sentence Prediction
We first empirically calculate the intrinsic dimension of various pretrained models on a set of sentence prediction tasks from the GLUE Benchmark (Wang et al., 2018). We focus on analyzing BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) at both the base and large model sizes.
We chose to experiment with MRPC (Dolan and Brockett, 2005) and QQP (Iyer et al., 2017) as reference examples of small and large tuning datasets. MRPC is a binary classification task for predicting semantic equivalency for two paraphrases with roughly 3700 training samples, while QQP is a binary classification task for predicting semantic equality of two questions, with roughly 363k samples. For every dataset and every model, we run 100 subspace trainings with
ranging from 10 to 10000 on a log scale. For every training run, we do a small hyperparameter search across four learning rates. We initialize every
to the zero vector to allow for our starting point to be the original pretrained model. Our subspace optimization method also operates over the randomly initialized sentence classification head to ensure we have exactly
parameters to optimize.We use both the SAID and DID subspace optimization methods, which we implemented in the Huggingface Transformers library (Wolf et al., 2019). We present the results in Figure 1.
SAID  DID  

Model  MRPC  QQP  MRPC  QQP 
BERTBase  1608  8030  1861  9295 
BERTLarge  1037  1200  2493  1389 
RoBERTaBase  896  896  1000  1389 
RoBERTaLarge  207  774  322  774 
4.2 Analysis
The first takeaway is the incredible low dimensionality of viable solutions. With RoBERTaLarge, we can reach 90% of the full finetuning solution of MRPC using roughly 200 parameters and 800 parameters for QQP (Table 1). Recall that our approximation of intrinsic dimension is necessarily crude by using random projections and restricting them to the use of Fastfood transform; therefore, it is likely that the true intrinsic dimension is much lower.
Furthermore, RoBERTa consistently outperforms BERT across various subspace dimensions while having more parameters. We leave a more indepth analysis of model parameter size on intrinsic dimensionality to a later section (§5.2).
Lastly we see that adding a notion of structure in the computation of intrinsic dimension is beneficial with the SAID method consistently improving over the structure unaware DID method.
5 Intrinsic Dimension, PreTraining, and Generalization Gap
One interpretation of the intrinsic parameter vector is that it encodes the task at hand with respect to the original pretrained representations. Therefore, we can interpret as the minimal description length of the task within the framework dictated by the pretrained representations (Hinton and Zemel, 1993). Under this interpretation of intrinsic dimensionality, we hypothesize that pretraining is implicitly lowering the intrinsic dimensionality of the average NLP task, and therefore compress the minimal description length of those same tasks.
What do we more precisely mean by intrinsic parameter encoding a task within the framework provided by the pretrained representations? Traditionally, a finetuned model (e.g. for a classification tasks) simply consists of a classification head , parameterized by applied to finetuned representations , parameterized by per sample . Therefore, to fully describe a task, we need to pack together parameterizations and weights . This model description is completely decoupled from the original weights of the pretrained representation , therefore to represent classification tasks, we need to maintain ; additionally, the task representation is incredibly high dimensional. Conversely, finetuning utilizing SAID in dimensions requires storing only per task, a single random seed used to generate and the original pretrained weights . Therefore, we can represent arbitrary NLP tasks within a single pretrained model framework with parameters.
For example, in the last section, we represented MRPC with roughly 200 parameters, which translates to needing less than a kilobyte of data to encode a complex natural language task within the framework provided by RoBERTa.
We hypothesize that the better the pretrained models are, the fewer bits (description length) are needed to represent the average NLP task, as we will demonstrate empirically in the next section.
5.1 PreTraining Intrinsic Dimension Trajectory
To verify our hypothesis of pretraining optimizing intrinsic dimension, we retrain a RoBERTaBase from scratch and measure various NLP tasks’ intrinsic dimensions using the SAID method across various checkpoints. We completely replicate the setting as described by (Liu et al., 2019) apart from only training for a total of 200k steps (instead of 500k) with half the batch size (1k). To calculate the intrinsic dimension more efficiently, we reuse the best learning rates discovered in Section 4 for and use a fixed learning rate for anything else. To find we do a binary search across per each checkpoint, with a minimum of 100 and a maximum of 4 million. The “full solution” that we use when deciding cutoff is computed by finetuning the checkpointed model in the standard way. We compute SAID on six datasets; MRPC, QQP, Yelp Polarity (Zhang et al., 2015), SST2 (Socher et al., 2013), MNLI (Williams et al., 2018) and ANLI using all rounds of data (Nie et al., 2019).
We present our results in Figure 2. We see that the intrinsic dimensionality of RoBERTaBase monotonically decreases as we continue pretraining. We do not explicitly optimize for intrinsic dimensionality, specifically during pretraining (the language model does not have access to downstream datasets!), but nonetheless the intrinsic dimension of these downstream tasks continues to decrease.
More so, tasks that are easier to solve consistently show lower intrinsic dimensionality across all checkpoints, for example, Yelp Polarity vs. the notoriously tough ANLI dataset. The correlation between tasks traditionally hard for RoBERTa and their large intrinsic dimension hints at a connection between generalization and intrinsic dimension. We will discuss generalization further in Section §5.3.
Given our task representation interpretation of intrinsic dimensionality, we argue that the large scale training of Masked Language Models (MLM) learns generic and distributed enough representations of language to facilitate downstream learning of highly compressed task representations. Furthermore, we argue for another perspective of pretraining learning representations that form a compression framework with respect to various NLP tasks.
5.2 Parameter Count and Intrinsic Dimension
We would also like to measure the relationships between the parameter count of arbitrary pretrained models and the intrinsic dimension of downstream NLP tasks. The optimal experiment to run would be to fix the pretraining method, e.g., MLM RoBERTa style, vary the architecture size from small to very big, and compute the intrinsic dimension of a group of tasks at every size of the model. Unfortunately, such an experiment is computationally infeasible due to the need to train many RoBERTa models.
Due to these constraints, we opt to do an empirical study over existing pretrained models, regardless of the pretraining method. We show that the trend is strong enough to overcome differences in training methodology. We select the following pretrained models in our study: BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), BART (Lewis et al., 2019), Electra (Clark et al., 2020), Albert (Lan et al., 2019), XLNet (Yang et al., 2019), T5 (Raffel et al., 2019), and XLMR (Conneau et al., 2019). Furthermore, we selected various sizes of these models, as available publicly within the HuggingFace Transformers library (Wolf et al., 2019).
We used the MRPC dataset and computed intrinsic dimension for every pretrained model utilizing the same binary search methodology mentioned in the previous section with additional small hyperparameter searches across learning rate (due to the wide range of learning rates needed by various models).
We present our results in Figure 3. We see a strong general trend that as the number of parameters increases, the intrinsic dimension of finetuning on MRPC decreases. We ran this experiment on other datasets to ensure that this is not an artifact of the dataset. Our experiments showed the same trend; we refer to the Appendix for all trends per dataset.
Within the same window of number of parameters, pretraining methodology becomes essential. For example, in the regime of parameters, the RoBERTa method of pretraining dominates similar sized pretraining methods. However, there does not seem to be a method that can overcome the limitations induced by the number of parameters. Interpreting these results through the lens of learning a compression framework for NLP tasks is straightforward; the more parameters we have in the model, the less we need to represent a task.
5.3 Generalization Bounds through Intrinsic Dimension
We have shown strong empirical evidence connecting pretraining, finetuning, and intrinsic dimensionality. However, we have yet to argue the connection between intrinsic dimensionality and generalization. Given that we have seen pretraining minimize intrinsic dimension, we hypothesize that generalization improves as the intrinsic dimension decreases.
To do so, we will empirically experiment with the connections between and evaluation set performance by looking at various checkpoints from our RoBERTa experiments in Section §5.1. We also plot the relative generalization gap (delta between train time performance and test time performance).
In Figure 4 we plot the evaluation accuracy’s achieved by our pretraining experiment in Section §5.1. A lower intrinsic dimension is strongly correlated with better evaluation performance. Additionally we are interested in measuring relative generalization gap (
) across intrinsic dimension. We select the training accuracy that provides us with the best evaluation metrics when computing this figure.
We present our results in Figure 5. Lower intrinsic dimension once again correlates strongly with a smaller relative generalization gap. If we interpret the intrinsic dimension as a measure of complexity, we expect the generalization gap to decrease with intrinsic dimension.
5.3.1 Generalization Bounds
By applying standard compression based generalization bounds, we can provide theoretical backing to the empirical connection between intrinsic dimension and generalization (Arora et al., 2018).
Consider the following definition of multiclass classification loss with an optional margin over our supervised dataset .
(4) 
When , recovers the standard classification loss. Furthermore, Let be an unbiased empirical estimate of the margin loss.
Theorem 1.
Let be a function which is parameterized by as described in Equation 1 with a total of trainable intrinsic parameters on a dataset with samples. Then with a high probability, we can state the following asymptotic generalization bound
(5) 
Proof.
This generalization bound is independent of the underlying parameter count () of the pretrained model but depends on the ability to compress the downstream task (). Moreover, given that our previous section shows larger models compress better, our bounds are aligned with general intuition and recent empirical evidence that larger pretrained models generalize better. Explicitly, these bounds only apply to pretrained methods trained with the intrinsic dimension subspace method; research has yet to show that standard SGD optimizes in this low dimensional space (although experimentally, this seems to be confirmed). We leave the theoretical contribution of showing SGD optimizes in this space, resembling something such as intrinsic subspace, for future work.
We want to highlight that generalization is not necessarily measured by the pretrained model’s parameter count or measure of complexity, but the pretrained model’s ability to facilitate the compression of downstream tasks. In some sense, if we want to compress downstream tasks better, we must expect pretrained representations to have a considerable measure of complexity.
6 Conclusion
In conclusion, we proposed viewing the various phenomena surrounding finetuning and pretraining through the lens of intrinsic dimensionality. We empirically showed that common natural language tasks could be learned with very few parameters, sometimes in the order of hundreds, when utilizing pretrained representations. We provided an interpretation of pretraining as providing a compression framework for minimizing the average description length of natural language tasks and showed that pretraining implicitly minimizes this average description length.
We continued by doing an empirical study of existing pretraining methods and their respective intrinsic dimension, uncovering the phenomena that intrinsic dimensionality decreases as we increase the number of pretrained representation parameters. This phenomenon provides some intuitions to the trend of growing pretrained representations. We connected intrinsic dimensionality with generalization by first showing that pretrained models with lower intrinsic dimensions across various tasks achieve higher evaluation accuracies and lower relative generalization gaps. Furthermore, we explain these empirical results by applying wellknown generalization bounds to the intrinsic dimension to get generalization bounds that grow on the order of the intrinsic dimension, not on the pretrained model’s parameter count.
Intrinsic dimensionality is a useful tool for understanding the complex behavior of large models. We hope that future work will make explicit theoretical connections between SGD and optimizing the intrinsic dimension as well as explain exactly why pretraining methods optimize the intrinsic dimensionailty of tasks before not seen.
References
 Better finetuning by reducing representational collapse. arXiv preprint arXiv:2008.03156. Cited by: §2.
 Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296. Cited by: §A.1, §A.1, §A.1, §5.3.1, §5.3.1.
 The lottery ticket hypothesis for pretrained bert networks. Advances in Neural Information Processing Systems 33. Cited by: §2.
 What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341. Cited by: §3.
 Electra: pretraining text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §5.2.
 Unsupervised crosslingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §5.2.
 Evaluating lottery tickets under distributional shifts. arXiv preprint arXiv:1910.12708. Cited by: §2.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §4.1, §5.2.
 Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Cited by: §4.1.
 Autoencoders, minimum description length and helmholtz free energy. Advances in neural information processing systems 6, pp. 3–10. Cited by: §5.

Parameterefficient transfer learning for nlp
. arXiv preprint arXiv:1902.00751. Cited by: §2.  First quora dataset release: question pairs. External Links: Link Cited by: §4.1.

Albert: a lite bert for selfsupervised learning of language representations
. arXiv preprint arXiv:1909.11942. Cited by: §5.2. 
Fastfoodapproximating kernel expansions in loglinear time.
In
Proceedings of the international conference on machine learning
, Vol. 85. Cited by: §3.  Pretraining via paraphrasing. External Links: 2006.15020 Cited by: §1.

Bart: denoising sequencetosequence pretraining for natural language generation, translation, and comprehension
. arXiv preprint arXiv:1910.13461. Cited by: §1, §5.2.  Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838. Cited by: §1, §2, §3, §3, §3, §3.
 Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §4.1, §5.1, §5.2.
 Adversarial nli: a new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599. Cited by: §5.1.
 When bert plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561. Cited by: §2.
 Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1.
 Exploring the limits of transfer learning with a unified texttotext transformer. arXiv preprint arXiv:1910.10683. Cited by: §5.2.

Recursive deep models for semantic compositionality over a sentiment treebank.
In
Proceedings of the 2013 conference on empirical methods in natural language processing
, pp. 1631–1642. Cited by: §5.1. 
GLUE: a multitask benchmark and analysis platform for natural language understanding.
In
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
, Brussels, Belgium, pp. 353–355. External Links: Link, Document Cited by: §4.1.  A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. External Links: Link Cited by: §5.1.
 HuggingFace’s transformers: stateoftheart natural language processing. ArXiv, pp. arXiv–1910. Cited by: §4.1, §5.2.
 Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §5.2.
 Characterlevel Convolutional Networks for Text Classification. arXiv:1509.01626 [cs]. External Links: 1509.01626 Cited by: §5.1.
Appendix A Appendix
a.1 Proofs
Arora et al. (2018) define compressible using helper string as the following.
Definition 1.
compressible using helper string
Suppose
is a class of classifiers indexed by trainable parameters A and fixed strings s. A classifier
is compressible with respect to using helper string s if there exists such that for any , we have for all y(6) 
Remark 1.
If we parameterize via the intrinsic dimension approach as defined in Equation 1, then is compressible losslessly using a helper string consisting of the random seed used to generate the static random projection weights and the initial pretrained representation . Therefore we say parameterized by either DID or SAID is compressible.
Theorem in Arora et al. (2018) states given a compression consisting of discrete states we achieve the following generalization bound.
(7) 
We can trivially represent our parameters in a discrete fashion through discretization (as was done in Arora et al. (2018)), and the number of states is dependent on the level of quantization but is static once chosen (FP32 vs. FP16).
We then connect the fact that models trained in low dimensional subspace using SAID/DID methods are (0, S)compressible to derive the final asymptotic bound.
(8) 