1 Introduction & Related Work
The recent emergence of successful large-scale language representation models (c.f., Devlin et al. (2018); Radford et al. (2019); Yang et al. (2019); Liu et al. (2019); Shoeybi et al. (2019); Keskar et al. (2019)) has led to an explosion of self-supervised language representation models trained on massive corpora of internet text. Even though these large models are trained on self-supervision111There is a larger debate, which we do not intend to participate in, as to whether or not such tasks are “self-supervised”, “unsupervised”, “supervised”, or something in-between. tasks such as Next-Sentence Prediction (Devlin et al., 2018), Masked-Language Modeling (Devlin et al., 2018), and vanilla next-token-prediction language models, such models perform well on tasks such as entailment or question answering for which they were not trained, whether in a zero-shot fashion (Radford et al. (2019); Wang et al. (2019), among others) or in a finetuned setting (Devlin et al., 2018; Radford et al., 2019; Yang et al., 2019; Liu et al., 2019).
Although such models perform well, relatively little is known about why they work (Hao et al., 2019). Recent work has shown that properly tuning such models with nearly identical configuration and simply training on more data (Liu et al., 2019) leads to large gains in performance. Such models have also been shown to be highly compressible either through distillation (Sanh et al., 2019; Jiao et al., 2019) or quantization (Zafrir et al., 2019; Shen et al., 2019).
An open question remains as to whether or not such models have to be over-parametrized (Zhang et al., 2019) in order to perform well. Kovaleva et al. (2019) and Hao et al. (2019), among others, have speculated that models such as BERT (Devlin et al., 2018) are vastly over-parametrized for the downstream tasks on which they are finetuned. In this work, we provide empirical evidence in favor of this view, specifically for text classification.
2 Experimental Setup
We conduct all experiments on the task of fine-tuning our model for text classification – in particular, we evaluate on three standard datasets for the aforementioned task; IMDB (Maas et al., 2011), AG-news (Zhang et al., 2015) and DBpedia (Zhang et al., 2015). These datasets have very different numbers of examples, and final target space cardinality (see Table 1). Running our analyses on these datasets provides an initial control as to whether the conclusions we draw are generalizable.
For simplicity, we fine-tune BERT with an identical experimental setup for each dataset. Each model is finetuned from the pretrained BERT-base (Devlin et al., 2018) model configuration which uses a hidden size of 768 with 12 transformer blocks and 12 attention heads. To adapt it for classification, we follow Devlin et al. (2018) and add a simple linear layer on top of the [CLS] token embedding.
We follow Sun et al. (2019)
for the selection of hyperparameters. We utilize a batch-size of 32 via gradient accumulation. We use the AdamW optimizer à laLoshchilov and Hutter (2017) with and , and combine this with slanted triangular learning rate decay (Howard and Ruder, 2018)
, with a cycle of 4 epochs, warm-up proportion of 0.1 and minimum learning rate of 0. Moreover, we adapt the maximum learning rate to each layer. The last layer is initialized with a maximum learning rate of, and a decay of 0.95 is applied for each layer below (i.e. , where is the layer number). Anecdotally, we observed that using slanted triangular learning rates and decaying learning rates slightly improved results for all datasets. In addition, we use a dropout (Srivastava et al., 2014) value of post-[CLS]-token.
We respectively retained 10%, 5%, and 5% of the training set of IMDB, AG-news, and DBpedia for early-stopping, and observed a convergence in under four epochs for all experiments. Our base-level fine-tuning results on the test sets are given in Table 2.
We sanity-checked our initial results through https://paperswithcode.com/. At the time this paper was written, our results place us in 9th place for IMDB, 4th place for AG-news and 2nd place for DBpedia. Although ancillary from the primary argument of this work, this epitomizes how powerful these pretrained systems are: with only a few hours and for close to no cost, it is now possible to train models which get close to state-of-the-art results on most datasets.
|Dataset||Training Set||Test Set||Output Classes|
3 Results and Analyses
In the following exposition, “pretrained models”
refers to models that have not been fully finetuned, where a linear classifier is trained on top of the pretrained embedding from the[CLS] token, but the BERT model itself is frozen. We refer to a “finetuned model” as having been fully end-to-end finetuned on the relevant dataset and task.
Several experiments were conducted to inspect and understand the outputs of BERT-based models, in both pretrained-only and finetuned states. Most results and observations are consistent through all three datasets, which points towards the generalizability of the emergent properties derived from our studies toward other domains.
3.1 Dimension Reduction via Principal Component Analysis
To begin to understand how information is stored and represented in BERT vectors, we begin by analyzing model outputs through Principal Component Analysis (PCA), in which we seek to decompose a set of such vectors into an orthogonal basis set such that we may begin to understand dimensions of maximal variance in BERT vectors.
3.1.1 Patterns in the output of pretrained BERT
Do dimensions with maximal variance generalize across domains for text classification? An answer to this question points to a question of ”how multi-task” components of BERT representations are. To investigate, we probe BERT using a general-domain PCA model, and attempt to understand how compress-able BERT representations are when carried cross-domain.
[CLS] token embeddings on a random sample of 1M sentences from a cleaned, English-language Wikipedia dump were computed, with a PCA model subsequently trained atop these embeddings. Post-training, this BERT + PCA combination was trained with a linear classifier on each of the datasets in our experimental setup.
We compare the performance under of this process with two alternatives – the case where the PCA model is trained on the dataset of interest (removing the out-of-domain component), and the case where a random number of BERT dimensions is selected. We show the behavior under these three scenarios for IMDB in Figure 1, for AG-News in Figure 2, and for DBPedia in Figure 3.
The behavior in Figures 1, 2, and 3 clearly demonstrate that there indeed is generalizable structure and information in BERT outputs which makes a large portion of output features redundant. Although performance clearly degrades when the PCA model is trained out-of-domain versus in-domain, the gap between results is smaller than the gap when compared to randomly selected dimensions.
An interesting anecdotal result is that PCA does not require a large sample to reach full capacity, as evaluated by downstream performance as a feature set on a given task. We initially trained PCA on around 1M embeddings, but after manual analysis, it was determined that around 32,000 embeddings are sufficient to obtain performance on par with the full set of 1M.
3.1.2 Patterns in the output of finetuned BERT
We empirically observe that when BERT is finetuned, the information is extremely compressed into a small number of dimensions in the [CLS] token embedding – surprisingly good performance is obtained with only a handful of dimensions. With 5 principal components obtained from the general PCA, scores are only 0.004% (IMDB) / 0.35% (AG-news) / 1.05% (DBpedia) percentage points away from models trained atop of all principal components. With 25 principal components, it is the percentage point gap for accuracy drops to 0.02% for DBpedia, and significantly less for others. Hence, output information can be drastically compressed.
3.1.3 Explained variance
Principal Component Analysis admits the useful property of maximizing the variance of the projected data across each component, subject to maintaining orthogonality to existing components. We utilize this fact, as is commonly done, to examine the percentage of observed variance encoded in each projected axis. In training PCA on the general Wikipedia dataset, we observed shared phenomena across all three datasets.
We observe that the first principal axes become more important when the model is finetuned, i.e they explain significantly more variance than their counter-parts in the pretrained models. However, the number of such principal axes is low. Surprisingly, the point at which we encounter the first principal component that explains less variance in the finetuned model than the pretrained model roughly corresponds to the number of classes contained in the task at hand.
Define the ith variance ratio to be the percentage of explained variance of th principal component of the finetuned model divided by th principal component in the pretrained model. We display the variance ratios for the first 20 principal components across the three datasets under consideration in Figure 4.
For AG-news, we note that the first four axes are significantly more important in the finetuned model than in the pretrained-only model, with variance ratios between 1.3 and 3.7, before dropping to a variance ratio of 0.76 in the fifth principal component. For DBpedia, the first 13 variance ratios are greater or equal to 1, and for IMDB, the first variance ratio is 1.25, and drops to below 1 for the second dimension.222As a reminder, AG-news has 4 output categories, DBPedia has 14 output categories, and IMDB has 2 output categories (See Table 1).
We hypothesize, without substantive testing, that the BERT embeddings corresponding to [CLS] tokens develop a natural dimensionality close to the number of natural categories of each dataset (c.f. Table 1). This is consistent with the intuition that BERT is vastly overparameterized for many of the tasks at hand333We suspect this is because the standard text classification chosen here tend to require little-to-no reasoning..
3.2 Does BERT have salient neurons?
|IMDB||50 % (2)||88.0%||67.0%||77.4%||93.7%||93.7%||93.8%||93.7%|
|AG-news||25 % (4)||90.4%||43.5%||70.3%||94.7%||83.6%||94.3%||94.3%|
|DBpedia||7.14 % (14)||99.02%||17.67%||53.67%||99.33%||60.9%||99.0%||99.2%|
We now investigate how individual dimensions of BERT [CLS] embeddings learn to store information, removing the explicit cumulative variance-maximizing projections of PCA in the previous section. To do this, we explicitly select output dimensions that are most useful for each specific task. We refer to these dimensions as salient neurons
, a slight nod towards sentiment neurons referenced inRadford et al. (2017).
3.2.1 Salient neurons in pretrained BERT
We first begin by investigating, in a similar spirit to Radford et al. (2017), which dimensions or neurons from the pretrained BERT model directly exhibit useful information for our three datasets. To select the best neuron for a given classification task, we perform 5-fold cross-validation on the train set and select the individual dimension with the best mean accuracy score over the folds.
As the original number of dimensions to search over is high (768 for the BERT-base model), selecting the subset of size which maximizes our cross-validated accuracy score suffers from combinatorial explosion – therefore, we proceed in a greedy fashion, choosing new dimensions to be added to existing ones by cross-validation444An interesting extension for future work is to understand whether a more optimal search for hidden unit combinations leads to better performance than the results shown in Table 3.
Results for the pretrained-only model are presented in the left part of Table 3. As is evident, certain neurons encode information with no finetuning that are useful for the classification tasks outlined in Section 2, even though the model has not received explicit signal for these tasks before. This hints towards validation that for the auxiliary tasks that comprise BERT (i.e. masked words and sentence prediction), the model has learned to embed information which encapsulates sentiment and other topical categorizations. As is clear in Table 3, however, results on selected pretrained subsets of embedding dimensions are still quite far from those obtained from finetuned models, exhibiting the value of domain alignment via finetuning.
3.2.2 Salient neurons in finetuned BERT
With finetuned models, one would expect that salient neurons would manifest themselves more clearly. As shown in Table 3, this is confirmed in our experiments. These salient neurons are surprisingly effective, in many cases removing the need for the fully finetuned classification layer. For the IMDB dataset, using a linear classification on even a single neuron from the [CLS]-token embedding provides equal performance to scores from models which utilize the full embedding. For AG-news and DBpedia this is less pronounced, primarily due to the number of classes contained in the given datasets. However, the best neuron manages to provide very strong separation between classes, and does much more than only separating one class from all the others, as shown in Figure 5, which clearly illustrates the power of a single dimensional representation from this neuron. An interesting anecdotal observation from our experiments is that the performance when selecting a number of neurons equal to the number of classes is approximately equal to the case where we utilize the entire embedding. This leads us to hypothesize that the over-parametrization of the finetuned BERT model for the tasks at hand results in embedding vectors lying on a significantly lower dimensional manifold than the full 768 dimensions, with this natural manifold tending towards the number of classes555A byproduct of the finetuning being explicitly for class separation..
A natural follow-up question to the previous analysis is whether or not salient neurons are unique – put differently, are there many redundant salient neurons?
To investigate, we consider the empirical accuracy distribution of individual classifiers trained on single elements of the [CLS] token embedding. In Figure 6, we display these empirical distributions for IMDB (Fig. 5(a)), AG-News (Fig. 5(b)), and DBPedia (Fig. 5(c)). As is evident, the distribution of accuracy scores is quite dense, i.e., there are no gaps in how individual neurons perform. Interestingly, the number of salient neurons that are highly performant is quite low, though there does seem to be some redundant information. For example, on the IMDB dataset, around 7% of neurons result in an accuracy greater than 0.93, but only a single neuron gives an accuracy score greater than 0.936666As a reminder, neurons are evaluated on the test set here, but when selecting neurons for inclusion in Table 3, selection occurs via cross-validation..
4 Discussion and Future Work
In this work, we show empirical evidence that both the pretrained and finetuned representations in a BERT [CLS]-token embedding contain significant amounts of redundancy, and, in the case of finetuning, exhibit low dimensional manifold structure intimately related to the problem at hand. We conclude, through experimental investigation, that BERT is vastly over-parametrized for standard text classification tasks, echoing Sanh et al. (2019).
One major consequence is that it is possible to reduce the dimension of BERT [CLS]-token output to a very small number of dimensions and lose almost no accuracy. This can be useful in settings where one needs to store a many such embeddings, such as text retrieval, or if a downstream task requires sending the embeddings through a network. In addition, we suspect that designing connective patterns between the multi-head attention components and the final [CLS]-token vectors that can take advantage of sparsity in the vector output may be useful for pruning-aware training.
As a field, natural language processing is currently under a deluge of transformer-based language representation models, where each one seems to be slightly better than all that came before. A key dimension of future work would be to run the same experiments on these new systems. We speculate that since their architectures are very similar, results are likely to be close.
Finally, our study exclusively concerned standard classification tasks. We intend to adapt this study to a wider range of tasks to determine if such strong evidence of over-parametrization exists across language understanding benchmarks.
The authors would like to acknowledge useful conversations with Ian Lane, Jungsuk Kim, Osman Başkaya, Richie Feng, Alfredo Láinez, Warren Green, Umair Akeel, and the entire Twilio AI team.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §1, §1, §2.
- Visualizing and understanding the effectiveness of bert. arXiv preprint arXiv:1908.05620. Cited by: §1, §1.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §2.
- TinyBERT: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. Cited by: §1.
- Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §1.
- Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593. Cited by: §1.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §1.
- Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. Cited by: §2.
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Cited by: §2.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Cited by: §3.2.1, §3.2.
- Language models are unsupervised multitask learners. Cited by: §1.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §1, §4.
- Q-bert: hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840. Cited by: §1.
- Megatron-lm: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1.
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research15 (1), pp. 1929–1958. Cited by: §2.
- How to fine-tune bert for text classification?. arXiv preprint arXiv:1905.05583. Cited by: §2.
- Cross-lingual bert transformation for zero-shot dependency parsing. arXiv preprint arXiv:1909.06775. Cited by: §1.
- XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
- Q8BERT: quantized 8bit bert. arXiv preprint arXiv:1910.06188. Cited by: §1.
- Identity crisis: memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698. Cited by: §1.
- Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §2.