Transformer-based pretrained language models are a battle-tested solution to a plethora of natural language processing tasks. In this paradigm, a transformer-based language model is first trained on copious amounts of text, then fine-tuned on task-specific data. BERT Devlin et al. (2019), XLNet Yang et al. (2019), and RoBERTa Liu et al. (2019) are some of the most well-known ones, representing the current state of the art in natural language inference, question answering, and sentiment classification, to list a few. These models are extremely expressive, consisting of at least a hundred million parameters, a hundred attention heads, and a dozen layers.
An emerging line of work questions the need for such a parameter-loaded model, especially on a single downstream task. Michel et al. (2019), for example, note that only a few attention heads need to be retained in each layer for acceptable effectiveness. Kovaleva et al. (2019) find that, on many tasks, just the last few layers change the most after the fine-tuning process. We take these observations as evidence that only the last few layers necessarily need to be fine-tuned.
The central objective of our paper is, then, to determine how many of the last layers actually need fine-tuning. Why is this an important subject of study? Pragmatically, a reasonable cutoff point saves computational memory across fine-tuning multiple tasks, which bolsters the effectiveness of existing parameter-saving methods Houlsby et al. (2019). Pedagogically, understanding the relationship between the number of fine-tuned layers and the resulting model quality may guide future works in modeling.
Our research contribution is a comprehensive evaluation, across multiple pretrained transformers and datasets, of the number of final layers needed for fine-tuning. We show that, on most tasks, we need to fine-tune only one fourth of the final layers to achieve within 10% parity with the full model. Surprisingly, on SST-2, a sentiment classification dataset, we find that not fine-tuning all of the layers leads to improved quality.
2 Background and Related Work
2.1 Pretrained Language Models
In the pretrained language modeling paradigm, a language model (LM) is trained on vast amounts of text, then fine-tuned on a specific downstream task. Peters et al. (2018) are one of the first to successfully apply this idea, outperforming state of the art in question answering, textual entailment, and sentiment classification. Their model, dubbed ELMo, comprises a two-layer BiLSTM pretrained on the Billion Word Corpus Chelba et al. (2014).
Furthering this approach with more data and improved modeling, Devlin et al. (2019) pretrain deep 12- and 24-layer bidirectional transformers Vaswani et al. (2017) on the entirety of Wikipedia and BooksCorpus Zhu et al. (2015). Their approach, called BERT, achieves state of the art across all tasks in the General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2018), as well as the Stanford Question Answering Dataset (Rajpurkar et al., 2016).
As a result of this development, a flurry of recent papers has followed this more-data-plus-better-models principle. Two prominent examples include XLNet Yang et al. (2019) and RoBERTa Liu et al. (2019), both of which contest the present state of the art. XLNet proposes to pretrain two-stream attention-augmented transformers on an autoregressive LM objective, instead of the original cloze and next sentence prediction (NSP) tasks from BERT. RoBERTa primarily argues for pretraining longer, using more data, and removing the NSP task for BERT.
2.2 Layerwise Interpretability
The prevailing evidence in the neural network literature suggests that earlier layers extract universal features, while later ones perform task-specific modeling.Zeiler and Fergus (2014) visualize the per-layer activations in image classification networks, finding that the first few layers function as corner and edge detectors, and the final layers as class-specific feature extractors. Gatys et al. (2016)
demonstrate that the low- and high-level notions of content and style are separable in convolutional neural networks, with lower layers capturing content and higher layers style.
Pretrained transformers. In the NLP literature, similar observations have been made for pretrained language models. Clark et al. (2019) analyze BERT’s attention and observe that the bottom layers attend broadly, while the top layers capture linguistic syntax. Kovaleva et al. (2019) find that the last few layers of BERT change the most after task-specific fine-tuning. Similar to our work, Houlsby et al. (2019) fine-tune the top layers of BERT, as part of their baseline comparison for their model compression approach. However, none of the studies comprehensively examine the number of necessary final layers across multiple pretrained transformers and datasets.
3 Experimental Setup
We conduct our experiments on NVIDIA Tesla V100 GPUs with CUDA v10.1. We run the models from the Transformers library (v2.1.1; Wolf et al., 2019
) using PyTorch v1.2.0.
3.1 Models and Datasets
We choose BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) as the subjects of our study, since they represent state of the art and the same architecture. XLNet Yang et al. (2019) is another alternative; however, they use a slightly different attention structure, and our preliminary experiments encountered difficulties in reproducibility with the Transformers library. Each model has base and large variants that contain 12 and 24 layers, respectively. We denote them by appending the variant name as a subscript to the model name.
|BERT||24M (22%)||7M (7%)||0.6M (0.5%)||110M|
|RoBERTa||39M (31%)||7M (6%)||0.6M (0.5%)||125M|
|BERT||32M (10%)||13M (4%)||1M (0.3%)||335M|
|RoBERTa||52M (15%)||13M (4%)||1M (0.3%)||355M|
Within each variant, the two models display slight variability in parameter count—110 and 125 million in the base variant, and 335 and 355 in the large one. These differences are mostly attributed to RoBERTa using many more embedding parameters—exactly 63% more for both variants. For in-depth, layerwise statistics, see Table 1.
For our datasets, we use the GLUE benchmark, which comprises the tasks in natural language inference, sentiment classification, linguistic acceptability, and semantic similarity. Specifically, for natural language inference (NLI), it provides the Multigenre NLI (MNLI; Williams et al., 2018), Question NLI (QNLI; Wang et al., 2018), Recognizing Textual Entailment (RTE; Bentivogli et al., 2009), and Winograd NLI Levesque et al. (2012) datasets. For semantic textual similarity and paraphrasing, it contains the Microsoft Research Paraphrase Corpus (MRPC; Dolan and Brockett, 2005), the Semantic Textual Similarity Benchmark (STS-B; Cer et al., 2017), and Quora Question Pairs (QQP; Iyer et al., ). Finally, its single-sentence tasks consist of the binary-polarity Stanford Sentiment Treebank (SST-2; Socher et al., 2013) and the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018).
3.2 Fine-Tuning Procedure
Our fine-tuning procedure closely resembles those of BERT and RoBERTa. We choose the Adam optimizer Kingma and Ba (2014)
with a batch size of 16 and fine-tune BERT for 3 epochs and RoBERTa for 10, following the original papers. For hyperparameter tuning, the best learning rate is different for each task, and all of the original authors choose one betweenand ; thus, we perform line search over the interval with a step size of . We report the best results in Table 2.
On each model, we freeze the embeddings and the weights of the first layers, then fine-tune the rest using the best hyperparameters of the full model. Specifically, if is the number of layers, we explore . Due to computational limitations, we set half as the cutoff point. Additionally, we restrict our comprehensive all-datasets exploration to the base variant of BERT, since the large model variants and RoBERTa are much more computationally intensive. On the smaller CoLA, SST-2, MRPC, and STS-B datasets, we comprehensively evaluate both models. These choices do not substantially affect our analysis.
4.1 Operating Points
: two extreme operating points and an intermediate one. The former is self-explanatory, indicating fine-tuning all or none of the nonoutput layers. The latter denotes the number of necessary layers for reaching at least 90% of the full model quality, excluding CoLA, which is an outlier.
From the reported results in Tables 3–5, fine-tuning the last output layer and task-specific layers is insufficient for all tasks—see the rows corresponding to 0, 12, and 24 frozen layers. However, we find that the first half of the model is unnecessary; the base models, for example, need fine-tuning of only 3–5 layers out of the 12 to reach 90% of the original quality—see Table 4, middle subrow of each row group. Similarly, fine-tuning only a fourth of the layers is sufficient for the large models (see Table 5); only 6 layers out of 24 for BERT and 7 for RoBERTa.
4.2 Per-Layer Study
In Figure 1, we examine how the relative quality changes with the number of frozen layers. To compute a relative score, we subtract each frozen model’s results from its corresponding full model. The relative score aligns the two baselines at zero, allowing the fair comparison of the transformers. The graphs report the average of five trials to reduce the effects of outliers.
When every component except the output layer and the task-specific layer is frozen, the fine-tuned model achieves only 64% of the original quality, on average. As more layers are fine-tuned, the model effectiveness often improves drastically—see CoLA and STS-B, the first and fourth vertical pairs of subfigures from the left. This demonstrates that gains decompose nonadditively with respect to the number of frozen initial layers. Fine-tuning subsequent layers shows diminishing returns, with every model rapidly approaching the baseline quality at fine-tuning half of the network; hence, we believe that half is a reasonable cutoff point for characterizing the models.
Finally, for the large variants of BERT and RoBERTa on SST-2 (second subfigure from both the top and the left), we observe a surprisingly consistent increase in quality when freezing 12–16 layers. This finding suggests that these models may be overparameterized for SST-2.
5 Conclusions and Future Work
In this paper, we present a comprehensive evaluation of the number of final layers that need to be fine-tuned for pretrained transformer-based language models. We find that only a fourth of the layers necessarily need to be fine-tuned to obtain 90% of the original quality. One line of future work is to conduct a similar, more fine-grained analysis on the contributions of the attention heads.
This research was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, and enabled by computational resources provided by Compute Ontario and Compute Canada.
- Bentivogli et al. (2009) Luisa Bentivogli, Ido Kalman Dagan, Dang Hoa, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC 2009 Workshop.
- Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation.
- Chelba et al. (2014) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Fifteenth Annual Conference of the International Speech Communication Association.
- Clark et al. (2019) Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. arXiv:1906.04341.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Dolan and Brockett (2005) William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing.
- Gatys et al. (2016) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In
Houlsby et al. (2019)
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin
De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.
Parameter-efficient transfer learning for NLP.In
International Conference on Machine Learning.
- (9) Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First Quora dataset release: Question pairs.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
- Kovaleva et al. (2019) Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of BERT. arXiv:1908.08593.
- Levesque et al. (2012) Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
- Michel et al. (2019) Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? arXiv:1905.10650.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.
- Warstadt et al. (2018) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2018. Neural network acceptability judgments. arXiv:1805.12471.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art natural language processing. arXiv:1910.03771.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237.
- Zeiler and Fergus (2014) Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision.
- Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision.