How fine can fine-tuning be? Learning efficient language models

04/24/2020 ∙ by Evani Radiya-Dixit, et al. ∙ Stanford University 8

State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task: the number of fine-tuning steps is typically five orders of magnitude lower than the total parameter count. Does this mean that fine-tuning only introduces small differences from the pre-trained model in the parameter space? If so, can one avoid storing and computing an entire model for each task? In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. As expected, we find that the fine-tuned models are close in parameter space to the pre-trained one, with the closeness varying from layer to layer. We show that it suffices to fine-tune only the most critical layers. Further, we find that there are surprisingly many good solutions in the set of sparsified versions of the pre-trained model. As a result, fine-tuning of huge language models can be achieved by simply setting a certain number of entries in certain layers of the pre-trained parameters to zero, saving both task-specific parameter storage and computational cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern deep neural networks operate in a regime where the generalization gap diminishes with growing model capacity, defying the classical bias-variance tradeoff 

(Belkin et al., 2018). Increased model capacity always leads to better generalization, a trend highly prominent in the natural language understanding domain. From BERT (340M parameters, Devlin et al., 2018)

, to GPT-2 

(1.5B parameters, Radford et al., 2019), and to Megatron-LM (8.3B parameters, Shoeybi et al., 2019), state-of-the-art performance in language comprehension tasks keeps improving with larger model capacity.

These huge language models are first pre-trained on large text corpora. Pre-training

is a learning procedure, often unsupervised, that yields a good common initialization for further supervised learning of various downstream tasks. This further learning, called

fine-tuning, is an additional optimization of model parameters jointly with a very small number of extra task-specific parameters (e.g. Table 1).

Though larger models generalize better, they are more expensive computationally, and the costs grow with the number of tasks learned. The high computational cost is usually addressed by network compression techniques that produce compact models for efficient inference (e.g. Zhao et al., 2019)

. To reduce task-specific cost, transfer learning and continual learning methods are useful to maximize sharing of parameters across tasks 

(e.g. Houlsby et al., 2019; Liu et al., 2019). In this work, we attempt to achieve these two desirable outcomes with a single effort. We use Bidirectional Encoder Representations from Transformers (BERT, Devlin et al., 2018) and the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) as a working example.

Our intuition comes from the observation that the amount of fine-tuning necessary to learn each task is very small (five orders of magnitude smaller than the dimensionality of the parameter space, Table 1). This is not surprising: the high quality of a pre-trained model should naturally lead to rather few iterations needed to fine-tune it to perform specific tasks. But importantly, such light fine-tuning might result in fine-tuned models hypothetically closer 333 This vague notion of closeness, viz. separation by few gradient update steps in the parameter space, will be made explicit later in the text. to the pre-trained one in parameter space. This suggests a potentially high degree of computational redundancy across tasks that might be avoided at inference time.

Figure 1: An illustration of -close and sparsification constraints on a pre-trained parameter in a three-dimensional parameter space. Here is the pre-trained parameters. Individual fine-tuning procedures for different tasks send the pre-trained parameters to distinct optimized parameters in a close -neighborhood, e.g. , and , of which each component, viz. , or , is subject to change. The -closeness constraint (red with its saturation encoding closeness) forces optimization within those parameter configurations that share a certain fraction of components with , i.e. having a small number of different components. The sparsification constraint (blue with its saturation encoding density) further confines optimization to a discrete subset of -close parameters, where all changed components have to be set to zero.

We first observe that the fine-tuned and pre-trained parameters are both -close and angular-close in parameter space, consistent with the small number of fine-tuning iterations separating them. Despite this closeness in parameter space, the fine-tuned parameters are not constrained to share any components with the pre-trained weights, and thus are equally expensive to store and to compute per iteration. We conjecture that there also exist good fine-tuned parameters under efficient computational constraints, even though they might be more -distant or angular-distant. In order to make fine-tuned models share parameters with the pre-trained models, we optimize parameters -close to pre-trained (Figure 1, red) by fine-tuning only the most sensitive layers (i.e. those most distant in parameter subspaces) of the network. Furthermore, we attempt to learn a task by sparsifying the pre-trained weights (Figure 1, blue). Surprisingly, our results reveal an abundance of good task-specific parameter configurations within sparified versions of pre-trained models: a specific task can be learned by simply masking anywhere between to of the pre-trained weights to zero.

A major contribution of the present work is the demonstration that fine-tuning can be realized by sparsification, which has favorable practical implications. By forcing fine-tuned parameters to be -close to the pre-trained ones, one only needs to store a small number of different weights per task, in addition to the common pre-trained weights, substantially saving parameter memory when there are many tasks to perform. By forcing fine-tuned parameters to be sparse, one potentially saves both memory and compute, because each task only requires a binary mask on top of the common pre-trained parameters and sparse linear algebraic operations could be used instead of dense ones.

2 Related work

A large body of literature is concerned with sparsification of large networks for efficient inference (e.g. Zhu and Gupta, 2017). Our search for -close fine-tuning solutions is motivated by the observation that sensitivities of the optimization objective to different layers in a network are highly variable (Zhang et al., 2019). Zhou et al. (2019) trained sparse connectivity patterns over randomly initialized network parameters, termed supermasks, suggesting that sparsification plays a role similar and complementary to gradient-based learning of the objective. This is also related to network architecture search (NAS).

The most similar study to ours is piggyback and its variants (Mallya et al., 2018; Mancini et al., 2019), where in a multi-task visual object classification scenario, the authors trained task-specific binary masks on top of a shared set of pre-trained parameters. In this work, we not only applied similar techniques to larger pre-trained language models, but also studied the sparsity-accuracy tradeoff in a systematic way. Houlsby et al. (2019) added adapter modules to pre-trained language models, achieving parameter sharing across multiple tasks, but not reducing the computational cost of the resultant fine-tuned networks. Also, note that randomly generated high-dimensional masks can also support multi-task learning, e.g. Cheung et al. (2019).

To impose parameter sparseness differentiably in combination with the -closeness constraint, instead of principled approaches to imposing -regularization (Louizos et al., 2017)

, we used the simpler straight-through estimator, much like binary quantization techniques 

(Courbariaux et al., 2015; Courbariaux and Bengio, 2016); note that this is also used by Mallya et al. (2018) and Zhou et al. (2019).

3 Methods

GLUE Task MNLI QQP QNLI SST-2 CoLA STS-B MRPC RTE
Additional parameter count
Fine-tuning iteration count
Table 1: Task-specific model information of (parameter count 109M).

3.1 Notations and model architecture

Consider a pre-trained network that transforms input sequence into a good representation. It is parameterized by , noted as subscript for convenience. The fine-tuning procedure to perform a task ( being the set of tasks) can be described as a supervised training procedure of model on fine-tuning set . is a task-specific last layer unique to task and is parameterized by ; denotes function composition.

In the case of BERT, we have a stack of modules

(1)

among which is the embedding layer, a final pooling layer and each is a transformer block

(2)

where collects all the learnable parameter matrices in the block. , , and are the query, key, value, and dense self-attention projection matrices, respectively. and are the intermediate and output feedforward matrices, respectively. represents multi-head scaled dot-product attention (Vaswani et al., 2017), dropout, layer normalization, and

the Gaussian error linear unit activation function 

(Hendrycks and Gimpel, 2016). We experimented with the model (Devlin et al., 2018), for which , with total parameter count of 109M 444 Pre-trained parameters obtained from https://github.com/google-research/bert. . See Table 1 for additional task-specific parameter counts, all 5 orders of magnitude smaller than the total parameter count. Optimization of them alone fails to fine-tune (see Appendix A).

3.2 GLUE benchmark

The GLUE benchmark is a collection of diverse natural language understanding tasks (Wang et al., 2018): CoLA (Warstadt et al., 2018), SST (Socher et al., 2013), MRPC (Dolan and Brockett, 2005), STS (Cer et al., 2018), QQP (Shankar Iyer et al., 2017), MNLI (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), and RTE (Dagan et al., 2006; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009). We exclude the problematic WNLI set 555 See (12) in https://gluebenchmark.com/faq.  (Levesque et al., 2012). We fine-tune on these tasks and report the evaluation performances. F1 is reported for QQP and MRPC, Matthews correlation for CoLA, Pearson and Spearman correlation for STS-B, and accuracy for all other tasks.

3.3 Constrained fine-tuning procedures

For all fine-tuning procedures, we use the exact hyperparameters as described in the original paper 

(Devlin et al., 2018) unless specified otherwise, with additional constraints described as follows. No constraints are imposed on task-specific last layers .

-close fine-tuning

To search for fine-tuned solutions that are -close to the pre-trained parameters, we selectively fix the least sensitive parameter matrices at their pre-trained values and perform fine-tuning optimization on a lower-dimensional parameter space.

Sparsification as fine-tuning

In order to search for fine-tuned networks that are both sparse and -close to the pre-trained one, we reparameterize the model by a multiplicative binary mask

(3)

where is the pre-trained parameters, and the mask, being the dimensionality of the parameter space, and the Hadamard product.

If learning is purely through optimizing the mask while holding constant, the mask is called a supermask (Zhou et al., 2019). Since is discrete-valued and thus not differentiable, we reparameterize as

(4)

where

denotes an element-wise independent Bernoulli sampler with probability

, and

the sigmoid function, applied element-wise on

, the continuous mask parameter

that is task-specific. We treat gradient backpropagation through

as a straight-through estimator, similar to the techniques used in Mallya et al. (2018); Zhou et al. (2019). Same fine-tuning hyperparameters as described in Devlin et al. (2018) were used except for the learning rate (see Appendix B).

Control over the final sparsity 666 Defined as the fraction of zero components, equal to one minus density. is exerted by initialization of for fine-tuning. We initialize according to a soft magnitude-based pruning mask: a fraction of small-magnitude values are initialized to and the rest to . We found that the initial sparsity directly controls the final sparsity (see Appendix C), allowing us to produce masks with sparsity levels ranging from to .

4 Experimental results

Distance metric
Between uniform
initializations
Between normal
initializations
Between uniform
initialization
and pre-trained
Between normal
initialization
and pre-trained
Between fine-tuned
and pre-trained
-distance
Angular distance
Table 2:

Distance metrics between fine-tuned and pre-trained parameters, compared against the expected values between two independent random initializations, either uniformly or normally distributed from

to where is the hidden dimension, as well as those between the pre-trained and a random initialization. Statistics presented in the rightmost column are across all GLUE tasks.
Figure 2: - and angular distances in parameter subspaces between pre-trained and fine-tuned weights. Shown are metrics across the 12 encoder stack layers for the self-attention projection matrices (, , and ) and feed-forward matrices ( and ). The results presented here are for MNLI fine-tuning, but similar patterns are observed across all GLUE tasks.
Layers excluded from fine-tuning Task-specific parameter storage F1 score
None (baseline) 109M float ()
(1) Key projection layers in self-attention 102M float ()
(2) Deepest encoder stack layers 95M float ()
(3) Word embedding layer 86M float ()
(1), (2), and (3) 66M float ()
(1), (2), and (3) with sparse fine-tuning 66M binary ()
(1), (2), and (3) with sparse fine-tuning 66M binary ()
Table 3:

-close fine-tuning results: layers excluded from fine-tuning, corresponding number of parameters remaining to fine-tune, and the fine-tuning performance on the MRPC task (F1 score); other GLUE tasks show similar patterns. We report the mean and standard deviation across 10 independent runs.

4.1 Fine-tuned and pre-trained parameters are -close and angular-close

We observe that the original fine-tuning procedures for GLUE tasks all take to parameter update steps (Table 1), negligible compared to the dimensionality of the parameter space, viz. . Thus, we first asked whether fine-tuned parameters are indeed close to the pre-trained ones in parameter space. We measured the -distances, i.e. -norm of parameter difference, and angular distances (Table 2). Specifically, we inspect the weight matrices in all self-attention layers, of size where is the hidden state dimension. We report the minimum and maximum values across GLUE tasks: RTE showed the smallest values, and MNLI showed the largest values. Evidently, we see a significantly higher - and angular-closeness between fine-tuned and pre-trained parameters as compared to the expected distance between two independent random initializations, or that between an initialization and the pre-trained paremeters. This confirms that, during the course of fine-tuning, the very few model parameter updates traversed a very short distance in the parameter space. Comparing the parameter distance across GLUE tasks, we find that it scales with the number of fine-tuning iterations (see Appendix D).

Further, we inspect the closeness in parameter subspaces for each layer. We found that, though all layers change very little during fine-tuning, there is nevertheless a high degree of variability across different parameter matrices (Figure 2). Blocks deeper in the encoder stack are less -close but more angular-close than shallower ones. In all self-attention modules, value and dense projection matrices ( and ) change considerably more than query and key projection matrices ( and ) during fine-tuning.

4.2 -close fine-tuning

Inspired by the high degree of variability in each layer’s parameter change during fine-tuning, we ask whether effective fine-tuning can be achieved by optimizing only a fraction of layers while having others fixed at pre-trained values, resulting in fine-tuned models -close in parameter space.

Our results suggest this is indeed feasible (Table 3). Informed by different layers’ sensitivity to fine-tuning, we performed fine-tuning experiments by progressively excluding: (1) key projection layers in self-attention across all encoder stack layers, (2) the penultimate and ultimate encoder stacks, and (3) the word embedding layer. Each of these exclusions independently or all three combined do not substantially degrade performance, while reducing the number of parameters to fine-tune by up to (from 109M to 66M).

Figure 3: Performance of supermask fine-tuned models across GLUE tasks. We show the mean of performance metrics across 10 independent Bernoulli sampling procedures. Note the baseline performance for each task marked by the leftmost end of each curve.
Figure 4: Supermask sparsity levels across layers. Shown is the low-sparsity MNLI supermask with a global sparsity level of ; similar patterns are observed across all GLUE tasks.

4.3 Sparsification as fine-tuning

Encouraged by these results, we ask whether more aggressive constraints can be imposed on the fine-tuning process to further reduce computational cost. Though -close fine-tuning obviates optimization of a substantial fraction of parameters, avoiding full storage of all parameters for each fine-tuned task, all operations still need to be performed at inference time. In order to reduce operations, we seek to sparsify parameters. This amounts to a search over a binary mask in a high-dimensional parameter space. We adopt supermask training (see Section 3) to this end.

Figure 3 shows fine-tuned model performance across GLUE tasks obtained by supermask training. Final sparsity level of the supermask is controlled by its initialization (see Section 3 and Appendix C). We note that there is little task performance degradation between and parameter sparsity, very close to sparse networks produced by iterative pruning (Zhu and Gupta, 2017) but underperfoming it at high sparsity levels (see Appendix E). Layer-wise sparsity levels of supermasks also demonstrate systematic trends (Figure 4): (1) across GLUE tasks, , and tend to be sparser than , and , and (2) shallower encoder stack layers are sparser than deeper ones. Moreover, we show that supermask fine-tuning of only a fraction of sensitive layers could also achieve performance with little degradation from baseline (Table 3).

4.4 Many good, sparse fine-tuned supermasks exist, but for pre-trained parameters only

One surprising finding of this study is the many occurrences of good fine-tuned parameters among the configurations in the set viz. vertices of an -dimensional hypercube, even though most of them are quite distant from the pre-trained parameters by -metric.

First, there exist supermasks up to sparse without remarkable performance degradation for all GLUE tasks, for some tasks even sparser (Figure 3, right end). Second, for any task, below this maximum sparsity, we found good masks at any sparsity level (Figure 3), which can be controlled by initialization of the supermask (see Appendix C). Finally, while it is natural that performance drops as the mask becomes extremely sparse (Figure 3, right end), it is rather counterintuitive that there exist good supermasks at the dense extreme (Figure 3, left end), since we observe that the pre-trained model with only the task-specific last layer fine-tuned utterly fails to learn any task (Appendix A). Noticeably, good supermasks selectively prune important weights of large magnitudes (Appendix F).

Figure 5: Low-sparsity supermask performance, i.e. task performance of super-masks initialized at sparsity, compared against baseline.
GLUE Task MNLI QQP QNLI SST-2 CoLA STS-B MRPC RTE
Baseline
Supermask
Final sparsity
Table 4: Low-sparsity supermask performance. We report the sparsity levels achieved when the supermasks were initialized at sparsity. For several tasks, fine-tuning is achieved with less than of pre-trained weights pruned. For the supermask evaluation results, we include the mean and standard deviation of 10 Bernoulli samplings of a single run.
Figure 6: Fractions of overlap of zero elements in supermasks across GLUE tasks, compared to randomly generated masks. Each value in the grid shows the fraction of pruned elements in one task (horizontal axis) that are also pruned in the other (vertical axis). Here, we show low-sparsity supermasks (initialized at sparsity) and compare the masks in the value layer of the first encoder, which is one of the most sparse layers in the entire model.

To understand this phenomenon better, we study the supermasks trained with all-dense initialization (Figure 5). Surprisingly, these low-sparsity supermasks successfully learn to perform all the tasks without noticeable degradation from baseline. Essentially, complicated tasks like MNLI and QQP can be learned by clamping - of the pre-trained weights to zero (see Appendix G for how model performance improves with sparsity), whereas for simple tasks like MRPC and RTE, setting only - of the pre-trained weight entries to zero suffices to learn the task (Table 4). Fine-tuning can indeed be very fine, suggesting relatively frequent occurrences of good solutions within a sparse -neighborhood of the pre-trained parameters.

Moreover, we ask whether such frequent occurrences of good sparsified versions of parameters is a unique property of the pre-trained weights. In other words, can one also obtain good supermasks on parameters that are not pre-trained? To answer this question, we perform supermask fine-tuning on pre-trained parameters with components shuffled (thusly norm-preserved). Performance degrades significantly, for instance, for the MRPC task, from an F1 score of to with shuffled pre-trained parameters. It is clear that one cannot obtain any good masks by doing so, suggesting that having high-performance sparsified versions is unique to pre-trained parameters.

4.5 Task-uniqueness of fine-tuned supermasks

Finally, we ask whether the supermasks learned to perform different tasks share commonalities. Specifically, we quantify the amount of overlapping zeros in learned supermasks across different tasks (Figure 6). It seems the overlaps are not substantially larger than what would have been caused by chance, suggesting that, even though there seem to be many good supermasks for each task, these masks are largely distinct from each other, each unique to the task it learns.

5 Discussion

One very puzzling fact about modern deep neural networks is that overparameterization helps both generalization and optimization. On the one hand, given an effective network architecture reflecting proper inductive biases, better generalizing models are always larger models (Hestness et al., 2017)

. On the other hand, sheer increases in dimensionality of the parameter space seldom make stochastic gradient-based optimization more difficult: deeper and/or wider networks take just about the same, if not a lower number of training iterations to converge. For example, ResNet-18 (11.7M parameters) and ResNet-152 (60.2M parameters) both train to converge, at similar convergence rates, in no more than 600K iterations on Imagenet 

(He et al., 2015). Thus, given adequate computing infrastructure, one always trains the largest possible model in order to obtain the best performance. This is perhaps most prominent in recent pre-trained huge language models (e.g. Devlin et al., 2018; Radford et al., 2019; Shoeybi et al., 2019) that have achieved state-of-the-art performance on language comprehension tasks. Similarly, fine-tuning larger pre-trained language models is just as easy, if not easier, than fine-tuning smaller ones. Fine-tuning steps are usually five orders of magnitude smaller than the dimensionality of the parameter space (Table 1). A direct consequence of this is that, in the parameter space, fine-tuned networks do not deviate substantially from the pre-trained one, which we quantitatively establish in this study. Analogous to the contrast between the low generalization performance of small networks and the high compressibility of large networks in the case of ResNets (e.g. Zhu and Gupta, 2017; Frankle and Carbin, 2018), we are faced with the high generalization performance of large language models and the low level of dissimilarity before and after fine-tuning. Just as network compression can generate compact models for efficient inference, the abovementioned parameter closeness can also be taken advantage of to achieve efficient computation, which we demonstrate in this work.

We show that, due to surprisingly frequent occurrences of good parameter configurations in a close -neighborhood and in the set of sparsified large pre-trained language models, two techniques are highly effective in producing efficient fine-tuned networks to perform specific language understanding tasks: (1) optimizing only the most sensitive layers and (2) learning to sparsify pre-trained parameters as fine-tuning. In contrast to commonly employed post-training sparsification methods which always incur performance degradation, our procedure of sparsifying pre-trained networks (similar to Mallya et al., 2018; Mancini et al., 2019) is by itself an optimization process that learns specific tasks.

6 Acknowledgements

We thank Sofia Samaniego de la Fuente for help with the experiments. We also wish to thank Robert S. Schreiber, Urs Köster, Jorge Albericio, Natalia S. Vassilieva, and Marcel Nassar for discussions and feedback on the manuscript.

References

  • R. Bar-Haim, I. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, B. Magnini, and I. Szpektor (2006) The Second PASCAL Recognising Textual Entailment Challenge. External Links: Document, ISBN 3540334270, ISSN 03029743, Link Cited by: §3.2.
  • M. Belkin, D. Hsu, S. Ma, and S. Mandal (2018)

    Reconciling modern machine learning and the bias-variance trade-off

    .
    External Links: 1812.11118, Link Cited by: §1.
  • L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo (2009) The Sixth PASCAL Recognizing Textual Entailment Challenge. External Links: Link Cited by: §3.2.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2018) SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. pp. 1–14. External Links: Document, 1708.00055 Cited by: §3.2.
  • B. Cheung, A. Terekhov, Y. Chen, P. Agrawal, and B. Olshausen (2019) Superposition of many models into one. External Links: 1902.05522, Link Cited by: §2.
  • M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: Training Deep Neural Networks with binary weights during propagations. Nips, pp. 1–9. External Links: Document, 1511.00363, ISSN 10495258, Link Cited by: §2.
  • M. Courbariaux and Y. Bengio (2016) BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv, pp. 9. External Links: 1602.02830, ISBN 9781510829008, Link Cited by: §2.
  • I. Dagan, O. Glickman, and B. Magnini (2006) The PASCAL Recognising Textual Entailment Challenge. In

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    ,
    Vol. 3944 LNAI, pp. 177–190. External Links: Document, ISBN 3540334270, ISSN 03029743 Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. External Links: 1810.04805, Link Cited by: §1, §1, §3.1, §3.3, §3.3, §5.
  • W. B. Dolan and C. Brockett (2005) Automatically Constructing a Corpus of Sentential Paraphrases. Proceedings of the Third International Workshop on Paraphrasing (IWP2005), pp. 9–16. External Links: Link Cited by: §3.2.
  • J. Frankle and M. Carbin (2018) The Lottery Ticket Hypothesis: Finding Small, Trainable Neural Networks. External Links: 1803.03635, Link Cited by: §5.
  • D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan (2007) The Third PASCAL Recognizing Textual Entailment Challenge. Technical report External Links: Link Cited by: §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. Arxiv.Org 7 (3), pp. 171–180. External Links: Document, 1512.03385, ISBN 978-1-4673-6964-0, ISSN 1664-1078, Link Cited by: §5.
  • D. Hendrycks and K. Gimpel (2016) Gaussian Error Linear Units (GELUs). External Links: 1606.08415, Link Cited by: §3.1.
  • J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, Md. M. A. Patwary, Y. Yang, and Y. Zhou (2017) Deep Learning Scaling is Predictable, Empirically. External Links: 1712.00409, Link Cited by: §5.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-Efficient Transfer Learning for NLP. External Links: 1902.00751, Link Cited by: §1, §2.
  • H. J. Levesque, E. Davis, and L. Morgenstern (2012) The Winograd Schema Challenge. Technical report External Links: Link Cited by: §3.2.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-Task Deep Neural Networks for Natural Language Understanding. External Links: 1901.11504, Link Cited by: §1.
  • C. Louizos, M. Welling, and D. P. Kingma (2017) Learning Sparse Neural Networks through $L_0$ Regularization. External Links: 1712.01312, Link Cited by: §2.
  • A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11208 LNCS, pp. 72–88. External Links: Document, 1801.06519, ISBN 9783030012243, ISSN 16113349, Link Cited by: Appendix C, §2, §2, §3.3, §5.
  • M. Mancini, E. Ricci, B. Caputo, and S. R. Bulò (2019) Adding new tasks to a single network with weight transformations using binary masks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11130 LNCS, pp. 180–189. External Links: Document, 1805.11119, ISBN 9783030110116, ISSN 16113349, Link Cited by: §2, §5.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language Models are Unsupervised Multitask Learners. Cited by: §1, §5.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuad: 100,000+ questions for machine comprehension of text. In

    EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings

    ,
    pp. 2383–2392. External Links: Document, 1606.05250, ISBN 9781945626258 Cited by: §3.2.
  • Shankar Iyer, N. Dandekar, and K. Csernai (2017) First Quora Dataset Release: Question Pairs. External Links: Link Cited by: §3.2.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. External Links: 1909.08053, Link Cited by: §1, §5.
  • R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP 2013 - 2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1631–1642. External Links: ISBN 9781937284978, Link Cited by: §3.2.
  • A. Vaswani, J. Uszkoreit, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. (Nips). External Links: 1706.03762, Link Cited by: §3.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. External Links: 1804.07461, Link Cited by: §1, §3.2.
  • A. Warstadt, A. Singh, and S. R. Bowman (2018) Neural Network Acceptability Judgments. External Links: 1805.12471, Link Cited by: §3.2.
  • A. Williams, N. Nangia, and S. Bowman (2018) A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. pp. 1112–1122. External Links: Document, 1704.05426 Cited by: §3.2.
  • C. Zhang, S. Bengio, and Y. Singer (2019) Are All Layers Created Equal?. External Links: 1902.01996, Link Cited by: Appendix B, Appendix C, §2.
  • S. Zhao, R. Gupta, Y. Song, and D. Zhou (2019) Extreme Language Model Compression with Optimal Subwords and Shared Projections. External Links: 1909.11687, Link Cited by: §1.
  • H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. External Links: 1905.01067, Link Cited by: §2, §2, §3.3.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. External Links: 1710.01878, Link Cited by: Appendix E, §2, §4.3, §5.

References

  • R. Bar-Haim, I. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, B. Magnini, and I. Szpektor (2006) The Second PASCAL Recognising Textual Entailment Challenge. External Links: Document, ISBN 3540334270, ISSN 03029743, Link Cited by: §3.2.
  • M. Belkin, D. Hsu, S. Ma, and S. Mandal (2018)

    Reconciling modern machine learning and the bias-variance trade-off

    .
    External Links: 1812.11118, Link Cited by: §1.
  • L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo (2009) The Sixth PASCAL Recognizing Textual Entailment Challenge. External Links: Link Cited by: §3.2.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2018) SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. pp. 1–14. External Links: Document, 1708.00055 Cited by: §3.2.
  • B. Cheung, A. Terekhov, Y. Chen, P. Agrawal, and B. Olshausen (2019) Superposition of many models into one. External Links: 1902.05522, Link Cited by: §2.
  • M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: Training Deep Neural Networks with binary weights during propagations. Nips, pp. 1–9. External Links: Document, 1511.00363, ISSN 10495258, Link Cited by: §2.
  • M. Courbariaux and Y. Bengio (2016) BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv, pp. 9. External Links: 1602.02830, ISBN 9781510829008, Link Cited by: §2.
  • I. Dagan, O. Glickman, and B. Magnini (2006) The PASCAL Recognising Textual Entailment Challenge. In

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    ,
    Vol. 3944 LNAI, pp. 177–190. External Links: Document, ISBN 3540334270, ISSN 03029743 Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. External Links: 1810.04805, Link Cited by: §1, §1, §3.1, §3.3, §3.3, §5.
  • W. B. Dolan and C. Brockett (2005) Automatically Constructing a Corpus of Sentential Paraphrases. Proceedings of the Third International Workshop on Paraphrasing (IWP2005), pp. 9–16. External Links: Link Cited by: §3.2.
  • J. Frankle and M. Carbin (2018) The Lottery Ticket Hypothesis: Finding Small, Trainable Neural Networks. External Links: 1803.03635, Link Cited by: §5.
  • D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan (2007) The Third PASCAL Recognizing Textual Entailment Challenge. Technical report External Links: Link Cited by: §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. Arxiv.Org 7 (3), pp. 171–180. External Links: Document, 1512.03385, ISBN 978-1-4673-6964-0, ISSN 1664-1078, Link Cited by: §5.
  • D. Hendrycks and K. Gimpel (2016) Gaussian Error Linear Units (GELUs). External Links: 1606.08415, Link Cited by: §3.1.
  • J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, Md. M. A. Patwary, Y. Yang, and Y. Zhou (2017) Deep Learning Scaling is Predictable, Empirically. External Links: 1712.00409, Link Cited by: §5.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-Efficient Transfer Learning for NLP. External Links: 1902.00751, Link Cited by: §1, §2.
  • H. J. Levesque, E. Davis, and L. Morgenstern (2012) The Winograd Schema Challenge. Technical report External Links: Link Cited by: §3.2.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-Task Deep Neural Networks for Natural Language Understanding. External Links: 1901.11504, Link Cited by: §1.
  • C. Louizos, M. Welling, and D. P. Kingma (2017) Learning Sparse Neural Networks through $L_0$ Regularization. External Links: 1712.01312, Link Cited by: §2.
  • A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11208 LNCS, pp. 72–88. External Links: Document, 1801.06519, ISBN 9783030012243, ISSN 16113349, Link Cited by: Appendix C, §2, §2, §3.3, §5.
  • M. Mancini, E. Ricci, B. Caputo, and S. R. Bulò (2019) Adding new tasks to a single network with weight transformations using binary masks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11130 LNCS, pp. 180–189. External Links: Document, 1805.11119, ISBN 9783030110116, ISSN 16113349, Link Cited by: §2, §5.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language Models are Unsupervised Multitask Learners. Cited by: §1, §5.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuad: 100,000+ questions for machine comprehension of text. In

    EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings

    ,
    pp. 2383–2392. External Links: Document, 1606.05250, ISBN 9781945626258 Cited by: §3.2.
  • Shankar Iyer, N. Dandekar, and K. Csernai (2017) First Quora Dataset Release: Question Pairs. External Links: Link Cited by: §3.2.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. External Links: 1909.08053, Link Cited by: §1, §5.
  • R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP 2013 - 2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1631–1642. External Links: ISBN 9781937284978, Link Cited by: §3.2.
  • A. Vaswani, J. Uszkoreit, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. (Nips). External Links: 1706.03762, Link Cited by: §3.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. External Links: 1804.07461, Link Cited by: §1, §3.2.
  • A. Warstadt, A. Singh, and S. R. Bowman (2018) Neural Network Acceptability Judgments. External Links: 1805.12471, Link Cited by: §3.2.
  • A. Williams, N. Nangia, and S. Bowman (2018) A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. pp. 1112–1122. External Links: Document, 1704.05426 Cited by: §3.2.
  • C. Zhang, S. Bengio, and Y. Singer (2019) Are All Layers Created Equal?. External Links: 1902.01996, Link Cited by: Appendix B, Appendix C, §2.
  • S. Zhao, R. Gupta, Y. Song, and D. Zhou (2019) Extreme Language Model Compression with Optimal Subwords and Shared Projections. External Links: 1909.11687, Link Cited by: §1.
  • H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. External Links: 1905.01067, Link Cited by: §2, §2, §3.3.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. External Links: 1710.01878, Link Cited by: Appendix E, §2, §4.3, §5.

Appendix A Optimization of task-specific last layers alone fails to fine-tune

Optimization of only task-specific layers does not lead to successful fine-tuning. For instance, for the MRPC task, freezing parameter weights in the pre-trained model and optimizing the task-specific last layer alone yields a non-performing model. Across 10 independent runs, the model consistently predicts all ’s for the paraphrase classification task, yielding an F1 score of . This is a significant degradation compared to the baseline performance of across multiple runs (Table 3). Thus, it is critical to fine-tune layers in the pre-trained model and not just the task-specific layers alone.

Appendix B Learning rate of supermask training

Supermask training requires a much larger learning rate compared to typical training (Zhang et al., 2019). While a learning rate of is used for optimizing weights, a learning rate of is used for optimizing masks. We notice a degradation in performance at smaller learning rates for supermask training (Table 5). This pattern holds true across GLUE tasks.

Learning-rate F1 score
Table 5: MRPC low-sparsity supermask performance at learning rates from and .

Appendix C Correlation between initial and final sparsities of supermasks

There is no straightforward control of the amount of weights pruned in previous reports of supermask training (Zhang et al., 2019; Mallya et al., 2018). We find that setting the initial sparsity through a soft magnitude-based pruning mask controls the final sparsity level, which we use to produce supermasks of varied sparsity levels. Figure 7 shows this correlation between initial and final sparsities of supermasks for different GLUE tasks. We note that, at lower initial sparsity levels, the supermask is pushed to a greater sparsity level, whereas at higher sparsity levels, the supermask is pushed to a lower sparsity level. This pattern is similar across GLUE tasks but is most prominent in the MNLI task, scaling with the number of fine-tuning steps (Table 1).

Figure 7: Initial versus final sparsity levels of supermasks.

Appendix D Correlation of parameter distance with fine-tuning steps

In order to understand how distance in parameter space increases as a function of fine-tuning steps, we study this relationship across GLUE tasks. We find that parameter distance scales with the number of fine-tuning steps by a power law with exponent close to (Figure 8).

Figure 8: Correlation of parameter distance with the number of fine-tuning iterations. Shown are angular distances. Each data point corresponds to a different GLUE task.

Appendix E Fine-tuning with iterative pruning

We also use iterative pruning (Zhu and Gupta, 2017) during fine-tuning to produce sparse models. Pruning is based on weight magnitudes in each layer and is performed periodically during fine-tuning with sparsity gradually increasing from to a final level according to a cubic schedule.

Figure 9: Iterative pruning during fine-tuning. We plot the evaluation performance at sparsity levels from to across GLUE tasks. Note the baseline performance for each task marked by the leftmost end of each curve ( sparsity).

Iterative pruning during fine-tuning (Figure 9) outperforms supermask training (Figure 3) at higher sparsity levels. While supermask training remains successful up to sparsity, iterative pruning produces binary masks up to sparse and for some tasks even sparser without significant performance degradation. Though iterative pruning produces sparse models, the fine-tuned models do not share parameters–one still needs to store all parameters for each task. Fine-tuned supermasks, on the other hand, store only a binary mask of certain layers for each task, with all tasks sharing a same set of underlying pre-trained weights.

Figure 10: Pruned weight distributions, compared between supermask and magnitude-based pruning. Shown for the RTE and MNLI fine-tuning tasks.

Appendix F Fine-tuned supermasks are not trivial

How does the learning of a supermask actually work? Does a supermask simply learn to prune away the weights with smallest magnitudes? Since pure magnitude-based pruning of pre-trained weights does not perform any task-specific learning, we reason that the weight entries being set to zero by the supermask must have significant values. Here, we inspect the magnitudes of the pre-trained weights zeroed by the supermasks (Figure 10, Table 6). These weights turn out to have remarkably higher magnitudes than the smallest entries, suggesting the learning of supermasks is not trivial magnitude-based pruning.

GLUE task MNLI QQP QNLI SST-2 CoLA STS-B MRPC RTE
Pruned max
Supermask max
Pruned mean
Supermask mean
Overlap
Table 6: Comparison between weights pruned with low-sparsity supermasks (initialized at sparsity) and weights pruned with magnitude-based pruning at the same final sparsity. We report the maximum and mean magnitude of the pruned weights. The last row shows percentages of the overlap between the supermask and the magnitude-based pruning mask, i.e. the percentages of weights zeroed by the supermask that are also the smallest weights.

Appendix G Learning curves of low-sparsity supermask fine-tuning

Our results suggest that supermask fine-tuning, if initialized at sparsity, gradually increases sparsity during optimization, reaching a final sparsity level that correlates with the number of fine-tuning steps (Table 4). For MNLI, the GLUE task with the most fine-tuning steps, the sparsity level reaches . We ask how prediction accuracy grows with sparsity during fine-tuning. As shown in Figure 11, like model performance, sparsity rapidly grows during the initial phase of fine-tuning. This makes model performance increase roughly linearly with sparsity.

Figure 11: Learning curves of MNLI low-sparsity supermask fine-tuning.