Log In Sign Up

UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

Conventional fine-tuning of pre-trained language models tunes all model parameters and stores a full model copy for each downstream task, which has become increasingly infeasible as the model size grows larger. Recent parameter-efficient language model tuning (PELT) methods manage to match the performance of fine-tuning with much fewer trainable parameters and perform especially well when the training data is limited. However, different PELT methods may perform rather differently on the same task, making it nontrivial to select the most appropriate method for a specific task, especially considering the fast-growing number of new PELT methods and downstream tasks. In light of model diversity and the difficulty of model selection, we propose a unified framework, UniPELT, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup. Remarkably, on the GLUE benchmark, UniPELT consistently achieves 1 3pt gains compared to the best individual PELT method that it incorporates and even outperforms fine-tuning under different setups. Moreover, UniPELT often surpasses the upper bound when taking the best performance of all its submodules used individually on each task, indicating that a mixture of multiple PELT methods may be inherently more effective than single methods.


page 1

page 2

page 3

page 4


AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Standard fine-tuning of large pre-trained language models (PLMs) for dow...

Towards a Unified View of Parameter-Efficient Transfer Learning

Fine-tuning large pre-trained language models on downstream tasks has be...

Parameter-Efficient Tuning on Layer Normalization for Pre-trained Language Models

Conventional fine-tuning encounters increasing difficulties given the si...

Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation

Inductive transfer learning has had a big impact on computer vision and ...

Know Where You're Going: Meta-Learning for Parameter-Efficient Fine-tuning

A recent family of techniques, dubbed as lightweight fine-tuning methods...

Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements

The growing capability and availability of generative language models ha...

SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers

Fine-tuning pre-trained language models (PLMs) achieves impressive perfo...

1 Introduction

As pre-trained language models (PLMs) Devlin et al. (2019); Brown et al. (2020) grow larger and larger, it becomes increasingly infeasible to perform conventional fine-tuning, where separate replicas of the model parameters are modified per single task. To solve the issue, there has recently been a surge of studies on parameter-efficient language model tuning (PELT), namely how to effectively tune the PLMs with fewer trainable parameters. One line of work proposes to only tune a small subset of the parameters such as the top layers Lee et al. (2019) or the bias terms Ben Zaken et al. (2021). Other studies take a step further by freezing the entire PLM and adding a small number of additional trainable parameters Houlsby et al. (2019); Li and Liang (2021); Lester et al. (2021); Guo et al. (2021); Hu et al. (2021).

Existing PELT research generally aims at achieving performance comparable to conventional fine-tuning with as few trainable parameters as possible, which has seen significant progress – the task-specific trainable parameters used in most recent approaches Lester et al. (2021); Guo et al. (2021) are almost negligible compared to the total parameters of the PLM (<1%). A more challenging yet barely studied problem is whether one can achieve better performance than fine-tuning with fewer parameters. Recent studies He et al. (2021); Li and Liang (2021); Karimi Mahabadi et al. (2021) find that some PELT methods could be more effective than fine-tuning when the training data is limited, possibly due to the reduced risk of overfitting. However, as found in our analytical experiments, various PELT methods may exhibit diverse characteristics and perform rather differently on the same task, which makes it nontrivial to select the most appropriate method for a specific task, especially considering the fast-growing number of new PELT methods as well as downstream tasks.

In light of the diverse performance of PELT methods and the cost of selecting the best method, we propose a unified PELT framework, named UniPELT, which incorporates different PELT methods as submodules and learns to dynamically activate the submodules that best suit the current data or task setup. As a result, model selection is no longer needed and consistently better performance is achieved under different setups. The activation of each submodule in UniPELT is controlled by gating mechanism, which learns to favor (assign more weight to) the submodules that perform well on a given task. In addition, since the number of parameters introduced by each submodule is generally small, combining multiple methods leads to negligible losses in parameter efficiency – the trainable parameters in UniPELT is still <1%.

We select two PELT methods as the representatives for our experiments – adapter-tuning Houlsby et al. (2019) and prefix-tuning Li and Liang (2021), as they (and their extensions) largely represent the most popular PELT methods to date.222We plan to incorporate more methods in the next version.

At a high level, adapter-tuning increases model depth by inserting bottleneck layers into each Transformer layer of the PLM, while prefix-tuning increases model width by prepending continuous vectors (virtual tokens) to the input of each Transformer layer before multi-head attention. In both methods, the original parameters of the PLM are frozen and only the newly added parameters are updated.

We conduct extensive experiments on the General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2019). Experiment results show that UniPELT is more effective and robust than using each method alone in various scenarios. Specifically, UniPELT consistently improves the best submodule that it incorporates by 1 to 3 points and even outperforms fine-tuning, achieving the best averaged performance on the GLUE benchmark under different setups. More remarkably, UniPELT often surpasses the upper bound when taking the best performance of all its submodules used individually on each task, which indicates that UniPELT successfully learns to leverage different submodules under different setups and maintains (near) optimal performance. The fact that UniPELT outperforms the upper bound also suggests that a mixture of PELT methods may be inherently more effective than single methods.

Contributions. (1) We conduct analytical experiments on two representative PELT methods under the same testbed and present valuable findings. (2) We propose a unified PELT framework that can incorporate multiple PELT methods as submodules and automatically learn to activate the most appropriate submodule for a given task without model selection. (3) Our proposed framework achieves better performance than fine-tuning and the PELT methods that it incorporates on the GLUE benchmark under different setups, with negligible losses in parameter efficiency.

2 Preliminaries

2.1 PELT methods w/o Additional Parameters

PLMs are often used as feature extractors where only the top layers or prediction head are fine-tuned Lee et al. (2019). However, such fine-tuning approaches generally lead to degenerate model performance that is much worse than fine-tuning all parameters Lee et al. (2019); Pfeiffer et al. (2021). A recent method BitFit Ben Zaken et al. (2021), which only fine-tunes the bias terms of the model, achieves performance comparable to fine-tuning when the training data is limited. In the extreme form, in-context prompting used by models such as GPT-3 Brown et al. (2020) does not involve any parameter tuning but merely few-shot demonstrations provided as the model input.

2.2 PELT methods w/ Additional Parameters

Alternatively, some methods fix the entire PLM and introduce a small number of new trainable parameters. Notable examples in this category include adapter-tuning Houlsby et al. (2019) and its extensions Pfeiffer et al. (2021); Karimi Mahabadi et al. (2021); Mahabadi et al. (2021), prefix-tuning Li and Liang (2021) and its extensions Lester et al. (2021), and additive methods Zhang et al. (2020); Guo et al. (2021); Hu et al. (2021).

Next, we will introduce these methods (mostly the primary version) in more detail to facilitate the introduction of our proposed framework. An illustration is shown in Fig. 1 for better understanding.

Adapter-tuning. Adapter-tuning Houlsby et al. (2019) is a lightweight alternative to fine-tuning, which adds a trainable bottleneck layer

after the feedforward network in each Transformer layer of the PLM. A bottleneck layer consists of a down+up projection pair that shrinks and recovers the size of token hidden states. Mathematically, if we denote the output of the feedforward network (after residual connection & layer normalization) as

with hidden size and the bottleneck size , then the output of the bottleneck layer () is:


where , ,

is a nonlinear activation function, and the bias terms are omitted for brevity. The parameters in layer normalization and the final prediction head sometimes are also fine-tuned depending on the specific adapter variants.

Adapter-tuning has shown to be on par with fine-tuning and sometimes exhibits better effectiveness in the low-resource setting He et al. (2021). Later studies extend adapter-tuning to multi-lingual Pfeiffer et al. (2021) and multi-task Karimi Mahabadi et al. (2021) settings, or further reduce the trainable parameters Mahabadi et al. (2021), which can be easily incorporated into UniPELT as a replacement of the vanilla adapter-tuning.

Prefix-tuning. Prefix-tuning Li and Liang (2021) prepends a number of task-specific trainable vectors to the input of multi-head attention in each Transformer layer as if they were virtual tokens, which allows the original tokens to attend to during multi-head attention. Specifically, we denote the prefix length as and the hidden state of the -th token in Transformer layer before multi-head attention as . Then, for each with , there is a corresponding trainable vector from an embedding matrix . The rest of () are the hidden states of the original tokens (in the actual natural language input), which depend on the output of the previous Transformer layer :


To allow for more expressiveness, the embedding matrix is reparameterized by a two-layer feedforward network:


where , , , and denotes the number of Transformer layers. The parameters of this network can be discarded after training is complete, and only prefix vectors with size are left to be prepended to the key and value states of multi-head attention in each of Transformer layers.

Prefix-tuning is originally used for natural language generation and we adapt it to understanding tasks. Note that prefix-tuning is different from prompt-based fine-tuning methods Schick and Schütze (2021); Gao et al. (2021) in multiple ways: (1) Prompt-based fine-tuning is not parameter-efficient as it updates all model parameters while prefix-tuning only updates the prefix embedding matrix . (2) The prompts are only used in model input for prompt-based fine-tuning, but added to every Transformer layer in prefix-tuning (stored as different vectors). (3) Prompt-based fine-tuning typically leverages carefully designed natural language prompts while prefix-tuning uses continuous prompts (virtual tokens). A follow-up method of prefix-tuning, named prompt-tuning Lester et al. (2021), further reduces task-specific parameters by limiting the prefix to the first layer but only performs competitively with very large model sizes (billions of total parameters), and is thus not considered in our study.

Additive Methods. Additive PELT methods treat the model parameters after fine-tuning as an addition of the pre-trained parameters and task-specific differences , where is fixed and a new (sub)set of model parameters are added on top (). There are various ways to parameterize the task-specific differences , leading to different additive methods such as LoRA Hu et al. (2021), diff pruning Guo et al. (2021), and side-tuning Zhang et al. (2020). We plan to incorporate additive methods into UniPELT in the next version.

Figure 1: Illustration of UniPELT inside one Transformer layer. Each submodule of UniPELT is controlled by a gating function. The trainable parameters are in green. Q, K, V, and P denote Query, Key, Value, and Prefix, respectively.

3 Unifying PELT Methods

3.1 Task Formulation

Given a large PLM with size that cannot be fine-tuned directly due to computational or storage cost, suppose that we have a list of PELT methods , the trainable parameters of which are negligible (i.e., ), our goal is to design a unified PELT framework that incorporates as submodules and learns to dynamically activate (upweight) different submodules when appropriate under different scenarios, such that one could achieve satisfactory results in terms of both model effectiveness and robustness without the hassle of trying out each method individually.

3.2 Proposed Method

In our analytical experiments, we observe that different PELT methods exhibit diverse characteristics and perform rather differently on the same task. For example, prefix-tuning generally performs well on natural language inference tasks regardless of the size of training data. Also, as can be seen in Fig. 1 and Sec. 2, different PELT methods often involve different parts of the PLM architecture (e.g., before multi-head attention for prefix-tuning and after feedforward layer for adapter-tuning), making it feasible to combine multiple PELT methods without (directly) interfering with each other.

In light of the two observations above, we propose a unified PELT framework, UniPELT, which takes a hybrid approach by incorporating multiple PELT methods as submodules. At a high level, UniPELT learns to activate (upweight) the submodules that best suit the current task or specific data sample and deactivate (downweight) the rest.

Gating Mechanism. To achieve fine-grained control of submodule (de)activation, we add a trainable gate for each submodule in every Transformer layer (see Fig. 1). Ideally, if a submodule is useful for a given data or task setup, the gate output for would be high such that plays a more important role in the current setup.

Specifically, for adapter-tuning, there is a residual connection between the feedforward network and the adapter-tuning submodule that sums the adapter input (before normalization) and output as its final output: . We design a gating function

that estimates the importance of adapter-tuning by its direct input

using a feedforward network with sigmoid activation and then scales its output:


Intuitively, the adapter-tuning submodule is effectively bypassed if .

Similarly, for prefix-tuning, we design a gating function that is applied to the prefix vectors with the representation of the original tokens intact:


In this way, the impact of the prefix would be diminished if the gate output of the prefix-tuning submodule is low. The gating function is estimated by the Transformer layer input with another feedforward network.

Despite the seeming simplicity of UniPELT, we note that it is nontrivial for a unified approach to work well under different scenarios. Naively combining different PELT methods as a hybrid may lead to worse performance than using individual methods, as observed in both our experiments and prior studies Hu et al. (2021).

4 Experiments

4.1 Experiment Setup

Task Setup. We conduct extensive experiments on the General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2019)

, which involves four types of natural language understanding tasks including linguistic acceptability (CoLA), sentiment analysis (SST-2), similarity and paraphrase tasks (MRPC, STS-B, QQP), and natural language inference (MNLI, QNLI, RTE). WNLI is omitted following prior studies

Houlsby et al. (2019); Devlin et al. (2019); He et al. (2021); Ben Zaken et al. (2021) due to its adversarial nature.

Data Setup. We first consider a low-resource setting where training data is limited. We sample a small subset of the training set for each task with size . As it is infeasible to submit a large number of runs to the GLUE leaderboard (2 submissions/day), we take 1,000 samples on the training set as the development set to select the best checkpoint and use the original development set as the test set following He et al. (2021). Specifically, we randomly shuffle the training set with seed , take the first

samples as the new training set, and the next 1,000 samples as the development set. To reduce random variance, we shuffle the data with 5 random seeds and report the average performance.

333We use as the data seeds and the same seed () for model training. We also conduct another set of experiments by fixing the data and using 5 different random seeds for model training, the results of which are similar. Next, we consider a high-resource setting where the whole training set is used for every task, and the best performance on the GLUE development set is recorded.

Compared Methods. We mainly compare UniPELT with conventional fine-tuning and the PELT methods that UniPELT incorporates, namely, adapter-tuning Houlsby et al. (2019) and prefix-tuning Li and Liang (2021) when used individually. We additionally compare with a baseline, UniPELT-NoGate, where the submodules are simply used together without gating.

Implementation Details. We use BERT as the major model in the experiments. We adopt AdapterHub Pfeiffer et al. (2020), a library based on HuggingFace Transformers Wolf et al. (2019)

, as our codebase. We re-implement other submodules in the same codebase to ensure a fair comparison for all compared methods. We largely follow the recommended hyperparameters of AdapterHub and keep them the same for different tasks due to practical considerations. Specifically, we set the input length to 128 and the training batch size to 16. We set the number of epochs to 50 to ensure that all methods under different setups are well trained. We adopt early stopping and set the patience to 10 non-increasing epochs. We set the learning rate of fine-tuning and adapter-tuning to 2e-5 and 1e-4 according to prior studies

Pfeiffer et al. (2020); He et al. (2021). We tune the learning rate of prefix-tuning and UniPELT from {1e-4, 2e-4, 5e-4} on the development set and set their learning rate to 2e-4 and 5e-4, respectively. We set the prefix length and adapter bottleneck size .

[] Dev Performance
Fine-tuning 81.78 81.96 17.91 58.15 70.02 74.07 45.08 61.65 61.33
Adapter-tuning 81.89 81.56 2.19 53.92 72.71 77.32 41.39 62.40 59.17
Prefix-tuning 66.24 81.22 0.00 57.14 72.36 57.69 42.53 15.75 49.12
UniPELT 80.54 81.88 16.53 57.90 73.52 79.14 45.39 65.05 62.50
[] Test Performance
Fine-tuning 79.61 81.81 16.56 55.88 69.25 74.07 42.56 60.41 60.02
Adapter-tuning 80.48 81.40 2.02 52.78 72.25 77.32 38.81 60.88 58.24
Prefix-tuning 60.87 81.22 0.00 55.96 71.91 57.69 40.58 15.68 47.99
UniPELT 77.22 81.86 14.42 55.52 72.26 79.14 42.59 63.41 60.80
[] Dev Performance
Fine-tuning 87.01 83.49 38.42 63.07 78.03 84.96 59.30 69.51 70.47
Adapter-tuning 85.86 83.00 39.13 63.52 78.39 83.52 52.60 69.40 69.43
Prefix-tuning 86.72 83.27 41.47 66.08 78.97 79.75 61.17 54.64 69.01
UniPELT 86.63 83.59 43.59 65.12 79.53 84.53 60.15 69.09 71.53
[] Test Performance
Fine-tuning 85.67 83.34 36.47 59.64 77.30 84.96 55.84 68.23 68.93
Adapter-tuning 84.54 82.53 38.65 59.35 77.39 83.52 50.04 68.12 68.02
Prefix-tuning 83.65 82.96 38.16 63.18 78.50 79.75 58.06 54.34 67.32
UniPELT 84.84 83.25 39.84 63.32 78.36 84.53 56.08 68.14 69.79
[] Dev Performance
Fine-tuning 87.79 85.00 43.59 64.90 79.59 86.39 64.88 72.14 73.04
Adapter-tuning 87.31 84.81 43.67 65.62 80.34 85.52 60.36 71.24 72.36
Prefix-tuning 87.86 84.00 45.60 68.87 80.93 82.38 66.08 69.08 73.10
UniPELT 87.88 85.98 46.17 67.36 81.36 86.82 66.19 71.26 74.13
[] Test Performance
Fine-tuning 86.54 84.87 43.26 62.31 79.03 86.39 61.95 71.09 71.93
Adapter-tuning 85.60 84.49 42.33 61.81 79.68 85.52 57.86 70.32 70.95
Prefix-tuning 85.09 83.66 44.07 66.71 80.34 82.38 63.59 68.58 71.81
UniPELT 86.17 85.86 44.33 64.91 80.65 86.82 62.17 69.95 72.61
Table 1: Results on the GLUE benchmark with

training samples. The evaluation metrics are Matthew’s Correlation for CoLA, F1 for MRPC and QQP, Spearman’s correlation for STS-B, and accuracy for the rest. We report average performance on five random seeds with standard deviation as the subscript.

Best and 2nd best methods under each setup are bold and underlined.

4.2 Analysis of Individual PELT Methods

In Table 1, we show the comparison results on the GLUE benchmark with various sizes of training data. As one can see, although the average performance of different methods over 8 tasks is sometimes similar, the differences are quite significant under certain setups and can be as large as 5~9 points on a specific task (e.g., STS-B and MNLI, ) even when excluding cases where some methods fail to learn (e.g., prefix-tuning on QQP ). Next, we will take a closer look at the submodules of UniPELT when used individually.

Analysis of Adapter-tuning. The performance of adapter-tuning is relatively stable – there is no significantly better or worse result than fine-tuning that is consistent on different tasks or sizes of training data. In general, adapter-tuning is slightly worse than fine-tuning in most cases. We do not observe that adapter-tuning consistently outperforms fine-tuning in the low-resource setting as in prior studies He et al. (2021), possibly because they tuned model hyperparameters on each task, which could be computationally prohibitive in real-world applications. For example, the bottleneck size of adapter-tuning is tuned from {64, 128, 256} in He et al. (2021), while in UniPELT, which involves fewer parameters and is fixed across tasks. Another difference is that we only add one adapter submodule in each Transformer layer, which has shown to be on par with adding two but uses half of the parameters Pfeiffer et al. (2021).

On the other hand, there are certain tasks (e.g., STS-B) that adapter-tuning largely outperforms prefix-tuning regardless of the size of training data, suggesting that one should favor adapter-tuning over prefix-tuning under certain scenarios.

Analysis of Prefix-tuning. For prefix-tuning, we observe that it sometimes fails to learn effectively when the training data is limited (e.g., on SST-2 and on QQP), leading to unsatisfactory performance and (or) huge variance across different runs. Similar phenomena have been observed in a concurrent study Gu et al. (2021) on few-shot prompt-tuning. Overall, prefix-tuning performs poorly when having very limited training data (), and becomes on par with fine-tuning as well as adapter-tuning when reaches 1000.

On the other hand, prefix-tuning performs especially well on certain tasks such as natural language inference (QNLI and MNLI) with various sizes of training data, which suggests that a hybrid approach that learns to activate (assign more weight to) prefix-tuning on these tasks is likely to yield decent results.

4.3 Effectiveness of UniPELT

Now let us turn to the effectiveness of our proposed framework UniPELT, which incorporates existing PELT methods as submodules.

Low-Resource Performance. We observe that UniPELT consistently achieves the best performance when averaged over 8 GLUE tasks, on both the development and test sets regardless of the number of training samples. Such results demonstrate the advantages of our hybrid approach regarding both model effectiveness and generalizability. The gains are generally 1~3 points over the submodule that performs the best (when used individually).

Training Size UniPELT
58.86 60.80
69.69 69.79
72.58 72.61
Table 2: Comparison of averaged test performance between UniPELT and the upper bound when taking the best performance of its submodules on each task.
[] Best Performance on GLUE Dev
Fine-tuning 91.63 90.94 62.08 66.43 89.95 89.76 83.23 87.35 82.67
Adapter-tuning 91.86 89.86 61.51 71.84 90.55 88.63 83.14 86.78 83.02
Prefix-tuning 90.94 91.29 55.37 76.90 90.39 87.19 81.15 83.30 82.07
UniPELT-NoGate 91.74 90.18 58.63 71.12 90.30 88.76 81.58 83.36 81.96
UniPELT 91.86 90.28 61.15 71.84 90.77 88.86 83.41 86.74 83.12
Table 3: Results on the GLUE benchmark when all training samples are used.

Moreover, UniPELT performs the best or 2nd best on 6/8/7 out of 8 tasks when trained with 100/500/1,000 samples, and never performs the worst in any setup across different tasks, which indicates that UniPELT is quite robust and performs reliably under different scenarios. The improvements of UniPELT are generally larger when having fewer training samples, suggesting that UniPELT performs especially well in the low-resource regime. In particular, on the tasks where both adapter-tuning and prefix-tuning fail to learn such as CoLA (), UniPELT manages to achieve performance close to fine-tuning.

UniPELT vs. Upper Bound. In Table 2, we show the comparison of UniPELT and the upper bound when taking the best performance of its submodules on each task. Perhaps surprisingly, UniPELT performs even better than the upper bound (although sometimes marginally), which indicates that UniPELT successfully learns to leverage different submodules and maintains (near) optimal performance under different setups. The fact that UniPELT outperforms the upper bound also suggests that a mixture of PELT methods might be inherently more superior to single methods.

High-Resource Performance. In Table 3, we compare the performance of different methods on the development set of GLUE when all training samples are used. UniPELT again achieves the best overall performance, although the gains are not as significant as in the low-resource setting. Also, simply combining multiple PELT methods without gating may not work very well – although UniPELT-NoGate never performs the worst in each task, its overall performance is rather poor, which suggests that a more careful mixture of PELT methods is important for achieving better model effectiveness.

4.4 Efficiency of UniPELT

Parameter Efficiency. Table 4 lists the number of trainable parameters in different PELT methods. A general trend is that the trainable parameters in recent PELT methods have been continuously decreasing. For example, for adapter-tuning, the number of task-specific parameters used to achieve competitive performance on GLUE has been reduced to 0.047% Mahabadi et al. (2021) from 3.6% in the primary version Houlsby et al. (2019). Prefix-tuning Li and Liang (2021) typically involves 0.1% to 1% additional parameters, while its successor prompt-tuning Lester et al. (2021) reaches under 0.01% for most model sizes.

As the trainable parameters in recent PELT methods are almost negligible, combining multiple methods does not lead to significant losses in parameter efficiency. UniPELT still has <1% trainable parameters in total, where its submodules prefix-tuning and adapter-tuning uses 0.17% and 0.81%, respectively. The number can be further reduced (to e.g., <0.1%) if one uses more parameter-efficient variants of the two methods, which can be easily swapped with the vanilla version used in the current framework.

Training and Inference Efficiency. We observe that incorporating multiple PELT methods into UniPELT does not suffer from slower training. UniPELT also has comparable inference speed as the baseline methods. The evaluation time on the development set (1,000 samples) is 3~4 seconds for every method.

Method #Param.
Adapter-tuning & extensions 0.047% ~ 3.6%
Prefix-tuning & extensions 0.01% ~ 2%
BitFit Ben Zaken et al. (2021) 0.01% ~ 0.09%
Diff pruning Guo et al. (2021) 0.5%
LoRA Hu et al. (2021) 0.01%
UniPELT 0.17%+0.81%=0.98%
Table 4: Number of trainable parameters in different PELT methods. Combining multiple methods leads to insignificant losses in parameter efficiency as the trainable parameters in each method are negligible.

5 Related Work

Parameter-Efficient Tuning of PLMs. As it is infeasible to train and store a full copy of a large PLM for each downstream task in practice, how to efficiently tune the PLM with a small number of trainable parameters becomes critical. Existing PELT methods can be largely divided into two categories based on whether new trainable parameters are introduced. Specifically, one may either train a subset of the model parameters such as the prediction head Lee et al. (2019) and bias terms Ben Zaken et al. (2021), or introduce task-specific parameters to different parts of the PLM such as before multi-head attention Li and Liang (2021) or after feedforward layer Houlsby et al. (2019). As the number of PELT methods keeps increasing, the purpose of UniPELT is to better understand and leverage the differences of different methods instead of proposing yet another one.

Mixture-of-Experts. UniPELT is also related to approaches that involve a high-capacity network and activate different parts of the network given different inputs. One notable example is Mixture-of-Experts (MoE) Shazeer et al. (2017); Hazimeh et al. (2021)

, which maintains a set of experts (neural networks) and one or more trainable gates to select a combination of the experts that is specific to each input example. Despite being conceptually similar,

UniPELT is different from MoE in several ways: (1) The submodules in UniPELT are not combined explicitly by summation like MoE but in sequential order and affect each other implicitly. (2) The “experts” are heterogeneous and diverse in UniPELT while usually homogeneous or identical in MoE methods. (3) The importance of each submodule in UniPELT is estimated individually instead of by a shared gate using the same representation.

6 Conclusion

In this paper, we propose a unified framework that incorporates different PELT methods as submodules and learns to automatically activate the most appropriate submodules for a given data or task setup. Our proposed framework consistently outperforms conventional fine-tuning as well as the submodules that it incorporates under different setups, and often surpasses the upper bound when taking the best performance of each submodule used individually on each task. Our findings suggest that a mixture of multiple PELT methods may be favorable in terms of both model effectiveness and robustness with negligible losses in parameter efficiency. For future work, we will conduct more analytical experiments on existing PELT methods and incorporate more of them into our framework. We will also try to better understand and explain the performance discrepancy of various PELT methods in different scenarios.


We thank Xiang Lisa Li, Hai Ye, Rabeeh Karimi Mahabadi, and Liyuan Liu for helpful discussions and feedback.