Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

08/30/2021 ∙ by Ningyu Zhang, et al. ∙ Zhejiang University 0

Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The pre-train—fine-tune paradigm has become the de facto standard for natural language processing (NLP), and has achieved excellent results in several benchmarks Devlin et al. (2019); Liu et al. (2019); Lewis et al. (2020); Dong et al. (2019); Bao et al. (2020a). The success of these pioneers seems to suggest that large-scale pre-trained models are always nothing short of a panacea for boosting machine intelligence. However, supervised fine-tuning is still prone to labeled data in practice and faces unignorable challenges owing to the variations of domains, language, and tasks. These drawbacks lead to the research of an important technique, few-shot learning, which can significantly improve the learning capabilities of machine intelligence and practical adaptive applications by accessing only a small number of labeled examples.

The GPT-3 model, introduced by Brown et al. (2020), exhibits impressive few-shot learning capabilities. Given a natural language prompt and 16 labeled samples as demonstrations in the contextual input, GPT-3 achieves 80% of the SOTA results. However, GPT-3 is a fully dense transformer model with 175B parameters, which makes it challenging to deploy in most real-world applications.

Figure 1: The architecture of DifferentiAble pRompT (DART) model comparing with MLM pre-training and conventional fine-tuning, where and are unused or special tokens in the vocabulary.

Recently, an emerging fine-tuning methodology has arisen to equip smaller language models (LMs) with few-shot capabilities: adapting the pre-trained LM directly as a predictor through completion of a cloze task Schick and Schütze (2021, 2020); Gao et al. (2020); Liu et al. (2021c)

, which treats the downstream task as a (masked) language modeling problem. These prompts can be used in fine-tuning to provide the classifier with additional task information, especially in the low-data regime. Notably, Scao et al.

Scao and Rush (2021) observe that prompting can often compensate for hundreds of data points on average across multiple classification tasks. However, determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets Perez et al. (2021). Recent studies Lu et al. (2021); Zhao et al. (2021) have reported that the manual prompt format can be sub-optimal, which would result in the accuracy varying from random guess performance to near the state-of-the-art. Therefore, previous approaches have attempted to search for discrete prompt tokens automatically. However, it is non-trivial for widespread classification tasks to obtain an optimized prompt template and target label token. For example, specific classification tasks such as relation extraction with the label of and cannot specify a single label token in the vocabulary.

In this paper, we propose a novel DifferentiAble pRompT (DART) fine-tuning approach, which is model-agnostic, parameter-efficient, and free of prompt engineering. As illustrated in Figure 1, the key idea is to leverage a few parameters (unused tokens) in the language model, which serve as the template and label tokens, and to optimize them in the continuous space using backpropagation. Subsequently, we introduce differentiable prompt learning to obtain optimized prompt templates as well as labels. Since fine-tuning with limited samples can be affected by instability Dodge et al. (2020); Zhang et al. (2021), we propose a two-stage optimization algorithm of first learning templates and labels then the overall parameters. We further introduce an auxiliary fluency constraint object to ensure the association among the prompt embeddings.

We conduct extensive experiments on 15 NLP datasets. With only a few training samples across all the tasks, our approach (DART) can obtain a better performance. Notably, absolute performance improvement of up to 23.28%, over the conventional fine-tuning, is obtained on average in the setting of (and 1.55% for fully supervised settings) on relation extraction datasets with complex label semantics. Our approach can be applied to real-world classification tasks without the high cost of collecting and annotating a large amount of data. The main contributions of this study are as follows:

  • We propose a new simple framework for few-shot learning, which is pluggable, extensible, and efficient without prompt engineering. To the best of our knowledge, optimizing label tokens in continuous space is also a new branch of research that has not been explored in language model prompting.

  • A systematic evaluation of 15 NLP tasks shows that the simple-yet-effective method contributes towards improvements across all these tasks. Remarkably, given only 8 labeled samples per class, our proposed approach can achieve 90% performance of the SOTA results (full dataset).

2 Related Work

Language Model Prompting.

The language model prompting has emerged with the introduction of GPT-3 Brown et al. (2020), which demonstrates excellent few-shot performance Liu et al. (2021b). However, GPT-3 is not designed for fine-tuning; it mainly relies on the handcraft prompt (in-context learning Liu et al. (2021a); Zhao et al. (2021); Ding et al. (2021); Min et al. (2021)). Thus, recent studies Qin and Eisner (2021); Hambardzumyan et al. (2021); Chen et al. (2021) conducted in this field have been focused on automatically searching the prompts. Schick et al. Schick and Schütze (2021, 2020) propose the PET, which reformulates the NLP tasks as cloze-style questions and performs gradient-based fine-tuning. Tam et al. Tam et al. (2021) improve the PET with a denser supervision object during fine-tuning. Shin et al. Shin et al. (2020) propose the AUTOPROMPT to create prompts for a diverse set of tasks based on a gradient-guided search. Han et al. Han et al. (2021) propose an approach called PTR which leverages logic rules to construct prompts with sub-prompts for many-class text classification. Wang et al. Wang et al. (2021) reformulate potential NLP task into an entailment one, and then fine-tune the model with few-shot samples. Hu et al. Hu et al. (2021)

propose an approach to incorporate external knowledge graph into the verbalizer with calibration. Additionally, Gao et al.

Gao et al. (2020) present LM-BFF—better few-shot fine-tuning of language models, which leverages T5 Raffel et al. (2020)

to generate templates and search label tokens in the vocabulary. However, the utilization of the generative model and the label search with validation is computation-intensive. Moreover, the prompt search over discrete space is sub-optimal due to the continuous nature of neural networks.

To overcome these limitations, Liu et al. Liu et al. (2021c) propose P-tuning, which employs trainable continuous prompt embeddings learned by an LSTM. Zhong et al. Zhong et al. (2021) propose an effective continuous method called OPTIPROMPT to optimize prompts for factual probing. Li et al. Liu et al. (2021c)

propose prefix-tuning, which keeps language model parameters frozen but optimizes a small continuous task-specific vector for natural language generation tasks. Lester et al.

Lester et al. (2021) propose a mechanism for learning “soft prompts” to condition frozen language models to perform downstream tasks. However, these approaches still have to optimize the external parameters (e.g., LSTM in P-tuning) and are prone to complex label space.

Conversely, this study aims to develop a novel few-shot learning framework based on pre-trained language models which do not require prompt engineering (including templates and labels) and external parameter optimization. Furthermore, the proposed approach only leverages the noninvasive modification of the model, which can be plugged into any pre-trained language model and extended to the widespread classification task.

Few-shot Learning.

Few-shot learning can significantly improve the learning capabilities for machine intelligence and practical adaptive applications by accessing only a small number of labeled examples Zhang et al. (2020). The proposed approach corresponds to the other few-shot NLP methods, including: (1) Meta-learning Yu et al. (2018); Bao et al. (2020b); Bansal et al. (2020); Deng et al. (2020b, a); Yu et al. (2020), in which the quantities of the auxiliary tasks are optimized. (2) Intermediate training Phang et al. (2018); Yin et al. (2020)

, which supplements the pre-trained LMs with further training on the data-rich supervised tasks. (3) Semi-supervised learning

Miyato et al. (2017); Xie et al. (2020), which leverages unlabeled samples. The proposed approach focuses on a more realistic few-shot setting (the number of labeled instances per class can be any variable).

3 Background

3.1 Language Model Prompting

Let be a sentence, where is the token in the input sentence and is the number of tokens. Specifically, is converted to a fixed token sequence and then mapped to a sequence of hidden vectors . Given the input sequence, , the conventional fine-tuning approaches leverage a generic head layer over [CLS] embeddings (e.g., an MLP layer) to predict an output class. For the prompt-based method, a task-specific pattern string (template ) is designed to coax the model into producing a textual output corresponding to a given class (label token )—we refer to these two things together as a prompt. Specifically, containing one [MASK] token is directly tasked with the MLM input as:

(1)

When the prompt is fed into the MLM, the model can obtain the probability distribution

of the candidate class, as:

(2)

where represents the label token of class . To further understand the mechanism of language model prompting, we theoretically analyze the underlying intuitions.

4 Our Approach

4.1 Motivation

It can be observed from the previous empirical findings Gao et al. (2020); Scao and Rush (2021) that an optimal prompt is necessary for the improvement of the pre-trained language models for the few-shot learners. Since templates with discrete tokens may be sub-optimal and are insufficient to represent a specific class111It is non-trivial to evaluate all options of templates and label tokens., this study proposes DifferentiAble pRompT, referred to as DART, which can reduce the requirement of prompt engineering in order to improve the applicability of the proposed method in various domains.

4.2 Differentiable Template Optimization

Since the language tokens are discrete variables, finding the optimal prompts with token searching is non-trivial and may easily fall into the local minima. To overcome these limitations, we utilize pseudo tokens to construct templates and then optimize them with backpropagation. Specifically, given the template, [MASK], which varies from the traditional discrete prompts, satisfying and map into:

(3)

DART considers [] as pseudo tokens and maps the template as follows:

(4)

where are trainable parameters. Differentiable template optimization can obtain expressive templates beyond the original vocabulary . Lastly, the templates, , are differentially optimized by:

(5)

Note that the values of the prompt embeddings, , must be co-dependent with each other rather than independent. Unlike P-tuning Liu et al. (2021c), which utilizes a bidirectional LSTM, DART leverages an auxiliary fluency constraint objective without any external parameters to associate the prompt embeddings with each other, thus stimulating the model to focus on context representation learning.

4.3 Differentiable Label Optimization

Prompt-based fine-tuning requires filling in one word, and the masked word prediction is mapped to a verbalizer, which produces a class (i.e., "Yes": True. "No": False). For each class , the previous approaches such as LM-BFF Gao et al. (2020)estimate the conditional likelihood of the initial on a pruned set of the top vocabulary words.

However, the brute-forcing label searching: (1) is computationally intensive and tedious because the is generally very large, requiring multiple rounds of evaluation. (2) has poor scalability with an increase in the class numbers (many classification datasets have more than 100 classes), the number of searches may be ( represents the total number of classes), which is exponential and thus intractable. Additionally, the labels of classes contain rich, complex semantic knowledge, and one discrete token may be insufficient to represent this information.

Specifically, with the labels, , different from the previous approach which converts the class type into a variable number of label tokens {…,,..,,…}, DART maps the to a continuous vocabulary space as follows:

(6)

, where is the number of trainable embedding in template. To avoid optimizing any external parameters, is replaced with unused tokens (e.g., [unused1] or special tokens in vocabulary) in to generate , as shown in Figure 1.

4.4 Training Objectives

Since the pseudo tokens in the prompt template must be co-dependent with each other, we introduce an auxiliary fluency constraint training without optimizing any other parameters inspired by Liu et al. (2021c); Tam et al. (2021). Overall, there are two objectives: the class discrimination objective and the fluency constraint objective .

Class Discrimination Object

The class discrimination objective is the main objective which aims to classify the sentences. As shown in Figure 1, given , we can generate as:

(7)

where

is the cross-entropy loss function,

represents the class discrimination loss.

Fluency Constraint Object

To ensure the association among the template tokens and to maintain the ability of language understanding inherited from the PLMs, we leverage a fluency constraint object with the MLM. As shown in Figure 1, one token in the input sentence is randomly masked and the masked language prediction is conducted. and are the original and masked sequences, respectively. Let be the target token that has been masked out in , and is maximized as follows222We use the golden label rather than the [MASK] in the input of the fluency constraint object.:

(8)
(9)

By optimizing , the language model can obtain a better contextual representation with a rich association among the template tokens. We have the following training object:

(10)

where is the hyper-parameter. Lastly, we introduce the overall optimization procedure of DART. To mitigate the instability of the few-shot fine-tuning, we introduce a two-stage optimization algorithm in Algorithm 1. At optimization step (lines 37), the described in §4.2 and §4.3 is first optimized to obtain the optimal prompts (We implement the procedure via stopping the gradient of other parameters), and is then all parameters are optimized in the step (lines 812).

1:: stochastic objective function with parameters ; ,: learning rate; : parameters of the templates and label tokens;
2:initialize ;
3:while  not converged do Template and label optimization with learning rate
4:     ;
5:     ;
6:     ;
7:end while
8:while  not converged do All parameter optimization with learning rate
9:      ;
10:     ;
11:     ;
12:end while
Algorithm 1 Differentiable Prompt Fine-tuning Algorithm with Two-stage Optimization

4.5 Comparison to Previous Prompt-tuning Approaches

Since prompt learning has become a new paradigm or a way for human-PLMs communication, it appeals to many researchers. Due to the fast development of prompt learning, some similar ideas (learned embeddings) may be introduced by different research teams. However, we list the major difference between our model and other approaches as shown in Table 1:

To conclude, our approach is quite simple and requires no external parameters (different from WARP/Prefix-Tuning/P-tuning/ADAPET). Moreover, our approach unifies the optimization of template and answer.

Model External Parameter External Architecture Template Answer
Prefix-Tuning Li and Liang (2021) yes no continuous no
WARP Hambardzumyan et al. (2021) yes no continuous continuous
P-tuning Liu et al. (2021c) yes LSTM continuous no
ADAPET Tam et al. (2021) yes no discrete discrete
DART (Ours) no no continuous continuous
Table 1: The difference between DART and previous prompt-tuning approaches.

5 Experiments

In this section, we detail the comprehensive experimental results conducted on classification tasks. The promising results demonstrate that our proposed DART substantially outperforms the conventional fine-tuning method, thus, making pre-trained language models better few-shot learners.

5.1 Dataset Statistics

We conduct a comprehensive study across 15 NLP tasks, which covers sentiment analysis, natural language inference, paraphrases, sentence similarity, relation extraction, and event extraction (We only report event argument extraction performance). The evaluation consisted of 10 popular sentence classification datasets (SST-2, MR, CR, Subj, TREC, MNLI, SNLI, QNLI, MRPC, QQP).To further evaluate the effectiveness of the proposed approach with complex label space, we conduct experiments on the relation extraction and event extraction datasets, including SemEval-2010 Task 8

(Hendrickx et al., 2010)

, TACRED-Revisit 

Alt et al. (2020), Wiki80333https://github.com/thunlp/OpenNRE/ (Han et al., 2019), ChemProt (Kringelum et al., 2016), and ACE-2005444https://catalog.ldc.upenn.edu/LDC2006T06.

5.2 Settings

The proposed model is implemented using Pytorch

Paszke et al. (2019) and Our experiments are conducted with the same setting following LM-BFF Gao et al. (2020), which measures the average performance with a fixed set of seeds, , across five different sampled

for each task. We utilize a grid search over multiple hyperparameters and select the best result as measured on

for each set . We employ AdamW as the optimizer. We conduct experiments with a RoBERTa-large Liu et al. (2019) on classification tasks for a fair comparison with LM-BFF. We leverage an uncased BERT-large Devlin et al. (2019) for relation extraction datasets, except that we use SCIBERT Beltagy et al. (2019) for the ChemProt dataset. We follow  Soares et al. (2019) and use special entity markers uniformly to highlight the entity mentions for relation extraction.

Model SST-2 (acc) MR (acc) CR (acc) Subj (acc) TREC (acc)
Majority 50.9 50.0 50.0 50.0 18.8
Prompt-based zero-shot 83.6 80.8 79.5 51.4 32.0
“GPT-3” in-context learning 84.8 (1.3) 80.5 (1.7) 87.4 (0.8) 53.6 (1.0) 26.2 (2.4)
Fine-tuning 81.4 (3.8) 76.9 (5.9) 75.8 (3.2) 90.8 (1.8) 88.8 (2.1)
LM-BFF 92.3 (1.0) 85.5 (2.8) 89.0 (1.4) 91.2 (1.1) 88.2 (2.0)
P-Tuning 92.2 (0.4) 86.7 (1.2) 91.8 (1.1) 90.3 (2.2) 86.3 (4.5)
DART 93.5 (0.5) 88.2 (1.0) 91.8 (0.5) 90.7 (1.4) 87.1(3.8)
Fine-tuning (full) 95.0 90.8 89.4 97.0 97.4

Model
MNLI (acc) SNLI (acc) QNLI (acc) MRPC (F1) QQP (F1)
Majority 32.7 33.8 49.5 81.2 0.0
Prompt-based zero-shot 50.8 49.5 50.8 61.9 49.7
“GPT-3” in-context learning 52.0 (0.7) 47.1 (0.6) 53.8 (0.4) 45.7 (6.0) 36.1 (5.2)
Fine-tuning 45.8 (6.4) 48.4 (4.8) 60.2 (6.5) 76.6 (2.5) 60.7 (4.3)
LM-BFF 68.3 (2.5) 77.1 (2.1) 68.3 (7.4) 76.2 (2.3) 67.0 (3.0)
P-Tuning 61.5 (2.1) 72.3 (3.0) 64.3 (2.8) 74.5 (7.6) 65.6 (3.0)
DART 67.5 (2.6) 75.8 (1.6) 66.7 (3.7) 78.3 (4.5) 67.8 (3.2)
Fine-tuning (full) 89.8 92.6 93.3 91.4 81.7

Table 2: Our main results with RoBERTa-large. : the full training set is used. : no training examples are used. Otherwise, we use

(# examples per class). We report mean (and standard deviation) performance over 5 different splits. Majority: majority class “GPT-3” in-context learning: using the in-context learning proposed in with RoBERTa-large (no parameter updates); LM-BFF: we report the performance in

Gao et al. (2020). full: fine-tuning using full training set.

5.3 Main Results

As shown in Table 2, we observe that our approach obtains better performance than conventional fine-tuning and achieves comparable results with LM-BFF. Note that DART does not need any prompt engineering or external model (e.g., T5 in LM-BFF) to generate templates that are readily easy to adapt to other datasets. DART can obtain 11.3% improvement with only 16 training samples per class on the MR dataset, comparable with LM-BFF, which leverages T5 to generate appropriate prompts. These results indicate that DART can better stimulate potential ability and makes the pre-trained language model a better few-shot learner. We also notice that DART yields better performance than P-tuning, which indicates that label optimization is beneficial.

Dataset Model
SemEval Fine-tuning 26.3 43.8 64.2 87.8
LM-BFF 43.2 62.0 72.9 88.0
DART 51.8 (+25.5) 67.2 (+23.4) 77.3 (+13.1) 89.1 (+1.3)
TACRED-Revisit Fine-tuning 7.4 15.5 25.8 75.0
LM-BFF 21.0 23.7 27.1 76.4
DART 25.8 (+18.4) 30.1 (+14.6) 31.8 (+6.0) 77.8 (+2.8)
WiKi80 Fine-tuning 46.3 60.3 70.0 87.5
LM-BFF 66.5 73.5 78.1 86.2
DART 68.5 (+22.2) 75.2 (+14.9) 79.4 (+9.4) 88.1 (+0.6)
ChemProt Fine-tuning 30.2 41.5 52.5 79.5
LM-BFF 55.0 56.1 60.0 79.1
DART 57.2 (+27.0) 60.8 (+19.3) 63.1 (+10.6) 81.0 (+1.5)
Table 3: Results on RE dataset WiKi80 (accuracy), while other datasets (micro F). We use (# examples per class). represents the full training set is used.
Method K=8 K=16 K=32 Full
Conventional FT 26.3 43.8 64.2 87.8
DART 51.8 67.2 77.3 89.1
  -fluency constraint object 50.3 66.1 76.0 88.2
  -differentiable template 49.8 66.3 76.2 88.4
  -differentiable label 47.5 62.5 73.7 87.8

Table 4: Ablation of DART with different components on SemEval. (FT= Fine tuning)

For the classification tasks with the complex label space, as shown in Table 3 and Figure 2(a), we observe that DART outperforms the conventional fine-tuning approach as well as LM-BFF with a large margin on relation extraction and event extraction datasets in both the few-shot and fully supervised settings. The proposed approach achieves an improvement of 2.8% of the absolute performance on the TACRED-Revisit dataset with full supervision and yields 18.4% gains with only 8 training samples per class. These findings also indicate that more relevant templates and labels can be determined without expert intervention, making it possible to generalize the proposed approach to other domains. Furthermore, we notice that the improvement decays slowly when becomes larger (i.e., from to ). Our approach is a simple yet effective fine-tuning paradigm that does not require prompt engineering within the complex label space, thus, making it possible to be an appropriate plug-in for some SOTA models.

(a) Event extraction results on ACE-2005.
(b) BERT-large & GPT-2-medium results on SemEval.
Figure 2: (a) Few-shot results using the ACE-2005. We used K = 4, 8, 16, and 32 (# examples per class) with BERT. (FT= Fine-tuning) (b) BERT-large vs. GPT-2-medium results for the SemEval. Moreover, for lower K, our method consistently outperforms conventional fine-tuning.

5.4 Ablation Study

We conduct an ablation study to validate the effectiveness of the components in the proposed approach. We observe that DART exhibits a performance decay in the absence of any one of the modules, i.e., fluency constraint object, differentiable template, or differentiable label, demonstrating that all the modules are advantageous. Furthermore, we notice that differentiable label optimization is more sensitive to performance and is highly beneficial for DART, especially for low-resource settings. Since the proposed approach is the first approach that utilizes the differentiable label optimization, these findings illustrate that a suitable label token is important.

5.5 Analysis and Discussion

Can DART Applied to Other Pre-trained LMs?

To evaluate whether the proposed approach can be applied to other LMs, we conduct experiments using GPT-2-medium. From Figure 2(b), we observe that DART with GPT-2-medium yields better performance than the conventional fine-tuning approach. Furthermore, we notice that DART with GPT-2-medium can achieve performance on par with BERT-large, as observed by Liu et al. (2021c), indicating that the potential of GPT-style architectures for natural language understanding has been underestimated.

What Exactly Optimized Prompt is?

Figure 3: A 3D visualization of several label representations learned in DART on WiKi80 dataset with t-SNE and normalization.

Since prompt templates and label tokens in the proposed approach are mapped as , we further analyze what exactly optimized label learned. We conduct a nearest-neighbor vocabulary embedding search to project the Top-3 optimized pseudo-label tokens in to a readable natural language.We use t-SNE Van der Maaten and Hinton (2008) with normalization to visualize labels on Wiki80 dataset. For example, “” refers to as red in Figure 3 represents the relation type, which is learned by optimizing the pseudo label in the continuous space, and the “”, “” and “”, refers to as are the tokens closest to the label. This finding indicates that optimized label embeddings can present better semantic representation ability.

DART v.s. Conventional Fine-tuning

The ability of the proposed approach to perform few-shot learning can be attributed to the label and being a true language understanding task, that once the model is capable of performing it correctly, it can easily apply this knowledge to other tasks that are framed as such. Superficially, (i) DART does not optimize any new parameters; however, conventional fine-tuning should learn an explicit classifier head over [CLS] embeddings, which may fail in the low-data regime. (ii) DART has the same task setting as large-scale language model pre-training and has a small theoretical upper bound for downstream classification tasks Saunshi et al. (2021).

Limitations

Our work may fail when the distribution of the task corpus varies from that of the pre-training corpus. For example, a general pre-trained language model may be fine-tuned with more training instances in a specific domain (e.g., medical domain). This issue can be addressed by intermediate training Phang et al. (2018); Yin et al. (2020); Zhao et al. (2021), and will be analyzed in the future work. Besides, our work also shows an instability associated with hyper-parameters which is also observed by Dodge et al. (2020); Zhang et al. (2021); Perez et al. (2021) as volatility of few-shot learning in NLP. Overall, however, we believe our work will inspire future work to few-shot settings with more practical applications to low-data settings, e.g., that involve low-resource languages or expert annotation.

6 Conclusion and Future Work

This paper presents DART, a simple-yet-effective fine-tuning approach that improves the fast-shot learning pre-trained language model. The proposed approach can produce satisfactory improvements in the few-shot scenarios when compared to the conventional fine-tuning approaches. The proposed method is also pluggable for other language models and can be extended to other tasks, such as intent detection. Intuitively, the results obtained in this study can be used to stimulate two future research directions in the few-shot learning for NLP: (i) Extending the proposed approach to a semi-supervised setting to further leverage unlabeled data; (ii) Extending the proposed approach to few-shot lifelong learning, whereas prompts must be optimized with adaptive tasks.

Broader Impact

The pre-train-fine-tune approach has become the standard for natural language processing (NLP). However, supervised fine-tuning is still practically affected by labeled data. This study proposes a novel pluggable, extensible, and efficient approach named DifferntiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. We believe that our study makes a significant contribution to the literature because determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets, and these issues have been overcome with the use of the proposed method, which is model-agnostic, parameter-efficient, and independent of prompt engineering. We experimentally verified our proposed approach on 13 standard NLP tasks, and it was seen to outperform several standard NLP platforms.

References

  • [1] C. Alt, A. Gabryszak, and L. Hennig (2020) TACRED revisited: A thorough evaluation of the TACRED relation extraction task. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 1558–1569. External Links: Link, Document Cited by: §5.1.
  • [2] T. Bansal, R. Jha, and A. McCallum (2020) Learning to few-shot learn across diverse natural language classification tasks. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, D. Scott, N. Bel, and C. Zong (Eds.), pp. 5108–5123. External Links: Link, Document Cited by: §2.
  • [3] H. Bao, L. Dong, F. Wei, W. Wang, N. Yang, X. Liu, Y. Wang, J. Gao, S. Piao, M. Zhou, and H. Hon (2020) UniLMv2: pseudo-masked language models for unified language model pre-training. In

    Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event

    ,
    Proceedings of Machine Learning Research, Vol. 119, pp. 642–652. External Links: Link Cited by: §1.
  • [4] Y. Bao, M. Wu, S. Chang, and R. Barzilay (2020) Few-shot text classification with distributional signatures. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §2.
  • [5] I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 3613–3618. External Links: Link, Document Cited by: §5.2.
  • [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1, §2.
  • [7] X. Chen, N. Zhang, X. Xie, S. Deng, Y. Yao, C. Tan, F. Huang, L. Si, and H. Chen (2021) Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. arXiv preprint arXiv:2104.07650. Cited by: §2.
  • [8] S. Deng, N. Zhang, J. Kang, Y. Zhang, W. Zhang, and H. Chen (2020) Meta-learning with dynamic-memory-based prototypical network for few-shot event detection. In WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, J. Caverlee, X. (. Hu, M. Lalmas, and W. Wang (Eds.), pp. 151–159. External Links: Link, Document Cited by: §2.
  • [9] S. Deng, N. Zhang, Z. Sun, J. Chen, and H. Chen (2020) When low resource NLP meets unsupervised language model: meta-pretraining then meta-learning for few-shot text classification (student abstract). In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020

    ,
    pp. 13773–13774. External Links: Link Cited by: §2.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link, Document Cited by: §1, §5.2.
  • [11] N. Ding, Y. Chen, X. Han, G. Xu, P. Xie, H. Zheng, Z. Liu, J. Li, and H. Kim (2021) Prompt-learning for fine-grained entity typing. arXiv preprint arXiv:2108.10604. Cited by: §2.
  • [12] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. A. Smith (2020) Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. CoRR abs/2002.06305. External Links: Link, 2002.06305 Cited by: §1, §5.
  • [13] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 13042–13054. External Links: Link Cited by: §1.
  • [14] T. Gao, A. Fisch, and D. Chen (2020) Making pre-trained language models better few-shot learners. CoRR abs/2012.15723. External Links: Link, 2012.15723 Cited by: §1, §2, §4.1, §4.3, §5.2, Table 2.
  • [15] K. Hambardzumyan, H. Khachatrian, and J. May (2021) WARP: word-level adversarial reprogramming. CoRR abs/2101.00121. External Links: Link, 2101.00121 Cited by: §2, Table 1.
  • [16] X. Han, T. Gao, Y. Yao, D. Ye, Z. Liu, and M. Sun (2019) OpenNRE: an open and extensible toolkit for neural relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 - System Demonstrations, S. Padó and R. Huang (Eds.), pp. 169–174. External Links: Link, Document Cited by: §5.1.
  • [17] X. Han, W. Zhao, N. Ding, Z. Liu, and M. Sun (2021) PTR: prompt tuning with rules for text classification. CoRR abs/2105.11259. External Links: Link, 2105.11259 Cited by: §2.
  • [18] I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó. Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz (2010) SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July 15-16, 2010, K. Erk and C. Strapparava (Eds.), pp. 33–38. External Links: Link Cited by: §5.1.
  • [19] S. Hu, N. Ding, H. Wang, Z. Liu, J. Li, and M. Sun (2021) Knowledgeable prompt-tuning: incorporating knowledge into prompt verbalizer for text classification. CoRR abs/2108.02035. External Links: Link, 2108.02035 Cited by: §2.
  • [20] J. Kringelum, S. K. Kjærulff, S. Brunak, O. Lund, T. I. Oprea, and O. Taboureau (2016) ChemProt-3.0: a global chemical biology diseases mapping. Database J. Biol. Databases Curation 2016. External Links: Link, Document Cited by: §5.1.
  • [21] B. Lester, R. Al-Rfou, and N. Constant (2021) The power of scale for parameter-efficient prompt tuning. CoRR abs/2104.08691. External Links: Link, 2104.08691 Cited by: §2.
  • [22] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 7871–7880. External Links: Link, Document Cited by: §1.
  • [23] X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. CoRR abs/2101.00190. External Links: Link, 2101.00190 Cited by: Table 1.
  • [24] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen (2021) What makes good in-context examples for gpt-3?. CoRR abs/2101.06804. External Links: Link, 2101.06804 Cited by: §2.
  • [25] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig (2021) Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CoRR abs/2107.13586. External Links: Link, 2107.13586 Cited by: §2.
  • [26] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang (2021) GPT understands, too. CoRR abs/2103.10385. External Links: Link, 2103.10385 Cited by: §1, §2, §4.2, §4.4, Table 1, §5.5.
  • [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, §5.2.
  • [28] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp (2021) Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. CoRR abs/2104.08786. External Links: Link, 2104.08786 Cited by: §1.
  • [29] S. Min, M. Lewis, H. Hajishirzi, and L. Zettlemoyer (2021) Noisy channel language model prompting for few-shot text classification. CoRR abs/2108.04106. External Links: Link, 2108.04106 Cited by: §2.
  • [30] T. Miyato, A. M. Dai, and I. J. Goodfellow (2017) Adversarial training methods for semi-supervised text classification. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.
  • [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    .
    In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §5.2.
  • [32] E. Perez, D. Kiela, and K. Cho (2021) True few-shot learning with language models. arXiv preprint arXiv:2105.11447. Cited by: §1, §5.
  • [33] J. Phang, T. Févry, and S. R. Bowman (2018) Sentence encoders on stilts: supplementary training on intermediate labeled-data tasks. CoRR abs/1811.01088. External Links: Link, 1811.01088 Cited by: §2, §5.
  • [34] G. Qin and J. Eisner (2021) Learning how to ask: querying lms with mixtures of soft prompts. CoRR abs/2104.06599. External Links: Link, 2104.06599 Cited by: §2.
  • [35] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .
    J. Mach. Learn. Res. 21, pp. 140:1–140:67. External Links: Link Cited by: §2.
  • [36] N. Saunshi, S. Malladi, and S. Arora (2021) A mathematical exploration of why language models help solve downstream tasks. In International Conference on Learning Representations, Cited by: §5.5.
  • [37] T. L. Scao and A. M. Rush (2021) How many data points is a prompt worth?. CoRR abs/2103.08493. External Links: Link, 2103.08493 Cited by: §1, §4.1.
  • [38] T. Schick and H. Schütze (2020) It’s not just size that matters: small language models are also few-shot learners. CoRR abs/2009.07118. External Links: Link, 2009.07118 Cited by: §1, §2.
  • [39] T. Schick and H. Schütze (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, P. Merlo, J. Tiedemann, and R. Tsarfaty (Eds.), pp. 255–269. External Links: Link Cited by: §1, §2.
  • [40] T. Shin, Y. Razeghi, R. L. L. IV, E. Wallace, and S. Singh (2020) AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 4222–4235. External Links: Link, Document Cited by: §2.
  • [41] L. B. Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski (2019) Matching the blanks: distributional similarity for relation learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 2895–2905. External Links: Link, Document Cited by: §5.2.
  • [42] D. Tam, R. R. Menon, M. Bansal, S. Srivastava, and C. Raffel (2021) Improving and simplifying pattern exploiting training. CoRR abs/2103.11955. External Links: Link, 2103.11955 Cited by: §2, §4.4, Table 1.
  • [43] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §5.5.
  • [44] S. Wang, H. Fang, M. Khabsa, H. Mao, and H. Ma (2021) Entailment as few-shot learner. CoRR abs/2104.14690. External Links: Link, 2104.14690 Cited by: §2.
  • [45] Q. Xie, Z. Dai, E. H. Hovy, T. Luong, and Q. Le (2020) Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §2.
  • [46] W. Yin, N. F. Rajani, D. R. Radev, R. Socher, and C. Xiong (2020) Universal natural language processing with limited annotations: try few-shot textual entailment as a start. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 8229–8239. External Links: Link, Document Cited by: §2, §5.
  • [47] H. Yu, N. Zhang, S. Deng, H. Ye, W. Zhang, and H. Chen (2020) Bridging text and knowledge with multi-prototype embedding for few-shot relational triple extraction. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, D. Scott, N. Bel, and C. Zong (Eds.), pp. 6399–6410. External Links: Link, Document Cited by: §2.
  • [48] M. Yu, X. Guo, J. Yi, S. Chang, S. Potdar, Y. Cheng, G. Tesauro, H. Wang, and B. Zhou (2018) Diverse few-shot text classification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 1206–1215. External Links: Link, Document Cited by: §2.
  • [49] N. Zhang, S. Deng, Z. Sun, J. Chen, W. Zhang, and H. Chen (2020) Relation adversarial network for low resource knowledge graph completion. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, Y. Huang, I. King, T. Liu, and M. van Steen (Eds.), pp. 1–12. External Links: Link, Document Cited by: §2.
  • [50] T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi (2021) Revisiting few-sample {bert} fine-tuning. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • [51] T. Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh (2021) Calibrate before use: improving few-shot performance of language models. CoRR abs/2102.09690. External Links: Link, 2102.09690 Cited by: §1, §2, §5.
  • [52] Z. Zhong, D. Friedman, and D. Chen (2021) Factual probing is[mask]: learning vs. learning to recall. In North American Association for Computational Linguistics (NAACL), Cited by: §2.