BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

02/07/2019
by   Asa Cooper Stickland, et al.
16

Multi-task learning allows the sharing of useful information between multiple related tasks. In natural language processing several recent approaches have successfully leveraged unsupervised pre-training on large amounts of data to perform well on various tasks, such as those in the GLUE benchmark. These results are based on fine-tuning on each task separately. We explore the multi-task learning setting for the recent BERT model on the GLUE benchmark, and how to best add task-specific parameters to a pre-trained BERT network, with a high degree of parameter sharing between tasks. We introduce new adaptation modules, PALs or `projected attention layers', which use a low-dimensional multi-head attention mechanism, based on the idea that it is important to include layers with inductive biases useful for the input domain. By using PALs in parallel with BERT layers, we match the performance of fine-tuned BERT on the GLUE benchmark with roughly 7 times fewer parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2020

Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters Less Data

Multi-Task Learning (MTL) has emerged as a promising approach for transf...
research
06/08/2021

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

State-of-the-art parameter-efficient fine-tuning methods rely on introdu...
research
07/01/2019

Pentagon at MEDIQA 2019: Multi-task Learning for Filtering and Re-ranking Answers using Language Inference and Question Entailment

Parallel deep learning architectures like fine-tuned BERT and MT-DNN, ha...
research
05/25/2021

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

The Lottery Ticket Hypothesis suggests that an over-parametrized network...
research
10/31/2019

DiaNet: BERT and Hierarchical Attention Multi-Task Learning of Fine-Grained Dialect

Prediction of language varieties and dialects is an important language p...
research
02/22/2021

Using Prior Knowledge to Guide BERT's Attention in Semantic Textual Matching Tasks

We study the problem of incorporating prior knowledge into a deep Transf...
research
04/29/2020

Emerging Relation Network and Task Embedding for Multi-Task Regression Problems

Multi-task learning (mtl) provides state-of-the-art results in many appl...

Please sign up or login with your details

Forgot password? Click here to reset