Lifelong Language Pretraining with Distribution-Specialized Experts

05/20/2023
by   Wuyang Chen, et al.
0

Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretraining. Our results show that by only introducing a limited number of extra experts while keeping the computation cost constant, our model can steadily adapt to data distribution shifts while preserving the previous knowledge. Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2020

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Deep pretrained language models have achieved great success in the way o...
research
10/16/2021

Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

Pretrained language models (PTLMs) are typically learned over a large, s...
research
05/06/2023

On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of Code

Pre-trained language models (PLMs) have become a prevalent technique in ...
research
11/21/2022

CBEAF-Adapting: Enhanced Continual Pretraining for Building Chinese Biomedical Language Model

Continual pretraining is a standard way of building a domain-specific pr...
research
07/03/2023

Improving Language Plasticity via Pretraining with Active Forgetting

Pretrained language models (PLMs) are today the primary model for natura...
research
06/26/2023

Understanding In-Context Learning via Supportive Pretraining Data

In-context learning (ICL) improves language models' performance on a var...
research
04/08/2023

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

With the increasing data volume, there is a trend of using large-scale p...

Please sign up or login with your details

Forgot password? Click here to reset