SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

03/18/2023
by   Vithursan Thangarasa, et al.
0

The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also leads to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity, and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity while retaining the benefits of pre-trained textual representations for downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2020

Task-specific Objectives of Pre-trained Language Models for Dialogue Adaptation

Pre-trained Language Models (PrLMs) have been widely used as backbones i...
research
10/30/2021

DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

Gigantic pre-trained models have become central to natural language proc...
research
12/14/2021

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Pre-trained Language Models (PLMs) have achieved great success in variou...
research
07/24/2022

No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence

Pre-trained models have been shown effective in many code intelligence t...
research
05/11/2022

Clinical Prompt Learning with Frozen Language Models

Prompt learning is a new paradigm in the Natural Language Processing (NL...
research
10/24/2020

Rethinking embedding coupling in pre-trained language models

We re-evaluate the standard practice of sharing weights between input an...
research
08/29/2021

Span Fine-tuning for Pre-trained Language Models

Pre-trained language models (PrLM) have to carefully manage input units ...

Please sign up or login with your details

Forgot password? Click here to reset