Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

05/03/2023
by   Cheng-Yu Hsieh, et al.
1

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2020

Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings

Over the past year, the emergence of transfer learning with large-scale ...
research
02/25/2022

On the data requirements of probing

As large and powerful neural language models are developed, researchers ...
research
05/24/2023

Large Language Model Distillation Doesn't Need a Teacher

Knowledge distillation trains a smaller student model to match the outpu...
research
01/31/2023

FLAME: A small language model for spreadsheet formulas

The widespread use of spreadsheet environments by billions of users pres...
research
12/01/2022

Distilling Multi-Step Reasoning Capabilities of Large Language Models into Smaller Models via Semantic Decompositions

Step-by-step reasoning approaches like chain-of-thought (CoT) have prove...
research
06/14/2019

Scalable Syntax-Aware Language Models Using Knowledge Distillation

Prior work has shown that, on small amounts of training data, syntactic ...
research
05/31/2023

Joint Adaptive Representations for Image-Language Learning

Image-language learning has made unprecedented progress in visual unders...

Please sign up or login with your details

Forgot password? Click here to reset