Meta-Learning Fast Weight Language Models

12/05/2022
by   Kevin Clark, et al.
0

Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3x more compute than standard inference. We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time so the model learns to make good use of gradient updates. FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and significantly improve language modeling perplexity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2021

Reconsidering the Past: Optimizing Hidden States in Language Models

We present Hidden-State Optimization (HSO), a gradient-based method for ...
research
02/24/2021

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Large language models have become increasingly difficult to train becaus...
research
03/29/2022

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training ...
research
04/17/2019

Dynamic Evaluation of Transformer Language Models

This research note combines two methods that have recently improved the ...
research
03/16/2022

Memorizing Transformers

Language models typically need to be trained or finetuned in order to ac...
research
11/16/2020

Learning Associative Inference Using Fast Weight Memory

Humans can quickly associate stimuli to solve problems in novel contexts...
research
05/22/2023

Iterative Forward Tuning Boosts In-context Learning in Language Models

Large language models (LLMs) have exhibited an emergent in-context learn...

Please sign up or login with your details

Forgot password? Click here to reset