Focused Transformer: Contrastive Training for Context Scaling

07/06/2023
by   Szymon Tworkowski, et al.
0

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of 3B and 7B OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a 256 k context length for passkey retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2023

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

We present LongLoRA, an efficient fine-tuning approach that extends the ...
research
08/31/2023

YaRN: Efficient Context Window Extension of Large Language Models

Rotary Position Embeddings (RoPE) have been shown to effectively encode ...
research
12/21/2022

Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners

Through in-context learning (ICL), large-scale language models are effec...
research
04/01/2022

Making Pre-trained Language Models End-to-end Few-shot Learners with Contrastive Prompt Tuning

Pre-trained Language Models (PLMs) have achieved remarkable performance ...
research
01/27/2021

CNN with large memory layers

This work is centred around the recently proposed product key memory str...
research
05/21/2018

A Simple Cache Model for Image Recognition

Training large-scale image recognition models is computationally expensi...
research
05/25/2023

Landmark Attention: Random-Access Infinite Context Length for Transformers

While transformers have shown remarkable success in natural language pro...

Please sign up or login with your details

Forgot password? Click here to reset