Trainable Transformer in Transformer

07/03/2023
by   Abhishek Panigrahi, et al.
0

Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TinT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TinT for a OPT-125M model improves performance by 4-16 large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TinT is included.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/14/2020

Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection

Pre-training a transformer-based model for the language modeling task in...
research
10/22/2020

AdapterDrop: On the Efficiency of Adapters in Transformers

Massively pre-trained transformer models are computationally expensive t...
research
10/19/2022

Revision Transformers: Getting RiT of No-Nos

Current transformer language models (LM) are large-scale models with bil...
research
05/23/2022

Simple Recurrence Improves Masked Language Models

In this work, we explore whether modeling recurrence into the Transforme...
research
04/30/2023

Reliable Gradient-free and Likelihood-free Prompt Tuning

Due to privacy or commercial constraints, large pre-trained language mod...
research
05/24/2023

Adapting Language Models to Compress Contexts

Transformer-based language models (LMs) are powerful and widely-applicab...
research
05/26/2023

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

We study the phenomenon of in-context learning (ICL) exhibited by large ...

Please sign up or login with your details

Forgot password? Click here to reset