Composable Function-preserving Expansions for Transformer Architectures

08/11/2023
by   Andrea Gesmundo, et al.
0

Training state-of-the-art neural networks requires a high cost in terms of compute and time. Model scale is recognized to be a critical factor to achieve and improve the state-of-the-art. Increasing the scale of a neural network normally requires restarting from scratch by randomly initializing all the parameters of the model, as this implies a change of architecture's parameters that does not allow for a straightforward transfer of knowledge from smaller size models. In this work, we propose six composable transformations to incrementally increase the size of transformer-based neural networks while preserving functionality, allowing to expand the capacity of the model as needed. We provide proof of exact function preservation under minimal initialization constraints for each transformation. The proposed methods may enable efficient training pipelines for larger and more powerful models by progressively expanding the architecture throughout training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2018

Training Competitive Binary Neural Networks from Scratch

Convolutional neural networks have achieved astonishing results in diffe...
research
06/08/2021

Scaling Vision Transformers

Attention-based neural networks such as the Vision Transformer (ViT) hav...
research
03/02/2020

Energy-efficient and Robust Cumulative Training with Net2Net Transformation

Deep learning has achieved state-of-the-art accuracies on several comput...
research
03/11/2022

Staged Training for Transformer Language Models

The current standard approach to scaling transformer language models tra...
research
05/04/2021

Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

Deep learning recommendation systems at scale have provided remarkable g...
research
03/02/2023

Learning to Grow Pretrained Models for Efficient Transformer Training

Scaling transformers has led to significant breakthroughs in many domain...
research
02/10/2023

Gauge-equivariant neural networks as preconditioners in lattice QCD

We demonstrate that a state-of-the art multi-grid preconditioner can be ...

Please sign up or login with your details

Forgot password? Click here to reset