Alternating Updates for Efficient Transformers

01/30/2023
by   Cenk Baykal, et al.
0

It is well established that increasing scale in deep transformer networks leads to improved quality and performance. This increase in scale often comes with an increase in compute cost and inference latency. Consequently, research into methods which help realize the benefits of increased scale without leading to an increase in the compute cost becomes important. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation without increasing the computation time by working on a subblock of the representation at each layer. Our experiments on various transformer models and language tasks demonstrate the consistent effectiveness of alternating updates on a diverse set of benchmarks. Finally, we present extensions of AltUp to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2022

Efficient Language Modeling with Sparse all-MLP

All-MLP architectures have attracted increasing interest as an alternati...
research
01/31/2023

The Power of External Memory in Increasing Predictive Model Capacity

One way of introducing sparsity into deep networks is by attaching an ex...
research
09/28/2020

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...
research
05/03/2023

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture-of-experts (MoE) models that employ sparse activation have demon...
research
09/17/2021

Primer: Searching for Efficient Transformers for Language Modeling

Large Transformer models have been central to recent advances in natural...
research
02/17/2020

Controlling Computation versus Quality for Neural Sequence Models

Most neural networks utilize the same amount of compute for every exampl...
research
08/15/2021

Vertical, Temporal, and Horizontal Scaling of Hierarchical Hypersparse GraphBLAS Matrices

Hypersparse matrices are a powerful enabler for a variety of network, he...

Please sign up or login with your details

Forgot password? Click here to reset