Brainformers: Trading Simplicity for Efficiency

05/29/2023
by   Yanqi Zhou, et al.
0

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3 with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.

READ FULL TEXT
research
05/22/2023

Parallel Attention and Feed-Forward Net Design for Pre-training and Inference on Transformers

In this paper, we introduce Parallel Attention and Feed-Forward Net Desi...
research
07/02/2019

Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modelin...
research
05/06/2021

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

The strong performance of vision transformers on image classification an...
research
05/28/2020

When Can Self-Attention Be Replaced by Feed Forward Layers?

Recently, self-attention models such as Transformers have given competit...
research
03/02/2023

Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

Recently, a considerable number of studies in computer vision involves d...
research
03/28/2022

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Transformer-based language models (LMs) are at the core of modern NLP, b...
research
06/10/2021

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in stat...

Please sign up or login with your details

Forgot password? Click here to reset