Shatter: An Efficient Transformer Encoder with Single-Headed Self-Attention and Relative Sequence Partitioning

08/30/2021
by   Ran Tian, et al.
0

The highly popular Transformer architecture, based on self-attention, is the foundation of large pretrained models such as BERT, that have become an enduring paradigm in NLP. While powerful, the computational resources and time required to pretrain such models can be prohibitive. In this work, we present an alternative self-attention architecture, Shatter, that more efficiently encodes sequence information by softly partitioning the space of relative positions and applying different value matrices to different parts of the sequence. This mechanism further allows us to simplify the multi-headed attention in Transformer to single-headed. We conduct extensive experiments showing that Shatter achieves better performance than BERT, with pretraining being faster per step (15 considerable memory savings (>50 8 V100 GPUs in 7 days, and match the performance of BERT_Base – making the cost of pretraining much more affordable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2022

Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

Impressive performance of Transformer has been attributed to self-attent...
research
04/23/2020

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

The great success of Transformer-based models benefits from the powerful...
research
05/18/2021

Effective Attention Sheds Light On Interpretability

An attention matrix of a transformer self-attention sublayer can provabl...
research
09/11/2021

HYDRA – Hyper Dependency Representation Attentions

Attention is all we need as long as we have enough data. Even so, it is ...
research
04/20/2021

RoFormer: Enhanced Transformer with Rotary Position Embedding

Position encoding in transformer architecture provides supervision for d...
research
06/16/2020

Untangling tradeoffs between recurrence and self-attention in neural networks

Attention and self-attention mechanisms, inspired by cognitive processes...
research
04/17/2020

Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Self-attention mechanisms have made striking state-of-the-art (SOTA) pro...

Please sign up or login with your details

Forgot password? Click here to reset