Representational Strengths and Limitations of Transformers

06/05/2023
by   Clayton Sanford, et al.
0

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.

READ FULL TEXT

page 8

page 35

research
11/24/2021

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are...
research
03/01/2023

Are More Layers Beneficial to Graph Transformers?

Despite that going deep has proven successful in many neural architectur...
research
07/16/2020

Hopfield Networks is All You Need

We show that the transformer attention mechanism is the update rule of a...
research
06/30/2021

On the Power of Saturated Transformers: A View from Circuit Complexity

Transformers have become a standard architecture for many NLP problems. ...
research
06/03/2023

Memorization Capacity of Multi-Head Attention in Transformers

In this paper, we investigate the memorization capabilities of multi-hea...
research
07/02/2022

Log-Precision Transformers are Constant-Depth Uniform Threshold Circuits

We prove that transformer neural networks with logarithmic precision in ...
research
09/17/2020

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...

Please sign up or login with your details

Forgot password? Click here to reset