Attention-Only Transformers and Implementing MLPs with Attention Heads

09/15/2023
by   Robert Huben, et al.
0

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/28/2019

Activation Adaptation in Neural Networks

Many neural network architectures rely on the choice of the activation f...
research
08/25/2023

Linear Oscillation: The Aesthetics of Confusion for Vision Transformer

Activation functions are the linchpins of deep learning, profoundly infl...
research
09/15/2021

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language p...
research
08/30/2022

Transformers with Learnable Activation Functions

Activation functions can have a significant impact on reducing the topol...
research
06/03/2023

Memorization Capacity of Multi-Head Attention in Transformers

In this paper, we investigate the memorization capabilities of multi-hea...
research
04/04/2019

Visualizing Attention in Transformer-Based Language Representation Models

We present an open-source tool for visualizing multi-head self-attention...
research
10/12/2022

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

This paper studies the curious phenomenon for machine learning models wi...

Please sign up or login with your details

Forgot password? Click here to reset