A Probabilistic Interpretation of Transformers

04/28/2022
by   Alexander Shim, et al.
0

We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2021

On the Power of Saturated Transformers: A View from Circuit Complexity

Transformers have become a standard architecture for many NLP problems. ...
research
09/14/2021

Sum-Product-Attention Networks: Leveraging Self-Attention in Probabilistic Circuits

Probabilistic circuits (PCs) have become the de-facto standard for learn...
research
04/04/2023

Effective Theory of Transformers at Initialization

We perform an effective-theory analysis of forward-backward signal propa...
research
05/24/2023

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning application...
research
05/19/2020

Normalized Attention Without Probability Cage

Attention architectures are widely used; they recently gained renewed po...
research
11/06/2020

Extending Equational Monadic Reasoning with Monad Transformers

There is a recent interest for the verification of monadic programs usin...
research
08/17/2021

Investigating transformers in the decomposition of polygonal shapes as point collections

Transformers can generate predictions in two approaches: 1. auto-regress...

Please sign up or login with your details

Forgot password? Click here to reset