DeepAI
Log In Sign Up

A Probabilistic Interpretation of Transformers

04/28/2022
by   Alexander Shim, et al.
0

We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/30/2021

On the Power of Saturated Transformers: A View from Circuit Complexity

Transformers have become a standard architecture for many NLP problems. ...
09/14/2021

Sum-Product-Attention Networks: Leveraging Self-Attention in Probabilistic Circuits

Probabilistic circuits (PCs) have become the de-facto standard for learn...
05/17/2021

Pay Attention to MLPs

Transformers have become one of the most important architectural innovat...
05/19/2020

Normalized Attention Without Probability Cage

Attention architectures are widely used; they recently gained renewed po...
06/23/2021

Probabilistic Attention for Interactive Segmentation

We provide a probabilistic interpretation of attention and show that the...
11/06/2020

Extending Equational Monadic Reasoning with Monad Transformers

There is a recent interest for the verification of monadic programs usin...
08/17/2021

Investigating transformers in the decomposition of polygonal shapes as point collections

Transformers can generate predictions in two approaches: 1. auto-regress...