Transformers from an Optimization Perspective

05/27/2022
by   Yongyi Yang, et al.
0

Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.

READ FULL TEXT
research
07/26/2023

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Existing analyses of the expressive capacity of Transformer models have ...
research
11/20/2022

Convexifying Transformers: Improving optimization and understanding of transformer networks

Understanding the fundamental mechanism behind the success of transforme...
research
03/01/2023

Are More Layers Beneficial to Graph Transformers?

Despite that going deep has proven successful in many neural architectur...
research
12/10/2021

Human Interpretation and Exploitation of Self-attention Patterns in Transformers: A Case Study in Extractive Summarization

The transformer multi-head self-attention mechanism has been thoroughly ...
research
10/04/2021

VTAMIQ: Transformers for Attention Modulated Image Quality Assessment

Following the major successes of self-attention and Transformers for ima...
research
09/11/2023

Uncovering mesa-optimization algorithms in Transformers

Transformers have become the dominant model in deep learning, but the re...
research
06/01/2023

White-Box Transformers via Sparse Rate Reduction

In this paper, we contend that the objective of representation learning ...

Please sign up or login with your details

Forgot password? Click here to reset