Energy Transformer

by   Benjamin Hoover, et al.

Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it's empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.


page 3

page 4

page 7

page 8

page 13

page 14

page 15

page 20


Are More Layers Beneficial to Graph Transformers?

Despite that going deep has proven successful in many neural architectur...

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are...

Coinductive guide to inductive transformer heads

We argue that all building blocks of transformer models can be expressed...

Looped Transformers as Programmable Computers

We present a framework for using transformer networks as universal compu...

Reservoir Transformer

We demonstrate that transformers obtain impressive performance even when...

Approximation theory of transformer networks for sequence modeling

The transformer is a widely applied architecture in sequence modeling ap...

iMixer: hierarchical Hopfield network implies an invertible, implicit and iterative MLP-Mixer

In the last few years, the success of Transformers in computer vision ha...

Please sign up or login with your details

Forgot password? Click here to reset