White-Box Transformers via Sparse Rate Reduction

06/01/2023
by   Yaodong Yu, et al.
0

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at <https://github.com/Ma-Lab-Berkeley/CRATE>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2021

ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction

This work attempts to provide a plausible theoretical framework that aim...
research
08/30/2023

Emergence of Segmentation with Minimalistic White-Box Transformers

Transformer-like models for vision tasks have recently proven effective ...
research
10/27/2020

Deep Networks from the Principle of Rate Reduction

This work attempts to interpret modern deep (convolutional) networks fro...
research
12/21/2022

What Makes for Good Tokenizers in Vision Transformer?

The architecture of transformers, which recently witness booming applica...
research
05/27/2022

Transformers from an Optimization Perspective

Deep learning models such as the Transformer are often constructed by he...
research
03/15/2023

Attention-likelihood relationship in transformers

We analyze how large language models (LLMs) represent out-of-context wor...
research
11/30/2021

Playing Ping Pong with Light: Directional Emission of White Light

Over the last decades, light-emitting diodes (LED) have replaced common ...

Please sign up or login with your details

Forgot password? Click here to reset