Emergence of Segmentation with Minimalistic White-Box Transformers

08/30/2023
by   Yaodong Yu, et al.
0

Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection. Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks. In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms, or if the same emergence can be achieved under much broader conditions through proper design of the model architecture. Through extensive experimental results, we demonstrate that when employing a white-box transformer-like architecture known as CRATE, whose design explicitly models and pursues low-dimensional structures in the data distribution, segmentation properties, at both the whole and parts levels, already emerge with a minimalistic supervised training recipe. Layer-wise finer-grained analysis reveals that the emergent properties strongly corroborate the designed mathematical functions of the white-box network. Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable. Code is at <https://github.com/Ma-Lab-Berkeley/CRATE>.

READ FULL TEXT

page 1

page 5

page 7

page 9

page 19

page 20

page 21

page 22

research
05/10/2021

Self-Supervised Learning with Swin Transformers

We are witnessing a modeling shift from CNN to Transformers in computer ...
research
06/01/2023

White-Box Transformers via Sparse Rate Reduction

In this paper, we contend that the objective of representation learning ...
research
09/13/2023

Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?

Convolutional networks and vision transformers have different forms of p...
research
05/30/2022

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Recently, masked image modeling (MIM) has offered a new methodology of s...
research
04/28/2021

Twins: Revisiting Spatial Attention Design in Vision Transformers

Very recently, a variety of vision transformer architectures for dense p...
research
05/22/2023

VanillaNet: the Power of Minimalism in Deep Learning

At the heart of foundation models is the philosophy of "more is differen...

Please sign up or login with your details

Forgot password? Click here to reset