SWAT: Spatial Structure Within and Among Tokens

11/26/2021
by   Kumara Kahatapitiya, et al.
1

Modeling visual data as tokens (i.e., image patches), and applying attention mechanisms or feed-forward networks on top of them has shown to be highly effective in recent years. The common pipeline in such approaches includes a tokenization method, followed by a set of layers/blocks for information mixing, both within tokens and among tokens. In common practice, image patches are flattened when converted into tokens, discarding the spatial structure within each patch. Next, a module such as multi-head self-attention captures the pairwise relations among the tokens and mixes them. In this paper, we argue that models can have significant gains when spatial structure is preserved in tokenization, and is explicitly used in the mixing stage. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing, both of which can be combined with existing models with minimal effort. We introduce a family of models (SWAT), showing improvements over the likes of DeiT, MLP-Mixer and Swin Transformer, across multiple benchmarks including ImageNet classification and ADE20K segmentation. Our code and models will be released online.

READ FULL TEXT

page 7

page 8

research
02/16/2022

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) take all the image patches as tokens and cons...
research
12/24/2021

SimViT: Exploring a Simple Vision Transformer with sliding windows

Although vision Transformers have achieved excellent performance as back...
research
06/21/2021

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which...
research
06/08/2021

On Improving Adversarial Transferability of Vision Transformers

Vision transformers (ViTs) process input images as sequences of patches ...
research
06/13/2022

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

Convolutional Neural Networks (CNNs) have been regarded as the go-to mod...
research
02/14/2022

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

Token-mixing multi-layer perceptron (MLP) models have shown competitive ...
research
07/23/2020

Spatially Aware Multimodal Transformers for TextVQA

Textual cues are essential for everyday tasks like buying groceries and ...

Please sign up or login with your details

Forgot password? Click here to reset