OAMixer: Object-aware Mixing Layer for Vision Transformers

12/13/2022
by   Hyunwoo Kang, et al.
0

Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, alternating classic convolutional networks. While the initial patch-based models (ViTs) treated all patches equally, recent studies reveal that incorporating inductive bias like spatiality benefits the representations. However, most prior works solely focused on the location of patches, overlooking the scene structure of images. Thus, we aim to further guide the interaction of patches using the object information. Specifically, we propose OAMixer (object-aware mixing layer), which calibrates the patch mixing layers of patch-based models based on the object labels. Here, we obtain the object labels in unsupervised or weakly-supervised manners, i.e., no additional human-annotating cost is necessary. Using the object labels, OAMixer computes a reweighting mask with a learnable scale parameter that intensifies the interaction of patches containing similar objects and applies the mask to the patch mixing layers. By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models, including ViTs, MLP-Mixers, and ConvMixers. Moreover, we show that OAMixer enhances various downstream tasks, including large-scale classification, self-supervised learning, and multi-object recognition, verifying the generic applicability of OAMixer

READ FULL TEXT

page 2

page 5

page 8

page 12

research
06/16/2022

Patch-level Representation Learning for Self-supervised Vision Transformers

Recent self-supervised learning (SSL) methods have shown impressive resu...
research
06/05/2021

Patch Slimming for Efficient Vision Transformers

This paper studies the efficiency problem for visual transformers by exc...
research
08/04/2023

M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

Recently, vision Transformers (ViTs) have been actively applied to fine-...
research
06/30/2023

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Vision transformers (ViTs) have significantly changed the computer visio...
research
01/24/2022

Patches Are All You Need?

Although convolutional networks have been the dominant architecture for ...
research
04/10/2017

Weakly-Supervised Spatial Context Networks

We explore the power of spatial context as a self-supervisory signal for...
research
06/01/2023

Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

The spreading of attention has been proposed as a mechanism for how huma...

Please sign up or login with your details

Forgot password? Click here to reset