Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

06/01/2023
by   Hossein Adeli, et al.
0

The spreading of attention has been proposed as a mechanism for how humans group features to segment objects. However, such a mechanism has not yet been implemented and tested in naturalistic images. Here, we leverage the feature maps from self-supervised vision Transformers and propose a model of human object-based attention spreading and segmentation. Attention spreads within an object through the feature affinity signal between different patches of the image. We also collected behavioral data on people grouping objects in natural images by judging whether two dots are on the same object or on two different objects. We found that our models of affinity spread that were built on feature maps from the self-supervised Transformers showed significant improvement over baseline and CNN based models on predicting reaction time patterns of humans, despite not being trained on the task or with any other object labels. Our work provides new benchmarks for evaluating models of visual representation learning including Transformers.

READ FULL TEXT

page 16

page 17

page 18

page 19

page 22

page 23

page 24

page 25

research
10/25/2022

Learning Explicit Object-Centric Representations with Vision Transformers

With the recent successful adaptation of transformers to the vision doma...
research
10/11/2021

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Studies on self-supervised visual representation learning (SSL) improve ...
research
03/31/2020

Attention-based Assisted Excitation for Salient Object Segmentation

Visual attention brings significant progress for Convolution Neural Netw...
research
03/15/2023

SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers

Self-supervised pre-training and transformer-based networks have signifi...
research
06/09/2022

Spatial Entropy Regularization for Vision Transformers

Recent work has shown that the attention maps of Vision Transformers (VT...
research
12/13/2022

OAMixer: Object-aware Mixing Layer for Vision Transformers

Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have sh...
research
08/23/2023

CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

We present a method for teaching machines to understand and model the un...

Please sign up or login with your details

Forgot password? Click here to reset