Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

04/06/2023
by   Mingyu Ding, et al.
0

Humans possess a versatile mechanism for extracting structured representations of our visual world. When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them. To mimic such capability, we propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencies without any labels. We achieve that with a novel neural operator called reversed attention that can naturally capture long-range visual dependencies between image patches. Specifically, we formulate it as a dependency graph where a child token in reversed attention is trained to attend to its parent tokens and send information following a normalized probability distribution rather than gathering information in conventional self-attention. With such a design, hierarchies naturally emerge from reversed attention layers, and a dependency tree is progressively induced from leaf nodes to the root node unsupervisedly. DependencyViT offers several appealing benefits. (i) Entities and their parts in an image are represented by different subtrees, enabling part partitioning from dependencies; (ii) Dynamic visual pooling is made possible. The leaf nodes which rarely send messages can be pruned without hindering the model performance, based on which we propose the lightweight DependencyViT-Lite to reduce the computational and memory footprints; (iii) DependencyViT works well on both self- and weakly-supervised pretraining paradigms on ImageNet, and demonstrates its effectiveness on 8 datasets and 5 tasks, such as unsupervised part and saliency segmentation, recognition, and detection.

READ FULL TEXT

page 5

page 6

page 8

page 15

research
03/27/2021

TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) is a challenging problem wh...
research
06/02/2023

Centered Self-Attention Layers

The self-attention mechanism in transformers and the message-passing mec...
research
03/11/2022

Integrating Dependency Tree Into Self-attention for Sentence Representation

Recent progress on parse tree encoder for sentence representation learni...
research
03/15/2022

Long Document Summarization with Top-down and Bottom-up Inference

Text summarization aims to condense long documents and retain key inform...
research
07/05/2022

Efficient Representation Learning via Adaptive Context Pooling

Self-attention mechanisms model long-range context by using pairwise att...
research
12/10/2021

LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) aims to learn object locali...
research
11/29/2022

Lightweight Structure-Aware Attention for Visual Understanding

Vision Transformers (ViTs) have become a dominant paradigm for visual re...

Please sign up or login with your details

Forgot password? Click here to reset