Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows

03/20/2022
by   Danyang Tu, et al.
0

This paper presents a new vision Transformer, named Iwin Transformer, which is specifically designed for human-object interaction (HOI) detection, a detailed scene understanding task involving a sequential process of human/object detection and interaction recognition. Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows. The irregular windows, achieved by augmenting regular grid locations with learned offsets, 1) eliminate redundancy in token representation learning, which leads to efficient human/object detection, and 2) enable the agglomerated tokens to align with humans/objects with different shapes, which facilitates the acquisition of highly-abstracted visual semantics for interaction recognition. The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets, HICO-DET and V-COCO. Results show our method outperforms existing Transformers-based methods by large margins (3.7 mAP gain on HICO-DET and 2.0 mAP gain on V-COCO) with fewer training epochs (0.5 ×).

READ FULL TEXT

page 1

page 4

page 10

research
06/04/2022

Video-based Human-Object Interaction Detection from Tubelet Tokens

We present a novel vision Transformer, named TUTOR, which is able to lea...
research
04/19/2022

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer visio...
research
05/31/2021

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...
research
03/26/2023

Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Vision transformers have recently shown strong global context modeling c...
research
04/25/2021

Visual Saliency Transformer

Recently, massive saliency detection methods have achieved promising res...
research
12/03/2021

Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer

Recent developments in transformer models for visual data have led to si...
research
10/17/2022

How do we get there? Evaluating transformer neural networks as cognitive models for English past tense inflection

There is an ongoing debate on whether neural networks can grasp the quas...

Please sign up or login with your details

Forgot password? Click here to reset