Video-based Human-Object Interaction Detection from Tubelet Tokens

06/04/2022
by   Danyang Tu, et al.
0

We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Expressiveness: each tubelet token is enabled to align with a semantic instance, i.e., an object or a human, across frames, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results shows our method outperforms existing works by large margins, with a relative mAP gain of 16.14% on VidHOI and a 2 points gain on CAD-120 as well as a 4 × speedup.

READ FULL TEXT

page 2

page 3

page 13

research
03/20/2022

Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows

This paper presents a new vision Transformer, named Iwin Transformer, wh...
research
08/16/2023

Agglomerative Transformer for Human-Object Interaction Detection

We propose an agglomerative Transformer (AGER) that enables Transformer-...
research
07/16/2022

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received...
research
06/01/2022

Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment

Detection transformers like DETR have recently shown promising performan...
research
03/25/2023

Selective Structured State-Spaces for Long-Form Video Understanding

Effective modeling of complex spatiotemporal dependencies in long-form v...
research
09/28/2022

DeViT: Deformed Vision Transformers in Video Inpainting

This paper proposes a novel video inpainting method. We make three main ...
research
05/02/2022

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Recently, large-scale pre-training methods like CLIP have made great pro...

Please sign up or login with your details

Forgot password? Click here to reset