Target Adaptive Context Aggregation for Video Scene Graph Generation

08/18/2021
by   Yao Teng, et al.
3

This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks. We present a new detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking. Specifically, we design an efficient method for frame-level VidSGG, termed as Target Adaptive Context Aggregation Network (TRACE), with a focus on capturing spatio-temporal context information for relation recognition. Our TRACE framework streamlines the VidSGG pipeline with a modular design, and presents two unique blocks of Hierarchical Relation Tree (HRTree) construction and Target-adaptive Context Aggregation. More specific, our HRTree first provides an adpative structure for organizing possible relation candidates efficiently, and guides context aggregation module to effectively capture spatio-temporal structure information. Then, we obtain a contextualized feature representation for each relation candidate and build a classification head to recognize its relation category. Finally, we provide a simple temporal association strategy to track TRACE detected results to yield the video-level VidSGG. We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance. The code and models are made available at <https://github.com/MCG-NJU/TRACE>.

READ FULL TEXT

page 1

page 10

page 11

page 12

research
06/14/2020

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Localizing persons and recognizing their actions from videos is a challe...
research
05/14/2020

TAM: Temporal Adaptive Module for Video Recognition

Temporal modeling is crucial for capturing spatiotemporal structure in v...
research
12/09/2019

STAGE: Spatio-Temporal Attention on Graph Entities for Video Action Detection

Spatio-temporal action localization is a challenging yet fascinating tas...
research
06/21/2021

Structured Sparse R-CNN for Direct Scene Graph Generation

Scene graph generation (SGG) is to detect entity pairs with their relati...
research
09/29/2022

4D-StOP: Panoptic Segmentation of 4D LiDAR using Spatio-temporal Object Proposal Generation and Aggregation

In this work, we present a new paradigm, called 4D-StOP, to tackle the t...
research
03/28/2023

CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection

The relation modeling between actors and scene context advances video ac...
research
03/02/2022

Colar: Effective and Efficient Online Action Detection by Consulting Exemplars

Online action detection has attracted increasing research interests in r...

Please sign up or login with your details

Forgot password? Click here to reset