Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

12/08/2021
by   Kaifeng Gao, et al.
0

Today's VidSGG models are all proposal-based methods, i.e., they first generate numerous paired subject-object snippets as proposals, and then conduct predicate classification for each proposal. In this paper, we argue that this prevalent proposal-based framework has three inherent drawbacks: 1) The ground-truth predicate labels for proposals are partially correct. 2) They break the high-order relations among different predicate instances of a same subject-object pair. 3) VidSGG performance is upper-bounded by the quality of the proposals. To this end, we propose a new classification-then-grounding framework for VidSGG, which can avoid all the three overlooked drawbacks. Meanwhile, under this framework, we reformulate the video scene graphs as temporal bipartite graphs, where the entities and predicates are two types of nodes with time slots, and the edges denote different semantic roles between these nodes. This formulation takes full advantage of our new framework. Accordingly, we further propose a novel BIpartite Graph based SGG model: BIG. It consists of a classification stage and a grounding stage, where the former aims to classify the categories of all the nodes and the edges, and the latter tries to localize the temporal location of each relation instance. Extensive ablations on two VidSGG datasets have attested to the effectiveness of our framework and BIG. Code is available at https://github.com/Dawn-LX/VidSGG-BIG.

READ FULL TEXT

page 1

page 4

page 7

page 12

research
09/03/2020

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

The prevailing framework for solving referring expression grounding is b...
research
08/18/2021

Social Fabric: Tubelet Compositions for Video Relation Detection

This paper strives to classify and detect the relationship between objec...
research
07/17/2020

Visual Relation Grounding in Videos

In this paper, we explore a novel task named visual Relation Grounding i...
research
03/24/2021

Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

Visual grounding, which aims to build a correspondence between visual ob...
research
03/27/2023

DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

We propose a new formulation of temporal action detection (TAD) with den...
research
05/12/2021

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

The prevailing framework for matching multimodal inputs is based on a tw...
research
01/25/2022

Explore and Match: End-to-End Video Grounding with Transformer

We present a new paradigm named explore-and-match for video grounding, w...

Please sign up or login with your details

Forgot password? Click here to reset