Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

08/16/2020
by   Zhu Zhang, et al.
0

Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. This challenging task requires to capture critical object relations to identify the queried target. However, existing approaches cannot distinguish notable objects and remain in ineffective relation modeling between unnecessary objects. Thus, we propose a novel object-aware multi-branch relation network for object-aware relation discovery. Concretely, we first devise multiple branches to develop object-aware region modeling, where each branch focuses on a crucial object mentioned in the sentence. We then propose multi-branch relation reasoning to capture critical object relationships between the main branch and auxiliary branches. Moreover, we apply a diversity loss to make each branch only pay attention to its corresponding object and boost multi-branch learning. The extensive experiments show the effectiveness of our proposed method.

READ FULL TEXT

page 1

page 3

page 6

research
01/19/2020

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

In this paper, we consider a novel task, Spatio-Temporal Video Grounding...
research
07/17/2020

Visual Relation Grounding in Videos

In this paper, we explore a novel task named visual Relation Grounding i...
research
08/12/2021

Learning Visual Affordance Grounding from Demonstration Videos

Visual affordance grounding aims to segment all possible interaction reg...
research
09/27/2022

Spatio-Temporal Relation Learning for Video Anomaly Detection

Anomaly identification is highly dependent on the relationship between t...
research
07/25/2023

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

3D visual grounding aims to localize the target object in a 3D point clo...
research
08/08/2021

Joint Inductive and Transductive Learning for Video Object Segmentation

Semi-supervised video object segmentation is a task of segmenting the ta...
research
03/09/2021

Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning

In this paper, we are tackling the proposal-free referring expression gr...

Please sign up or login with your details

Forgot password? Click here to reset