Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

06/06/2019
by   Zhenfang Chen, et al.
0

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatio-temporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches.

READ FULL TEXT

page 1

page 6

page 8

research
11/10/2020

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

In this work, we introduce a novel task - Humancentric Spatio-Temporal V...
research
10/19/2022

Grounded Video Situation Recognition

Dense video understanding requires answering several questions such as w...
research
01/25/2020

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

In this paper, we study the problem of weakly-supervised temporal ground...
research
03/16/2020

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos

The task of temporally grounding textual queries in videos is to localiz...
research
12/01/2021

Weakly-Supervised Video Object Grounding via Causal Intervention

We target at the task of weakly-supervised video object grounding (WSVOG...
research
10/22/2022

Weakly-Supervised Temporal Article Grounding

Given a long untrimmed video and natural language queries, video groundi...
research
12/16/2021

Spatio-Temporal CNN baseline method for the Sports Video Task of MediaEval 2021 benchmark

This paper presents the baseline method proposed for the Sports Video ta...

Please sign up or login with your details

Forgot password? Click here to reset