What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

03/29/2023
by   Brian Chen, et al.
0

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.

READ FULL TEXT

page 1

page 4

page 8

research
03/30/2022

TubeDETR: Spatio-Temporal Video Grounding with Transformers

We consider the problem of localizing a spatio-temporal tube in a video ...
research
04/08/2019

Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions

This paper presents a new task, the grounding of spatio-temporal identif...
research
11/10/2020

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

In this work, we introduce a novel task - Humancentric Spatio-Temporal V...
research
10/19/2022

Grounded Video Situation Recognition

Dense video understanding requires answering several questions such as w...
research
06/16/2021

Grounding Spatio-Temporal Language with Transformers

Language is an interface to the outside world. In order for embodied age...
research
05/21/2023

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Audio-visual question answering (AVQA) is a challenging task that requir...
research
01/18/2021

Non-parametric Memory for Spatio-Temporal Segmentation of Construction Zones for Self-Driving

In this paper, we introduce a non-parametric memory representation for s...

Please sign up or login with your details

Forgot password? Click here to reset