Compositional Temporal Visual Grounding of Natural Language Event Descriptions

12/04/2019
by   Jonathan C. Stroud, et al.
6

Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions. Compositional modeling becomes central: we first ground atomic descriptions "girl eating an apple," "batter hitting the ball" to short video segments, and then establish the temporal relationships between the segments. This compositional structure enables models to recognize a wider variety of events not seen during training through recognizing their atomic sub-events. Explicit temporal modeling accounts for a wide variety of temporal relationships that can be expressed in language: e.g., in the description "girl stands up from the table after eating an apple" the visual ordering of the events is reversed, with first "eating an apple" followed by "standing up from the table." We leverage these observations to develop a unified deep architecture, CTG-Net, to perform temporal grounding of natural language event descriptions to videos. We demonstrate that our system outperforms prior state-of-the-art methods on the DiDeMo, Tempo-TL, and Tempo-HL temporal grounding datasets.

READ FULL TEXT

page 1

page 7

page 8

page 13

page 14

page 15

research
08/11/2019

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

We address the problem of video moment localization with natural languag...
research
03/24/2022

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Temporal grounding in videos aims to localize one target video segment t...
research
11/30/2016

Modeling Relationships in Referential Expressions with Compositional Modular Networks

People often refer to entities in an image in terms of their relationshi...
research
03/29/2023

EgoTV: Egocentric Task Verification from Natural Language Task Descriptions

To enable progress towards egocentric agents capable of understanding ev...
research
06/03/2021

SOCCER: An Information-Sparse Discourse State Tracking Collection in the Sports Commentary Domain

In the pursuit of natural language understanding, there has been a long ...
research
01/22/2023

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding

Temporal grounding is the task of locating a specific segment from an un...
research
08/04/2021

Private Power and Public Interests: An Ethnographic Examination of the Power Outages in Texas in February 2021

In 21st century America, to many observers, the idea that 10's of millio...

Please sign up or login with your details

Forgot password? Click here to reset