Localizing Moments in Video with Natural Language

08/04/2017
by   Lisa Anne Hendricks, et al.
0

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

READ FULL TEXT

page 1

page 8

page 12

page 13

page 14

page 16

page 17

page 18

research
09/05/2018

Localizing Moments in Video with Temporal Language

Localizing moments in a longer video via natural language queries is a n...
research
04/05/2019

Weakly Supervised Video Moment Retrieval From Text Queries

There have been a few recent methods proposed in text to video moment re...
research
03/19/2021

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Text-based video segmentation is a challenging task that segments out th...
research
12/04/2020

Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language

We address the problem of retrieving a specific moment from an untrimmed...
research
02/27/2015

Describing Videos by Exploiting Temporal Structure

Recent progress in using recurrent neural networks (RNNs) for image desc...
research
08/11/2019

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

We address the problem of video moment localization with natural languag...
research
06/28/2023

SpotEM: Efficient Video Search for Episodic Memory

The goal in episodic memory (EM) is to search a long egocentric video to...

Please sign up or login with your details

Forgot password? Click here to reset