CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

09/22/2022
by   Zhijian Hou, et al.
0

Video temporal grounding (VTG) targets to localize temporal moments in an untrimmed video according to a natural language (NL) description. Since real-world applications provide a never-ending video stream, it raises demands for temporal grounding for long-form videos, which leads to two major challenges: (1) the long video length makes it difficult to process the entire video without decreasing sample rate and leads to high computational burden; (2) the accurate multi-modal alignment is more challenging as the number of moment candidates increases. To address these challenges, we propose CONE, an efficient window-centric COarse-to-fiNE alignment framework, which flexibly handles long-form video inputs with higher inference speed, and enhances the temporal grounding via our novel coarse-to-fine multi-modal alignment framework. Specifically, we dynamically slice the long video into candidate windows via a sliding window approach. Centering at windows, CONE (1) learns the inter-window (coarse-grained) semantic variance through contrastive learning and speeds up inference by pre-filtering the candidate windows relevant to the NL query, and (2) conducts intra-window (fine-grained) candidate moments ranking utilizing the powerful multi-modal alignment ability of a contrastive vision-text pre-trained model. Extensive experiments on two large-scale VTG benchmarks for long videos consistently show a substantial performance gain (from 3.13 Ego4d-NLQ) and CONE achieves the SOTA results on both datasets. Analysis reveals the effectiveness of components and higher efficiency in long video grounding as our system improves the inference speed by 2x on Ego4d-NLQ and 15x on MAD while keeping the SOTA performance of CONE.

READ FULL TEXT

page 1

page 7

research
11/16/2022

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

This technical report describes the CONE approach for Ego4D Natural Lang...
research
03/15/2023

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Video temporal grounding aims to pinpoint a video segment that matches t...
research
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
research
07/26/2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

The recent video grounding works attempt to introduce vanilla contrastiv...
research
03/14/2023

Generation-Guided Multi-Level Unified Network for Video Grounding

Video grounding aims to locate the timestamps best matching the query de...
research
01/21/2019

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

The task of video grounding, which temporally localizes a natural langua...
research
09/14/2021

Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

A key solution to temporal sentence grounding (TSG) exists in how to lea...

Please sign up or login with your details

Forgot password? Click here to reset