Language-free Training for Zero-shot Video Grounding

10/24/2022
by   Dahye Kim, et al.
0

Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.

READ FULL TEXT

page 3

page 7

page 8

research
05/24/2023

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Generalization to unseen tasks is an important ability for few-shot lear...
research
01/02/2023

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Searching long egocentric videos with natural language queries (NLQ) has...
research
08/20/2019

Zero-Shot Grounding of Objects from Natural Language Queries

A phrase grounding system localizes a particular object in an image refe...
research
09/23/2020

A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention

The task of language-guided video temporal grounding is to localize the ...
research
02/20/2023

Constraint and Union for Partially-Supervised Temporal Sentence Grounding

Temporal sentence grounding aims to detect the event timestamps describe...
research
03/16/2023

LERF: Language Embedded Radiance Fields

Humans describe the physical world using natural language to refer to sp...
research
10/22/2022

Weakly-Supervised Temporal Article Grounding

Given a long untrimmed video and natural language queries, video groundi...

Please sign up or login with your details

Forgot password? Click here to reset