LocVTP: Video-Text Pre-training for Temporal Localization

07/21/2022
by   Meng Cao, et al.
5

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2023

Temporal Perceiving Video-Language Pre-training

Video-Language Pre-training models have recently significantly improved ...
research
09/02/2022

Temporal Contrastive Learning with Curriculum

We present ConCur, a contrastive video representation learning method th...
research
04/26/2022

Contrastive Language-Action Pre-training for Temporal Localization

Long-form video understanding requires designing approaches that are abl...
research
11/03/2022

FactorMatte: Redefining Video Matting for Re-Composition Tasks

We propose "factor matting", an alternative formulation of the video mat...
research
04/21/2023

Contrastive Language, Action, and State Pre-training for Robot Learning

In this paper, we introduce a method for unifying language, action, and ...
research
07/21/2022

Correspondence Matters for Video Referring Expression Comprehension

We investigate the problem of video Referring Expression Comprehension (...
research
01/13/2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...

Please sign up or login with your details

Forgot password? Click here to reset