Temporal Perceiving Video-Language Pre-training

01/18/2023
by   Fan Ma, et al.
0

Video-Language Pre-training models have recently significantly improved various multi-modal downstream tasks. Previous dominant works mainly adopt contrastive learning to achieve global feature alignment across modalities. However, the local associations between videos and texts are not modeled, restricting the pre-training models' generality, especially for tasks requiring the temporal video boundary for certain query texts. This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment such that the trained model can accurately perceive temporal boundaries in videos given the text description. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features. To produce temporal boundaries, frame features in several videos are manually merged into a long video sequence that interacts with a text sequence. With the localization task, our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality. Notably, comprehensive experimental results show that our method significantly improves the state-of-the-art performance on various benchmarks, covering text-to-video retrieval, video question answering, video captioning, temporal action localization and temporal moment retrieval. The code will be released soon.

READ FULL TEXT

page 8

page 9

research
07/21/2022

LocVTP: Video-Text Pre-training for Temporal Localization

Video-Text Pre-training (VTP) aims to learn transferable representations...
research
02/20/2023

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Although large-scale video-language pre-training models, which usually b...
research
01/13/2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...
research
08/21/2023

UnLoc: A Unified Framework for Video Localization Tasks

While large-scale image-text pretrained models such as CLIP have been us...
research
11/14/2020

ActBERT: Learning Global-Local Video-Text Representations

In this paper, we introduce ActBERT for self-supervised learning of join...
research
04/26/2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Dominant pre-training work for video-text retrieval mainly adopt the "du...
research
11/18/2020

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Identifying a short segment in a long video that semantically matches a ...

Please sign up or login with your details

Forgot password? Click here to reset