STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

02/20/2023
by   Weihong Zhong, et al.
0

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the spatial-temporal action set prediction, which guides the model to generate consistent action features by predicting actions found in the text. Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9 accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).

READ FULL TEXT

page 1

page 2

research
10/08/2022

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

While recent large-scale video-language pre-training made great progress...
research
06/07/2022

Revealing Single Frame Bias for Video-and-Language Learning

Training an effective video-and-language model intuitively requires mult...
research
01/18/2023

Temporal Perceiving Video-Language Pre-training

Video-Language Pre-training models have recently significantly improved ...
research
07/29/2022

Contrastive Pre-training of Spatial-Temporal Trajectory Embeddings

Pre-training trajectory embeddings is a fundamental and critical procedu...
research
11/30/2021

CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

For video captioning, "pre-training and fine-tuning" has become a de fac...
research
09/04/2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visu...
research
04/01/2021

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

This work concerns video-language pre-training and representation learni...

Please sign up or login with your details

Forgot password? Click here to reset