VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

05/04/2023
by   Xilun Chen, et al.
0

We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and video question answering: A generative encoder-decoder model is first jointly pre-trained on massive image-text data to learn fundamental vision-language concepts, and then adapted to video data in an intermediate video-text pre-training stage to learn video-specific skills such as spatio-temporal reasoning. As a result, our VideoOFA model achieves new state-of-the-art performance on four Video Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr score. It also outperforms existing models on two open-ended Video Question Answering datasets, showcasing its generalization capability as a universal video-to-text model.

READ FULL TEXT

page 2

page 14

research
05/22/2023

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Large-scale image-text contrastive pre-training models, such as CLIP, ha...
research
04/01/2021

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

This work concerns video-language pre-training and representation learni...
research
10/13/2021

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sent...
research
05/06/2022

Dual-Level Decoupled Transformer for Video Captioning

Video captioning aims to understand the spatio-temporal semantic concept...
research
12/06/2022

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

We present a simple approach which can turn a ViT encoder into an effici...
research
04/12/2023

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

Training models to apply linguistic knowledge and visual concepts from 2...
research
05/27/2022

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transforme...

Please sign up or login with your details

Forgot password? Click here to reset