End-to-end Generative Pretraining for Multimodal Video Captioning

01/20/2022
by   Paul Hongsuck Seo, et al.
7

Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective – we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.

READ FULL TEXT

page 1

page 7

page 14

page 15

research
04/17/2023

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
research
06/20/2023

MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

Multimodal learning on video and text data has been receiving growing at...
research
11/10/2020

Multimodal Pretraining for Dense Video Captioning

Learning specific hands-on skills such as cooking, car maintenance, and ...
research
12/09/2022

VindLU: A Recipe for Effective Video-and-Language Pretraining

The last several years have witnessed remarkable progress in video-and-l...
research
12/10/2020

Look Before you Speak: Visually Contextualized Utterances

While most conversational AI systems focus on textual dialogue only, con...
research
08/31/2017

Video Captioning with Guidance of Multimodal Latent Topics

The topic diversity of open-domain videos leads to various vocabularies ...
research
10/10/2022

Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks

Searching vast troves of videos with textual descriptions is a core mult...

Please sign up or login with your details

Forgot password? Click here to reset