End-to-end Generative Pretraining for Multimodal Video Captioning

01/20/2022
by   Paul Hongsuck Seo, et al.
7

Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective – we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.

READ FULL TEXT

page 1

page 7

page 14

page 15

04/01/2022

Learning Audio-Video Modalities from Image Captions

A major challenge in text-video and text-audio retrieval is the lack of ...
01/28/2021

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

We present Vx2Text, a framework for text generation from multimodal inpu...
11/10/2020

Multimodal Pretraining for Dense Video Captioning

Learning specific hands-on skills such as cooking, car maintenance, and ...
12/10/2020

Look Before you Speak: Visually Contextualized Utterances

While most conversational AI systems focus on textual dialogue only, con...
08/31/2017

Video Captioning with Guidance of Multimodal Latent Topics

The topic diversity of open-domain videos leads to various vocabularies ...
05/22/2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...
11/23/2020

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Understanding videos is challenging in computer vision. In particular, t...