VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

01/28/2021
by   Xudong Lin, et al.
6

We present Vx2Text, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion modules. To address the non-differentiability of tokenization on continuous inputs (e.g., video or audio), we utilize a relaxation scheme that enables end-to-end training. Furthermore, unlike prior encoder-only models, our network includes an autoregressive decoder to generate open-ended text from the multimodal embeddings fused by the language encoder. This renders our approach fully generative and makes it directly applicable to different "video+x to text" problems without the need to design specialized network heads for each task. The proposed framework is not only conceptually simple but also remarkably effective: experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks – captioning, question answering and audio-visual scene-aware dialog.

READ FULL TEXT

page 3

page 5

page 6

page 8

page 13

page 14

research
07/11/2023

Generative Pretraining in Multimodality

We present Emu, a Transformer-based multimodal foundation model, which c...
research
04/17/2023

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
research
10/26/2022

End-to-End Multimodal Representation Learning for Video Dialog

Video-based dialog task is a challenging multimodal learning task that h...
research
01/17/2021

Narration Generation for Cartoon Videos

Research on text generation from multimodal inputs has largely focused o...
research
09/07/2021

Predicting Mood Disorder Symptoms with Remotely Collected Videos Using an Interpretable Multimodal Dynamic Attention Fusion Network

We developed a novel, interpretable multimodal classification method to ...
research
09/06/2021

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Multimodal abstractive summarization (MAS) models that summarize videos ...
research
08/08/2023

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

This paper presents OmniDataComposer, an innovative approach for multimo...

Please sign up or login with your details

Forgot password? Click here to reset