Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

09/06/2021
by   Tiezheng Yu, et al.
7

Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pre-trained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs' powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information? In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability. Results show that our best model significantly surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores on the How2 dataset, and our visual guidance method contributes 83.6 Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.

READ FULL TEXT
research
10/15/2021

Control Prefixes for Text Generation

Prompt learning methods adapt pre-trained language models to downstream ...
research
11/14/2022

Grafting Pre-trained Models for Multimodal Headline Generation

Multimodal headline utilizes both video frames and transcripts to genera...
research
05/24/2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Integrating vision and language has gained notable attention following t...
research
07/24/2023

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

Prompt engineering is a technique that involves augmenting a large pre-t...
research
01/28/2021

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

We present Vx2Text, a framework for text generation from multimodal inpu...
research
11/27/2022

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

The recently developed discrete diffusion models perform extraordinarily...
research
09/21/2023

CAMERA: A Multimodal Dataset and Benchmark for Ad Text Generation

In response to the limitations of manual online ad production, significa...

Please sign up or login with your details

Forgot password? Click here to reset