Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization

08/24/2022
by   Xinnian Liang, et al.
0

Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features, then these features are fused with language representations to generate the summary with an encoder-decoder model. The cascaded way cannot capture the semantic alignments between images and paragraphs, which are crucial to a precise summary. In this paper, we propose ViL-Sum to jointly model paragraph-level Vision-Language Semantic Alignment and Multi-Modal Summarization. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection. The joint multi-modal encoder captures the interactions between modalities, where the reordering task guides the model to learn paragraph-level semantic alignment and the selection task guides the model to selected summary-related images in the final summary. Experimental results show that our proposed ViL-Sum significantly outperforms current state-of-the-art methods. In further analysis, we find that two well-designed tasks and joint multi-modal encoder can effectively guide the model to learn reasonable paragraphs-images and summary-images relations.

READ FULL TEXT
research
04/26/2021

GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization

Traditional video summarization methods generate fixed video representat...
research
01/30/2023

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Vision-language alignment learning for video-text retrieval arouses a lo...
research
05/19/2023

A Topic-aware Summarization Framework with Different Modal Side Information

Automatic summarization plays an important role in the exponential docum...
research
05/19/2020

Multi-Modal Summary Generation using Multi-Objective Optimization

Significant development of communication technology over the past few ye...
research
05/18/2023

Rate-Adaptive Coding Mechanism for Semantic Communications With Multi-Modal Data

Recently, the ever-increasing demand for bandwidth in multi-modal commun...
research
12/15/2022

Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization

The goal of multimodal abstractive summarization (MAS) is to produce a c...
research
12/29/2020

Visual-Thermal Camera Dataset Release and Multi-Modal Alignment without Calibration Information

This report accompanies a dataset release on visual and thermal camera d...

Please sign up or login with your details

Forgot password? Click here to reset