CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

07/06/2023
by   Min Xiao, et al.
0

Multimodal summarization usually suffers from the problem that the contribution of the visual modality is unclear. Existing multimodal summarization approaches focus on designing the fusion methods of different modalities, while ignoring the adaptive conditions under which visual modalities are useful. Therefore, we propose a novel Coarse-to-Fine contribution network for multimodal Summarization (CFSum) to consider different contributions of images for summarization. First, to eliminate the interference of useless images, we propose a pre-filter module to abandon useless images. Second, to make accurate use of useful images, we propose two levels of visual complement modules, word level and phrase level. Specifically, image contributions are calculated and are adopted to guide the attention of both textual and visual modalities. Experimental results have shown that CFSum significantly outperforms multiple strong baselines on the standard benchmark. Furthermore, the analysis verifies that useful images can even help generate non-visual words which are implicitly represented in the image.

READ FULL TEXT

page 8

page 13

page 14

research
12/16/2021

Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Multimodal summarization with multimodal output (MSMO) generates a summa...
research
08/11/2021

Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference

Multimodal abstractive summarization with sentence output is to generate...
research
02/20/2023

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Multimodal summarization (MS) aims to generate a summary from multimodal...
research
05/24/2023

Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion

Video multimodal fusion aims to integrate multimodal signals in videos, ...
research
05/17/2021

AudioVisual Video Summarization

Audio and vision are two main modalities in video data. Multimodal learn...
research
11/28/2018

Exploiting "Quantum-like Interference" in Decision Fusion for Ranking Multimodal Documents

Fusing and ranking multimodal information remains always a challenging t...
research
06/03/2020

M2P2: Multimodal Persuasion Prediction using Adaptive Fusion

Identifying persuasive speakers in an adversarial environment is a criti...

Please sign up or login with your details

Forgot password? Click here to reset