Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference

08/11/2021
by   Zijian Zhang, et al.
0

Multimodal abstractive summarization with sentence output is to generate a textual summary given a multimodal triad – sentence, image and audio, which has been proven to improve users satisfaction and convenient our life. Existing approaches mainly focus on the enhancement of multimodal fusion, while ignoring the unalignment among multiple inputs and the emphasis of different segments in feature, which has resulted in the superfluity of multimodal interaction. To alleviate these problems, we propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities (by low-level cross-modal interaction module) and respective characteristics within single fusion feature (by high-level selective routing module). In details, it firstly aligns the inputs from different sources and then adopts a divide and conquer strategy to highlight or de-emphasize multimodal fusion representation, which can be seen as a sparsely feed-forward model - different groups of parameters will be activated facing different segments in feature. We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies. And Further experimental results on MSMO demonstrate that our model outperforms SOTA baselines in terms of ROUGE, relevance scores and human evaluation.

READ FULL TEXT

page 1

page 7

research
07/06/2023

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

Multimodal summarization usually suffers from the problem that the contr...
research
10/10/2022

Hierarchical3D Adapters for Long Video-to-text Summarization

In this paper, we focus on video-to-text summarization and investigate h...
research
02/20/2023

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Multimodal summarization (MS) aims to generate a summary from multimodal...
research
12/16/2021

Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Multimodal summarization with multimodal output (MSMO) generates a summa...
research
06/03/2023

Provable Dynamic Fusion for Low-Quality Multimodal Data

The inherent challenge of multimodal fusion is to precisely capture the ...
research
02/22/2018

Deep Multimodal Learning for Emotion Recognition in Spoken Language

In this paper, we present a novel deep multimodal framework to predict h...
research
10/01/2020

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Referring image segmentation aims to predict the foreground mask of the ...

Please sign up or login with your details

Forgot password? Click here to reset