Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference

by   Zijian Zhang, et al.

Multimodal abstractive summarization with sentence output is to generate a textual summary given a multimodal triad – sentence, image and audio, which has been proven to improve users satisfaction and convenient our life. Existing approaches mainly focus on the enhancement of multimodal fusion, while ignoring the unalignment among multiple inputs and the emphasis of different segments in feature, which has resulted in the superfluity of multimodal interaction. To alleviate these problems, we propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities (by low-level cross-modal interaction module) and respective characteristics within single fusion feature (by high-level selective routing module). In details, it firstly aligns the inputs from different sources and then adopts a divide and conquer strategy to highlight or de-emphasize multimodal fusion representation, which can be seen as a sparsely feed-forward model - different groups of parameters will be activated facing different segments in feature. We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies. And Further experimental results on MSMO demonstrate that our model outperforms SOTA baselines in terms of ROUGE, relevance scores and human evaluation.



There are no comments yet.


page 1

page 7


Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Multimodal summarization with multimodal output (MSMO) generates a summa...

MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis

Current deep learning approaches for multimodal fusion rely on bottom-up...

Deep Multimodal Learning for Emotion Recognition in Spoken Language

In this paper, we present a novel deep multimodal framework to predict h...

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

Multimodal affective computing, learning to recognize and interpret huma...

Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal Fusion with Depth Guidance

Image outpainting technology generates visually reasonable content regar...

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Referring image segmentation aims to predict the foreground mask of the ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.