Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion

05/24/2023
by   Shaoxaing Wu, et al.
0

Video multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents. However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities. Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information. Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities. Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs. The code for this paper is publicly available at https://github.com/WSXRHFG/DBF.

READ FULL TEXT

page 2

page 4

page 8

research
10/31/2022

Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations

Learning effective joint embedding for cross-modal data has always been ...
research
09/01/2021

Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

In multimodal sentiment analysis (MSA), the performance of a model highl...
research
06/16/2018

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Multimodal sentiment analysis is a very actively growing field of resear...
research
11/22/2017

Integrating both Visual and Audio Cues for Enhanced Video Caption

Video caption refers to generating a descriptive sentence for a specific...
research
07/06/2023

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

Multimodal summarization usually suffers from the problem that the contr...
research
09/09/2022

MIntRec: A New Dataset for Multimodal Intent Recognition

Multimodal intent recognition is a significant task for understanding hu...
research
04/04/2018

Fine-grained Video Attractiveness Prediction Using Multimodal Deep Learning on a Large Real-world Dataset

Nowadays, billions of videos are online ready to be viewed and shared. A...

Please sign up or login with your details

Forgot password? Click here to reset