Multimodal Understanding Through Correlation Maximization and Minimization

05/04/2023
by   Yifeng Shi, et al.
0

Multimodal learning has mainly focused on learning large models on, and fusing feature representations from, different modalities for better performances on downstream tasks. In this work, we take a detour from this trend and study the intrinsic nature of multimodal data by asking the following questions: 1) Can we learn more structured latent representations of general multimodal data?; and 2) can we intuitively understand, both mathematically and visually, what the latent representations capture? To answer 1), we propose a general and lightweight framework, Multimodal Understanding Through Correlation Maximization and Minimization (MUCMM), that can be incorporated into any large pre-trained network. MUCMM learns both the common and individual representations. The common representations capture what is common between the modalities; the individual representations capture the unique aspect of the modalities. To answer 2), we propose novel scores that summarize the learned common and individual structures and visualize the score gradients with respect to the input, visually discerning what the different representations capture. We further provide mathematical intuitions of the computed gradients in a linear setting, and demonstrate the effectiveness of our approach through a variety of experiments.

READ FULL TEXT

page 6

page 7

page 13

page 14

research
06/08/2021

What Makes Multimodal Learning Better than Single (Provably)

The world provides us with data of multiple modalities. Intuitively, mod...
research
10/28/2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Vision-and-language (V-L) tasks require the system to understand both vi...
research
11/22/2018

An Efficient Approach to Informative Feature Extraction from Multimodal Data

One primary focus in multimodal feature extraction is to find the repres...
research
08/14/2019

Harmonized Multimodal Learning with Gaussian Process Latent Variable Models

Multimodal learning aims to discover the relationship between multiple m...
research
04/13/2023

Efficient Multimodal Fusion via Interactive Prompting

Large-scale pre-training has brought unimodal fields such as computer vi...
research
12/10/2021

Quality-Aware Multimodal Biometric Recognition

We present a quality-aware multimodal recognition framework that combine...

Please sign up or login with your details

Forgot password? Click here to reset