Learning Multimodal VAEs through Mutual Supervision

06/23/2021
by   Tom Joy, et al.
0

Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g. vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing – something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image-image) and CUB (image-text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.

READ FULL TEXT

page 6

page 8

page 9

research
06/29/2020

Self-Supervised MultiModal Versatile Networks

Videos are a rich source of multi-modal supervision. In this work, we le...
research
06/27/2023

Semi-supervised Multimodal Representation Learning through a Global Workspace

Recent deep learning models can efficiently combine inputs from differen...
research
10/09/2021

Discriminative Multimodal Learning via Conditional Priors in Generative Models

Deep generative models with latent variables have been used lately to le...
research
03/30/2021

Is Image-to-Image Translation the Panacea for Multimodal Image Registration? A Comparative Study

Despite current advancement in the field of biomedical image processing,...
research
07/22/2019

Information-Bottleneck Approach to Salient Region Discovery

We propose a new method for learning image attention masks in a semi-sup...
research
05/29/2018

Learn to Combine Modalities in Multimodal Deep Learning

Combining complementary information from multiple modalities is intuitiv...
research
11/22/2018

An Efficient Approach to Informative Feature Extraction from Multimodal Data

One primary focus in multimodal feature extraction is to find the repres...

Please sign up or login with your details

Forgot password? Click here to reset