Taxonomy of multimodal self-supervised representation learning

12/25/2020

∙

Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors and get combined based on factors they share. This system motivated the design of powerful unsupervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We empirically show on two versions of multimodal MNIST and a multimodal brain imaging dataset that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which sometimes can lead to reduced downstream performance but still can reveal multimodal relations. Consequently, we outperform previous unsupervised encoder-decoder methods based on CCA or variational mixtures MMVAE on various datasets on linear evaluation protocol.

READ FULL TEXT

Taxonomy of multimodal self-supervised representation learning

On self-supervised multi-modal representation learning: An application to Alzheimer's disease

Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Self-supervised multimodal neuroimaging yields predictive representations for a spectrum of Alzheimer's phenotypes

Understanding Collapse in Non-Contrastive Siamese Representation Learning

Can representation learning for multimodal image registration be improved by supervision of intermediate layers?

Rethinking 360° Image Visual Attention Modelling with Unsupervised Learning

Taxonomy of multimodal self-supervised representation learning

Related Research

On self-supervised multi-modal representation learning: An application to Alzheimer's disease

Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Self-supervised multimodal neuroimaging yields predictive representations for a spectrum of Alzheimer's phenotypes

Understanding Collapse in Non-Contrastive Siamese Representation Learning

Can representation learning for multimodal image registration be improved by supervision of intermediate layers?

Rethinking 360° Image Visual Attention Modelling with Unsupervised Learning