Taxonomy of multimodal self-supervised representation learning

by   Alex Fedorov, et al.

Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors and get combined based on factors they share. This system motivated the design of powerful unsupervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We empirically show on two versions of multimodal MNIST and a multimodal brain imaging dataset that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which sometimes can lead to reduced downstream performance but still can reveal multimodal relations. Consequently, we outperform previous unsupervised encoder-decoder methods based on CCA or variational mixtures MMVAE on various datasets on linear evaluation protocol.


page 4

page 5


On self-supervised multi-modal representation learning: An application to Alzheimer's disease

Introspection of deep supervised predictive models trained on functional...

Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning

Contrastive representation learning has proven to be an effective self-s...

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Modality representation learning is an important problem for multimodal ...

Self-supervised multimodal neuroimaging yields predictive representations for a spectrum of Alzheimer's phenotypes

Recent neuroimaging studies that focus on predicting brain disorders via...

Understanding Collapse in Non-Contrastive Siamese Representation Learning

Contrastive methods have led a recent surge in the performance of self-s...

Rethinking 360° Image Visual Attention Modelling with Unsupervised Learning

Despite the success of self-supervised representation learning on plana...

Code Repositories


Fusion is a self-supervised framework for data with multiple sources — specifically, this framework aims to support neuroimaging applications.

view repo

Please sign up or login with your details

Forgot password? Click here to reset