Self-Supervised Multimodal Learning: A Survey

03/31/2023
by   Yongshuo Zong, et al.
0

Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to leverage supervision from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, which we categorize along three orthogonal axes: objective functions, data alignment, and model architectures. These axes correspond to the inherent characteristics of self-supervised learning methods and multimodal data. Specifically, we classify training objectives into instance discrimination, clustering, and masked prediction categories. We also discuss multimodal input data pairing and alignment strategies during training. Finally, we review model architectures including the design of encoders, fusion modules, and decoders, which are essential components of SSML methods. We review downstream multimodal application tasks, reporting the concrete performance of the state-of-the-art image-text models and multimodal video models, and also review real-world applications of SSML algorithms in diverse fields such as healthcare, remote sensing, and machine translation. Finally, we discuss challenges and future directions for SSML. A collection of related resources can be found at: https://github.com/ys-zong/awesome-self-supervised-multimodal-learning.

READ FULL TEXT

page 3

page 11

research
06/06/2022

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Recently, Self-Supervised Representation Learning (SSRL) has attracted m...
research
03/06/2020

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

One of the key factors of enabling machine learning models to comprehend...
research
06/18/2022

Self-Supervised Learning for Videos: A Survey

The remarkable success of deep learning in various domains relies on the...
research
10/18/2021

Self-Supervised Representation Learning: Introduction, Advances and Challenges

Self-supervised representation learning methods aim to provide powerful ...
research
05/06/2021

Generalized Multimodal ELBO

Multiple data types naturally co-occur when describing real-world phenom...
research
07/29/2021

UIBert: Learning Generic Multimodal Representations for UI Understanding

To improve the accessibility of smart devices and to simplify their usag...
research
11/04/2021

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

We consider the use of automated supervised learning systems for data ta...

Please sign up or login with your details

Forgot password? Click here to reset