Training Transitive and Commutative Multimodal Transformers with LoReTTa

05/23/2023
by   Manuel Tran, et al.
0

Collecting a multimodal dataset with two paired modalities A and B or B and C is difficult in practice. Obtaining a dataset with three aligned modalities A, B, and C is even more challenging. For example, some public medical datasets have only genetic sequences and microscopic images for one patient, and only genetic sequences and radiological images for another - but no dataset includes both microscopic and radiological images for the same patient. This makes it difficult to integrate and combine all modalities into a large pre-trained neural network. We introduce LoReTTa (Linking mOdalities with a tRansitive and commutativE pre-Training sTrAtegy) to address this understudied problem. Our self-supervised framework combines causal masked modeling with the rules of commutativity and transitivity to transition within and between different modalities. Thus, it can model the relation A -> C with A -> B -> C. Given a dataset containing only the disjoint combinations (A, B) and (B, C), we show that a transformer pre-trained with LoReTTa can handle any modality combination at inference time, including the never-seen pair (A, C) and the triplet (A, B, C). We evaluate our approach on a multimodal dataset derived from MNIST containing speech, vision, and language, as well as a real-world medical dataset containing mRNA, miRNA, and RPPA samples from TCGA. Compared to traditional pre-training methods, we observe up to a 100-point reduction in perplexity for autoregressive generation tasks and up to a 15 classification accuracy for previously unseen modality pairs during the pre-training phase.

READ FULL TEXT
research
11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...
research
10/30/2022

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Self-supervised pre-training has been successful in both text and speech...
research
11/03/2022

Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Self-supervised pre-training recently demonstrates success on large-scal...
research
09/06/2023

Gene-induced Multimodal Pre-training for Image-omic Classification

Histology analysis of the tumor micro-environment integrated with genomi...
research
09/12/2021

TEASEL: A Transformer-Based Speech-Prefixed Language Model

Multimodal language analysis is a burgeoning field of NLP that aims to s...
research
11/11/2020

Improving Multimodal Accuracy Through Modality Pre-training and Attention

Training a multimodal network is challenging and it requires complex arc...
research
02/11/2023

Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

Recently, vision transformer (ViT) based multimodal learning methods hav...

Please sign up or login with your details

Forgot password? Click here to reset