Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

06/06/2022
by   Basil Mustafa, et al.
6

Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6 zero-shot ImageNet accuracy (vs. 76.2 additional data) it achieves 84.1 which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

READ FULL TEXT

page 3

page 18

page 28

page 29

page 34

page 35

page 40

page 41

research
04/11/2022

Mixture-of-experts VAEs can disregard variation in surjective multimodal data

Machine learning systems are often deployed in domains that entail data ...
research
05/10/2023

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

We present Integrated Multimodal Perception (IMP), a simple and scalable...
research
12/15/2022

Image-and-Language Understanding from Pixels Only

Multimodal models are becoming increasingly effective, in part due to un...
research
07/26/2022

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great ...
research
03/08/2022

Mutual Contrastive Learning to Disentangle Whole Slide Image Representations for Glioma Grading

Whole slide images (WSI) provide valuable phenotypic information for his...
research
02/13/2023

Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data

Language-supervised vision models have recently attracted great attentio...
research
04/22/2022

Balancing Expert Utilization in Mixture-of-Experts Layers Embedded in CNNs

This work addresses the problem of unbalanced expert utilization in spar...

Please sign up or login with your details

Forgot password? Click here to reset