ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

10/04/2022
by   Antonio Norelli, et al.
13

Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP trains both an image and a text encoder, while LiT manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

READ FULL TEXT
research
06/13/2023

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

Contrastive visual language pretraining has emerged as a powerful method...
research
10/08/2022

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Recent advances in learning aligned multimodal representations have been...
research
10/12/2022

That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data

Pretraining multimodal models on Electronic Health Records (EHRs) provid...
research
05/10/2023

Text-To-Concept (and Back) via Cross-Model Alignment

We observe that the mapping between an image's representation in one mod...
research
04/27/2023

DataComp: In search of the next generation of multimodal datasets

Large multimodal datasets have been instrumental in recent breakthroughs...
research
09/10/2021

MURAL: Multimodal, Multitask Retrieval Across Languages

Both image-caption pairs and translation pairs provide the means to lear...
research
06/21/2023

OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Large multimodal models trained on natural documents, which interleave i...

Please sign up or login with your details

Forgot password? Click here to reset