APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

10/08/2022
by   Elan Rosenfeld, et al.
13

Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment data relevant to the downstream task of interest. We study a natural approach to aligning existing encoders via small auxiliary functions, and we find that this method is competitive with (or outperforms) state of the art in many settings while being less prone to overfitting, less costly to train, and more robust to distribution shift. With a properly chosen alignment distribution, our method surpasses prior state of the art for ImageNet zero-shot classification on public data while using two orders of magnitude less time and data and training 77

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2022

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Aligning the visual and language spaces requires to train deep neural ne...
research
09/11/2023

Natural Language Supervision for General-Purpose Audio Representations

Audio-Language models jointly learn multimodal text and audio representa...
research
10/22/2020

Zero-Shot Learning from scratch (ZFS): leveraging local compositional representations

Zero-shot classification is a generalization task where no instance from...
research
02/08/2022

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

Deriving multimodal representations of audio and lexical inputs is a cen...
research
01/30/2019

GeNet: Deep Representations for Metagenomics

We introduce GeNet, a method for shotgun metagenomic classification from...
research
09/01/2019

Higher-order Comparisons of Sentence Encoder Representations

Representational Similarity Analysis (RSA) is a technique developed by n...
research
03/25/2023

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

Recent vision architectures and self-supervised training methods enable ...

Please sign up or login with your details

Forgot password? Click here to reset