ViT-Lens: Towards Omni-modal Representations

08/20/2023
by   Weixian Lei, et al.
0

Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0 on Objaverse-LVIS, 87.4 we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.

READ FULL TEXT

page 12

page 13

research
05/09/2023

ImageBind: One Embedding Space To Bind Them All

We present ImageBind, an approach to learn a joint embedding across six ...
research
11/21/2022

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Although speech is a simple and effective way for humans to communicate ...
research
12/02/2021

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Biological intelligence systems of animals perceive the world by integra...
research
07/26/2022

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great ...
research
03/14/2023

Eliciting Latent Predictions from Transformers with the Tuned Lens

We analyze transformers from the perspective of iterative inference, see...
research
04/13/2023

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

Large pre-trained models, also known as foundation models (FMs), are tra...
research
12/22/2020

Seeing past words: Testing the cross-modal capabilities of pretrained V L models

We investigate the ability of general-purpose pretrained vision and lang...

Please sign up or login with your details

Forgot password? Click here to reset