ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

05/18/2023
by   Peng Wang, et al.
0

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

READ FULL TEXT

page 11

page 14

page 27

page 28

page 29

page 30

research
02/18/2022

Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?

Humans express their emotions via facial expressions, voice intonation a...
research
11/21/2022

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

We present Perceiver-VL, a vision-and-language framework that efficientl...
research
08/23/2023

Understanding Dark Scenes by Contrasting Multi-Modal Observations

Understanding dark scenes based on multi-modal image data is challenging...
research
05/12/2022

One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code

People perceive the world with multiple senses (e.g., through hearing so...
research
03/02/2022

HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning

Learning multimodal representations involves discovering correspondences...
research
07/17/2023

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

LLMs have demonstrated remarkable abilities at interacting with humans t...
research
06/02/2021

Exploring modality-agnostic representations for music classification

Music information is often conveyed or recorded across multiple data mod...

Please sign up or login with your details

Forgot password? Click here to reset