i-Code: An Integrative and Composable Multimodal Learning Framework

05/03/2022
by   ZiYi Yang, et al.
1

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11 multimodal pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2023

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

The convergence of text, visual, and audio data is a key step towards hu...
research
05/12/2022

One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code

People perceive the world with multiple senses (e.g., through hearing so...
research
09/12/2023

Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals

Leveraging multimodal information from biosignals is vital for building ...
research
08/18/2023

Long-range Multimodal Pretraining for Movie Understanding

Learning computer vision models from (and for) movies has a long-standin...
research
03/18/2023

Just Noticeable Visual Redundancy Forecasting: A Deep Multimodal-driven Approach

Just noticeable difference (JND) refers to the maximum visual change tha...
research
07/27/2023

Cortex Inspired Learning to Recover Damaged Signal Modality with ReD-SOM Model

Recent progress in the fields of AI and cognitive sciences opens up new ...
research
09/24/2021

Dense Contrastive Visual-Linguistic Pretraining

Inspired by the success of BERT, several multimodal representation learn...

Please sign up or login with your details

Forgot password? Click here to reset