i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

05/21/2023
by   ZiYi Yang, et al.
0

The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/03/2022

i-Code: An Integrative and Composable Multimodal Learning Framework

Human intelligence is multimodal; we integrate visual, linguistic, and a...
research
04/17/2023

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
research
05/19/2023

Any-to-Any Generation via Composable Diffusion

We present Composable Diffusion (CoDi), a novel generative model capable...
research
12/09/2021

MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Large-scale pretraining is fast becoming the norm in Vision-Language (VL...
research
03/30/2023

A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

There has been a recent explosion of computer vision models which perfor...
research
12/20/2022

InterMulti:Multi-view Multimodal Interactions with Text-dominated Hierarchical High-order Fusion for Emotion Analysis

Humans are sophisticated at reading interlocutors' emotions from multimo...
research
05/27/2021

Self-Supervised Multimodal Opinion Summarization

Recently, opinion summarization, which is the generation of a summary fr...

Please sign up or login with your details

Forgot password? Click here to reset