Any-to-Any Generation via Composable Diffusion

05/19/2023
by   Zineng Tang, et al.
3

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io

READ FULL TEXT

page 1

page 3

page 6

page 8

page 9

research
09/11/2023

NExT-GPT: Any-to-Any Multimodal LLM

While recently Multimodal Large Language Models (MM-LLMs) have made exci...
research
05/21/2023

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

The convergence of text, visual, and audio data is a key step towards hu...
research
05/24/2023

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

The recent progress in diffusion-based text-to-image generation models h...
research
12/01/2022

Unite and Conquer: Cross Dataset Multimodal Synthesis using Diffusion Models

Generating photos satisfying multiple constraints find broad utility in ...
research
09/27/2022

Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

Digital art synthesis is receiving increasing attention in the multimedi...
research
11/02/2022

Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild

Laughter is considered one of the most overt signals of joy. Laughter is...
research
12/22/2021

Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Personality computing and affective computing have gained recent interes...

Please sign up or login with your details

Forgot password? Click here to reset