Multimodal Foundation Models: From Specialists to General-Purpose Assistants

09/18/2023
by   Chunyuan Li, et al.
0

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics – methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics – unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

READ FULL TEXT

page 28

page 33

page 34

page 35

page 36

page 37

page 38

page 39

research
06/26/2023

Large Multimodal Models: Notes on CVPR 2023 Tutorial

This tutorial note summarizes the presentation on “Large Multimodal Mode...
research
05/24/2023

PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

As advances in large language models (LLMs) and multimodal techniques co...
research
10/17/2022

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

This paper surveys vision-language pre-training (VLP) methods for multim...
research
08/29/2023

Multimodal Foundation Models For Echocardiogram Interpretation

Multimodal deep learning foundation models can learn the relationship be...
research
06/19/2023

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

Generating realistic human motion from given action descriptions has exp...
research
03/04/2021

A Survey on Spoken Language Understanding: Recent Advances and New Frontiers

Spoken Language Understanding (SLU) aims to extract the semantics frame ...
research
03/10/2021

What is Multimodality?

The last years have shown rapid developments in the field of multimodal ...

Please sign up or login with your details

Forgot password? Click here to reset