A Survey on Multimodal Large Language Models

06/23/2023
by   Shukang Yin, et al.
0

Multimodal Large Language Model (MLLM) recently has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional methods, suggesting a potential path to artificial general intelligence. In this paper, we aim to trace and summarize the recent progress of MLLM. First of all, we present the formulation of MLLM and delineate its related concepts. Then, we discuss the key techniques and applications, including Multimodal Instruction Tuning (M-IT), Multimodal In-Context Learning (M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR). Finally, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2023

Multimodal Chain-of-Thought Reasoning in Language Models

Large language models (LLMs) have shown impressive performance on comple...
research
08/15/2023

Link-Context Learning for Multimodal LLMs

The ability to learn from context with novel concepts, and deliver appro...
research
05/24/2023

Vision + Language Applications: A Survey

Text-to-image generation has attracted significant interest from researc...
research
06/23/2023

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Multimodal Large Language Model (MLLM) relies on the powerful LLM to per...
research
07/06/2023

A Survey on Evaluation of Large Language Models

Large language models (LLMs) are gaining increasing popularity in both a...
research
06/26/2023

Large Multimodal Models: Notes on CVPR 2023 Tutorial

This tutorial note summarizes the presentation on “Large Multimodal Mode...

Please sign up or login with your details

Forgot password? Click here to reset