MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

09/14/2023
by   Haozhe Zhao, et al.
0

Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

READ FULL TEXT

page 2

page 6

research
05/08/2023

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Conditional inference on joint textual and visual clues is a multi-modal...
research
07/31/2023

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

In recent times there has been a surge of multi-modal architectures base...
research
08/23/2022

Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

Is more data always better to train vision-and-language models? We study...
research
05/31/2023

Joint Adaptive Representations for Image-Language Learning

Image-language learning has made unprecedented progress in visual unders...
research
06/14/2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Recent research on Large Language Models (LLMs) has led to remarkable ad...
research
08/09/2023

Prompting In-Context Operator Learning with Sensor Data, Equations, and Natural Language

In the growing domain of scientific machine learning, in-context operato...
research
03/18/2021

Reading Isn't Believing: Adversarial Attacks On Multi-Modal Neurons

With Open AI's publishing of their CLIP model (Contrastive Language-Imag...

Please sign up or login with your details

Forgot password? Click here to reset