LMEye: An Interactive Perception Network for Large Language Models

05/05/2023
by   Yunxin Li, et al.
0

Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is resource-intensive. Our paper proposes an alternative method called LMEye, a play-plug-in Interactive Perception Network for Large Language Models (LLMs), aiming to improve the accuracy of image understanding for the LVLM. Previous methods that infuse visual information into LLMs utilize a static visual mapping network, but lack dynamic interaction between the LLMs and visual information. LMEye addresses this issue by allowing the LLM to incorporate the visual information that aligned with human instruction. Specifically, the LMEye network consists of a static visual mapping network to provide the basic perception of an image to LLMs. Then, it also contains additional linear layers responsible for acquiring requests from LLMs, decomposing image features, and transmitting the interleaved information to LLMs, respectively. In this way, LLMs act to be in charge of understanding human instructions, sending it to the interactive perception network, and generating the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal question answering and reasoning tasks, demonstrating that it significantly improves the zero-shot performance of LLMs on multimodal tasks compared to previous methods.

READ FULL TEXT

page 4

page 6

page 7

research
02/27/2023

Language Is Not All You Need: Aligning Perception with Language Models

A big convergence of language, multimodal perception, action, and world ...
research
09/20/2023

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

Charts are common in literature across different scientific fields, conv...
research
05/22/2023

Enhance Reasoning Ability of Visual-Language Models via Large Language Models

Pre-trained visual language models (VLM) have shown excellent performanc...
research
06/08/2023

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

High-quality instructions and responses are essential for the zero-shot ...
research
05/07/2023

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Large language models (LLMs) have demonstrated remarkable language abili...
research
08/31/2023

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-language models (VLMs) have shown powerful capabilities in visual...
research
11/23/2016

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

While deep convolutional neural networks frequently approach or exceed h...

Please sign up or login with your details

Forgot password? Click here to reset