VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

05/18/2023
by   Wenhai Wang, et al.
0

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM.

READ FULL TEXT
research
05/09/2023

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

We present an interactive visual framework named InternGPT, or iGPT for ...
research
06/02/2023

Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

The recent success of Large Language Models (LLMs) signifies an impressi...
research
06/13/2023

Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions

Human intelligence can retrieve any person according to both visual and ...
research
07/20/2023

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Evaluation of Large Language Models (LLMs) is challenging because aligni...
research
06/25/2023

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Segment anything model (SAM) is a prompt-guided vision foundation model ...
research
07/27/2023

How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges

Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT...
research
08/03/2023

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

We present the All-Seeing (AS) project: a large-scale data and model for...

Please sign up or login with your details

Forgot password? Click here to reset