The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

08/03/2023
by   Weiyun Wang, et al.
0

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

READ FULL TEXT

page 4

page 6

page 16

page 17

page 20

research
08/24/2023

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

We introduce the Qwen-VL series, a set of large-scale vision-language mo...
research
01/28/2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Vision-Language Pre-training (VLP) has advanced the performance for many...
research
05/26/2023

Zero-shot Visual Question Answering with Language Model Feedback

In this paper, we propose a novel language model guided captioning appro...
research
08/22/2023

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Text-to-music generation (T2M-Gen) faces a major obstacle due to the sca...
research
05/18/2023

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Large language models (LLMs) have notably accelerated progress towards a...
research
10/07/2022

ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering

With the recent advance in large pre-trained language models, researcher...
research
12/03/2022

VLG: General Video Recognition with Web Textual Knowledge

Video recognition in an open and dynamic world is quite challenging, as ...

Please sign up or login with your details

Forgot password? Click here to reset