Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

08/24/2023
by   Jinze Bai, et al.
0

We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

READ FULL TEXT

page 2

page 6

page 18

page 20

research
02/27/2023

Language Is Not All You Need: Aligning Perception with Language Models

A big convergence of language, multimodal perception, action, and world ...
research
05/24/2023

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Generalization to unseen tasks is an important ability for few-shot lear...
research
09/23/2020

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Mirroring the success of masked language models, vision-and-language cou...
research
08/03/2023

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

We present the All-Seeing (AS) project: a large-scale data and model for...
research
03/06/2023

PaLM-E: An Embodied Multimodal Language Model

Large language models excel at a wide range of complex tasks. However, e...
research
05/29/2023

Contextual Object Detection with Multimodal Large Language Models

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision...
research
05/22/2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...

Please sign up or login with your details

Forgot password? Click here to reset