GLIPv2: Unifying Localization and Vision-Language Understanding

06/12/2022
by   Haotian Zhang, et al.
10

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

READ FULL TEXT

page 2

page 6

page 16

page 17

page 18

research
12/07/2021

Grounded Language-Image Pre-training

This paper presents a grounded language-image pre-training (GLIP) model ...
research
06/15/2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Vision-language (VL) pre-training has recently received considerable att...
research
03/31/2022

FindIt: Generalized Localization with Natural Language Queries

We propose FindIt, a simple and versatile framework that unifies a varie...
research
06/11/2021

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Entities Object Localization (EOL) aims to evaluate how grounded or fait...
research
03/04/2023

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

Benefiting from large-scale vision-language pre-training on image-text p...
research
05/23/2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a...
research
06/08/2023

R-MAE: Regions Meet Masked Autoencoders

Vision-specific concepts such as "region" have played a key role in exte...

Please sign up or login with your details

Forgot password? Click here to reset