PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

05/23/2022
by   Yuan Yao, et al.
1

Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and competitive performance. However, the removal of object detectors also deprives the capability of VLP models in explicit object modeling, which is essential to various position-sensitive vision-language (VL) tasks, such as referring expression comprehension and visual commonsense reasoning. To address the challenge, we introduce PEVL that enhances the pre-training and prompt tuning of VLP models with explicit object position modeling. Specifically, PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks. We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs. We make the data and code for this paper publicly available at https://github.com/thunlp/PEVL.

READ FULL TEXT
research
12/19/2022

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to a...
research
09/24/2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabi...
research
06/12/2022

GLIPv2: Unifying Localization and Vision-Language Understanding

We present GLIPv2, a grounded VL understanding model, that serves both l...
research
05/31/2023

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Two-Tower Vision-Language (VL) models have shown promising improvements ...
research
08/18/2023

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision repre...
research
04/23/2021

Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized v...
research
09/07/2023

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

As it is empirically observed that Vision Transformers (ViTs) are quite ...

Please sign up or login with your details

Forgot password? Click here to reset