RLIPv2: Fast Scaling of Relational Language-Image Pre-training

08/18/2023
by   Hangjie Yuan, et al.
0

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1 available at https://github.com/JacobYuan7/RLIPv2.

READ FULL TEXT

page 3

page 4

page 9

page 17

research
09/05/2022

RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

The task of Human-Object Interaction (HOI) detection targets fine-graine...
research
04/26/2023

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

With the development of Vision-Language Pre-training Models (VLPMs) repr...
research
11/06/2021

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
research
05/23/2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a...
research
07/19/2022

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
research
09/10/2022

OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Advancing object detection to open-vocabulary and few-shot transfer has ...
research
03/08/2023

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

Training a 3D scene understanding model requires complicated human annot...

Please sign up or login with your details

Forgot password? Click here to reset