CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding

05/15/2023
by   Linhui Xiao, et al.
0

Visual Grounding (VG) refers to locating a region described by expressions in a specific image, which is a critical topic in vision-language fields. To alleviate the dependence on labeled data, existing unsupervised methods try to locate regions using task-unrelated pseudo-labels. However, a large proportion of pseudo-labels are noisy and diversity scarcity in language taxonomy. Inspired by the advances in V-L pretraining, we consider utilizing the VLP models to realize unsupervised transfer learning in downstream grounding task. Thus, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP via exploiting pseudo-language labels to solve VG problem. By elaborating an efficient model structure, we first propose a single-source and multi-source curriculum adapting method for unsupervised VG to progressively sample more reliable cross-modal pseudo-labels to obtain the optimal model, thus achieving implicit knowledge exploiting and denoising. Our method outperforms the existing state-of-the-art unsupervised VG method Pseudo-Q in both single-source and multi-source scenarios with a large margin, i.e., 6.78 existing weakly supervised methods. The code and models will be released at <https://github.com/linhuixiao/CLIP-VG>.

READ FULL TEXT

page 2

page 4

page 12

page 14

page 15

research
03/16/2022

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Visual grounding, i.e., localizing objects in images according to natura...
research
06/30/2021

Weakly Supervised Temporal Adjacent Network for Language Grounding

Temporal language grounding (TLG) is a fundamental and challenging probl...
research
08/15/2021

ST3D++: Denoised Self-training for Unsupervised Domain Adaptation on 3D Object Detection

In this paper, we present a self-training method, named ST3D++, with a h...
research
08/18/2023

Language-Guided Diffusion Model for Visual Grounding

Visual grounding (VG) tasks involve explicit cross-modal alignment, as s...
research
11/28/2022

G^3: Geolocation via Guidebook Grounding

We demonstrate how language can improve geolocation: the task of predict...
research
08/03/2018

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

We present a simple yet efficient approach capable of training deep neur...

Please sign up or login with your details

Forgot password? Click here to reset