PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting

07/14/2023
by   Zixin Guo, et al.
0

Vision-language (VL) Pre-training (VLP) has shown to well generalize VL models over a wide range of VL downstream tasks, especially for cross-modal retrieval. However, it hinges on a huge amount of image-text pairs, which requires tedious and costly curation. On the contrary, weakly-supervised VLP (W-VLP) explores means with object tags generated by a pre-trained object detector (OD) from images. Yet, they still require paired information, i.e. images and object-level annotations, as supervision to train an OD. To further reduce the amount of supervision, we propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images. Concretely, given a category label of an image, e.g. refinery, the knowledge, e.g. a refinery could be seen with large storage tanks, pipework, and ..., extracted by LLMs is used as the language counterpart. The knowledge supplements, e.g. the common relations among entities most likely appearing in a scene. We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K with PiTL. Empirically, the VL models pre-trained with PiTL-generated pairs are strongly favored over other W-VLP works on image-to-text (I2T) and text-to-image (T2I) retrieval tasks, with less supervision. The results reveal the effectiveness of PiTL-generated pairs for VLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2022

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at lea...
research
10/24/2020

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions

Pre-trained contextual vision-and-language (V L) models have brought i...
research
05/12/2022

A Computational Acquisition Model for Multimodal Word Categorization

Recent advances in self-supervised modeling of text and images open new ...
research
03/20/2017

Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval

This paper contributes a new large-scale dataset for weakly supervised c...
research
07/27/2023

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Training an image captioner without annotated image-sentence pairs has g...
research
06/20/2023

Quilt-1M: One Million Image-Text Pairs for Histopathology

Recent accelerations in multi-modal applications have been made possible...
research
03/14/2022

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

In this work we present a new State-of-The-Art on the text-to-video retr...

Please sign up or login with your details

Forgot password? Click here to reset