HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

05/24/2022
by   Feilong Chen, et al.
0

In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a Hierarchical Vision-Language Pre-Training (HiVLP) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast inference speed but also can be easily scaled to large-scale ITR scenarios. The detailed results show that HiVLP is 1,427∼120,649× faster than the fusion-based model UNITER and 2∼5 faster than the fastest embedding-based model LightingDot in different candidate scenarios. It also achieves about +4.9 AR on COCO and +3.8 AR on Flickr30K than LightingDot and achieves comparable performance with the state-of-the-art (SOTA) fusion-based model METER.

READ FULL TEXT

page 13

page 14

research
01/22/2020

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...
research
03/16/2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Multimodal pre-training has propelled great advancement in vision-and-la...
research
04/10/2023

Delving into E-Commerce Product Retrieval with Vision-Language Pre-training

E-commerce search engines comprise a retrieval phase and a ranking phase...
research
05/01/2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...
research
02/06/2023

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Image-text retrieval (ITR) is a task to retrieve the relevant images/tex...
research
06/12/2023

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of m...
research
05/28/2023

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Large-scale pre-trained text-image models with dual-encoder architecture...

Please sign up or login with your details

Forgot password? Click here to reset