Log In Sign Up

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

by   Feilong Chen, et al.

In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a Hierarchical Vision-Language Pre-Training (HiVLP) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast inference speed but also can be easily scaled to large-scale ITR scenarios. The detailed results show that HiVLP is 1,427∼120,649× faster than the fusion-based model UNITER and 2∼5 faster than the fastest embedding-based model LightingDot in different candidate scenarios. It also achieves about +4.9 AR on COCO and +3.8 AR on Flickr30K than LightingDot and achieves comparable performance with the state-of-the-art (SOTA) fusion-based model METER.


page 13

page 14


ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Multimodal pre-training has propelled great advancement in vision-and-la...

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

We introduce CommerceMM - a multimodal model capable of providing a dive...

VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval

Cross-model retrieval has emerged as one of the most important upgrades ...

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

We study joint learning of Convolutional Neural Network (CNN) and Transf...

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

Many previous methods on text-based person retrieval tasks are devoted t...