LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

03/16/2021
by   Siqi Sun, et al.
13

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature indexes offline, and employing instant dot-product matching with further re-ranking, which significantly speeds up retrieval process. In fact, LightningDOT achieves new state of the art across multiple ITR benchmarks such as Flickr30k, COCO and Multi30K, outperforming existing pre-trained models that consume 1000x magnitude of computational hours. Code and pre-training checkpoints are available at https://github.com/intersun/LightningDOT.

READ FULL TEXT

page 9

page 14

page 15

page 16

research
05/15/2020

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutioni...
research
12/17/2021

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Video-and-language pre-training has shown promising improvements on vari...
research
05/24/2022

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

In the past few years, the emergence of vision-language pre-training (VL...
research
05/28/2023

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Large-scale pre-trained text-image models with dual-encoder architecture...
research
08/22/2023

Pre-training with Aspect-Content Text Mutual Prediction for Multi-Aspect Dense Retrieval

Grounded on pre-trained language models (PLMs), dense retrieval has been...
research
08/10/2023

Follow Anything: Open-set detection, tracking, and following in real-time

Tracking and following objects of interest is critical to several roboti...
research
11/17/2022

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled imag...

Please sign up or login with your details

Forgot password? Click here to reset