LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

03/10/2022
by   Jie Lei, et al.
3

Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of training the two encoders in the same network, and demonstrate that distillation can be quite effective with just a few hard negative examples. Experiments on two standard datasets (Flickr30K and COCO) show our approach achieves state-of-the-art dual encoder performance when compared with approaches using a similar amount of data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2022

ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval

Neural retrievers based on pre-trained language models (PLMs), such as d...
research
12/16/2021

Distilled Dual-Encoder Model for Vision-Language Understanding

We propose a cross-modal attention distillation framework to train a dua...
research
05/08/2023

Retriever and Ranker Framework with Probabilistic Hard Negative Sampling for Code Search

Pretrained Language Models (PLMs) have emerged as the state-of-the-art p...
research
03/27/2023

Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining

Dual encoder models are ubiquitous in modern classification and retrieva...
research
05/01/2021

Generative Art Using Neural Visual Grammars and Dual Encoders

Whilst there are perhaps only a few scientific methods, there seem to be...
research
10/23/2022

Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization

Efficient k-nearest neighbor search is a fundamental task, foundational ...
research
03/30/2021

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Our objective is language-based search of large-scale image and video da...

Please sign up or login with your details

Forgot password? Click here to reset