Retrieval-Enhanced Contrastive Vision-Text Models

06/12/2023
by   Ahmet Iscen, et al.
0

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark.

READ FULL TEXT

page 9

page 16

page 17

research
07/30/2021

CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification

Existing computer vision research in artwork struggles with artwork's fi...
research
03/21/2020

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

We propose a new application of embedding techniques for problem retriev...
research
05/08/2022

Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

Vision-language pre-training (VLP) relying on large-scale pre-training d...
research
03/23/2023

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

The field of vision and language has witnessed a proliferation of pre-tr...
research
08/22/2023

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Cross-modal pre-training has shown impressive performance on a wide rang...
research
07/31/2023

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

In recent times there has been a surge of multi-modal architectures base...
research
05/24/2022

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

In the past few years, the emergence of vision-language pre-training (VL...

Please sign up or login with your details

Forgot password? Click here to reset