Learning Visual Representations with Caption Annotations

Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining. Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations. To do so, motivated by the recent progresses in language models, we introduce image-conditioned masked language modeling (ICMLM) – a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations. Project website: https://europe.naverlabs.com/icmlm.

READ FULL TEXT

page 1

page 24

page 25

research
06/11/2020

VirTex: Learning Visual Representations from Textual Annotations

The de-facto approach to many vision tasks is to start from pretrained v...
research
12/08/2022

Task Bias in Vision-Language Models

Incidental supervision from language has become a popular approach for l...
research
08/14/2023

Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

Cultural heritage applications and advanced machine learning models are ...
research
08/26/2021

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Computer vision tasks such as object detection and semantic/instance seg...
research
11/25/2019

Learning to Learn Words from Narrated Video

When we travel, we often encounter new scenarios we have never experienc...
research
04/02/2016

Automatic Annotation of Structured Facts in Images

Motivated by the application of fact-level image understanding, we prese...
research
08/11/2022

MILAN: Masked Image Pretraining on Language Assisted Representation

Self-attention based transformer models have been dominating many comput...

Please sign up or login with your details

Forgot password? Click here to reset