LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

08/26/2021
by   Zhijian Liu, et al.
15

Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets. In this paper, we propose LocTex that takes advantage of the low-cost localized textual annotations (i.e., captions and synchronized mouse-over gestures) to reduce the annotation effort. We introduce a contrastive pre-training framework between images and captions and propose to supervise the cross-modal attention map with rendered mouse traces to provide coarse localization signals. Our learned visual features capture rich semantics (from free-form captions) and accurate localization (from mouse traces), which are very effective when transferred to various downstream vision tasks. Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x while achieving comparable or even improved performance on COCO instance segmentation. When provided with the same amount of annotations, LocTex achieves around 4 state-of-the-art "vision+language" pre-training approach on the task of PASCAL VOC image classification.

READ FULL TEXT

page 1

page 3

page 8

page 12

research
02/24/2022

FreeSOLO: Learning to Segment Objects without Annotations

Instance segmentation is a fundamental vision task that aims to recogniz...
research
06/11/2020

VirTex: Learning Visual Representations from Textual Annotations

The de-facto approach to many vision tasks is to start from pretrained v...
research
12/06/2021

Joint Learning of Localized Representations from Medical Images and Reports

Contrastive learning has proven effective for pre-training image models ...
research
08/04/2020

Learning Visual Representations with Caption Annotations

Pretraining general-purpose visual features has become a crucial part of...
research
07/05/2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

In this work, we present Auto-captions on GIF, which is a new large-scal...
research
11/06/2018

Deep feature transfer between localization and segmentation tasks

In this paper, we propose a new pre-training scheme for U-net based imag...
research
04/11/2023

ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity

Deep learning has shown great potential in assisting radiologists in rea...

Please sign up or login with your details

Forgot password? Click here to reset