Training Vision-Language Transformers from Captions Alone

05/19/2022
by   Liangke Gui, et al.
0

We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.

READ FULL TEXT

page 7

page 9

page 13

page 14

page 15

research
11/25/2018

Learning to discover and localize visual objects with open vocabulary

To alleviate the cost of obtaining accurate bounding boxes for training ...
research
11/21/2022

NeRF-RPN: A general framework for object detection in NeRFs

This paper presents the first significant object detection framework, Ne...
research
10/05/2020

Non-anchor-based vehicle detection for traffic surveillance using bounding ellipses

Cameras for traffic surveillance are usually pole-mounted and produce im...
research
12/01/2019

Learning to Relate from Captions and Bounding Boxes

In this work, we propose a novel approach that predicts the relationship...
research
03/28/2021

TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

Transformer networks have proven extremely powerful for a wide variety o...
research
05/22/2023

Type-to-Track: Retrieve Any Object via Prompt-based Tracking

One of the recent trends in vision problems is to use natural language c...
research
01/25/2022

Main Product Detection with Graph Networks for Fashion

Computer vision has established a foothold in the online fashion retail ...

Please sign up or login with your details

Forgot password? Click here to reset