Caption supervision enables robust learners

10/13/2022
by   Benjamin Feuer, et al.
1

Vision language models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that CNNs trained on a standard cross-entropy loss can also benefit from caption supervision, in some cases even more than VL models, on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (https://github.com/penfever/CaptionNet/), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at https://github.com/penfever/vlhub/

READ FULL TEXT

page 5

page 6

research
08/07/2023

Distributionally Robust Classification on a Data Budget

Real world uses of deep learning require predictable model behavior unde...
research
11/01/2022

Training Vision-Language Models with Less Bimodal Supervision

Standard practice in pretraining multimodal models, such as vision-langu...
research
03/14/2022

SimMatch: Semi-supervised Learning with Similarity Matching

Learning with few labeled data has been a longstanding problem in the co...
research
07/19/2020

Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets

We present a new loss function called Distribution-Balanced Loss for the...
research
09/27/2021

GANiry: Bald-to-Hairy Translation Using CycleGAN

This work presents our computer vision course project called bald men-to...
research
12/09/2022

A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others

Machine learning models have been found to learn shortcuts – unintended ...
research
05/31/2019

Max-MIG: an Information Theoretic Approach for Joint Learning from Crowds

Eliciting labels from crowds is a potential way to obtain large labeled ...

Please sign up or login with your details

Forgot password? Click here to reset