SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

10/07/2022
by   Omiros Pantazis, et al.
17

Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs, and have been shown to sometimes exhibit impressive zero- and low-shot image classification performance. However, due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required. To combat this, a series of light-weight adaptation methods have been proposed to efficiently adapt such models when limited supervision is available. In this work, we show that while effective on internet-style datasets, even those remedies under-deliver on classification tasks with images that differ significantly from those commonly found online. To address this issue, we present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning. We report an average classification accuracy improvement of 10 on a set of challenging visual classification tasks. Further, we present a fully automatic way of selecting an important blending hyperparameter for our model that does not require any held-out labeled validation data. Code for our project is available here: https://github.com/omipan/svl_adapter.

READ FULL TEXT

page 1

page 3

research
07/21/2023

Generating Image-Specific Text Improves Fine-grained Image Classification

Recent vision-language models outperform vision-only models on many imag...
research
11/01/2022

Training Vision-Language Models with Less Bimodal Supervision

Standard practice in pretraining multimodal models, such as vision-langu...
research
11/14/2019

Self-Supervised Learning For Few-Shot Image Classification

Few-shot image classification aims to classify unseen classes with limit...
research
10/04/2022

VICRegL: Self-Supervised Learning of Local Visual Features

Most recent self-supervised methods for learning image representations f...
research
04/24/2020

Extending and Analyzing Self-Supervised Learning Across Domains

Self-supervised representation learning has achieved impressive results ...
research
06/08/2023

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Visual Prompt Tuning (VPT) is an effective tuning method for adapting pr...
research
10/11/2020

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

One reason pretraining on self-supervised linguistic tasks is effective ...

Please sign up or login with your details

Forgot password? Click here to reset