Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

07/13/2023
by   Denis Coquenet, et al.
0

Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.

READ FULL TEXT

page 2

page 6

research
11/28/2022

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

Contrastive Language-Image Pre-training (CLIP) has emerged as a simple y...
research
06/29/2023

Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization

The rapid advancements in computer vision have stimulated remarkable pro...
research
07/21/2023

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP hav...
research
03/10/2023

HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Human-centric perceptions include a variety of vision tasks, which have ...
research
03/13/2023

ViM: Vision Middleware for Unified Downstream Transferring

Foundation models are pre-trained on massive data and transferred to dow...
research
03/25/2023

Equivariant Similarity for Vision-Language Foundation Models

This study explores the concept of equivariance in vision-language found...
research
03/22/2022

Fine-Grained Scene Graph Generation with Data Transfer

Scene graph generation (SGG) aims to extract (subject, predicate, object...

Please sign up or login with your details

Forgot password? Click here to reset