Context-Aware Robust Fine-Tuning

11/29/2022
by   Xiaofeng Mao, et al.
0

Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5 the new state-of-the-art.

READ FULL TEXT

page 2

page 4

page 10

page 11

research
02/02/2023

CLIPood: Generalizing CLIP to Out-of-Distributions

Out-of-distribution (OOD) generalization, where the model needs to handl...
research
03/10/2022

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

The conventional recipe for maximizing model accuracy is to (1) train mu...
research
06/09/2023

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

Recent large vision-language models such as CLIP have shown remarkable o...
research
11/24/2022

Efficient Zero-shot Visual Search via Target and Context-aware Transformer

Visual search is a ubiquitous challenge in natural vision, including dai...
research
12/16/2022

Context-aware Fine-tuning of Self-supervised Speech Models

Self-supervised pre-trained transformers have improved the state of the ...
research
10/12/2022

Are Sample-Efficient NLP Models More Robust?

Recent work has observed that pre-trained models have higher out-of-dist...
research
05/08/2022

Context-Aware Abbreviation Expansion Using Large Language Models

Motivated by the need for accelerating text entry in augmentative and al...

Please sign up or login with your details

Forgot password? Click here to reset