Log In Sign Up

Context-Aware Robust Fine-Tuning

by   Xiaofeng Mao, et al.

Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5 the new state-of-the-art.


page 2

page 4

page 10

page 11


CLIPood: Generalizing CLIP to Out-of-Distributions

Out-of-distribution (OOD) generalization, where the model needs to handl...

Robust fine-tuning of zero-shot models

Large pre-trained models such as CLIP offer consistent accuracy across a...

Context-aware Fine-tuning of Self-supervised Speech Models

Self-supervised pre-trained transformers have improved the state of the ...

Are Sample-Efficient NLP Models More Robust?

Recent work has observed that pre-trained models have higher out-of-dist...

Context-Aware Abbreviation Expansion Using Large Language Models

Motivated by the need for accelerating text entry in augmentative and al...

Efficient Zero-shot Visual Search via Target and Context-aware Transformer

Visual search is a ubiquitous challenge in natural vision, including dai...

1 Introduction

With rough category-level annotation, traditional visual models (he2016deep; krizhevsky2017imagenet) only focus on the task-related object for classification, and neglect the rest part, i.e. the context or domain information of the image. This paradigm has a natural disadvantage, where the model behaviour becomes unstable and unpredictable when facing abnormal inputs with context or domain out of training set (he2021towards; zhang2022nico++; moreno2012unifying). Even high In-Distribution (ID) accuracy can be reached, the giant degeneration of performance still happens on Out-Of-Distribution (OOD) data (taori2020measuring; hendrycks2021many). Recently, Contrastive Language-Image Pre-training (CLIP) (radford2021learning) opens up a bright future for narrowing the gap between ID and OOD performance. Instead of category label, CLIP adopts exhaustive text description as supervision to learn visual features. Guided by precise description, CLIP can capture visual concepts not only for classified object, but also other content in the image, e.g. background, style, viewpoint, etc. We collectively refer them as context. Such a context-aware visual feature helps the generalization to any other domains or tasks. Although CLIP has shown superior zero-shot performance, a supervised fine-tuning is still necessary for yielding further improvement on a specific downstream task. However, several works (kumar2021fine; wortsman2022robust) have pointed fine-tuning can distort pre-trained features, and make CLIP lose its power on robustness and generalization. andreassen2021evolution explored several fine-tuning approaches but found that they are hard to improve robustness at high accuracy. How to robustly fine-tune pre-trained models is increasingly important and still an open problem till now. Although existing methods add pre-step of linear probing (kumar2021fine) or post-step of weight averaging (wortsman2022robust)

to prevent the decline caused by fine-tuning, their complex procedure introduces additional computation or heuristic hyper-parameters which is adapted to specific task and has limited versatility.

In this paper, we explore a simple yet effective fine-tuning method from the perspective of context. We first conduct an empirical investigation to show that fine-tuning leads pre-trained models to forget old knowledge about context catastrophically. Detailly, we construct a task of prompt-based (petroni2019language; liu2021pre) zero-shot context recognition in Figure 1

. Original CLIP model has initial ability of recognizing a type of context with 82.5% accuracy. However, such ability is suddenly vanishing with a few epochs of fine-tuning, which appears as a dramatic drop of context recognition accuracy. As a consequence, the fine-tuned model cannot take advantage of context-awareness for classifying images with different contexts, leading to underperformance on OOD classification.

To alleviate such effect, we propose Context-Aware Robust Fine-tuning (CAR-FT), a novel fine-tuning paradigm for pre-trained models. CAR-FT regularizes the model during fine-tuning to capture the context information. However, since common datasets do not always provide annotation of image context, CAR-FT borrows the zero-shot ability of CLIP models to extract the context distributions in the input images. Specially, context distributions is a predicted probability vector of context prompt, which has the same form of “a

of a ” but is used for classifying “” of the image instead of “”. In this way, we can restrict the fine-tuned feature to encode useful context information by minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by zero-shot/fine-tuned CLIP models. The detailed procedure is shown in Figure 2.

Benefitting from the context-aware feature, CAR-FT can improve both ID and OOD accuracy on downstream tasks. Thorough experiments are conducted to validate the effectiveness of our method. Concretely, CAR-FT achieves 83.3% ID accuracy and 56.9% averaged OOD accuracy on ImageNet classification (deng2009imagenet), which is 2.1% and 8.2% higher than fine-tuning baseline respectively. Other than ImageNet, CAR-FT brings accuracy gains on nine downstream tasks. As an application to Domain Generalization (DG), CAR-FT gets 78.5% averaged accuracy on DomainBed (gulrajani2020search) benchmarks, surpassing existing DG methods and building the new state-of-the-art.

Our main contributions are in three aspects:

  1. We empirically demonstrate that the initial context-aware ability of pre-trained models will be corrupted by downstream fine-tuning. As a consequence, models after fine-tuning cannot encode useful image contexts and underperform on OOD classification.

  2. We propose a novel fine-tuning paradigm of pre-trained models called Context-Aware Robust Fine-tuning (CAR-FT). It regularizes the model during fine-tuning also encoding context information, and makes the context-aware ability inherited into downstream tasks.

  3. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.

2 Related Work

Contrastive Language-Image Pre-training Models. CLIP (radford2021learning) has demonstrated that dual-encoder models pre-trained with contrastive objectives on massive image-language pairs can learn generic visual representations. Such representations have zero-shot transfer ability to a variety of downstream classification tasks via prompting. Moreover, it also exhibits remarkable robustness under multiple natural distribution shifts. After the initial success, subsequent works propose further improvement on CLIP framework containing scaling up tasks (pham2021combined; jia2021scaling), using pre-trained visual encoders (zhai2022lit)

, combining sub-task of image captioning 

(yu2022coca) or expanding more data format (yuan2021florence). As the performance of CLIP has a strong correlation with its used datasets (fang2022data), there are some efforts (schuhmannlaion; thomee2016yfcc100m) to create plentiful and useful image-text pairs and make them open to the community. Various opened pre-trained models on these datasets have been used in our paper. Our CAR-FT are applicable to most CLIP based foundation models which support prompting. We also compare the effectiveness of CAR-FT on different backbone scales, datasets and other CLIP variants.

Figure 2: (Top left) Prompt-based zero-shot classification by CLIP. (Bottom left) Fine-tuned CLIP with linear classifier on downstream tasks. (Right) The procedure of our CAR-FT.

Robustness under Distribution Shifts.

Practical machine learning systems require the stability under distribution shifts. Previous methods improve the OOD robustness by exploring richer data augmentation 

(hendrycks2019augmix; hendrycks2021many), adversarial training mechanism (xie2020adversarial; mao2022enhance; herrmann2022pyramid), advanced network design (paul2022vision; mao2022towards; bai2021ood; wang2022can) or regularization for a flatter minimum (cha2021swad; foret2020sharpness). However, although many advanced techniques are proposed, the giant gap between ID and OOD accuracy still exists (miller2021accuracy), which means the robustness of deep models is still far from satisfactory. Fortunately, CLIP (radford2021learning) contributes a big step to close this gap. It opens a new feasible solution for general robustness by contrastive pre-training on web-scale language-image pairs. Though scaling the model size and training data size of CLIP, powerful foundation models (pham2021combined; jia2021scaling) can be bulit for achieving state-of-the-art on most OOD benchmarks. This work relies on contrastive language-image pre-trained models, which can provide sufficient context prior of image. Such context knowledge is used by our CAR-FT as possible for improving the generalization on downstream tasks.

Robust Fine-tuning of Pre-trained Models.

Fine-tuning a pre-trained model has become the de facto standard for doing transfer learning in the field of Computer Vision (CV) or Natural Language Processing (NLP). With explosive growth of the capability of large-scale pre-trained models 

(radford2021learning), there are increasing attention to the advanced techniques of fine-tuning. Methods improving the fine-tuning usually aims at higher accuracy on downstream tasks (ge2017borrowing; guo2019spottune). However, the robustness and generalization of downstream models are not extensively studied. Some works (kumar2021fine; wortsman2022robust; andreassen2021evolution) have investigated the evolution of OOD robustness during downstream fine-tuning. They find the effective robustness of pre-trained models will gradually vanish during fine-tuning. LP-FT (kumar2021fine) proposes to solve this problem by a two-step strategy of first linear-probing to find a good classification head and then full fine-tuning for reducing the distortion of robust pre-trained features. WiSE-FT (wortsman2022robust) adopts weight-space ensemble of original pre-trained weights and fine-tuned weights to achieve both higher ID and OOD accuracy. This work also aims at robustness and accuracy on downstream task, but differently, we modify the fine-tuning methods by simply adding a loss term, without introducing extra training cost or heuristic hyper-parameters for specific tasks.

1:Pre-trained image encoder and text encoder ; Text prompts .
2:Fine-tuned image encoder weights and classification weights .
3:Compute based on using Equation (1)
4:Compute based on using Equation (2)
5:Fix parameters , and
7:for each training steps do
8:     Sample a mini-batch images with labels
9:     Get reference context distribution of zero-shot model:
10:     Get predicted context distribution of fine-tuned model:
11:     Compute KL divergence loss
12:     Compute classification loss
14:     Update parameters , for minimizing
16:end for
Algorithm 1 Pseudo code of CAR-FT

3 Method

3.1 Preliminaries

Vision-Language Pre-training (VLP) models like CLIP (radford2021learning) and ALIGN (jia2021scaling) consist of an image encoder and text encoder with dual form. Both of them are pre-trained by contrastive loss (hadsell2006dimensionality) to push the embedding of matched image-text pairs together while pushing those of mismatched pairs apart. After pre-training, we can get the visual representation corresponding to input image , is feature dimensions.

Zero-shot CLIP. Consider a downstream classification task with image and label , . Borrowing the pre-training objective of text-image matching, we can design text prompt to fit CLIP to zero-shot classification. Specifically, let be a set of prompt templates such like “a of .”, where , are the placeholder for a specific class name and context description respectively. is a set of class names corresponding to each . All combinations of prompt templates and class names consist of text prompt set , we feed them to text encoder to get , where . The final classification weights can be calculated by:


where is the text features corresponding to each class, and is the Frobenius norm. If not specified, both and have been normalized, such that we can compare with image features

, and use their cosine similarity

as the zero-shot prediction.

Fine-tuned CLIP. A standard fine-tuning paradigm always adds a set of learnable parameters for mapping pre-trained representation into label space of downstream task, where is usually randomly initialized. Since CLIP has its own zero-shot ability, a better choice is to use zero-shot classification weights for initialization instead of random initialization (wortsman2022robust; wortsman2022model). Both and are optimized by minimizing cross-entropy loss. Note that we only consider the end-to-end fine-tuning in this work, which usually brings best results when training data is sufficient.

3.2 An Empirical Study of CLIP’s Context-Awareness

In this section, we empirically study the ability of CLIP models recognizing specific context, and how it changes during fine-tuning. To this end, we construct a context recognition task, whose aim is to classify four contexts, i.e. art painting, cartoon, photo, sketch, contained in PACS dataset (li2017deeper). As suggested in Figure 1, the original CLIP and model fine-tuned with category label are compared on this task. Since class names and context descriptions have been given in PACS, we follow the procedure of zero-shot CLIP in Section 3.1 to get prompt weights W. The zero-shot context prompt weights are then obtained by:


which is used for classifying contexts. Interestingly, we find a zero-shot CLIP model can achieved 82.5% accuracy, exhibiting a good context-aware capability. But such ability is soon lost in fine-tuning stage, the context recognition accuracy of fine-tuned CLIP are fell into 26.7%, close to a random guess. Above investigation shows that downstream fine-tuning obscures context-related features. As consequence, the fine-tuned CLIP models cannot take advantage of context-awareness for classifying images with unusual contexts, which results in the underperformance on OOD classification.

3.3 Context-Aware Robust Fine-tuning

We introduce the proposed Context-Aware Robust Fine-tuning (CAR-FT) in this section. Based on empirical analysis in Section 3.2, we are motivated to retain the context-awareness of CLIP models during fine-tuning. Our main idea lies in using zero-shot CLIP models to induce the context-related information in guiding with fine-tuning process. Such that a context-aware feature can be learnt on downstream tasks for achieving both advanced ID and OOD performance. Specifically, CAR-FT uses context prompt weights to get the reference distribution of context in the input image:


where is the image encoder weights of zero-shot CLIP. At start of fine-tuning stage, we duplicate a new copy of as . We assume fine-tuned CLIP model is parameterized by , and is shared between pre-trained and fine-tuned CLIP. The induced context distribution of fine-tuned models can be:


To regularize the fine-tuned visual representation i.e. to encode effective context information, we make the predicted context distribution closer to . This regularization term is realized by minimizing the Kullback-Leibler Divergence (KLD):


Another objective of CAR-FT is the regular downstream classification loss. The overall loss of our CAR-FT is:


is a factor to trade-off the impact of term. In this work we set empirically. It should also be denoted that only and is updated during fine-tuning and all other parameters are frozen. The detailed procedure of our CAR-FT is summarized in Algorithm 1.

4 Experiments

We demonstrate the robustness of Context-Aware Robust Fine-tuning (CAR-FT) against distribution shifts by evaluating on large-scale ImageNet classification and DomainBed which consists of five Domain Generalization (DG) tasks. We further show CAR-FT gains accuracy improvement on downstream tasks in Section 4.2. Finally thorough ablation experiments are conducted to study the impact of hyper-parameters, model scales and so on. If not specified, for all experiments we adopt ViT-B/16 as default backbone.

4.1 Robustness to Distribution Shifts

4.1.1 ImageNet Classification

Benchmarks. We use five OOD testsets on ImageNet classification task. Each of them represents a type of OOD scenario where the classifier is prone to make mistakes. IN-V2 (recht2019imagenet) is a new ImageNet test set with distribution shift; IN-R (hendrycks2021many) collects online images with artificial creation, e.g., cartoons, graphics, video game renditions, etc; IN-Sketch (wang2019learning) contains sketches instead of natural images; ObjectNet (barbu2019objectnet) places ImageNet objects in hard contexts, e.g. unusual backgrounds, rotations or imaging viewpoints; IN-A (hendrycks2021natural) consists of images misclassified by a ResNet-50 (he2016deep). We report the top-1 accuracy on all above datasets.

Implementation. For training large-scale ImageNet dataset, we adopt AdamW optimizer (loshchilov2018decoupled) with initial learning rate of 3e-5, weight decay of 0.1 and a cosine annealing decay schedule. Only random resized crop is used for data augmentation. All models are trained 10 epochs with batch size of 512.

Distribution shifts Avg. Avg.
IN IN-V2 IN-R IN-Sketch ObjectNet IN-A shifts all
CLIP Zero-shot 68.3 61.9 77.6 48.3 29.8 50.1 53.5 56.0
Fine-tuning Only Methods
 FT 81.0 70.9 54.7 42.1 26.6 31.3 45.1 51.1
 TP-FT 81.2 70.7 65.0 44.9 27.4 35.3 48.7 54.1
 LP-FT 81.7 71.6 72.9 48.4 28.2 49.1 54.0 58.7
 CAR-FT (Ours) 83.3 74.0 75.4 53.0 32.6 49.5 56.9 61.3
Combined with Weight-Space Ensemble
 WiSE-FT (opt. ) 81.7 72.7 78.8 53.2 33.4 52.2 58.1 62.0
 + CAR-FT 82.1 73.3 79.2 54.5 33.8 53.6 58.9 62.8
 Greedy Model Soups 84.8 75.1 74.1 53.8 30.3 48.1 56.3 61.0
 + CAR-FT 85.0 75.8 74.4 54.6 31.6 48.9 57.1 61.7
 Uniform Model Soups 83.4 74.7 76.8 54.6 31.5 50.0 57.5 61.8
 + CAR-FT 83.9 75.1 77.3 55.5 32.0 51.1 58.2 62.5

Table 1: Top@1 accuracy of compared methods on ImageNet and its derived distribution shifts. Avg. shifts presents the mean accuracy among five distribution shifts. We use ViT-B/16 as basic backbone.

Results. We first compare our CAR-FT with previous fine-tuning methods. Among them, FT, TP-FT (li2022elevater; wortsman2022robust; wortsman2022model), LP-FT (kumar2021fine) adopts random weights, text prompt weights and linear probing weights for initialization of the classification head respectively. It can be shown that linear probing weights is better for initializing classification layer for downstream fine-tuning. However, to get linear probing weights, LP-FT must spend extra round of training. By contrast, our CAR-FT, which uses text prompt weights for initialization, has less training cost. Meanwhile, compared with LP-FT, CAR-FT achieves +1.6% and 2.9% improvement on ID and OOD accuracy. It suggests that preserving context-awareness during fine-tuning does help for OOD classification.

To demonstrate the versatility of our method, we combine CAR-FT with weight-space ensemble methods to further robustify the model. Both WiSE-FT (wortsman2022robust) and ModelSoups (wortsman2022model)

take advantage from the shared training trajectory among models fine-tuned with same pre-trained weights, and use weight-space ensemble to integrate the power of multiple models. Our CAR-FT can be applied by simply modifying preceding fine-tuning stage of these method. For WiSE-FT, we use the optimal interpolation weight for comparison. The results are shown in Table 

1. An interesting phenomenon is that the ID accuracy of CAT-FT drops from 83.3% to 82.1% after combining with WiSE-FT. This finding is contrary to WISE-FT paper, where weight-space ensemble always helps improving accuracy. We think CAR-FT has learnt useful knowledge from the zero-shot CLIP models via KLD loss. After that, assembling with a low accurate zero-shot model has no complementary promotion but even neutralizes the original performance. However, although accuracy is decreased, the robustness of CAR-FT under distribution shifts is enhanced with weight-space ensemble. Our methods can yield +0.8% averaged OOD accuracy than WiSE-FT. ModelSoups fine-tune 72 models with various combinations of hyper-parameters, and conduct weight-space ensemble of them with uniform or greedy policy. We test CAR-FT augmented with uniform and greedy soups. Our CAR-FT can further obtain +0.2%, +0.5% ID and +0.8%, +0.7% OOD accuracy of greedy soup and uniform soup respectively.

4.1.2 Domain Generalization

Benchmarks. We present our results on five benchmark datasets included in DomainBed (gulrajani2020search): PACS (li2017deeper) (4 domains, 7 classes, and 9, 991 images), VLCS (torralba2011unbiased) (4 domains, 5 classes, and 10, 729 images), OfficeHome (venkateswara2017deep) (4 domains, 65 classes, and 15, 588 images), TerraIncognita (beery2018recognition) (4 domains, 10 classes, and 24, 788 images), and DomainNet (peng2019moment) (6 domains, 345 classes, and 586, 575 images). The leave-one-out evaluation protocol is adopted, where we iteratively choose a single domain as test domain and use other domains for training. The averaged accuracy on all chosen test domains is reported as the general performance on each dataset. We repeat experiment of CAR-FT three times for suppressing the fluctuation of results.

PACS VLCS OfficeHome TerraInc DomainNet Avg.
Previous SOTA using RegNetY-16GF
MIRO (cha2022domain) 97.4 79.9 80.4 58.9 53.8 74.1
EoA (arpit2021ensemble) 95.8 81.1 83.9 61.1 60.9 76.6
MIRO+SWAD (cha2022domain; cha2021swad) 96.8 81.7 83.3 64.3 60.7 77.3

ViT-B/16 Pre-trained by CLIP
ERM 93.7 82.7 78.5 52.3 53.8 72.2
MIRO (cha2022domain) 95.6 82.2 82.5 54.3 54.0 73.7
DPL (zhang2022domain) 97.3 84.3 84.2 52.6 56.7 75.0
CAR-FT (Ours) 96.8 85.5 85.7 61.9 62.5 78.5
Table 2: Comparison with domain generalization methods on DomainBed. The accuracy reported are averaged on three trials.
Methods ImageNet Flowers Aircraft Pets CIFAR10 CIFAR100 Cars DTD SUN397 Avg.
FT 81.0 93.7 53.0 93.4 97.6 85.6 88.1 75.8 71.5 82.2
TP-FT 81.2 93.1 51.6 90.6 97.7 89.5 86.6 77.4 74.8 82.5
LP-FT 81.7 95.0 53.3 93.2 97.8 85.4 88.5 75.4 71.6 82.4
WiSE-FT (opt. ) 82.5 93.6 53.2 90.7 98.2 90.6 88.3 78.5 78.6 83.8
CAR-FT (Ours) 83.3 94.4 54.1 92.5 98.7 91.1 88.9 79.0 78.9 84.5

Table 3: Performance on nine downstream tasks using different fine-tuning methods. The reported accuracy is averaged on three trials.

Implementation. Our CAR-FT requires designed context prompts. However, for Domain Generalization (DG) datasets, there is no off-the-shelf description for each class or context. So we construct the prompts ourselves. For PACS, OfficeHome and DomainNet, since domain names have provided, we use a template of “a of .”, where is the domain name. Take PACS as an example, is the one of {art painting, cartoon, photo, sketch}. For TerraIncognita and VLCS, even the domain names are not available. We directly use the prompt templates of ImageNet in CLIP paper (radford2021learning). The detailed text prompts of each dataset can be referred to Supplementary. Next we present the training configuration. The models in domain generalization experiments are optimized by AdamW with learning rate of 5e-6 and weight decay of 0.1. For data augmentation, simple random resize crop and random horizontal filp are used. The other hyper-parameters, such as batch size, dropout rate, and training steps, we keep consistent with the default configuration in DomainBed.

Results. We make fair comparison with ERM, MIRO (cha2022domain), and DPL (zhang2022domain). All of them use ViT-B/16 pre-trained by CLIP on 4B image-text pairs. Benefit from large-scale pre-training on massive data, simple ERM baseline can obtain 72.2% averaged accuracy on five datasets. The recent proposed DG algorithms MIRO and DPL yield +1.5% and +2.8% average improvement respectively. By contrast, our CAR-FT surpasses all other methods and achieves largest +6.3% improvement. We also compare CAR-FT with previously reported state-of-the-art of DG task. The best reported performance is MIRO with SWAD (cha2021swad) using a RegNetY-16GF (radosavovic2020designing) pre-trained on 3.6B images. Our CAR-FT builds the new state-of-the-art with the leading of +1.2%.

4.2 Accuracy on Downstream Tasks

Benchmarks. Beyond robustness to distribution shift, we also select 9 image classification datasets used in downstream transfer exmperiment of CLIP: ImageNet (deng2009imagenet), CIFAR10, CIFAR100, FGVCAircraft (maji2013fine), OxfordPets (parkhi2012cats), StanfordCars (krause20133d), Flowers102 (nilsback2008automated), SUN397 (xiao2016sun) and DTD (cimpoi2014describing) (see Supplementary for their statistics). These benchmarks cover a diverse tasks of classification on textures, image details and generic objects. We use them for comprehensive evaluation of compared fine-tuning methods.

Implementation. For each downstream dataset, we use the corresponding text prompt in CLIP repository111 The model is end-to-end fine-tuned. Other training settings are consistent with ImageNet experiment in Section 4.1.1.

Results. We fine-tune CLIP models with various method and report the top@1 accuracy on downstream tasks. Baseline methods in Section 4.1.1 are compared except for ModelSoups, which assembles a large number of models and may bring unfairness of comparison. Since we have shown previously that WiSE-FT cannot improve the ID accuracy of CAR-FT models, in this experiment, we only use plain CAR-FT without augmenting of weight-space ensemble methods. The results are shown in Table 3. Generally, text prompts can be regarded as good prior to initialize the classification head. However, our experiment presents TP-FT falls behind of FT on several datasets e.g. Flowers102, OxfordPets and StanfordCars, which demonstrates that fine-tuning a randomly initialized head is even better than initialization with text prompts in some cases. Besides, for other methods, both WiSE-FT and our CAR-FT improve the baseline of TP-FT consistently. Our CAR-FT achieves the best averaged accuracy, which surpasses WiSE-FT by 0.7%. This experiment suggests learning a context-aware feature can also benefit the performance on in-domain data.

ImageNet Avg. shifts
 Zero-shot 59.6 38.5
 TP-FT 76.1 37.6
 LP-FT 76.2 39.6
 WiSE-FT 76.2 42.2
 CAR-FT 76.7 42.9
 CAR-FT+WiSE-FT 76.3 43.3
 Zero-shot 75.5 65.1
 TP-FT 85.0 59.9
 LP-FT 85.3 66.2
 WiSE-FT 86.0 68.8
 CAR-FT 87.1 67.8
 CAR-FT+WiSE-FT 86.6 69.7
 Zero-shot 76.2 67.3
 TP-FT 86.2 64.2
 LP-FT 86.3 68.0
 WiSE-FT 87.1 71.7
 CAR-FT 87.7 70.2
 CAR-FT+WiSE-FT 87.3 72.1
ViT-L/14 (using LAION-2B)
 Zero-shot 75.2 62.0
 TP-FT 84.3 59.4
 LP-FT 84.4 62.8
 WiSE-FT 85.3 66.0
 CAR-FT 86.1 66.0
 CAR-FT+WiSE-FT 85.6 67.1
ViT-H/14 (using LAION-2B)
 Zero-shot 78.0 65.2
 TP-FT 84.7 58.7
 LP-FT 84.9 66.1
 WiSE-FT 86.0 68.2
 CAR-FT 86.6 69.7
 CAR-FT+WiSE-FT 86.3 69.9
Table 4: The effect of CAR-FT on different CLIP backbones.

4.3 Ablation Study

Trade-off Between Loss Terms. We set a hyper-parameter in Equation 6 to balance the effect of two loss terms. To study the sensibility of our method w.r.t , we adopt multiple values of for running CAR-FT on ImageNet. The results are shown in Figure 3. The best performance appears at around , which is the empirical value used in this paper. Besides, we find both the ID and OOD accuracy have roughly the same trend with change of . And they become worse along with the smaller or larger value of . The figure also suggests CAR-FT is insensitive with larger values of . The performance decreases slowly after is larger than 2. This phenomenon also indicates KLD loss term is the key component for the enhancement.

Figure 3: The performance of our CAR-FT with different on ImageNet. The solid line indicates the validation accuracy, and dashed line presents the averaged accuracy under distribution shifts.
Figure 4: Visualization of top@1 model prediction on context prompt. The text in red presents incorrect prediction.

Different Backbones. We conduct ablation on backbones to learn how model scales or architectures affect the performance of our CAR-FT in Table 4. For simplicity, we omit the detailed accuracy on each OOD dataset and directly show the averaged accuracy under 5 distribution shifts. ResNet50 and ViT are adopted as the typical convolution-based and transformer-based networks respectively. We adopt the official weights of CLIP for ResNet50 and ViT-L/14, ViT-L/14@336px. For more types of backbones, we additionally use the weights of ViT-L/14 and ViT-H/14 opened in OpenCLIP (ilharco_gabriel_2021_5143773), which are pre-trained on LAION-2B (schuhmannlaion). Our CAR-FT outperforms other methods across multiple model scales and architectures. Especially, a plain CAR-FT without weight-space ensemble can even surpass WiSE-FT on ResNet50 and ViT-H/14. By sacrificing little accuracy on ImageNet, the robustness against distribution shifts of CAR-FT can be further improved by WiSE-FT. However, such trade-off results are at least better than WiST-FT, with both higher ID and OOD accuracy.

Template Template ImageNet Avg.
Types Num shifts
CLIP 80 83.3 56.9
CLIP 7 83.2 56.5
Searched 7 83.2 56.9
Table 5: CAR-FT with different types or amounts of text prompts.
ImageNet Avg. shifts
ViT-B/32 by DeCLIP-88M
 Zero-shot 66.2 43.0
 TP-FT 74.2 38.9
 LP-FT 74.0 42.8
 CAR-FT 74.3 43.7
 Zero-shot 39.5 19.6
 TP-FT 59.4 24.2
 LP-FT 59.3 24.1
 CAR-FT 59.4 24.4
ViT-B/32 by SLIP-YFCC15M
 Zero-shot 34.3 15.5
 TP-FT 56.0 20.7
 LP-FT 56.1 20.7
 CAR-FT 56.1 21.0
Table 6: Performance of CAR-FT on other CLIP variants.

Impact of Context Prompts. In our CAR-FT, prompts provide language supervision of context. This supervision is influenced by types or quantity of used prompts. In general understanding, the finer granularity and greater quantity of used prompts, the more types of image contexts can be perceived by CAR-FT. However, we find a few of concise context prompts are sufficient for achieving desirable results of CAR-FT. This finding is validated by our ablation experiment comparing CAR-FT models by various types and quantity of used prompts. Specifically, we adopt official CLIP templates (radford2021learning), and change the size of prompts by randomly sampling a subset from original template set. The subset size of 7 is compared. Results in Table 5 suggest only using 7 context prompts gets lower ID and OOD performance. To compare different types of context prompt, we use the searched 7 templates222 After using better organized prompts, the OOD accuracy ups to 56.9%, which is same performed with baseline model.

Figure 5: Visualized attention of CLIP models with zero-shot, traditional fine-tuning and our CAR-FT.
Figure 6: More results of interpolated weights using WiSE-FT. We compare original WiSE-FT with our CAR-FT based WiSE-FT.

Beyond CLIP. Table 6 illustrates if CAR-FT is still helpful on other CLIP variants such as DeCLIP (li2021supervision), SLIP (mu2022slip) or FILIP (yao2021filip). Among them, DeCLIP is pre-trained on private 88M dataset, and others are pre-trained on YFCC15M, a filtered version of YFCC100M (thomee2016yfcc100m). We discover that the effect of CAR-FT relies on the zero-shot classification capability. For zero-shot models pre-trained on smaller datasets with low performance, e.g. FILIP and SLIP, CAR-FT can only obtain +0.3% improvement on OOD accuracy. It is reasonable that a poorly performed zero-shot model always cannot provide an accurate prediction of context distribution, which limits the promotion of our CAR-FT.

5 Discussion

We further analyze the phenomena observed from CAR-FT and give some visualizations for deeper understanding about how CAR-FT works.

Can CAR-FT recognize image contexts? Since the motivation of CAR-FT is to reduce the corruption of context-aware features during fine-tuning, it is necessary to discuss if CAR-FT models truly recognize image context successfully. We run the empirical experiment in Section 3.2 to validate the context-awareness of our CAR-FT. We keep the other setting fixed and only replace the image encoder with the one fine-tuned by CAR-FT. The experimental result shows CAR-FT has 83.5% zero-shot accuracy in recognizing contexts of PACS dataset, which is even 1% higher than the original CLIP model. It reflects our method is effective in learning context-aware features. We additionally visualize the top@1 context prediction of CLIP models on ImageNet classification. We choose test samples in ImageNet-R dataset and present the best matched context prompt in Figure 4. It should be denoted that for single image, there will be more than one prompts matching its contexts. For example, in the first row of Figure 4, both the description of “a photo of the hard to see jellyfish.” and “a dark photo of the jellyfish.” is reasonable in human cognition. We judge the correctness of a context prediction from a subjective perspective. For the zero-shot CLIP model, its most predictions are aligned with image context, conforming with human understanding. However, after fine-tuning, the model turns context-unaware and produces wrong text prompt prediction. Our CAR-FT can fix these mistakes and even output more accurate context predictions. The visualization reveals the strong ability of CAR-FT in context understanding vividly. Additionally, we use interpreting method of  chefer2021generic to visualize the attention of CAR-FT models. In Figure 5, we sample some images in ImageNet-R with unusual context, and compare the attention on them using zero-shot, fine-tuned, and CAR-FT CLIP models. Our CAR-FT exhibits stronger interpretability, and focuses on the important objects even under uncommon context.

Other variants of CAR-FT. There are multiple variants of CAR-FT in different ways of context prompt construction or context distribution divergence measuring. For example, Equation 2 calculates by summing up context prompt weights of all categories. However, a subjectively more proper way is to choose the corresponding context prompt of the ground-truth category for . We experiment with these two types of and find they have similar performance. For objective function, this paper uses KLD to measure the discrepancy of context distribution, actually there are more alternative metrics such as Maximum Mean Discrepancy (MMD) or Wasserstein distance. When we replace KLD with these advanced metrics, it still brings no better performance. Therefore, a simple KLD term is minimized in our CAR-FT.

6 Conclusion

In this paper, we propose CAR-FT which alleviates the loss of context-aware ability of zero-shot models during fine-tuning. By inheriting context-awareness into downstream tasks, CAR-FT can build fine-tuned models with enhanced accuracy and robustness. Our work relies on image-language pre-trained models and task-related text prompts. Such strict preconditions make CAR-FT may lose its applicability on models pre-trained by only images or tasks without designed text prompts. How to apply the idea of context-aware fine-tuning on vision-only pre-trained models and remove the prerequisite of text prompts will be left as the future work.

Supplementary information

We provide the supplementary information about details of used datasets and prompts, and more results of CAR-FT combined with WiSE-FT.

Datasets Details. The detailed statistics of the nine downstream datasets, as well as the five OOD testsets of ImageNet, are shown in Table 7.

Dataset Classes Train Test
ImageNet 1000 1,281,167 50,000
CIFAR10 10 50,000 10,000
CIFAR100 100 50,000 10,000
FGVCAircraft 100 6,667 3,333
OxfordPets 37 3,680 3,669
StanfordCars 196 8,144 8,041
OxfordFlowers102 102 2,040 6,149
SUN397 397 19,850 19,850
Describable Textures 47 3,760 1,880
ImageNet-V2 1,000 N/A 10,000
ImageNet-R 200 N/A 30,000
ImageNet-Sketch 1,000 N/A 50,889
ObjectNet 313 N/A 50,000
ImageNet-A 200 N/A 7,500
Table 7: Details of the datasets used in our paper. For domain generalization benchmarks, we have introduced them in Section 4.1.2.

Prompts Details. Following is the prompt template used for PACS, OfficeHome and DomainNet in domain generalization experiments. For TerraIncognita and VLCS, we directly use the templates of ImageNet.

templates = [’a art painting of [CLASS].’,
              ’a cartoon [CLASS].’,
              ’a photo of [CLASS].’,
              ’a sketch of [CLASS].’]
Listing 1: Templates for PACS
templates = [’a art painting of [CLASS].’,
              ’a clipart of [CLASS].’,
              ’a product of [CLASS].’,
              ’a photo of [CLASS].’]
Listing 2: Templates for OfficeHome
templates = [’a clipart of [CLASS].’,
              ’a infograph of [CLASS].’,
              ’a painting of [CLASS].’,
              ’a quickdraw of [CLASS].’,
              ’a photo of [CLASS].’,
              ’a sketch of [CLASS].’]
Listing 3: Templates for DomainNet

More results of CAR-FT combined with WiSE-FT. In Table 1 and 4, we only display results of CAR-FT combined with WiSE-FT at optimal interpolation weight. To show more specific results, we take 10 weights uniformly from the interval [0,1] and linearly interpolate zero-shot and fine-tuned models. Such that an ID-OOD accuracy curve can be plotted to see if our CAR-FT achieves better trade-off. Figure 6 suggests the curve of CAR-FT combine with WiSE-FT is located above the original WiSE-FT curve. Most interpolated models of CAR-FT are better than WiSE-FT consistently.