Log In Sign Up

Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization

by   Xin Zhang, et al.
The University of Tokyo

Domain generalization (DG) is a difficult transfer learning problem aiming to learn a generalizable model to unseen domains. Recent massive pre-trained models such as CLIP and GPT-3, i.e. foundation models (FMs), have been shown to be robust to many distribution shifts and therefore should lead to substantial improvements in DG. In this work, we study generic ways to adopt CLIP for DG problems in image classification, where we evaluate on naive zero-shot learning and full DG learning settings. For the latter, we propose AP (Amortized Prompt), as a novel approach for domain inference in the form of prompt generation. Using several standard datasets on domain generalization benchmark, namely PACS, VLCS, OfficeHome, and TerraIncognita, CLIP provides comparable performance without fine-tuning any parameters, suggesting the applicability and importance of FM in DG. In addition, we show that combining domain prompt inference with CLIP enables AP to outperform strong baselines and the naive CLIP baselines by a large margin, raising accuracy from 71.3% to 79.3%. We hope the simplicity and success of our approach emphasizes the importance of and leads to wider more adoption and analysis of foundation models in the field of domain generalization.


Robust fine-tuning of zero-shot models

Large pre-trained models such as CLIP offer consistent accuracy across a...

Momentum-based Weight Interpolation of Strong Zero-Shot Models for Continual Learning

Large pre-trained, zero-shot capable models have shown considerable succ...

Large-scale Generative Modeling to Improve Automated Veterinary Disease Coding

Supervised learning is limited both by the quantity and quality of the l...

Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers

Prompt tuning attempts to update few task-specific parameters in pre-tra...

Large Margin Mechanism and Pseudo Query Set on Cross-Domain Few-Shot Learning

In recent years, few-shot learning problems have received a lot of atten...

FETA: Towards Specializing Foundation Models for Expert Task Applications

Foundation Models (FMs) have demonstrated unprecedented capabilities inc...

1 Introduction

Figure 2: The conceptual difference between (a) Standard DG methods that using ResNet18 or ResNet50 as a backbone and (b) Foundation Model such as CLIP. The most of standard DG methods (a) explicit/implicit conduct domain alignment to learn domain-invariant representation or add samples/regularization to avoid overfitting on source domains. In this work, we propose (b) to utilize Foundation Model that includes more effective representation for adaption to a target domain.

Transfer Learning is often our first choice when we want to build AI applications , where pre-train then fine-tune is the most popular paradigm: many natural language processing (NLP) applications are based on BERT 


, and many computer vision (CV) applications are based on pre-trained ResNet or Vision Transformer (ViT) 

[14]. However, in practical application scenarios, domain shifts pose a substantial challenge for successfully transferring models [42]. For example, patient data collected at different hospitals can have a disruptive effect on medical diagnosis AI models, and traffic conditions in different cities can make self-driving algorithms unstable. Many works in domain adaptation (DA) [37, 50] and domain generalization (DG) [51, 49, 42] have been done to address this domain shift issue , but it still remains an unsolved and critical problem.

A group of large-scale pre-trained Transfer Learning models such as GPT-3 [10] and CLIP [38] – recently coined foundation models (FM) [7] – are in the spotlight for their impressive broad generalization performance [12]. Because of their unpredictable capabilities including both opportunities and risks, lots of studies work on new applications, analyses, and interpretations [7]. We ask the following foundational question for foundation models: how would FM perform on domain generalization, arguably the hardest form of transfer learning problems requiring both distribution shift robustness and unsupervised adaptation to test domains? While the latest work [25] conduct experiments with various backbones in DG, we focus on CLIP (Contrastive Language-Image Pre-Training) [38] and visual classification DG datasets. First, we consider naively adapt CLIP to domain generalization. We fix the CLIP model and focus on how to extract the useful information with the assumption that the representation of CLIP is powerful enough to cover the target domain. A problem in DG is to design a prompt of CLIP for target domain, because the target domain is unknown during training phase. To address this problem, we propose Amortized Prompt (AP) to amortize the prompt for the target domain. We allow AP to generate a domain prompt for target domain from input images, since we find that domain information improves the performance of CLIP in experiments Section 4.1.

We conduct experiments on four standard datasets included in DG benchmarks to evaluate AP, following the experiment setup in [24], such as parameter tuning and model selection. We show that CLIP with AP outperforms the strong baselines by a large margin, raising the accuracy from 71.3% to 79.6%. Moreover, through comparisons with AP with various backbones of ERM including big transfer [28] (BiT-M-R50x3, BiT-M-R101x3, and BiT-M-R152x4), vision transformer (ViTB16 and ViT-L16 [14], Hybrid ViT, DeiT [45]), and MLP-Mixer [44] (Mixer-L16), CLIP with AP outperforms all these backbones. And lastly, through careful analyze on datasets, we find out that AP stably improves CLIP in a domain that is hard to distinguish and design a prompt even for humans.

In summary, our main contributions are:

  1. We are the first to introduce a foundation model to domain generation and show its strong empirical performances

  2. We propose a novel domain-prompt inference architecture to most effectively apply CLIP for DG

  3. We conduct experiments on various standard DG datasets, compare with a series of SoTA backbone architectures, and show our Amortized Prompt outperforms previous approaches by a large margin

2 Related work

2.1 Domain Generalization

Over the past decade, various studies proposed different approaches to solve DG. Most prior works focus on regularizing the model using the knowledge from multiple source domains (Figure 2). For example, domain-invariant representation learning [21]

is a major branch of domain generalization, aiming to reduce domain gaps in the space of latent representations. There are many different approach to measures the domain gaps, including adversarial classifier

[32, 20, 21], kernel mapping [6, 23], metric learning [36, 26], and invariant risk minimization [1]. Similarly, several works try to generate samples with diverse styles so that models can learn domain-invariant features through them [41, 53, 8]. Other methods use meta learning to learn how to regularize the model to improve robustness [15, 31].

Our work investigates the importance of CLIP [38] in DG, and proposes a lightweight way to fine-tune CLIP for DG. There are several recent observations to motivate us to benchmark CLIP in DG setup. First, [24]

shows that many prior approaches do not provide significant improvement compared to simple supervised learning. The results imply that regularizing the model is not enough to achieve high performance in DG. Secondly, despite significant works on this front, most studies focus on medium-scale pre-trained models such as ResNet18 or ResNet50, despite that very-large-scale models often lead to substantial improvements. Notably, the latest work

[25] compares more large-scale backbone networks, including big transfer [28] (BiT-M-R50x3, BiT-M-R101x3, and BiT-M-R152x4), vision transformer (ViTB16 and ViT-L16 [14], Hybrid ViT, DeiT [45]), and MLP-Mixer [44] (Mixer-L16), and show that the selection of backbone networks does indeed matter in DG. Differently from [25], we show that CLIP performs surprisingly well without fine-tuning the entire model in source domains, which is time-consuming in practice.

From the methodological perspective, our work relates to several prior works  [21, 53, 8] that try leveraging domain features rather than discarding them as in domain-invariant learning. However, our work differs in that we propose a CLIP-specific way to leverage the domain features by combining these features with prompt tuning. Our work also relates to [25] where both approaches modulate their prediction given unlabeled data available at test time. Specifically, [25] proposes T3A that replaces the linear classifier using pseudo-labeling and prototypical classification and shows that it stably improves the performance in unseen domains. However, T3A cannot be directly applied to CLIP, as it assumes a simple linear classifier that CLIP does not employ.

2.2 Prompt Tuning

The success of GPT-3 demonstrated the importance of prompt tuning. There are various prompting strategies, such as discrete natural language prompts and continuous prompts, and many other prompting strategies have appeared[34].

Many works focus on learning from discrete natural language prompts. AutoPrompt [43] elicits knowledge from language models with automatically generated discrete prompts. PADA [5] proposed a domain adaptation algorithm that trains T5 [39], a language foundation model, to generate unique domain relevant features for each input. The motivation of PADA is similar to AP, but with discrete prompt for NLP applications, and our AP with continuous prompt in computer vision.

Lately, there have been many works[33, 29]

directly tuning prompts in continuous vector forms, because often the primary purpose of prompt tuning is to extract knowledge from language models, and not keeping interpretability for humans. In another recent work, P-Tuning v2

[35] shows that continuous prompt tuning achieve the same performance as fine-tuning in various settings.

Due to the successful applications of CLIP, prompt tuning is also of great interest in computer vision. CoOp (Context Optimization) [52] demonstrated that the performance of CLIP is highly sensitive to prompts and that a suitable prompt can improve performance for the image recognition task. CLIP-Adapter[22] was proposed to learn with an additional adapter network.

In this work, we first introduce CLIP to DG for image recognition. Then, we propose a prompt tuning method that involves predicting the domain labels. Unlike CoOp and CLIP-Adapter, which need class labels to tune prompts, we generate domain prompts from input images as a simple way to amortize prompt inference to an unseen domain.

3 Method

Figure 3: Architecture for (a) Empirical Risk Minimization (ERM) fine-tuning from prior works, (b) naive CLIP without fine-tuning, and (c) CLIP +  AP. Gray boxes are fixed networks during learning, while blue boxes are the learned components. Instead of intervening through directly on back-bone vision network representations as in (a) ERM, our (c) CLIP + AP intervenes through prompt generation in language representation, passed through the backbone language network.

In this section, we first introduce the notations and definitions of domain generalization following [49]. Then, we explain how to use CLIP in DG and introduce Amortized Prompt  to effectively enhance the performance of CLIP in DG.

3.1 Problem Setup of Domain Generalization

Let denote an input space and an output space. A domain is composed of data that is sampled from a distribution. We denote the datasets from a distribution as , where is an input image, denotes the class associated with , and

denotes the joint distribution of the sample and output label in domain


denote the corresponding random variables.

In domain generalization, we are interested in the performance of predictor on data from an unseen domain . To achieve the goal, prior works fine-tune a pre-trained image encoder (usually ResNet18 or ResNet50) in conjunction with randomly initialized classification head (linear classifier) using data from multiple different datasets. Specifically, given datasets collected from several different domains , the and is updated by



is a loss function. In the most simple case,

is a simple cross-entropy loss, then minimizing eq. 2 is called empirical risk minimization (ERM). As discussed in Section 2.1, different methods in DG use other loss functions by designing regularization terms to prevent overfitting to specific domains. These datasets are often denoted as source domains and distinguished as target domain where we want the model to work well.

3.2 CLIP for DG

This paper adapts CLIP in domain generalization setups. As usual, CLIP is consists of two parts: an image encoder and an language model . There are two notable differences between our works (using CLIP in DG) and the convention of DG. First, the image encoder is pre-trained on massive datasets. We are interested in how such massively trained models are robust to unknown distribution shifts that may not be included in the training dataset. Second, CLIP classifies the image features based on the similarity between embedding of a text prompt , such as ‘dog’ or ‘class label’, rather than using the classification head trained from scratch. Specifically, given an image and class prompt , CLIP output a prediction using both and :


where is the number of category and

is cosine similarity.

It is worth noting that we fixed and during whole experiments. In other words, we evaluated CLIP in a zero-shot manner, rather than fine-tuning backbone networks in source domains as with prior works. Instead, we simply change the text prompt by the class labels used in each dataset. By doing so, we show how powerful the representation of massively pre-trained models (CLIP) for DG setup is, without any additional computational costs to re-train such large models entirely. In addition, we propose a novel way to design the prompt to improve the performance in an unseen dataset without the fine-tuning the entire model.

3.3 AP: Amortized Prompt  for CLIP in DG

As discussed in Section 2.2, designing a prompt is a powerful way to improve the performance of transformer-based models. Not only it is powerful, but it should be easier to train since the dimension of prompts is overwhelmingly smaller than the entire parameters of and . For example, if we can access to supervised dataset from the target domain, we can simply optimize a prefix vector by simple supervised loss:


where is


where is a concatenation of trainable parameters and . Note that, outputs the fixed length vector regardless of the input dimension (i.e., size of ). The size of

is a hyperparameter.

Unfortunately, such labeled training data is not available in DG for the target domain. We proposed AP to replace the optimization process of in each domain by training of novel prompt generators that generate a prompt given small unlabeled images from a distribution. Specifically, we simply use a fully connected network to generate a prompt from input images:


where is batch size for each domain and denote the images from -th distribution. Given a batch of data from multiple source distributions, we optimize using the following loss function:




where is a concatenation of pre-defined and . We show the architecture of AP in figure 3.

4 Experiment

Figure 4: Examples of each domain in our experiments. For each dataset, we randomly selected images from two classes in each domain. Note that VOC2007, LabelMe, Caltech101, SUN09 are the datasets’ names included in VLCS. L100 means the images collected with a camera which are located in index 100 position in TerraIncognita dataset.
Algorithm VLCS PACS OfficeHome Terra Avg
ERM(ViT-B16) 79.2 0.3 85.7 0.1 78.4 0.3 41.8 0.6 71.3
CLIP 76.6 0.0 95.8 0.1 79.9 0.1 36.4 0.1 72.2
CLIP +  AP 84.3 0.4 97.3 0.2 84.2 0.2 52.6 0.6 79.6
(CLIP + AP) - (CLIP) +7.7 +1.5 +4.3 +16.2 +7.4
Table 1: Average results of ERM (baseline), CLIP, and CLIP + AP on DG benchmark. The last row shows the improvement by AP.
Backbone Model VLCS PACS OfficeHome Terra Avg
ResNet18 73.2 0.9 80.3 0.4 55.7 0.2 40.7 0.3 62.5
ResNet50 75.5 0.1 83.9 0.2 64.4 0.2 45.4 1.2 67.3
BiT-M-R50x3 76.7 0.1 84.4 1.2 69.2 0.6 52.5 0.3 70.7
BiT-M-R101x3 75.0 0.6 84.0 0.7 67.7 0.5 47.8 0.8 68.6
BiT-M-R152x2 76.7 0.3 85.2 0.1 71.3 0.6 51.4 0.6 71.1
Mixer-L16 76.4 0.2 81.3 1.0 69.4 1.6 37.1 0.4 66.1
ViT-B16 79.2 0.3 85.7 0.1 78.4 0.3 41.8 0.6 71.3
ViT-L16 78.2 0.5 84.6 0.5 78.0 0.1 42.7 1.9 70.9
DeiT 79.3 0.4 87.8 0.5 76.6 0.3 50.0 0.2 73.4
HViT 79.2 0.5 89.7 0.4 80.0 0.2 51.4 0.9 75.1
CLIP (ViT-B16) 76.6 0.0 95.8 0.1 79.9 0.1 36.4 0.1 72.2
CLIP (ViT-B16) + AP 84.4 0.4 97.3 0.2 84.2 0.2 52.4 0.7 79.6
Table 2: Results of ERM with various backbone networks on DG benchmark. We show CLIP with AP can outperform the results of ResNet, BiT, DeiT, HViT, ViT and Mixer-L16 reported in [25]. The accuracy of CLIP + AP surpassed HViT that ranked second by a large mergin (4.2%). Note that we fixed CLIP and only train a simple full connection network, while others methods fine-tune their backbone.

In this section, we show that CLIP outperforms the strong baseline ERM with the same backbone architecture. Then, we demonstrate that CLIP with AP provides effective generalization on four standard DG datasets and that it outperforms ERM with various backbones. Finally, we analyze the reason for the surprising results.


As [24] pointed out, Empirical Risk Minimization (ERM) [47] is a still strong baseline, where the experiments are under a unified specification. For a fair comparison, we use ViT-B16 as a backbone in both ERM baseline and CLIP. Also, we compare with other various backbones, such as Big Transfer [28] with three different layers(BiT-M-R50x3, BiT-M-R101x3, BiT-M-R152x2), Vision Transformer with variations (ViT-B16, ViT-L16, HViT [3], DeiT [45]), and MLP-Mixer[44]. All of these results that we compare are reported in T3A [25]


We choose four real world datasets from the Domainbed benchmark and show the examples in Figure 4. VLCS [18] gathers four photographic datasets {Caltech101 [19], LabelMe [40], SUN09 [11], VOC2007 [16]}, containing 10,729 samples of 5 classes. PACS [30], comprises four domain datasets {art, cartoons, photos, sketches}, with 9,991 samples and 7 classes. OfficeHome [48], includes domains {art, clipart, product, real}, with 15,588 samples and 65 classes. TerraIncognita [4] includes photo of wild animals taken by camera at different locations. Following [24], we used datasets of {Location 100, Location 38, Location 43, Location 46}, with total 24,788 samples and classes.

Hyperparameters and model selection.

We setup experiments on DomainBed 222, and implemented AP based on CLIP333, referred to CoOp444

. We use Stochastic Gradient Descent 

[9] with momentum as optimizer in contrast with other algorithms in DomainBed following CoOp, because we find Adam [27] does not work well in our setting. We apply the transforms proposed in CLIP for each image instead the data augmentation used in Domainbed.

Since model selection affects performance significantly, we exactly follow the basic selection criterion [24]. We select hyperparameters by standard training-domain validation, which uses the subset of each training domain to choose a model[24]. We pool the subsets of each training domain together and split the data of each domain into 80% and 20% for training model and selecting hyperparameters. Following [24], we conduct a random search of 20 trials over a joint distribution of all hyperparameters to select hyperparameters. Then, we run three trials of each hyperparameter setting and hold out one domain for testing and the rest for training. Finally, we select the set of hyperparameters that maximize validation accuracy over the training domains and report final accuracy averaged over all three trials.

4.1 Results

VLCS Caltech101 LabelMe SUN09 VOC2007 Avg
ERM 97.1 0.5 65.4 1.2 76.6 0.7 77.8 0.5 79.2
CLIP 99.5 0.1 61.9 0.5 69.8 0.2 75.3 0.6 76.6
CLIP + Domain label 99.6 0.0 69.6 0.3 65.3 0.5 80.9 0.4 78.8
CLIP + AP 100.0 0.0 69.5 1.1 79.9 0.9 87.6 0.3 84.3
(CLIP + AP) - (CLIP) +0.5 +7.6 +10.1 +12.3 +7.7
PACS Art Cartoon Photo Sketch Avg
ERM 89.9 0.7 86.3 0.4 99.2 0.1 67.2 0.1 85.7
CLIP 96.9 0.1 98.8 0.0 99.9 0.0 87.8 0.3 95.8
CLIP + Domain label 96.8 0.2 99.3 0.1 99.9 0.0 90.2 0.3 96.6
CLIP + AP 98.1 0.2 99.0 0.1 99.9 0.1 92.3 0.4 97.3
(CLIP + AP) - (CLIP) + 1.2 + 0.2 + 0.0 + 4.5 + 1.5
OfficeHome Artistic Clipart Product Real World Avg
ERM 76.5 0.6 62.7 0.2 86.9 0.1 87.6 0.3 78.4
CLIP 80.4 0.2 64.4 0.2 86.3 0.1 88.3 0.3 79.9
CLIP + Domain label 82.0 0.3 67.6 0.1 89.1 0.2 89.2 0.4 81.9
CLIP + AP 82.5 0.7 71.7 0.4 91.2 0.6 91.5 0.3 84.2
(CLIP + AP) - (CLIP) +2.1 +7.3 +4.9 +3.2 +4.3
Terra location 100 location 38 location 43 location 46 Avg
ERM 53.1 1.9 26.0 1.0 50.3 0.6 37.9 0.6 41.8
CLIP 50.1 0.4 31.8 0.4 35.0 0.2 28.9 0.0 36.4
CLIP + Domain label 27.5 0.2 26.3 0.3 37.4 0.6 22.6 0.2 28.4
CLIP + AP 58.2 2.1 57.2 1.0 50.1 0.2 44.9 2.6 52.6
(CLIP + AP) - (CLIP) +8.1 +25.4 +15.1 +16.0 +16.2
Table 3: Detailed results on VLCS, PACS, OfficeHome, TerraIncognita. The performance of (CLIP + AP) - (CLIP) shows the consistent improvement over CLIP all of the datasets. We highlight the most improved domain in each dataset.

Table 1

shows the results of CLIP and CLIP + AP compared with our baseline ERM. The numbers of each dataset are the average of results of all domains. We report the mean and standard error over three repetitions with different weight initialization and dataset splits.

Table 2 summaries the results of ERM with different kinds of backbones. We implement CLIP (ViT-B16) and CLIP + AP. The numbers of other methods we report are in [25]. Note that we fixed CLIP and train only a full connection network for CLIP + AP, in contrast with the others are fine-tuned on source domains with the classification loss.

Table 3 shows the results of CLIP with AP on each dataset in detail. The prompt of CLIP is simply a class label, such as ‘Dog’. CLIP* represents CLIP with a domain prompt generated from a domain label with a template. For example, the domain prompt of a dog in photo domain is ’A photo of a dog’, where photo and dog are domain label and class label. Our CLIP + AP is trained to generate a continuous prompt from the input image, so CLIP + AP use the prompt that generated in testing phase.

Figure 5: Samples from each domain per dataset with how much AP improves CLIP.

Does AP improve performance of CLIP and outperform the baseline?

Table 1 shows that native CLIP can outperform ERM(ViT-B16) without fine-tuning. This surprising result shows that the fixed CLIP is powerful enough to beat many DG methods, due to ERM also being a strong baseline in our experiment setting [24]. Though prompt tuning for CLIP is proposed in some previous works, in domain generalization we cannot access test samples from the target domain. Therefore, we allow AP to learn a prompt generator that can generate a prompt in testing phase. The result of CLIP with AP demonstrates that it outperforms ERM by a large margin, raising the accuracy from 71.3% to 79.6%. Notably,AP boosts the performance of CLIP from 36.4% to 52.6%, with an improvement of 16.2% using the TerraIncognita dataset.

To show further how well CLIP with AP performs, we compare to the results of ERM with other stronger backbones in Table 2. We observe that CLIP + AP stably surpasses all other backbones and leads HViT that ranked second by a large margin (4.5%). Note that CLIP + AP only trains a simple full connection network, while HViT and others conduct fine-tuning. To summarize, we suggest that to use CLIP + AP as a backbone and a classifier due to its effectiveness and reasonable. Moreover, CLIP + AP could be built on various algorithms in DG and also could be fine-tune for more performance enhancement.

Does domain information prompt help CLIP to DG?

To answer this question, we analyze the detail of each dataset. The results in Table 3 show CLIP + Domain label outperforms CLIP on VLCS, PACS, and OfficeHome. It means the prompt ’A cartoon of a horse’ should be better than ’horse’ for CLIP and domain information might help models in DG. But in TerraIncognita, CLIP + Domain label does not work well. We assume the reason is that the domain labels of TerraIncognita, such as ’A location 100 of a dog.’, are clearly meaningless for classification. This implies that how to design the prompt is a critical problem in DG, because the target domain label is unknown.

On the other hand in Table 3, we observe that CLIP + AP unexpectedly performs well in all datasets including TerraIncognita. To explain it, we take a closer look to the detailed results of each dataset as shown in Table 3 and refocus on the samples in datasets. We randomly selected examples from two classes for each domain in Figure 4. Intuitively, the samples in PACS, and OfficeHome are easier for humans to distinguish than that in VLCS and TerraIncognita. We highlight the accuracy of the most enhanced by AP and show the samples in Figure 5. Both the quantitative and qualitative results from Table 3 and Figure 5, we demonstrate that AP improves more performances of CLIP when addressing a harder domain. We verified the efficiency of APby these observation.

5 Conclusion

We propose a novel approach for domain generalization called Amortized Prompt (AP). By amortizing the unknown target domain prompt conditional on input images, CLIP+AP brings substantial improvements over standard ERM fine-tuning baselines and naive CLIP baseline without adaptation. We hope our work can expand and inspire the roles of foundation models in domain generalization, arguably the hardest mode of transfer learning.

5.1 Limitation and Future Work

In the technical perspective, due to generating a prompt from the input image, our AP cannot captures the domain shift outside of the images, such as label shift included in TerraIncognita. We consider that one promising direction in Domain Generalization is to address the various kinds of domain shift problems[2] since they exist in the actual-world applications.

From the social impact perspective, we should note that many images and text descriptions of web data are directly used to train CLIP. Though CLIP benefits from low-cost data that without manually labeling, it inevitably leads to a lot of bias and privacy being included in CLIP and other foundation models[7]. This requires us to spend more time paying attention to the opportunities and risks of Foundation Models. Primarily, we want to work on interpretability[17] which is essential in both Domain Generalization and utilizing Foundation Models.