VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

by   Renrui Zhang, et al.

Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning. Supervised by large-scale image-text pairs, CLIP is able to align paired images and texts and thus conduct zero-shot recognition in open-vocabulary scenarios. However, there exists semantic gap between the specific application and generally pre-trained knowledge, which makes the matching sub-optimal on downstream tasks. In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts. Specifically, we guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature by cross-attention machanism. In this way, the visual-guided text become more semantically correlated with the image, which greatly benefits the matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets and experiment extensive ablation studies to demonstrate the effectiveness of VT-CLIP. The code will be released soon.



There are no comments yet.


page 1

page 5

page 6


PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Languag...

Unsupervised Prompt Learning for Vision-Language Models

Contrastive vision-language models like CLIP have shown great progress i...

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...

Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

Vision-language pre-training (VLP) relying on large-scale pre-training d...

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Recent progress has shown that large-scale pre-training using contrastiv...

Supporting Vision-Language Model Inference with Causality-pruning Knowledge Prompt

Vision-language models are pre-trained by aligning image-text pairs in a...

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Can a generative model be trained to produce images from a specific doma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Attention maps from the cross attention module in VT-CLIP. Where the text feature is generated by the sentence ”a photo of airline.”. During matching, the text feature focus more on the heater region of the image.
Figure 2: Overview of VT-CLIP where text encoder and visual encoder refers to the encoders in vision-language models like CLIP.

Traditional visual representation learning approach is to train vision models on image classification task, which is improved with large-scale high-quality datasets like [imagenet] and well-designed architectures [he16][dos21]. However, there are two limitations of existing datasets, which is the lack of generality and expensive to scale since additional labeled data is needed to specify new visual concept that requires human annotation. Recently, contrastive vision-language pre-training models such as CLIP[rad21] and ALIGN[jia21] have emerged as a promising alternative, which trains on large-scale noisy image-text pairs datasets. The main approach is to maximizing the similarity of a pair of image and raw text in the embedding space. Through large-scale pre-training, vision-language models are allowed to learn open-set visual concepts and can readily be transferred to downstream tasks. In particular, for zero-shot classification task, one can synthesize the classification weights by feeding natural language describing classes of interest to the text encoder, and compare them with image features produced by the image encoder.

CLIP learns an image encoder and a text encoder through pre-training. The image encoder contains two parts, a neural network backbone(ResNet and ViT) and an attention pooling layer or a projection layer. The backbone generate the contextual-level visual feature which is fed into the attention pool layer or a projection layer to generate the class-level feature. For image classification task, the final prediction is conducted by the class-level image feature and the class-level text feature which is generated directly by the text encoder.

Previous work on enhancing transfer learning ability for vision-language model focus on prompt tuning methods which learns suitable prompts for downstream tasks. Context Optimization (CoOp)


propose to learn soft prompts represented by continuous context vectors as alternative for hand-crafted prompts, which brings significant improvements on few-shot classification over both zero-shot CLIP and linear probe CLIP setting

[rad21]. CoOp shows great potential of prompt tuning to leverage the abundant semantic information learned by large-scale vision-language model. However, the analysis of CoOp shows that sentences restored from the learned contexts are mostly not interpretable which is not consistent with the motivation of CoOp. CLIP-Adapter conduct feature adapters on either visual or language branch which further improve the performance through a simpler architecture. However, CLIP-Adapter’s feature adapters focus on finetune the single text or image branch without considering the interaction of text and image branch. Both of CoOp and CLIP-Adapter didn’t leverage the contextual-level visual feature in image encoder.

In this paper, we introduce a different approach for enhancing vision-language model with visual-guided texts instead of prompt tuning. Different from CoOp that perform soft prompt optimization, we leverage contextual-level visual feature with cross-attention machanism to guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature. As the scale of pre-trained model increases in recent years, it would be expensive and inefficient to finetune the entire pre-trained model on downstream tasks which also leads to overfitting issue under insufficient scale of training data. The gap between the pre-training objective and downstream task objective also makes transfer learning on vision-language model a challenging task. Motivated by the significant improvement and limitation of CoOp and CLIP-Adapter, we introduce VT-CLIP, which only finetunes a small number of parameters instead of overall parameters of CLIP. VT-CLIP adopts a cross-attention module which leverage the contextual-level visual feature to guide the text feature such that the text feature become more semantically correlated with the image. To demonstrate the effectiveness of VT-CLIP, we benchmark on 11 datasets, which cover a diverse set of visual recognition tasks including classification on generic objects, scenes and actions, as well as fine-grained tasks like recognizing textures and satellite imageries. The results show that VT-CLIP outperforms CoOp and CLIP-Adapter under the same setting.

  • [leftmargin=*]

  • We propose VT-CLIP that adaptively blend text feature with contextual-level visual feature by cross-attention mechanism to achieve efficient few shot transfer learning via fine-tuning.

  • Compared with CoOp and CLIP-Adapter, VT-CLIP performs better on few-shot classification task while has a clearer motivation and simpler architecture, demonstrating that VT-CLIP is a stronger competitor of prompt tuning.

  • We perform extensive analysis of VT-CLIP on the experimental results and model visualization to offer a comprehensive picture on how to enhance vision-language model. The code will be released soon.

2 Related works

Vision-Language Models

Recently, vision-Language Models shows great potential in learning generic visual representation with nature language supervision, which allowing zero-shot transfer ability for various downstream classification tasks. Early researches leverage object detection model to extract image feature to explore the interaction between text and image([LXMERT] [ViLBERT][gao2019dynamic]). Inspired by the success of pre-train models[BERT], [UNITER][Oscar] and SimVLM [SimVLM] use attention architecture further improve the performance of vision-language tasks. At present, the recent breakthrough in vision-language learning, particularly CLIP[rad21] and ALIGN[jia21] are driven by the noisy large-scale datasets available in the Internet, which is 400 million image-text pair for CLIP and 1.8 billion noisy image-text pairs for ALIGN.To finetune vision-Language Models on downstream tasks like few-shot classification task, CoOp[coop] propose to learn soft prompts represented by continuous context vectors as alternative for hand-craft prompt while CLIP-Adapter propose to adopts an additional bottleneck layer to learn new features and performs residual style feature blending with the original pre-trained features. Though CoOp and CLIP-Adapter achieve significant performance in the perspective of prompt learning and feature adapters, our VT-CLIP explores the impact of instance-level visual feature on text feature with a cross-attention module.

Prompt Design

Prompt Learning are designed to better mine the knowledge from pre-trained models without finetuning the entire model, which generate a prompting template or function to bridge gap between the pre-training objective and downstream tasks[shin20][jiang20][liliang][zhong21][lester][gao20]

. Prompt engineering is an important topic in prompt learning. Early research focus on designing hand-crafted prompts, which generate cloze-style prompts like “fill-in-the-blank” cloze tests and benefits a number of downstream tasks, such as sentiment analysis

[jiang20]. Recently, [liliang][zhong21][lester] introduce gradient-based approaches which optimize continuous vectors through forward and backward propagation in the word embedding space. The limitation of prompt Learning is that hand-crafted prompt requires specific domain knowledge while continuous prompt learning methods lack a clear way to visualize the prompt content learned by optimization. In this paper, we demonstrate guiding text feature with instance-level image feature through cross-attention module is an alternative for prompt learning on large-scale vision-language models, which is more interpretable and simpler in architecture.

Attention Methods

Attention mechanism was first introduced by[vaswani] in NLP for machine translation task. The basic module is self-attention layers composed by multi-head scaled dot-product attention and feed forward network which is vital for extracting characteristics of relationships between words at multiple levels. One of the main advantages of attention methods is their non-local computations and perfect global memory, which makes them more suitable than RNNs on dealing with sequential data [huang2017instance] [huang2015bidirectional] .Recently, cross attention mechanism is designed to fuse inputs from different feature data in the embedding space which has been widely applied to various visual tasks, including object detection[detr] [ji2020casnet]image captioning[neuralimage] [lee2018stacked] and image classification[hou19] [chen2021crossvit] [liu2021multiscale] which demonstrate the potential of cross attention mechanism. SMCA[gao2021fast] proposed a Modulated Attention mechanism which can achieve fast-convergence on object detection. In this paper, we leverage the cross attention mechanism to highlight the important region of text feature with the characteristics from contextual-level visual feature which adaptively guide the text feature to become more semantically correlated with the image content.

3 Method

In this section, we introduce the architecture of VT-CLIP.In section 3.1, we revisit the zero-shot classification of CLIP. In section 3.2, we explain the detailed architecture of VT-CLIP. Here we choose CLIP with ResNet-50 visual backbone as default setting for convenient.

Zero-shot CLIP

Lets us first revisit the zero-shot classification of CLIP. The CLIP’s image encoder consist of ResNet and an Attention Pool layer while CLIP’s text encoder is a Transformer Encoder. Define the input image where H and W are the height and width of the image. The dataset consist of categories donated by { }. The prompt template proposed by[rad21] is donated as . The class-level text and visual feature are generated as following:


Where is an contextual-level feature from ResNet and is the category. By feeding to Attention Pool, CLIP generate a class-level feature which is used to compute the similarity with text features generated from k sentences describing the categories . is the dimension of class-level feature which is 1024 under current setting. The similarity score for each categories can be written as:


Where the is the similarity score for category and is the text feature of sentence describing category. is a temperature parameter learned by CLIP and

denotes cosine similarity. The final prediction for zero-shot classification is the category with maximum similarity score.


We present a different approach to enhance vision-language model instead of prompt learning. We leverage the contextual-level visual feature to guide the text feature to adaptively explore informative regions on the image. For cross attention module, we follows the standard architecture of the transformer introduced in [vaswani], cross attention module consist of a self-attention layer, a co-attention layer and a feed forward network. In VT-CLIP, is feed into the cross attention module as the input query while is considered as input key and value. The process can be formulated as:


Where is the multi-head cross attention module. Through training, the visual-guided class-level text feature

becomes more semantically correlated with the image. After obtaining new text feature, we make the final prediction on category probability as following:


To optimize the weights in cross attention module, the objective function for few-shot training is a cross-entropy loss defined as:


Where = 1 if i equals to the ground-truth category label otherwise = 0 and is the predicted probability. is the collection of all learnable parameters in cross attention module.

ImageNet DTD
VT-CLIP Prompt engineering 63.54 65.72
VT-CLIP Prompt ensemble 63.88 65.48
Table 1: Prompt Ensemble – Comparison between hand-crafted prompt ”a photo of {CLASS}” for ImageNet and ”{CLASS} texture.” for DTD together with prompt ensemble of both datasets in VT-CLIP.
Heads 4 8 16 32
ImageNet(%) 63.85 63.88 63.78 63.35
DTD(%) 64.42 65.72 64.78 65.43
Table 2: Number of Heads – Experiments result of VT-CLIP with different number of heads in cross attention module.
Figure 3: Experiment Results – Main results of few-shot learning on 11 datasets. CLIP-Adapter shows overall better performance over previous baselines across different training shots.
Figure 4: Visualization of the prediction logits and attention map.

The upper tables show the prediction logits from Zero-shot CLIP while the lower tables show the predcition logits from VT-CLIP.

4 Results

4.1 Few-Shot Learning

4.1.1 Training Settings

For fair comparison with CLIP[rad21], CLIP-Adapter[clip] and CoOp(ref:coop), we select the 11 datasets used in CLIP, including ImageNet[imagenet], Caltech101 [feifei04], OxfordPets[parkhi], StanfordCars[krause], Flowers102[nizi], Food101[bossard], FGVCAircraft[maji], SUN397[xiao10], DTD[cimpoi], EuroSAT [helber]

and UCF101

[soomro]. We follow the few-shot evaluation protocol adopted in CLIP using 1, 2, 4, 8 and 16 shots for training respectively while test the model in full test set. We conduct all experiments on a single Nvidia RTX TITAN GPU.

We follow the We training hyperparameters as CoOp which is a batch size of 32 and a learning rate of

for all datasets. We use CLIP with ResNet-50[he16] as the visual backbone (visual encoder) provided by OpenAI. The layer for cross attention module is set to 1 and the number of heads is 8. Instead of using learnable continuous prompts in CoOp, we adopt the same hand-crafted prompt as CLIP. For specific-category datasets like DTD, we adopt a prompt template as ”{CLASS} texture” which focus on describing texture material in the DTD dataset. For generic-category datasets like ImageNet, we adopt prompt ensemble where 7 prompt templates describing different datasets, including ”itap of a {CLASS}. a bad photo of the {CLASS}. a origami {CLASS}. a photo of the large {CLASS}. a {CLASS} in a video game. art of the {CLASS}. a photo of the small {CLASS}.”, are averaged to generate the prompt ensemble.

4.1.2 Baseline Model

We compare VT-CLIP with four baselines, Zero-shot CLIP[rad21], CoOp[zhou21] and CLIP-Adapter[clip]. For fair comparison, we adopt same hand-craft prompt as Zero-shot CLIP[rad21] and CLIP-Adapter[clip]. For comparison with CoOp[zhou21], We choose the learnable content length and candidate position for class token with best performance as reported in paper, which is 16 for learnable content length and place the class token at the end of prompt template. For comparison with CLIP-Adapter, we choose the variants with only visual adapter which is the best of three variant.

4.1.3 Performance Comparison & Analysis

The main results are presented in Fig. 3. The average accuracy over 11 datasets are placed at the top-left corner, VT-CLIP shows better performance over other three baselines.

Compared with Zero-shot CLIP[rad21], our VT-CLIP achieve significant improvements on all 11 datasets. Huge performance boost are achieved on datasets DTD and EuroSAT which is ranging from 20% to 50%. The improvements on generic-categories datasets like ImageNet and Caltech101 are smaller. For OxfordPets and Food101 datasets, Zero-shot CLIP has already achieved decent performance.

Compared with CoOp[zhou21], our VT-CLIP outperforms CoOp under the number of shots ranging from 1 to 16. Although CoOp achieve considerable improvement over Zero-shot CLIP, the performance are still worse than our VT-CLIP. Note that CoOp achieve the result from a perspective of prompt learning and demonstrate that the prompt base methods are competitive approach for enhancing vision-language model but its potential is not more promising than finetuning an additional module over frozen pre-trained model.

Compared with CLIP-Adapter[clip], VT-CLIP performs better than CLIP-Adapter in average accuracy of all datasets and under 16-shot setting, VT-CLIP achieves higher accuracy over CLIP-Adapter except for a slight degradation in DTD and EuroSAT. Although CLIP-Adapter is a strong competitor that trains a adapter module over pre-trained model which is similar to our methods in perspective of “pretrain-finetuning” paradigm, our VT-CLIP achieves higher performance and the key difference is that VT-CLIP leverage the contextual visual feature to guided the text feature, which enable the text feature to become more semantically correlated to the downstream tasks. The experiment results demonstrate that our VT-CLIP with intersection between visual and text branch of CLIP is a more promising perspective of enhancing the vision-language model.

4.2 Visualization of Manifold

We use the heat map to visualize the attention map of the cross attention module after training VT-CLIP on two images of Airliner from ImageNet to show the learned characteristic of VT-CLIP. The result is displayed in Fig. 4. For attention map, the more reddish a region in the heat map is, the more attention is paid to that region. We can observe that the attention of VT-CLIP is focus on the engines of the airline which is a vital characteristic for airline. We also display the predicted logits of VT-CLIP and Zero-shot CLIP for the same example. We can observe that the logits from the VT-CLIP is more concentrated in the ground truth category than Zero-shot CLIP while the score for other categories are less than Zero-shot CLIP. From the comparison between Zero-shot CLIP and our VT-CLIP, we claim that our VT-CLIP is more effective in recognizing the ground true category among the similar classes. In summary, the visualization results prove that visual guided text feature pay more attention to the informative region of the image and enhance the vision-language model under few-shot setting.

4.3 Ablation Study

In this section, we perform several ablation studies for VT-CLIP. We compare the performance of prompt ensemble and hand-crafted prompt on generic-category dataset ImageNet and specific-category dataset DTD.

4.3.1 Prompt Ensemble

We first conduct ablation studyon prompt ensemble by comparing the performance of 16 shot classification on ImageNet with prompt ensemble and without prompt ensemble. The result is showed in Tab. 1. We can tell that for datasets with generic categories, prompt ensemble outperform single hand-crafted prompt. The performance gap is caused by the difficulty of generating a precise prompt describing the generic categories in a dataset. However, for dataset with specific range of categories, a well-design prompt is competitive with prompt ensemble.

4.3.2 Cross Attention Module Architecture

For ablation study of model architecture, we vary the number of heads in cross attention module in VT-CLIP and conduct experiment on the ImageNet. The result is show in Tab. 2. We can observe that under few-shot settings, the performance drops as the complexity of model architecture increase. The issue is mainly caused by lack of training data. Under 16-shot training setting, ImageNet with 1000 categories has 16000 training samples while DTD with 47 categories only has 752 training samples. The scale of training samples causes the complex model with more heads to overfit the insufficient training data.

5 Conclusions and Future Work

We present VT-CLIP as a differentiable approach for few-shot classification. The VT-CLIP focus on fine-tuning an cross attention module instead of the entire pre-trained vision-language model. We leverage the contextual visual feature to guide the text feature to highlight the important region of the image through cross attention module. We claim that the contextual visual feature and interaction between the image and text branches in vision-language model is of great potential in enhancing the vision-language model’s ability on various downstream tasks. According to the experimental results, VT-CLIP outperform all the competitive baselines in few-shot settings on 11 carefully selected datastes which represent generic world objects. Ablation studies are conducted to confirm our design and give a view of the extensive performance of VT-CLIP. In the future, we hope to combine VT-CLIP with prompt learning based approaches to push the boundary further for fine-tuning vision-language models. We will also explore the potential of VT-CLIP on other vision or textual tasks.