DeepAI
Log In Sign Up

Revisiting Few-Shot Learning from a Causal Perspective

Few-shot learning with N-way K-shot scheme is an open challenge in machine learning. Many approaches have been proposed to tackle this problem, e.g., the Matching Networks and CLIP-Adapter. Despite that these approaches have shown significant progress, the mechanism of why these methods succeed has not been well explored. In this paper, we interpret these few-shot learning methods via causal mechanism. We show that the existing approaches can be viewed as specific forms of front-door adjustment, which is to remove the effects of confounders. Based on this, we introduce a general causal method for few-shot learning, which considers not only the relationship between examples but also the diversity of representations. Experimental results demonstrate the superiority of our proposed method in few-shot classification on various benchmark datasets. Code is available in the supplementary material.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/28/2020

Interventional Few-Shot Learning

We uncover an ever-overlooked deficiency in the prevailing Few-Shot Lear...
06/09/2020

Simultaneous Perturbation Stochastic Approximation for Few-Shot Learning

Few-shot learning is an important research field of machine learning in ...
05/07/2021

Diff-ResNets for Few-shot Learning – an ODE Perspective

Interpreting deep neural networks from the ordinary differential equatio...
12/01/2018

Cross-Modulation Networks for Few-Shot Learning

A family of recent successful approaches to few-shot learning relies on ...
08/13/2020

Few shot clustering for indoor occupancy detection with extremely low-quality images from battery free cameras

Reliable detection of human occupancy in indoor environments is critical...
11/22/2020

RNNP: A Robust Few-Shot Learning Approach

Learning from a few examples is an important practical aspect of trainin...

Introduction

When the available data is rare, few-shot learning (FSL) Wang et al. (2020) has emerged as an appealing approach and has been widely used on various machine learning tasks Fan et al. (2021); Liu et al. (2022). In this paper, we focus on -way -shot image classification Vinyals et al. (2016); Snell et al. (2017), where there are classes and each class has only labeled training examples.

Many algorithms have been proposed for FSL. For example, metric-based methods Li et al. (2019) make predictions based on the distance computations between test example and training examples in a learned metric space. Matching Networks Vinyals et al. (2016)

proposed a weighted nearest neighbour classifier given the training data. Prototypical Networks 

Snell et al. (2017)

performed classification by computing distances between test example and class prototypes of training examples. Recently, large-scale pre-trained vision-language models have also shown surprising results for few-shot learning. CLIP-Adapter 

Gao et al. (2021) is based on the pre-trained model CLIP Radford et al. (2021), which is used as a fixed feature extractor, and only fine-tunes the light-weight additional feature adapter for few-shot tasks. Tip-adapter Zhang et al. (2021) further improves CLIP-Adapter by acquiring the weight of feature adapter from a cache model constructed by training data. Despite the success of these methods, our theoretical understanding of this success remains limited.

Figure 1: (a) The unobserved confounder such as background would mislead the training model to learn spurious correlation: the model would tend to classify test examples with grass as background as dogs after training. This is incorrect for examples of birds with grass as background. (b) Our proposed causal graph for few-shot learning methods, which are to remove the effects of confounder. : unobserved confounder, : example, : representation of the example, : label.

In this paper, we introduce causal mechanism Glymour et al. (2016) to interpret these FSL methods, which provides a unified explanation. In few-shot learning tasks, one of the main challenges is the unobserved confounder Glymour et al. (2016), which causes spurious correlations between the examples and labels. This problem is especially serious in the few-shot setting, since there are only a few training examples which can lead to data bias more easily. As illustrated in Figure 1 (a), the unobserved confounder, such as background “grass”/ “sky”, would mislead the training model to learn spurious correlation, which may lead to the misclassification of examples from a new distribution, e.g., a bird with grass as background. To block the paths from the confounder, back-door adjustment Glymour et al. (2016) and front-door adjustment Glymour et al. (2016) are two popular approaches. In this paper, we revisit the above few-shot learning methods and formalize a causal graph as illustrated in Figure 1 (b). In our proposed causal graph, we interpret the representation of example as the intermediate variable between example and label . The unobserved confounder affects the distribution of and . We find that existing mainstream FSL methods like Matching Networks and Prototypical Networks can well fit the intervened causal framework in some special cases of the front-door criterion Glymour et al. (2016) (please refer to Section Causal Interpretation for more details).

Based on our causal interpretation, we further propose a general causal method of FSL. We find that previous metric-based methods mainly focus on the way how examples interact while little attention has been paid to the diversity of the representations of examples. Therefore, we propose a general causal few-shot method which considers not only the interaction between examples but also the diversity of representations. In this paper, we propose to use diverse representations from CLIP Radford et al. (2021) and BLIP Li et al. (2022)

. The final prediction is the linear combination of intermediate predictions. Experimental results have shown that our proposed method can make significant improvements on various benchmark datasets. Our method improves 1-shot accuracy from 60.93% to 64.21% and 16-shot accuracy from 65.34% to 72.91% on ImageNet.

The contribution of our work is three-fold:

  • We formalize a way of FSL in causal framework. We find that existing mainstream FSL methods can be well explained in the framework.

  • We propose a general causal method to deal with FSL problems, which considers both the relationship between examples and the diversity of representations. We also give our method a causal interpretation.

  • We evaluate our method on 10 benchmark few-shot image classification datasets and conduct ablation studies to explore its characteristics. Experimental results demonstrate the compelling performance of our method across the board.

Related Work

Metric-Based Few-Shot Learning

Metric-based FSL methods conduct few-shot learning by comparing examples in a learned metric space. Matching Networks Vinyals et al. (2016) proposed to compare training examples and test examples in a learned metric space. Prototypical Networks Snell et al. (2017) assign test example to the nearest class prototype. CLIP-Adapter Gao et al. (2021)

adopts a simple residual layer as feature adapter over the feature extracted by pre-trained vision-language model CLIP 

Radford et al. (2021). Tip-Adapter Zhang et al. (2021) improves CLIP-Adapter by effectively initializing the weight of feature adapter.

Nevertheless, the success of these methods still lacks a solid theoretical explanation. In this paper, we interpret the success of metric-based methods from the perspective of causal mechanism. Along with the interpretation, we propose a general causal method to deal with few-shot tasks.

Figure 2: Illustration of front-door adjustment. : unobserved confounder, : example, : representation of the example, : label.
Figure 3: The pipeline of our method and its corresponding causal interpretation in Eq. (3). For a given test example , we first obtain its representations and from diverse visual encoders. We then perform intermediate predictions via different Attention Blocks. The final prediction is the linear combination of intermediate predictions. Different parts of the pipeline and their corresponding causal interpretations are separated by dotted lines. Note that we only update representations as learnable parameters during training.

Causal Inference

Causal inference was recently introduced to machine learning and applied to various tasks, such as image captioning 

Yang et al. (2021), visual grounding Huang et al. (2021), semantic segmentation Zhang et al. (2020), representation learning Mitrovic et al. (2021), and few-shot learning Yue et al. (2020)

. A general approach in machine learning is to maximize the observational conditional probability

given example and label . However, the existence of confounder Glymour et al. (2016) will mislead the training model to learn spurious correlations between the cause and the effect . Causal inference proposes to deconfound Glymour et al. (2016) the training by using a new objective instead of , which aims to find what truly causes . The -calculus denotes the pursuit of the causal effect of on by intervening . One useful deconfounding causal tool is front-door adjustment Glymour et al. (2016).

Front-Door Adjustment.

Front-door adjustment is a deconfounding way in causal mechanism. Consider the causal graph in Figure 2, where is example, is the representation of , is label. represents that causes . is an unobserved confounder that influences both and . Confounder can cause spurious correlation when the example is irrelevant to the label.

For , , and , considering the partial effect of on , we have since there is no backdoor path from to . On the other side, considering the partial effect of on , we have for . This is because the backdoor path from to , namely , can be blocked by conditioning on . Finally by chaining together the two partial effects and , we have the following front-door adjustment formula:

(1)

where is the true effect of on .

Pre-trained Vision-Language Models

Pre-trained vision-language models aim to improve the performance of downstream vision and language tasks by pre-training on large amounts of image-text pairs. Pre-trained vision-language models like ALBEF Li et al. (2021), BLIP Li et al. (2022), and CLIP Gao et al. (2021) have shown amazing results on a wide range of downstream tasks even without fine-tuning. They learn the representations of images and texts in the way of contrastive representation learning Gao et al. (2021), where an image and a text from the same pair are expected to have similar representations.

Causal Interpretation

We consider the traditional -way -shot classification in FSL, where we are asked to perform classification on classes while we only have access to training examples from each class.

In this paper, we formalize FSL methods using causal graph as illustrated in Figure 1 (b). Given example , a common procedure in machine learning is to generate its representation at first. Then we use the representation to make prediction , i.e., the label of image . is the unobserved confounder that affects the distribution of example and label . We observe that variable satisfies the front-door criterion Glymour et al. (2016) between variable and variable . For each example , we wish to identify its cause on label . We adopt the front-door adjustment Glymour et al. (2016) to derive the intervened FSL framework:

(2)

where probability can be interpreted as the probability predicted by exploring the relationship between the representation of and the representation .

In the following, we will show that the existing few-shot learning methods can be viewed as specific forms of front-door adjustment by setting different values for . Let be a specific value, then we have the following approximation for :

(3)

Interpretation for Matching Networks

In Matching Networks (MN) Vinyals et al. (2016), the distance computation is performed for each pair of training example and test example. Let the labeled training set , given a test example , MN computes a probability over as follows:

(4)

where is an attention mechanism that computes the similarity between two examples. This can be interpreted as a special case in Eq. (3) by letting the representation of test example be unique, then the cardinality of representation set is exactly one and . Let , where is to make sure the summing result is a valid conditional probability, we have:

(5)

where is the training set and since is the representation of .

According to Eq. (5), we can observe that MN is a method that aims to remove the effects of confounder by considering all examples in the training sets. For example, in Figure 1, if we have more training examples in the training set, e.g., a cat with grass as background, it is possible to remove the effects of confounder since grass is not the distinguishable feature.

Interpretation for Prototypical Networks

In Prototypical Networks(PN) Snell et al. (2017), the representation of test example is also unique and the distance computation is performed between class prototypes of training examples and test example. Let the labeled training set be and the set of training examples of class be . PN first computes the class prototype as follows:

(6)

where is the representation of . Given a test example , PN computes probability as follows:

(7)

where is an attention mechanism that computes the similarity and is the label of examples . This can be interpreted as a special case in Eq. (3):

(8)

where , and .

Interpretation for CLIP/Tip-Adapter

In CLIP-Adapter Gao et al. (2021), the representation of an example is generated by a pre-trained CLIP visual encoder . CLIP-Adapter freezes the CLIP backbone and explore the relationship between the representation and textual representations

. The logits of CLIP-Adapter(

Eq. (2) in  Zhang et al. (2021)) can be rephrased as follows:

(9)

where is the updated representation of example . This can be interpreted as a special case in Eq. (3):

(10)

where and .

In Tip-Adapter Zhang et al. (2021), the representation of an example is also generated by a pre-trained CLIP visual encoder . Classification is then performed by summing the predictions from two classifiers given the generated representation. The weight of one classifier is derived from visual representations of training set while the weight of another classifier is derived from textual representations of the class names. The logits of Tip-Adapter(Eq. (6) in  Zhang et al. (2021)) can be rephrased as follows:

(11)

where is a hyper-parameter, is distance function, is the one-hot label matrix of training set and is the textual representations. This can be interpreted as a special case in Eq. (3):

(12)

where . The same as CLIP-Adapter, Tip-Adapter uses the “class centroid” in the textual domain provided by the textual encoder given the class name.

Methods aforementioned mainly focus on the relationship between examples, i.e., the term in Eq. (3) but pay little attention to the diversity of representations, i.e., the term . Though recent work  Dvornik et al. (2019) has made some effort to improve the diversity of representations, their improvement is limited: 1) They need to fine-tune the whole model, which would lead to over-fitting on some datasets and the training is time-consuming for large-scale vision-language model Gao et al. (2021); Zhang et al. (2021); Sung et al. (2022). 2) They use the same backbone for representation learning while we use various backbones for representation learning, which improves the diversity of representations a lot.

Method

Here we propose a general causal method for FSL motivated by Eq. (3). As indicated by Eq. (3), we can design a method considering three aspects: 1) Taking the term into consideration, we can obtain diverse representations of example. 2) Taking the term into consideration, we can make predictions by exploring the relationship between examples such as computing distance in a learned metric space. 3) Taking the term into consideration, the importance of a training example should be relevant to its frequency of occurrence.

The pipeline of our method and its corresponding causal interpretation in Eq. (3) are illustrated in Figure 3. We obtain diverse representations of test example from CLIP Radford et al. (2021) backbone and from BLIP Li et al. (2022) backbone. Intermediate predictions are performed via CLIP Attention Block and BLIP Attention Block. The final prediction is the linear combination of intermediate predictions.

CLIP Attention Block.

Figure 4: Illustration of CLIP Attention Block. is dot product.

We first consider the representations generated by CLIP backbone. CLIP backbone consists of two parts: visual encoder and textual encoder. For a test example , we first get its L2-normalized -dimension visual representation from CLIP visual encoder :

(13)

Then, for all training examples , we obtain their L2-normalized -dimension representations also by using CLIP visual encoder :

(14)

We get the one-hot labels of training examples , which are denoted as .

We now treat representation as query, representations as keys, and one-hot labels as values. The intermediate prediction is performed by the Attention Layer formulated as follows:

(15)

Thanks to that query and keys are L2-normalized, the term is equivalent to computing the Euclidean distances between the representation and the representations . The exponential function serves as a non-negative transformation for computed distances.

On the other side, we get the L2-normalized -dimension textual representations for class names by CLIP textual encoder. The intermediate prediction is calculated as follows:

(16)

The generation of predictions and is illustrated in Figure 4. The term serves as new knowledge from few-shot tasks while the term preserves the prior knowledge from CLIP model.

BLIP Attention Block.

We then consider the representations generated by BLIP backbone. BLIP backbone consists of two parts: visual encoder and textual encoder. For a test example , we first get its L2-normalized -dimension visual representation from BLIP visual encoder :

(17)

We then get the L2-normalized -dimension textual representations for class names by BLIP textual encoder. The intermediate prediction is calculated as follows:

(18)

Final Prediction.

The final prediction is the linear combination of intermediate predictions mentioned above:

(19)

where and are hyper-parameters. Here controls the trade-off between new knowledge from few-shot tasks and prior knowledge from pre-trained model while controls the trade-off between representations from CLIP backbone and BLIP backbone.

During fine-tuning, we only update representations as learnable parameters with few-shot training data using cross-entropy loss while other parameters remain fixed. We don’t fine-tune the whole model since fine-tuning large-scale vision-language model would lead to severe over-fitting Gao et al. (2021); Zhang et al. (2021); Sung et al. (2022)

. Besides, the one-hot labels serving as ground-truth annotations should be kept frozen for anti-forgetting. Owing to the good initialization of Attention Layer, our method can adapt to new tasks after a few epochs of updates on training set.

Causal Interpretation

As indicated by Eq. (3), our method considers not only the relationship between examples but also the diversity of representations. We can rephrase our method as follows:

(20)

where and are both the representation obtained from CLIP visual encoder, is the representation obtained from BLIP visual encoder, is the “class centroid” in textual domain provided by CLIP textual encoder and is the “class centroid” in textual domain provided by BLIP textual encoder.

Experimental Results

In this section, we evaluate our method w.r.t. classification accuracy on various datasets compared with several state-of-the-art baselines including Zero-shot BLIP Li et al. (2022), Zero-shot CLIP Gao et al. (2021), Tip-Adapter Zhang et al. (2021), and Tip-Adapter-F Zhang et al. (2021). Besides, we also conduct ablation studies to explore the characteristics of our method.

Datasets and Implementation Detail

We conduct experiments on 10 benchmark image classification datasets: ImageNet 

Deng et al. (2009), Caltech101 Fei-Fei et al. (2004), DTD Cimpoi et al. (2014), EuroSAT Helber et al. (2019), FGVCAircraft Maji et al. (2013), Flowers102 Nilsback and Zisserman (2008), Food101 Bossard et al. (2014), OxfordIIIPet Parkhi et al. (2012), StanfordCars Krause et al. (2013), and SUN397 Xiao et al. (2010). For each dataset, we sample images to construct two disjoint sets: training set and test set. Each method is trained on training set and tested on test set.

For the pre-trained CLIP backbone, we use ResNet-50 He et al. (2016) as the visual encoder and transformer Vaswani et al. (2017) as the textual encoder. For the pre-trained BLIP backbone, we use ViT-Large Dosovitskiy et al. (2020) with patch size as the visual encoder and BERT Devlin et al. (2018) as the textual encoder. The prompt design of CLIP backbone and the way of image pre-processing follows  Zhang et al. (2021).

During fine-tuning, we only update the representations in Eq. (15). We use AdamW Loshchilov and Hutter (2018) as optimizer and the learning rate is set to . The batch size is set to 256 for the training set. The hyper-parameter is set to 100 and to 0.6 in Eq. (19) by default. The number of shots, or the size of the training set, is set to 16 as default.

Models ImageNet Caltech101 DTD EuroSAT FGVCAircraft
Zero-shot BLIP 47.48 89.11 47.18 35.25 5.52
Zero-shot CLIP 60.32 83.96 41.33 23.50 16.20
Tip-Adapter 61.81 85.35 47.02 45.62 20.13
Tip-Adapter-F 65.34 90.89 67.29 78.25 32.76
Ours 72.91 96.14 69.84 82.75 31.14
Models Flowers102 Food101 OxfordIIIPet StanfordCars SUN397
Zero-shot BLIP 57.11 71.81 59.80 71.06 45.57
Zero-shot CLIP 62.99 76.93 83.05 54.37 58.95
Tip-Adapter 68.87 78.08 84.08 60.45 64.11
Tip-Adapter-F 91.42 81.31 88.58 72.86 71.14
Ours 95.34 86.61 91.01 84.52 75.13
Table 1: Accuracy (%) of different models on various datasets. Methods are evaluated with 16 shots.

Comparison on Various Datasets

Table 1 shows the performance of our method on 10 benchmark image classification datasets. The CLIP visual backbone is ResNet-50 and the BLIP visual backbone is ViT-Large with patch size for all methods. As Table 1 shows, our method can greatly improve classification accuracies on various datasets, which are increased by at least 2.43% on ImageNet, Caltech 101, DTD, EuroSAT, Flowers102, Food101, OxfordIIIPet, StanfordCars, and SUN397. Accuracy of our method is marginally lower than that of Tip-Adapter-F on FGVCAircraft by 1.62%. We attribute the degraded performance on FGVCAircraft to the lack of generalization of BLIP on FGVCAircraft since Zero-shot BLIP only achieves 5.52% accuracy on FGVCAircraft.

Comparison on ImageNet

Figure 5: Accuracy(%) of different models with various shots on ImageNet. (Best viewed in color.)

As Figure 5 shows, our method can achieve significant improvement on ImageNet with various shots. Ours surpasses all previous methods in all few-shot settings on ImageNet, which demonstrates the superiority of using diverse representations.

We test our model with various CLIP visual encoders on ImageNet and the number of shots is set to 16. We adopt ResNet He et al. (2016) and ViT Dosovitskiy et al. (2020) for comparison. As Table 2 shows, ours can gain significant improvement with various CLIP visual encoders.

Models RN50 RN101 ViT/32 ViT/16
Zero-shot CLIP 60.32 62.53 63.81 68.74
Tip-Adapter 61.81 64.00 65.31 70.29
Tip-Adapter-F 65.34 68.49 68.62 73.64
Ours 72.91 73.84 73.69 75.82
Table 2: Accuracy (%) of different models with various CLIP visual encoders. RN50 refers to ResNet-50 while ViT/32 refers to ViT-Base with 32 × 32 patch size. Methods are evaluated with 16 shots.

Ablation Studies

In this set of experiments, we conduct ablation studies on ImageNet to analyze the characteristics of our method. All experiments are tested with 16 shots.

Analysis of Different .

As indicated by Eq. (19), hyper-parameter controls the trade-off between new knowledge from few-shot tasks and prior knowledge from pre-trained model. Therefore, as becomes smaller, our model tends to learn more from the current few-shot task and less otherwise. is varied from 1 to 10000 and is set to 0.6. As the top row of Table 3 shows, our model performs best when is a moderate value 100. This implies that we should strike a good balance between new knowledge from few-shot tasks and prior knowledge from the pre-trained model.

Analysis of Different .

As indicated by Eq. (19), hyper-parameter controls the trade-off between representations from CLIP backbone and BLIP backbone. Therefore, as becomes larger, our model tends to depend more on information provided by CLIP backbone and less otherwise. is varied from 0.3 to 0.9 and is set to 100. As the bottom row of Table 3 shows, our model performs best when is a moderate value 0.6. This implies that the CLIP backbone and BLIP backbone contribute approximately equally to the final prediction.

Analysis of Different Intermediate Predictions.

We further study the effect of each intermediate prediction on the final prediction w.r.t. Eq. (19). We use a tick “” to indicate that we adopt the intermediate prediction for the final prediction. Here we set to 100 and to 0.6 if they appear as the coefficient for the final prediction. Note that during fine-tuning we only update the representations in Eq. (15) if possible. Prediction alone is essentially the same as Matching Networks while few parameters can be updated, prediction alone represents Zero-shot CLIP, and prediction alone represents Zero-shot BLIP. with represents Tip-Adapter-F under our hyper-parameter setting. along with and represents our proposed method. As Table 4 shows, prediction from diverse representations is always better than that from a single representation, which strongly demonstrates the necessity for considering the diversity of representations.

1 10 100 1000 10000
Accuracy(%) 55.65 59.62 72.91 70.26 68.58
0.3 0.45 0.6 0.75 0.9
Accuracy(%) 71.38 72.47 72.91 71.93 69.02
Table 3: Ablation studies of our method varying and .
Accuracy(%)
55.21
60.32
47.48
64.72
70.50
63.19
72.91
Table 4: Ablation studies of our method with different intermediate predictions. We use a tick “” to indicate that the intermediate prediction is adopted for the final prediction.

Conclusion

In this paper, we interpreted existing mainstream few-shot learning methods from the perspective of causal mechanism. We interpreted these methods using causal graph. Based on the proposed causal graph, we deconfound the training by intervening the example via front-door adjustment. We observed that the intervened causal framework can well explain existing mainstream few-shot learning methods. Motivated by the intervened causal framework, we proposed a general causal method to tackle few-shot tasks considering not only the relationship between examples but also the diversity of representations. Conducted experimental results have demonstrated the superiority of our proposed method. In our future work, we aim to give a better theoretical explanation for more few-shot learning methods.

References

  • L. Bossard, M. Guillaumin, and L. V. Gool (2014)

    Food-101–mining discriminative components with random forests

    .
    In

    European conference on computer vision

    ,
    pp. 446–461. Cited by: Datasets and Implementation Detail.
  • M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3606–3613. Cited by: Datasets and Implementation Detail.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Datasets and Implementation Detail.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Datasets and Implementation Detail.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: Datasets and Implementation Detail, Comparison on ImageNet.
  • N. Dvornik, C. Schmid, and J. Mairal (2019) Diversity with cooperation: ensemble methods for few-shot classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3723–3731. Cited by: Interpretation for CLIP/Tip-Adapter.
  • Z. Fan, Y. Ma, Z. Li, and J. Sun (2021) Generalized few-shot object detection without forgetting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4527–4536. Cited by: Introduction.
  • L. Fei-Fei, R. Fergus, and P. Perona (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp. 178–178. Cited by: Datasets and Implementation Detail.
  • P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao (2021) Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544. Cited by: Introduction, Metric-Based Few-Shot Learning, Pre-trained Vision-Language Models, Interpretation for CLIP/Tip-Adapter, Interpretation for CLIP/Tip-Adapter, Final Prediction., Experimental Results.
  • M. Glymour, J. Pearl, and N. P. Jewell (2016) Causal inference in statistics: a primer. John Wiley & Sons. Cited by: Introduction, Causal Inference, Causal Interpretation.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Datasets and Implementation Detail, Comparison on ImageNet.
  • P. Helber, B. Bischke, A. Dengel, and D. Borth (2019)

    EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification

    .
    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 (7), pp. 2217–2226. External Links: Document Cited by: Datasets and Implementation Detail.
  • J. Huang, Y. Qin, J. Qi, Q. Sun, and H. Zhang (2021) Deconfounded visual grounding. arXiv preprint arXiv:2112.15324. Cited by: Causal Inference.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: Datasets and Implementation Detail.
  • H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang (2019) Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1–10. Cited by: Introduction.
  • J. Li, D. Li, C. Xiong, and S. Hoi (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086. Cited by: Introduction, Pre-trained Vision-Language Models, Method, Experimental Results.
  • J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi (2021) Align before fuse: vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, pp. 9694–9705. Cited by: Pre-trained Vision-Language Models.
  • J. Liu, Y. Bao, G. Xie, H. Xiong, J. Sonke, and E. Gavves (2022) Dynamic prototype convolution network for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11553–11562. Cited by: Introduction.
  • I. Loshchilov and F. Hutter (2018) Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: Datasets and Implementation Detail.
  • S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: Datasets and Implementation Detail.
  • J. Mitrovic, B. McWilliams, J. C. Walker, L. H. Buesing, and C. Blundell (2021) Representation learning via invariant causal mechanisms. In International Conference on Learning Representations, External Links: Link Cited by: Causal Inference.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: Datasets and Implementation Detail.
  • O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar (2012) Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Datasets and Implementation Detail.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. Cited by: Introduction, Introduction, Metric-Based Few-Shot Learning, Method.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. Advances in neural information processing systems 30. Cited by: Introduction, Introduction, Metric-Based Few-Shot Learning, Interpretation for Prototypical Networks.
  • Y. Sung, J. Cho, and M. Bansal (2022)

    Vl-adapter: parameter-efficient transfer learning for vision-and-language tasks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5227–5237. Cited by: Interpretation for CLIP/Tip-Adapter, Final Prediction..
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: Datasets and Implementation Detail.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. Advances in neural information processing systems 29. Cited by: Introduction, Introduction, Metric-Based Few-Shot Learning, Interpretation for Matching Networks.
  • Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni (2020) Generalizing from a few examples: a survey on few-shot learning. ACM computing surveys (csur) 53 (3), pp. 1–34. Cited by: Introduction.
  • J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010)

    Sun database: large-scale scene recognition from abbey to zoo

    .
    In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492. Cited by: Datasets and Implementation Detail.
  • X. Yang, H. Zhang, and J. Cai (2021) Deconfounded image captioning: a causal retrospect. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Causal Inference.
  • Z. Yue, H. Zhang, Q. Sun, and X. Hua (2020) Interventional few-shot learning. Advances in neural information processing systems 33, pp. 2734–2746. Cited by: Causal Inference.
  • D. Zhang, H. Zhang, J. Tang, X. Hua, and Q. Sun (2020) Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems 33, pp. 655–666. Cited by: Causal Inference.
  • R. Zhang, R. Fang, P. Gao, W. Zhang, K. Li, J. Dai, Y. Qiao, and H. Li (2021) Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930. Cited by: Introduction, Metric-Based Few-Shot Learning, Interpretation for CLIP/Tip-Adapter, Interpretation for CLIP/Tip-Adapter, Interpretation for CLIP/Tip-Adapter, Final Prediction., Datasets and Implementation Detail, Experimental Results.