Log In Sign Up

Rectifying the Shortcut Learning of Background: Shared Object Concentration for Few-Shot Image Recognition

by   Xu Luo, et al.

Few-Shot image classification aims to utilize pretrained knowledge learned from a large-scale dataset to tackle a series of downstream classification tasks. Typically, each task involves only few training examples from brand-new categories. This requires the pretraining models to focus on well-generalizable knowledge, but ignore domain-specific information. In this paper, we observe that image background serves as a source of domain-specific knowledge, which is a shortcut for models to learn in the source dataset, but is harmful when adapting to brand-new classes. To prevent the model from learning this shortcut knowledge, we propose COSOC, a novel Few-Shot Learning framework, to automatically figure out foreground objects at both pretraining and evaluation stage. COSOC is a two-stage algorithm motivated by the observation that foreground objects from different images within the same class share more similar patterns than backgrounds. At the pretraining stage, for each class, we cluster contrastive-pretrained features of randomly cropped image patches, such that crops containing only foreground objects can be identified by a single cluster. We then force the pretraining model to focus on found foreground objects by a fusion sampling strategy; at the evaluation stage, among images in each training class of any few-shot task, we seek for shared contents and filter out background. The recognized foreground objects of each class are used to match foreground of testing images. Extensive experiments tailored to inductive FSL tasks on two benchmarks demonstrate the state-of-the-art performance of our method.


page 8

page 15

page 16

page 19

page 20

page 21

page 22

page 23


CP2: Copy-Paste Contrastive Pretraining for Semantic Segmentation

Recent advances in self-supervised contrastive learning yield good image...

Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models

Prompt learning is a new learning paradigm which reformulates downstream...

SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive Background Prototypes

Few-shot semantic segmentation aims to segment novel-class objects in a ...

Few-Shot Object Detection in Unseen Domains

Few-shot object detection (FSOD) has thrived in recent years to learn no...

Semantic Image Matting

Natural image matting separates the foreground from background in fracti...

Unified Instance and Knowledge Alignment Pretraining for Aspect-based Sentiment Analysis

Aspect-based Sentiment Analysis (ABSA) aims to determine the sentiment p...

Towards Discriminative and Transferable One-Stage Few-Shot Object Detectors

Recent object detection models require large amounts of annotated data f...

1 Introduction

Through observing a few samples at a glance, humans can accurately identify brand-new objects. This advantage comes from years of experience accumulated by the human vision system. Inspired by such learning capabilities, Few-Shot Learning (FSL) is developed to tackle the problem of learning from limited data Alexander et al. (1995, 1995). A typical FSL framework has two separated stages, the pretraining stage and the evaluation stage. At the pretraining stage, FSL models absorb knowledge from a large-scale dataset; at the evaluation stage, the learned knowledge is leveraged to solve a series of downstream classification tasks, each of which contains very few training images from brand-new categories. The category difference leads to an overwhelming distribution gap between these two stages Alexander et al. (1995, 1995). This challenge places high demands for FSL to learn well-generalizable pretrained knowledge.

In this paper, we find that image background serves as one source of such domain-specific knowledge in FSL. Image background is disturbing information that usually does not determine the category of the image. But as pointed out in Alexander et al. (1995), there is spurious correlation between background and category of image (e.g., birds usually stand on branches, and shells often lie on the beaches), which serves as shortcut knowledge for models to learn. As illustrated in Fig. 1, background knowledge is useful only when no category gap (i.e.

, the classes remain unchanged) exists between pretraining and evaluation datasets. However, in FSL, the correlation cannot generalize to brand-new classes and will probably mislead the predictions. For more thorough empirical study, please refer to Sec.


Figure 1: An illustrative example that demonstrates why background information is useful for regular classification but is harmful for Few-Shot Learning.

A very natural solution is to force the model to concentrate on foreground objects at both two stages, but this is not easy since we do not have any prior knowledge of the entity and position of the foreground object in images. Our key insight is that foreground objects from images of the same class share more common patterns compared to background. Motivated by this observation, we customize algorithms separately for the pretraining and evaluation stages.

At the pretraining stage, we seek foreground objects by clustering random crops of images within each class. In this way, crops containing only foreground objects are more likely to form a single cluster, thus can be identified easily. However, the premise is that we have a feature extractor that can map foreground objects into compact regions in the feature space. To this end, we adopt the contrastive pretrained feature extractor Alexander et al. (1995, 1995), which is revealed to be good at identifying foreground objects as shown in Sec. 3.2. After figuring out clusters that contain crops with foreground objects only, we define the foreground score of each image crop as the feature distance to its closest cluster center. Crops with the highest foreground score from each image are picked up and treated as potential foreground objects of the image. We call this framework Clustering-based Object Seeker (COS). During pretraining, foreground objects obtained by COS are used to replace the original image with a probability depending on the foreground score. In this way, we aim to maximally suppress shortcut learning of background and improve the transferability of models.

At the few-shot evaluation stage, we propose Shared Object Concentrator (SOC) to grasp the shared information (i.e., foreground) among a few training images from each class. This is achieved by repeatedly seeking one part for each image such that the sum of pair-wise distance between these parts is minimized. The average features of obtained parts are regarded as representations of foreground objects and are used to locate foreground objects of testing images by feature matching. In this way, we remove irrelevant background information, such that reliable predictions can be made via comparisons between foreground objects of testing and training images in the few-shot task.

In summary, our contributions can be summarized as follows. i) By conducting empirical studies on image foreground and background in FSL, we reveal that image background serves as a source of shortcut knowledge which harms the evaluation performance. ii) To solve this problem, we propose COSOC, a two-stage framework combining COS and SOC, which can draw the model’s attention to image foreground at both pretraining and evaluation stages. iii) Extensive experiments for non-transductive FSL tasks on two benchmarks have demonstrated the state-of-the-art performance of our method.

2 Related Works

Few-shot Image Classification. Few-shot learning Alexander et al. (1995) focuses on recognizing samples from new classes with only scarce labeled examples. This is a challenging task because of the risk of overfitting. Plenty of previous works tackled this problem in meta-learning framework Alexander et al. (1995, 1995), where a model learns experience about how to solve few-shot learning tasks by tackling pseudo few-shot classification tasks constructed from the pretraining dataset. Existing methods that ultilize meta-learning can be generally divided into three groups: (1) Optimization-based methods learn the experience of how to optimize the model given few training samples. This kind of methods either meta-learn a good model initialization point Alexander et al. (1995, 1995, 1995, 1995, 1995) or meta-learn the whole optimization process Alexander et al. (1995, 1995, 1995, 1995) or meta-learn both Alexander et al. (1995, 1995). (2) Hallucination-based methods Alexander et al. (1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995) learn to augment similar training samples in few-shot tasks, thus can greatly alleviate the low-shot problem. (3) Metric-based methods Alexander et al. (1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995)

learn to map images into a metric feature space and classify testing images by computing feature distances to training images. Among them, several recent works 

Alexander et al. (1995, 1995, 1995, 1995) sought correspondence between images either by attention or meta-filter, in order to obtain a more reasonable similarity measure. Similar with our SOC algorithm in Sec. 4.2, these methods can also align foreground objects in pairs of images at the evaluation stage. However, such pair-wise alignment can be easily misled by similar background, leading to unreliable similarity measures between images. Instead of pair-wise matching, SOC seeks for shared contents among all images within the same class, thus can reliably remove background.

Contrastive learning.

Recent success on contrastive learning of visual representations has greatly promoted the development of unsupervised learning 

Alexander et al. (1995, 1995, 1995, 1995). The promising performance of contrastive learning relies on the instance-level discrimination loss which maximizes agreement between transformed views of the same image and minimizes agreement between transformed views of different images. Recently there have been some attempts Alexander et al. (1995, 1995, 1995, 1995, 1995) at integrating contrastive learning into the framework of FSL. Although achieving good results, these works don’t have an in-depth understanding of why contrastiveness is useful in FSL. To this end, one target of our work is to reveal the advantages of contrastive learning over traditional supervised FSL models in identifying core objects of images and then show how to use this to improve the performance of FSL.

3 Empirical Investigation

Problem Definition. Few-shot learning consists of a pretraining dataset and an evaluation set which share no overlapping classes. contains a large amount of labeled data and is usually used at first to pretrain a backbone network . After pretraining, a set of -way -shot classification tasks are constructed, each by first sampling classes in and then sampling and images from each class to constitute and , respectively. In each task , given the learned backbone and a small training set consisting of images and corresponding labels from each of classes, a few-shot classification algorithm is designed to classify images from the test set .

Preparation. To investigate the influence of background and foreground in few-shot learning, we create a subset of miniImageNet Alexander et al. (1995) and crop each image manually in so that backgrounds are removed at all and only the core object for recognition is remained. We denote the uncropped version of the subset as --, and the cropped version as --. We carry out experiments on two well-known FSL baselines: Cosine Classifier (CC) Alexander et al. (1995) and Prototypical Networks (PN) Alexander et al. (1995). The details of constructing and introductions of CC and PN can be found in Appendix A.

3.1 The Role of Foreground and Background in Few-Shot Image Classification

Figure 2: 5-way 5-shot FSL performance under different variants of pretraining and evaluation datasets which are described in Sec. 3. (a) Empirical exploration of image foreground and background in FSL using two models: PN and CC. (b) Comparison between CC and Exampler pretrained on full miniImageNet and evaluated on - and -.

Fig. 2 shows the average of 5-way 5-shot classification accuracy obtained by pretraining CC and PN on - and -, and evaluating on - and -, respectively. Additional 5-way 1-shot experiments can be found in Appendix F.

Category gap disables generalization of background knowledge. It can be first noticed that, under any condition, the performance is consistently and significantly improved if background is removed at the evaluation stage (switch from - to -). The result implies that background information at the evaluation stage in FSL is disturbing information. This is the opposite of that reported in Alexander et al. (1995) which shows removing background harms the performance of traditional classification task, where no category gap exists between pretraining and evaluation. Thus we can infer that the class/distribution gap in FSL disables generalization of background knowledge and degrades performance.

Removing background at pretraining stage prevents shortcut learning. When only foreground is given during at evaluation stage (-), the models pretrained with only foreground (-) perform much better than that pretrained with the original image (-). This indicates that models pretrained with original image do not pay enough attention to the foreground object which really matters for classification. Background information at the pretraining stage serves as a shortcut for the model to learn and cannot generalize to brand-new classes. In contrast, models pretrained with only foreground "learn to compare" different objects, which is the desirable ability for reliable generalization to downstream few-shot learning with out-of-domain classes.

Pretraining with background helps models to handle complex scenes. When evaluating on -, the models pretrained with original dataset - are slightly better than that with foreground dataset -. We attribute this to a kind of domain shift: models trained with - never meet images with complex background and do not know how to handle with it. In Appendix D.1 we show evaluation accuracy of each class under the above two pretraining situations which further verify the assertion. Note that since we apply random crop augmentation during pretraining, the same domain shift does not exist if the model is pretrained on - and evaluated on - instead.

Simple fusion sampling combines advantages of both sides. We want to cut off shortcut learning of background while maintaining adaptability of model to complex scenes. A simple way is to mix the data sampling process: given an image as input, choose its foreground version with probability , and its original version with probability . is simply set as . We denote the dataset using this sampling strategy as -. From Fig. 2, we can observe that models trained by this way indeed combine advantages of both sides: achieve relatively good performance on both - and -. In Appendix C, we compare the training curves of PN models pretrained on three versions of datasets to further investigate the effectiveness of fusion sampling.

The analysis above gives us new insights about how to improve few-shot learning further: (1) During pretraining, fusion sampling is required for enhancing the transferability of pretrained models. (2) Because background information disturbs during evaluation, it is needed to figure out where the foreground object is and then utilize only these parts for evaluation. Therefore, a reliable foreground object identification mechanism is required at both pretraining (for fusion sampling) and evaluation stages.

3.2 Contrastive Learning is Good at Identifying Objects

In this subsection, we reveal the potential of contrastive learning in identifying foreground objects without the interference of background, which will be utilized in Sec. 4.1. Given one transformed view of an image, contrastive learning tends to identify another transformed view of the same image in thousands of views of different images. We give a comprehensive introduction of contrastive learning in Appendix B. The two augmented views of the same image always cover the same object, but probably with different parts, sizes and colors. To discriminate two augmented patches from thousands of other image patches, the model has to learn how to identify the key discriminative information of the object under varying environment. In this manner, contrastive learning explicitly models semantic relations among crops of images, thereby clustering semantically similar contents automatically. The features of different images are pushed away, but features of similar objects in different images are pulled closer. Thus we speculate that contrastive learning enables models with better identification of single object than supervised models.

To verify this, we pretrain CC and contrastive learning on the whole miniImageNet (-) and compare their accuracy on - and -. The contrastive learning method we use is Exampler Alexander et al. (1995), a modified version of MoCo Alexander et al. (1995). Unsurprisingly, Fig 2 shows that Exampler performs much better when only foregrounds of testing images are given, which affirms that contrastive learning has a better discriminative ability of single object. The evaluation accuracy of Exampler on - is slightly worse than that of CC. Contrastive learning aims to discriminate between every image, without knowing what foreground and background are, thus the model judges that two images are similar only if both foreground and background are similar. This may lead to misclassification when the foreground object is the same but the background differs. In Appendix D.2, we provide a more in-depth analysis of why contrastive learning has such properties and find that the shape bias and viewpoint invariance of contrastive learning may play an important role.

4 Rectifying Shortcut Learning of Background

Given the analysis in Sec. 3.1, we want the model to focus more on image foreground both at pretraining and evaluation stages. And in Sec. 3.2 we reveal the potential of contrastive learning in discriminating foreground objects. Thus in this section, we propose COSOC, a two-stage framework which ultilizes contrastive learning to draw the model’s attention to the foreground objects of images.

4.1 Pretraining Stage: Clustering-based Object Seeker (COS) with Fusion Sampling

Since contrastive learning is good at discriminating foreground objects, we utilize it to make up for deficiencies of supervised few-shot learning models. The first step is to pretrain a backbone using Exempler Alexander et al. (1995). Then COS algorithm is used to extract "objects" identified by the pretrained model. The basic idea is that features of foreground objects in images within one class extracted by contrastive learning models are similar, thereby can be identified via a clustering algorithm, see Fig. 3. All images within the -th class in form a set . We will omit the class index for brevity. The scheme of seeking for foreground objects in the -th class can be described as follows:

Figure 3: Simplified schematic illustration of COS algorithm. We show how we obtain foreground objects of three exemplified images. The number under each crop denotes its foreground score.

1) For each image , we randomly crop it times to obtain image patches . Each image patch is then passed through the pretrained model

and gets a normalized feature vector


2) We run a clustering algorithm on all features vectors and obtain clusters , where is the center feature of the -th cluster.

Figure 4: Examples of obtained objects from pretraining set of miniImageNet. The first row shows the original images. The second row shows the picked patch with the highest foreground score.

3) We say an image , if there exists s.t. , where . Let be the proportion of images in class that belong to . If is small, then the cluster is not representative for the whole class and is possibly background. Thus we remove all the clusters with , where is a threshold that controls the generality of clusters. The remaining clusters are “objects” of the class that we find.

4) The foreground score of image patch is defined as , where is used to normalize the score into . Then top-k scores of each image are obtained as . The corresponding patches are seen as possible crops of the foreground object in image , and the foreground scores as the confidence.

We visualize some obtained “objects” in Fig. 4, and display more results in Appendix G. From the result, we can see that the model indeed extracts relatively reliable crops of foreground objects. We then use it as prior knowledge to rectify shortcut learning of background of supervised models.

The strategy resembles the simple fusion strategy introduced in Sec. 3.1. For an image , the probability that we choose the original version is , and the probability of choosing is . Then in any circumstances, we randomly crop the chosen image patch and make sure that the least area proportion to the original image is a constant. We use this strategy to pretrain a few-shot learning model from scratch or finetune from Exampler-pretrained model to maximally improve the ability of model to discriminate foreground objects while retain the ability to adapt to complex scenes.

4.2 Downstream Stage: Few-shot Learning with Shared Object Concentrator (SOC)

As shown in Sec. 3.1, if the accurate crop of the foreground object in the image is given at the evaluation stage, the performance of the model will be improved by a large margin, serving as an upper bound of the model performance. To approach this upper bound, we propose SOC, an algorithm that imitates human inferential thinking, to capture foreground objects by seeking shared contents among training images from the same class and testing images.

Figure 5: The overall pipeline of step 1 in SOC. Points in one color represent features of crops from one image. The red points are and .

Step 1: Shared Content Searching within Each Class. For each image within one class from training set , we randomly crop it times and obtain corresponding candidates . Each patch is individually sent to the learned backbone to obtain a normalized feature vector . Thus we have totally feature vectors within a class . Our goal is to obtain a feature vector that can maximally represent the shared information of all images in class . Ideally, represents the center of the most similar image patches, each from one image, which can be formulated as


where denotes cosine distance and denotes the set of functions that take as domain and as range. While can be obtained by enumerating all possible combinations of image patches, the computation complexity of this brute-force method is , which is computation prohibitive when or is large. Thus when the computation is not affordable, we turn to use a simplified method that leverages iterative optimization. Instead of seeking for the closest image patches, we directly optimize so that the sum of minimum distance to patches of each image is minimized, i.e.,


which can be achieved by any optimization method. SGD is applied in our experiments. After optimization, we remove the most similar image patch with in each image, and obtain feature vectors. Then we repeatedly implement the above optimization process until no vectors are left, as shown in Fig. 5. We finally obtain sorted feature vectors , which we use to represent possible objects inside class . If shot , then we use the original feature vectors.

Step 2: Locate Foreground Object of Testing Images. Once the core objects of each class are identified, the next step is to use them to locate foreground objects in the testing images in the testing set . For each testing image , we again randomly crop it for times and obtain candidate features . For each class , we have feature vectors , where is an importance factor. We then match the most similar patches between testing features and class features, i.e.,


similarly, the two matched features are removed and the above process repeats until no features left. Finally, the score of w.r.t. class is obtained as a weighted sum of all similarities, i.e., , where is an importance factor controlling the belief of each crop being foreground objects. The predicted class of is the one with the highest score.

5 Experiments

5.1 Experiment Setup

Dataset. We adopt two benchmark datasets which are the most representative in few-shot learning. The first is miniImageNet Alexander et al. (1995), a small subset of ILSVRC-12 Alexander et al. (1995) that contains 600 images within each of the 100 categories. The categories are split into 64, 16, 20 classes for pretraining, validation and evaluation, respectively. The second dataset, tieredImageNet Alexander et al. (1995), is a much larger subset of ILSVRC-12 and is more challenging. It is constructed by choosing 34 super-classes with 608 categories. The super-classes are split into 20, 6, 8 super-classes which ensures separation between pretraining and evaluation categories. The final dataset contains 351, 97, 160 classes for pretraining, validation and evaluation, respectively. On both datasets, the input image size is 84 × 84 for fair comparison.

Evaluation Protocols. We follow the 5-way 5-shot (1-shot) FSL evaluation setting. Specifically, 2000 tasks, each contains 15 testing images and 5 (1) training images per class, are randomly sampled from the evaluation set

and the average classification accuracy is computed. This is repeated 5 times and the mean of the average accuracy with a 95% confidence interval is reported.

Implementation Details.

The backbone we use throughout the article is ResNet-12, which is widely used in few-shot learning. We use Pytorch 

Alexander et al. (1995)

to implement all our experiments on two NVIDIA 1080Ti GPUs. We train the model using SGD with cosine learning rate schedule without restart to reduce the number of hyperparameters (Which epochs to decay the learning rate). The initial learning rate for training Exampler is

, and for CC is . The batch size for Exampler, CC are 256 and 128, respectively. For miniImageNet, we train Exampler for 150k iterations, and train CC for 6k iterations. For tiered

ImageNet, we train Exampler for approximately 900k iterations, and train CC for 120k iterations. We choose k-means 

Alexander et al. (1995) as the clustering algorithm for COS. The threshold is set to , and top out of features are chosen per image at the pretraining stage. At the evaluation stage, we crop each image times. The importance factors and are both set to .

5.2 Model Analysis

In this subsection, we show the effectiveness of each component of our method. Tab. 1 shows the ablation study conducted on the full miniImageNet. Since the aim of SOC algorithm is to find foreground objects, it is not necessary to evaluate SOC on the foreground dataset -.

Importance of COS Algorithm. When COS is applied on CC, the performance is improved on both versions of datasets. Especially, it improves CC on the foreground version - by 3.45%, 2.75% under 1-shot and 5-shot settings, respectively. While not substantial, it still improves CC on the original version - by 2.38%, 0.96% under 1-shot and 5-shot settings, respectively. This verifies that foreground objects obtained by the COS algorithm are relatively reliable, and the model pretrained with these objects using fusion strategy improves the ability of discriminating core objects while maintains the ability to handle complex backgrounds.

1-shot 5-shot 1-shot 5-shot
62.67 0.32 80.22 0.24 66.69 0.32 82.86 0.19
65.05 0.06 81.16 0.17 71.36 0.30 86.20 0.14
64.41 0.22 81.54 0.28 - -
69.29 0.12 84.94 0.28 - -
Table 1: Ablative study on miniImageNet. All models are pretrained on full miniImageNet.
Figure 6: Visualization examples of the SOC algorithm.The first row displays 5 images that belong to dalmatian and guitar classes respectively from evaluation set of miniImageNet. The second row shows image patches that are picked up from the first round of SOC algorithm. Our method succesfully puts focus on the shared contents/foreground.
Figure 7: Comparison of training and validation curves between CC with and without COS.

In Fig. 7, we show the curves of training and validation error during pretraining of CC with and without COS. Both models are pretrained and validated on the full miniImageNet. As we can see, CC sinks into overfitting: the training accuracy drops to zero, and validation accuracy stops improving before the end of the pretraining. Meanwhile, the COS algorithm slows down convergence and prevents training accuracy from reaching zero. This makes validation accuracy comparable at first but higher at the end. Our COS with fusion strategy weakens the “background shortcut” for learning and draws model’s attention toward discriminating between foreground objects, which increases the difficulty of the task and improves generalization.

Effectiveness of SOC Algorithm. The result in Tab. 1 shows that the SOC algorithm is the key to maximally exploit the potential of good object-discrimination ability: it further boosts performance on - by 4.24%, 3.78% under 1-shot and 5-shot settings, respectively. This even approaches the upper bound performance obtained by evaluating the model on -, whose images perfectly bound foreground objects, which implies that the model with the SOC algorithm successfully concentrates on the foreground objects.

Note that if we apply only the SOC algorithm on CC, the performance degrades. This indicates that COS and SOC are both necessary: COS provides the discrimination ability of foreground objects and SOC leverages it to maximally boost the performance.

Model backbone miniImageNet tieredImageNet
1-shot 5-shot 1-shot 5-shot
MatchingNet Alexander et al. (1995) ResNet-12 63.08 0.80 75.99 0.60 68.50 0.92 80.60 0.71
TADAM Alexander et al. (1995) ResNet-12 58.50 0.30 76.70 0.30 - -
LEO Alexander et al. (1995) WRN-28-10 61.76 0.08 77.59 0.12 66.33 0.05 81.44 0.09
wDAE-GNN Alexander et al. (1995) WRN-28-10 61.07 0.15 76.75 0.11 68.18 0.16 83.09 0.12
MetaOptNet Alexander et al. (1995) ResNet-12 62.64 0.82 78.63 0.46 65.99 0.72 81.56 0.53

DC Alexander et al. (1995)
ResNet-12 62.53 0.19 79.77 0.19 - -
CTM Alexander et al. (1995) ResNet-18 64.12 0.82 80.51 0.13 68.41 0.39 84.28 1.73
CAM Alexander et al. (1995) ResNet-12 63.85 0.48 79.44 0.34 69.89 0.51 84.23 0.37
AFHN Alexander et al. (1995) ResNet-18 62.38 0.72 78.16 0.56 - -
DSN Alexander et al. (1995) ResNet-12 62.64 0.66 78.83 0.45 66.22 0.75 82.79 0.48
AM3+TRAML Alexander et al. (1995) ResNet-12 67.10 0.52 79.54 0.60 - -

FEAT Alexander et al. (1995)
ResNet-12 66.78 0.20 82.05 0.14 - -
DeepEMD Alexander et al. (1995) ResNet-12 68.77 0.29 84.13 0.53 74.29 0.32 86.98 0.60
Net-Cosine Alexander et al. (1995) ResNet-12 63.85 0.81 81.57 0.56 - -
RFS-Distill Alexander et al. (1995) ResNet-12 64.82 0.60 82.14 0.43 71.52 0.69 86.03 0.49
CA Alexander et al. (1995) WRN-28-10 65.92 0.60 82.85 0.55 74.40 0.68 86.61 0.59
MABAS Alexander et al. (1995) ResNet-12 65.08 0.86 82.70 0.54 - -
ConsNet Alexander et al. (1995) ResNet-12 64.89 0.23 79.95 0.17 - -
IEPT Alexander et al. (1995) ResNet-12 67.05 0.44 82.90 0.30 72.24 0.50 86.73 0.34
MELR Alexander et al. (1995) ResNet-12 67.40 0.43 83.40 0.28 72.14 0.51 87.01 0.35
IER-Distill Alexander et al. (1995) ResNet-12 67.28 0.80 84.78 0.52 72.21 0.90 87.08 0.58

LDAMF Alexander et al. (1995)
ResNet-12 67.76 0.46 82.71 0.31 71.89 0.52 85.96 0.35
FRN Alexander et al. (1995) ResNet-12 66.45 0.19 82.83 0.13 72.06 0.22 86.89 0.14

COSOC (ours)
ResNet-12 69.28 0.49 85.16 0.42 73.57 0.43 87.57 0.10

Table 2: Comparisons with state-of-the-art models on miniImageNet and tieredImageNet. The average inductive 5-way few-shot classification accuracies with 95 confidence interval are reported.

Visualization of SOC. Fig. 6 demonstrates two visualization examples of SOC. In each example, we show images within one class from a 5-shot task. The first row shows the original images, and the second row shows the corresponding automatically-picked patches. The original images are from unknown classes—“dalmatian” and “guitar”—which the model has never met and contain noisy backgrounds. For instance, the first and third guitar images do not only describe guitars but in fact display scenes where one band is playing on stage. It just happened that guitars appear in the scenes, taking up a small space. If without any prior knowledge, the model does not know what to focus on in the images, thus has to embed the whole image, and the resulting feature will contain a lot of class-irrelevant information. From the second row, we can observe that SOC successfully focuses on the foreground and extract corresponding patches from the images by searching for shared information from all images. More visualization results can be found in Appendix G.

5.3 Comparison to State-of-the-Arts

Tab. 2 presents 5-way 1-shot and 5-shot classification results on miniImageNet and tieredImageNet. We compare with state-of-the-art few-shot learning methods which are sorted by the year of publishment. Our method uses a relatively shallower backbone (ResNet-12) but achieves comparable/stronger performance on both datasets. Particularly, our method outperforms the most recent work IER-Distill Alexander et al. (1995) by 2.0% in the 1-shot task and 0.38% in the 5-shot task on miniImageNet, and on tieredImageNet the improvement is 1.36% in the 1-shot task and 0.49% in the 5-shot task. Our method achieves state-of-the-art performance under all settings except for 1-shot task on tieredImageNet, in which the performance of our method is slightly worse than CA Alexander et al. (1995), which uses WRN-28-10, a deeper backbone, as the feature extractor.

6 Conclusion

Few-shot image classification benefits from increasingly more complex network and algorithm design, but little attention has been focused on image itself. In this paper, we reveal that image background serves as a source of harmful knowledge that few-shot learning models easily absorb in. This problem is tackled by our COSOC method which successfully draws the model’s attention to image foreground both at the pretraining and evaluation stage. However, random cropping in our algorithm brings instability into models: what if there are no crops accurately bounding the foreground objects? It is future work to consider solving this problem by identifying approximate object-based regions as potential cropping positions. This may be achieved by the use of unsupervised segmentation or detection algorithms.


  • Alexander et al. (1995) Arman Afrasiyabi, Jean-François Lalonde, and Christian Gagné. Associative alignment for few-shot image classification. In ECCV, 2020.
  • Alexander et al. (1995) Sungyong Baik, Myungsub Choi, Janghoon Choi, Heewon Kim, and Kyoung Mu Lee. Meta-learning with adaptive hyperparameters. In NIPS, 2020.
  • Alexander et al. (1995) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In NIPS, 2020.
  • Alexander et al. (1995) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  • Alexander et al. (1995) Zitian Chen, Yanwei Fu, Yu-Xiong Wang, Lin Ma, Wei Liu, and Martial Hebert. Image deformation meta-networks for one-shot learning. In CVPR, 2019.
  • Alexander et al. (1995) Carl Doersch, Ankush Gupta, and Andrew Zisserman. Crosstransformers: spatially-aware few-shot transfer. In NIPS, 2020.
  • Alexander et al. (1995) Nanyi Fei, Zhiwu Lu, Tao Xiang, and Songfang Huang. MELR: meta-learning via modeling episode-level relationships for few-shot learning. In ICLR, 2021.
  • Alexander et al. (1995) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • Alexander et al. (1995) Yizhao Gao, Nanyi Fei, Guangzhen Liu, Zhiwu Lu, Tao Xiang, and Songfang Huang. Contrastive prototype learning with augmented embeddings for few-shot learning. arXiv preprint arXiv:2101.09499, 2021.
  • Alexander et al. (1995) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, 2019.
  • Alexander et al. (1995) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
  • Alexander et al. (1995) Spyros Gidaris and Nikos Komodakis.

    Generating classification weights with GNN denoising autoencoders for few-shot learning.

    In CVPR, 2019.
  • Alexander et al. (1995) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko.

    Bootstrap your own latent - A new approach to self-supervised learning.

    In NIPS, 2020.
  • Alexander et al. (1995) Bharath Hariharan and Ross B. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In ICCV, 2017.
  • Alexander et al. (1995) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  • Alexander et al. (1995) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
  • Alexander et al. (1995) Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Cross attention network for few-shot classification. In NIPS, 2019.
  • Alexander et al. (1995) Muhammad Abdullah Jamal and Guo-Jun Qi. Task agnostic meta-learning for few-shot learning. In CVPR, 2019.
  • Alexander et al. (1995) Jaekyeom Kim, Hyoungseok Kim, and Gunhee Kim. Model-agnostic boundary-adversarial sampling for test-time generalization in few-shot learning. In ECCV, 2020.
  • Alexander et al. (1995) Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In CVPR, 2019.
  • Alexander et al. (1995) Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, and Liwei Wang. Boosting few-shot learning with adaptive margin loss. In CVPR, 2020.
  • Alexander et al. (1995) Fei-Fei Li, Robert Fergus, and Pietro Perona. One-shot learning of object categories. In TPAMI, 2006.
  • Alexander et al. (1995) Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler, and Xiaogang Wang. Finding task-relevant features for few-shot learning by category traversal. In CVPR, 2019.
  • Alexander et al. (1995) Kai Li, Yulun Zhang, Kunpeng Li, and Yun Fu. Adversarial feature hallucination networks for few-shot learning. In CVPR, 2020.
  • Alexander et al. (1995) Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
  • Alexander et al. (1995) Yann Lifchitz, Yannis Avrithis, Sylvaine Picard, and Andrei Bursuc. Dense classification and implanting for few-shot learning. In CVPR, 2019.
  • Alexander et al. (1995) Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Mingsheng Long, and Han Hu. Negative margin matters: Understanding margin in few-shot classification. In ECCV, 2020.
  • Alexander et al. (1995) Chen Liu, Yanwei Fu, Chengming Xu, Siqian Yang, and Jilin Li. Learning a few-shot embedding model with contrastive learning. In AAAI, 2021.
  • Alexander et al. (1995) James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967.
  • Alexander et al. (1995) Orchid Majumder, Avinash Ravichandran, Subhransu Maji, Marzia Polito, Rahul Bhotika, and Stefano Soatto. Revisiting contrastive learning for few-shot classification. arXiv preprint arXiv:2101.11058, 2021.
  • Alexander et al. (1995) Tsendsuren Munkhdalai and Hong Yu. Meta networks. In ICLR, 2017.
  • Alexander et al. (1995) Boris N. Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. TADAM: task dependent adaptive metric for improved few-shot learning. In NIPS, 2018.
  • Alexander et al. (1995) Yassine Ouali, Céline Hudelot, and Myriam Tami. Spatial contrastive learning for few-shot classification. arXiv preprint arXiv:2012.13831, 2020.
  • Alexander et al. (1995) Eunbyung Park and Junier B. Oliva. Meta-curvature. In NIPS, 2019.
  • Alexander et al. (1995) Seong-Jin Park, Seungju Han, Ji-Won Baek, Insoo Kim, Juhwan Song, Haebeom Lee, Jae-Joon Han, and Sung Ju Hwang.

    Meta variance transfer: Learning to augment from the others.

    In ICML, 2020.
  • Alexander et al. (1995) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.

    Pytorch: An imperative style, high-performance deep learning library.

    In NIPS, 2019.
  • Alexander et al. (1995) Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, and Sergey Levine. Meta-learning with implicit gradients. In NIPS, 2019.
  • Alexander et al. (1995) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
  • Alexander et al. (1995) Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B. Tenenbaum, Hugo Larochelle, and Richard S. Zemel. Meta-learning for semi-supervised few-shot classification. In ICLR, 2018.
  • Alexander et al. (1995) Mamshad Nayeem Rizve, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In CVPR, 2021.
  • Alexander et al. (1995) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. In IJCV, 2015.
  • Alexander et al. (1995) Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2019.
  • Alexander et al. (1995) Eli Schwartz, Leonid Karlinsky, Joseph Shtok, Sivan Harary, Mattias Marder, Abhishek Kumar, Rogério Schmidt Feris, Raja Giryes, and Alexander M. Bronstein. Delta-encoder: an effective sample synthesis method for few-shot object recognition. In NIPS, 2018.
  • Alexander et al. (1995) Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. Adaptive subspaces for few-shot learning. In CVPR, 2020.
  • Alexander et al. (1995) Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In NIPS, 2017.
  • Alexander et al. (1995) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
  • Alexander et al. (1995) Sebastian Thrun and Lorien Y. Pratt. Learning to learn: Introduction and overview. In Learning to Learn, 1998.
  • Alexander et al. (1995) Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: A good embedding is all you need? In ECCV, 2020.
  • Alexander et al. (1995) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In NIPS, 2016.
  • Alexander et al. (1995) Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. Generalizing from a few examples: A survey on few-shot learning. In ACM Comput. Surv., 2020.
  • Alexander et al. (1995) Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In CVPR, 2018.
  • Alexander et al. (1995) Davis Wertheimer, Luming Tang, and Bharath Hariharan. Few-shot classification with feature map reconstruction networks. In CVPR, 2021.
  • Alexander et al. (1995) Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. In ICLR, 2021.
  • Alexander et al. (1995) Chengming Xu, Chen Liu, Li Zhang, Chengjie Wang, Jilin Li, Feiyue Huang, Xiangyang Xue, and Yanwei Fu. Learning dynamic alignment via meta-filter for few-shot learning. In CVPR, 2021.
  • Alexander et al. (1995) Jin Xu, Jean-Francois Ton, Hyunjik Kim, Adam R. Kosiorek, and Yee Whye Teh. Metafun: Meta-learning with iterative functional updates. In ICML, 2020.
  • Alexander et al. (1995) Weijian Xu, yifan xu, Huaijin Wang, and Zhuowen Tu. Attentional constellation nets for few-shot learning. In ICLR, 2021.
  • Alexander et al. (1995) Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, 2020.
  • Alexander et al. (1995) Sung Whan Yoon, Jun Seo, and Jaekyun Moon.

    Tapnet: Neural network augmented with task-adaptive projection for few-shot learning.

    In ICML, 2019.
  • Alexander et al. (1995) Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. Interventional few-shot learning. In NIPS, 2020.
  • Alexander et al. (1995) Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR, 2020.
  • Alexander et al. (1995) Manli Zhang, Jianhong Zhang, Zhiwu Lu, Tao Xiang, Mingyu Ding, and Songfang Huang. IEPT: instance-level and episode-level pretext tasks for few-shot learning. In ICLR, 2021.
  • Alexander et al. (1995) Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to few-shot learning. In NIPS, 2018.
  • Alexander et al. (1995) Nanxuan Zhao, Zhirong Wu, Rynson WH Lau, and Stephen Lin. What makes instance discrimination good for transfer learning? In ICLR, 2021.
  • Alexander et al. (1995) Luisa M. Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In ICML, 2019.

Appendix A Details of Section 3

Dataset Construction. We construct a subset of miniImageNet ---. is created by randomly picking out of images from the first categories of -; And is created by randomly picking out of images from all categories of -. We then crop each image in such that the foreground object is tightly bounded. Some examples are displayed in Fig. 8.

Cosine Classifier (CC) and Prototypical Network (PN). In CC Alexander et al. (1995), the feature extractor

is pre-trained together with a cosine-similarity based classifier under standard supervised way. The loss can be formally described as


where denotes the number of classes in , denotes cosine distance and denotes the learnable prototype for class . To solve a following downstream few-shot classification task , CC adopts a non-parametric metric-based algorithm. Specifically, all images in are mapped into features by the pre-trained feature extractor . Then all features from the same class in are averaged to form a prototype . Cosine distance between test image and each prototype is then calculated to obtain score w.r.t. the corresponding class. In summary, the score for a test image w.r.t. class can be written as


and the predicted class for is the one with the highest score.

The difference between PN and CC is only at the pre-training stage. PN follows meta-learning/episodic paradigm, in which a pseudo -way -shot classication task is sampled from during each iteration and is solved using the same algorithm as (5). The loss at iteration is the average prediction loss of all test images and can be described as


Implementation Details in Sec. 3. For all experiments in Sec. 3, we pretrain CC and PN with ResNet-12 for 60 epochs. The initial learning rate is 0.1 with cosine decay schedule without restart. Random crop is used as data augmentation. The batch size for CC is 128 and for PN is 4.

Appendix B Contrastive Learning

Contrastive learning tends to maximize the agreement between transformed views of the same image and minimize the agreement between transformed views of different images. Specifically, Let

be a convolutional neural network with output feature space

. Two augmented image patches from one image are mapped by , producing one query feature , and one key feature . Additionally, a queue containing thousands of negative features is produced using patches of other images. This queue can either be generated online using all images in the current batch Alexander et al. (1995) or offline using stored features from last few epochs Alexander et al. (1995). Given , contrastive learning aims to identify in thousands of features , and can be formulated as:


Where denotes a temperature parameter, a similarity measure. In Exampler Alexander et al. (1995), all samples in that belong to the same class as are removed in order to “preserve the unique information of each positive instance while utilizing the label information in a weak manner”.

Appendix C Shortcut Learning in PN

Figure 8: Examples of images of constructed datasets . The first row shows images in which are original images of miniImageNet; and the second row illustrates corresponding cropped versions in in which only foreground objects are remained.
Figure 9: Comparison of training and validation curves of PN pretrained under three different settings.

Fig. 9 shows training and validation curves of PN pretrained on -, - and -. It can be observed that the pretraining errors of models trained on - and - both decrease to zero within 10 epochs. However, the validation error does not decrease to a relatively low value and remains high after convergence, which reflects severe overfitting phenomenon. On the contrary, PN with fusion samping converges much slower with a much lower validation error throughout pretraining. There are apparently shortcuts for PN on both - and - which are avoided by the fusion sampling strategy. In our paper we have showed that the shortcut for dataset - is the statistical correlations between background and label and this is solved by focusing more on the foreground objects. However for dataset - the shortcut is not clear, and we speculate that appropriate amount of background information injects some noisy signals into the optimizatiton process which can help the model escape from local minima. We leave it for future works to further explore this phenomenon.

Appendix D Comparisons of Class-wise Evaluation Performance

Common few-shot evaluation focuses on the average performance of the whole evaluation set, which can not tell a method is why and in what aspect better than another one. To this end, we propose a more fine-grained class-wise evaluation protocol which displays average few-shot performance per class instead of a single average performance.

Figure 10: Illustrative examples of images in -. The number under each class of images denotes Signal-to-Full ratio (SNF) ratio which is the average ratio of foreground area over original area in each class. Higher SNF approximately means less noise inside images.

We first visualize some images from each class of - in Fig. 10. The classes are sorted by Signal-to-Full (SNF) ratio, which is the average ratio of foreground area over original area in each class. For instance, the class with highest SNF is bookshop. The images within this class always display a whole indoor scene, which can be almost fully recognised as foreground. In contrast, images from the class ant always contain large parts of background which are irrelevant with the category semantics, thus have low SNF. However, we mention that the SNF doesn’t reflect the complexity of the background. For example, although the class vase has low SNF, the backgrounds of images in this class are always simple. The class king crab has relatively higher SNF, but the backgrounds of image s in this class are always complicated with different irrelevant objects, making it more difficult to recognize.

- vs. - - vs. - -: Exampler vs CC

- -
class score class score class score

+3.81 theater curtain +7.39 electric guitar +17.28

theater curtain
+3.47 mixing bow +7.04 vase +10.64

mixing bowl
+1.61 trifle +4.33 ant +8.88

+1.13 vase +3.84 nematode +7.72

+0.12 ant +3.62 cuirass +4.63

school bus
-0.52 scoreboard +2.92 mixing bowl +4.30

electric guitar
-0.87 crate +1.18 theater curtain +3.35

black-footed ferret
-0.91 nematode +0.93 bookshop +2.27

-1.55 lion +0.83 crate +1.64

-1.60 electric guitar +0.61 lion +1.46

-2.04 hourglass -0.57 African hunting dog +1.42

-3.12 black-footed ferret -0.69 trifle +1.13

African hunting dog
-3.99 school bus -0.86 scoreboard +1.06

-4.05 king crab -1.95 schoolbus +0.73

king crab
-4.44 bookshop -2.25 hourglass -0.94

-5.32 cuirass -2.54 dalmatian -1.66

-5.50 golden retriever -3.18 malamute -2.39

-9.71 African hunting dog -3.84 king crab -2.48

golden retriever
-10.27 dalmatian -3.90 golden retriever -3.13

-12.00 malamute -5.72 black-footed ferret -5.81

Table 3: Comparisons of class-wise evaluation performance. The first row shows the pretraining datasets of which we compare different models. The second row shows the dataset we evaluate on. Each score denotes the difference of average accuracy of one class, e.g. a vs. b: (performance of a) - (performance of b).

d.1 Domain Shift

We first analyse the domain shift of few-shot models that pretrained on - and evaluated on -. The first column in Tab. 3 displays class-wise performance difference between CC pretrained on - and -. It can be seen that the worst-performance classes of model pretrained on - are those with low SNF and complex background. This indicates that the model pretrained on - fails to recognise objects which only takes up small space because they have never met such images during pretraining.

d.2 Shape Bias and View-Point Invariance of Contrastive Learning

The third column of Tab. 3 shows the class-wise performance difference between Exampler and CC evaluated on -. We at first take a look at classes on which contrastive learning performs much better than CC: electric guitar, vase, ant, nematode, cuirass and mixing bowl. What they have in common is that they all have a similar shape. Take vase as an example, all vases have a similar shape with various textures. Geirhos et al. Alexander et al. (1995) point out that CNN are strongly biased towards recognising textures rather than shapes, which is different from what humans do and is harmful for some downstream tasks. Thus we speculate that one of the reasons that contrastive learning is better than supervised models in some aspects is that contrastive learning prefers shape information more to recognise objects. To simply verify this, we hand draw shapes of some examples from the evaluation dataset, see Fig. 11. Then we calculate the similarity between features of original image and the shape image using different feature extractors. The results are shown in Fig. 11. As we can see, Exampler recognises objects based on shape information more than the other two supervised methods. We mention that this is a conjecture more than a assertion and the validation is not completely convincing. We leave it for future works to explore the shape bias of contrastive learning more deeply.

Figure 11: Shape similarity test. Each number denotes the feature similarity between the above image and its shape using correponding pretrained feature extractor.

Next, let’s have a look on the classes on which contrastive learning performs relatively poor: black-footed ferret, golden retriever, king crab and malamute. It can be noticed that these classes all refer to animals that have different shapes under different view points. For example, dogs from the front and dogs from the side look totally different. The supervised loss pulls all views of one kind of animals closer, therefore enabling the model with the knowledge of dicriminating objects from different view points. On the contrary, contrastive learning pushes different images farther, but only pulls patches of the same one image which has the same view point, thus has no prior of view point invariance. This suggests that contrastive learning can be further improved if view point invariance is injected into the learning process.

d.3 The Similarity between Pretraining Supervised Models with Foreground and Pretraining Models with Contrastive Learning

The second column and the third column of Tab. 3 are somehow similar. This indicates that supervised models learned with foreground and learned with contrastive learning learn similar patterns of images. However, there are some classes that have distinct performance. For instance, the performance difference of contrastive learning on class electric guitar over CC is much higher than that of CC with - over -. It is very interesting for researchers to investigate what makes the difference between the representations learned by contrastive learning and supervised learning. This can be potentially used to explore the way to improve representation learning further.

Appendix E Additional Ablative Studies

In Fig. 12, we show how different values of and influence the performance of our model. and serve as importance factors in SOC, that express the belief of our firstly obtained foreground objects. As we can see, the performance of our model suffers from either excessively firm (small values) or weak (high values) belief. As and approach zero, we put more attention on the first few detected objects, leading to increasing risk of wrong matchings of foreground objects; as and approach one, all weights of features tend to be the same, losing more emphasis on foreground objects.

Appendix F Detailed Performance in Sec. 3

We show detailed performance (both 1-shot and 5-shot) in Tab. 4 and Tab. 5. From the tables, we can see that 5-way 1-shot performance follows the same trend as 5-way 5-shot performance which we discuss in the main article.

Appendix G Additional Visualization Results

In Fig. 13-16 we display more visualization results of COS algorithm on four classes from the pretraining set of miniImageNet. For each image, we show the top 3 out of 30 crops with the highest foreground scores. From the visualization results, we can conclude that: (1) our COS algorithm can reliably extract foreground regions from images, even if the foreground objects are very small or backgrounds are extremely noisy. (2) When there is an object in the image which is similar with the foreground object but comes from a distinct class, our COS algorithm can accurately distinguish them and focus on the right one, e.g. the last group of pictures in Fig. 14. (3) When multiple instances of foreground object exist in one picture, our COS algorithm can capture them simultaneously, distributing them in different crops, e.g. last few groups in Fig. 13. Fig. 17 shows additional visulization results of SOC algorithms. Each small group of images display one 5-shot example from one class of the evaluation set of miniImageNet. The observations are similar with what we discuss in the main article.

Figure 12: The effect of different values of and . The left figure shows the 5-way 1-shot accuracies, while the right figure shows the 5-way 5-shot accuracies with fixed as .
Model Pretraining dataset - -
1-shot 5-shot 1-shot 5-shot
CC - 45.29 0.27 62.73 0.36 49.03 0.28 66.75 0.15
- 44.84 0.20 60.85 0.32 52.22 0.35 68.65 0.22
- 46.02 0.18 62.91 0.40 51.87 0.39 68.98 0.22
PN - 40.57 0.32 52.74 0.11 44.24 0.45 56.75 0.34
- 40.25 0.36 53.25 0.33 46.93 0.50 61.16 0.35
- 45.25 0.44 59.23 0.28 50.72 0.43 64.96 0.20
Table 4: 5-way few-shot performance of CC and PN with different variants of pretraining and evaluation datasets.
Model - -
1-shot 5-shot 1-shot 5-shot
CC 62.67 0.32 80.22 0.23 66.69 0.32 82.86 0.20
Exampler 61.14 0.14 78.13 0.23 70.14 0.12 85.12 0.21
Table 5: Comparisons of 5-way few-shot performance of CC and Exampler pretrained on the full miniImageNet and evaluated on two versions of evaluation datasets.
Figure 13: Visulization results of COS algorithm on class house finch.
Figure 14: Visulization results of COS algorithm on class Saluki.
Figure 15: Visulization results of COS algorithm on class ladybug.
Figure 16: Visulization results of COS algorithm on class unicycle.
Figure 17: Additional visualization results of the first step of SOC algorithm. In each group of images, we show a 5-shot example from one class.