Ranking Distance Calibration for Cross-Domain Few-Shot Learning

Recent progress in few-shot learning promotes a more realistic cross-domain setting, where the source and target datasets are from different domains. Due to the domain gap and disjoint label spaces between source and target datasets, their shared knowledge is extremely limited. This encourages us to explore more information in the target domain rather than to overly elaborate training strategies on the source domain as in many existing methods. Hence, we start from a generic representation pre-trained by a cross-entropy loss and a conventional distance-based classifier, along with an image retrieval view, to employ a re-ranking process for calibrating a target distance matrix by discovering the reciprocal k-nearest neighbours within the task. Assuming the pre-trained representation is biased towards the source, we construct a non-linear subspace to minimise task-irrelevant features therewithin while keep more transferrable discriminative information by a hyperbolic tangent transformation. The calibrated distance in this target-aware non-linear subspace is complementary to that in the pre-trained representation. To impose such distance calibration information onto the pre-trained representation, a Kullback-Leibler divergence loss is employed to gradually guide the model towards the calibrated distance-based distribution. Extensive evaluations on eight target domains show that this target ranking calibration process can improve conventional distance-based classifiers in few-shot learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

10/15/2020

Self-training for Few-shot Transfer Across Extreme Task Differences

All few-shot learning techniques must be pre-trained on a large, labeled...
04/13/2021

Few-shot Image Generation via Cross-domain Correspondence

Training generative models, such as GANs, on a target domain containing ...
12/23/2019

How to Pick the Best Source Data? Measuring Transferability for Heterogeneous Domains

Given a set of source data with pre-trained classification models, how c...
09/03/2021

Self-Taught Cross-Domain Few-Shot Learning with Weakly Supervised Object Localization and Task-Decomposition

The domain shift between the source and target domain is the main challe...
07/26/2021

Meta-FDMixup: Cross-Domain Few-Shot Learning Guided by Labeled Target Data

A recent study finds that existing few-shot learning methods, trained on...
08/28/2018

Distance Based Source Domain Selection for Sentiment Classification

Automated sentiment classification (SC) on short text fragments has rece...
05/16/2021

Is In-Domain Data Really Needed? A Pilot Study on Cross-Domain Calibration for Network Quantization

Post-training quantization methods use a set of calibration data to comp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Few-Shot Learning (FSL) promises to allow a machine to learn novel concepts from limited experience, i.e. few novel target data and data-rich source data. Typically, the defaulted FSL assumes that the source and target data is in the same domain, but belong to different classes. In practice, FSL is required to generalise to different target domains. Cross-Domain Few-Shot Learning (CD-FSL)  [tseng2020cross_fwt, guo2020broader, liang2021boosting_cdfsl, fu2021metamixup] has been studied more recently. In CD-FSL, the target data not only has a different label space but also are from a different domain to the source data.

Figure 1: An illustration of ranking distance calibration process in a FSL task. The idea is to first discover likely positive samples (e.g. sample 7) for each instance (e.g. sample 1) and then to calibrate their pairwise distances. This is achieved by mining the reciprocal ranking relations for each instance retrieval task in the target domain so to expand the -nearest neighbours set.

It is nontrivial to directly extend the general FSL approach to address the CD-FSL challenges. In fact, many promising FSL methods [snell2017prototypical, finn2017model, sung2018learning, satorras2018fewgnn] performed poorly in CD-FSL [guo2020broader, tseng2020cross_fwt]. The central idea of these general FSL methods is to transfer and generalise the visual representations learned from source data to target data. However, the significant visual domain gap between the source and target data in CD-FSL makes it fundamentally difficulty to learn a shared visual representation across different domains.

A few recent CD-FSL studies [tseng2020cross_fwt, liu2020urt, liang2021boosting_cdfsl, wang2021cross_ata] try to learn a generalisable feature extractor to improve model transferability, which is a popular idea in domain generalisation and domain adaptation [volpi2018generalizing_dataaug, NEURIPS2020_adver_dataaug, zhou2021dg_mixstyle, li2021simple, wu2021striking] where the source and target domains share the same label space. Empirically, this approach shows some improvement on CD-FSL but it does not model any visual and label characteristics of the target domain and more importantly their cross-domain impact on the pre-trained source domain representation. We argue this cross-domain mapping between the source domain representation and its interpretion in the context of the target domain data characteristics is essential for effective CD-FSL. From a related perspective, other CD-FSL studies have considered fine-tuning the source domain feature representation from augmenting additional support data in the target domain, e.g. either explicitly augmenting the support data by adversarial training [wang2021cross_ata] and image transformations [guo2020broader], or implicitly augmenting the support data by training an auto-encoder [liang2021boosting_cdfsl]. However, these methods for CD-FSL are straightforward data-augmentation methods for increasing training data in target domain model fine-tuning, without considering how to quantify cross-domain relevance of the pre-trained source domain representation.

In this work, we consider an alternative approach with a new perspective to treat cross-domain few-shot learning as an image retrieval task. We wish to optimise model adaptation by leveraging target domain retrieval task context, that is, not only the labelled support data but also the unlabelled query data. To that end, we use a generic representation pre-trained by a cross entropy loss and a simple distance-based classifier as a baseline, then employ a -reciprocal neighbour discovery (as in Fig. 1) and encoding process to calibrate pairwise distances between each unlabelled query image and its likely matches. Our idea is both orthogonal and complementary to other generalisable model learning methods [guo2020broader, liu2020feature, liang2021boosting_cdfsl]. It can be flexibly used in either fine-tuning or without fine-tuning based model learning.

Generally, the distance matrix for CD-FSL task contains many incorrect results as this distance is built on a potentially biased pre-trained source domain representational space. To calibrate this distance matrix towards the target domain so to reduce its bias to the source domain, we explore the re-ranking concept in the target domain by considering CD-FSL optimisation as re-ranking in a retrieval task given few-shots as anchor points. As in Fig. 1, re-ranking first computes a -nearest neighbour ranking list. This is further expanded by discovering the -reciprocal nearest neighbours in the target domain. The expanded ranking list is used for re-computing a Jaccard distance to measure the difference between the original ranking list and the expanded ranking list, achieving a more robust and accurate distance matrix.

Critically, a pre-trained representation from source domain is biased and poor for generalisation cross-domains in CD-FSL. The reason is that conventional FSL methods assume implicitly linear transformations mostly between the source and target data as they are sampled from the same domain. This becomes invalid in CD-FSL with mostly nonlinear transformations across source and target domains. To address this problem, we propose a task-aware subspace mapping to minimise transferring task-irrelevant representational information from the source domain. In particular, instead of mapping to a linear projection space, we explore a hyperbolic tangent function to project the source domain representation to a non-linear space

. Compared to the linear Euclidean space, this non-linear space performs a dimensionality reduction to optimise the retention of transferrable information from the source to the target domain. Moreover, we explore the idea of re-ranking to calibrate and align two distance matrices in two representational spaces between the original pre-trained source domain linear space and the new non-linear subspace. The calibrated matrices are combined to construct a single distance matrix for the target domain in CD-FSL. We call this Ranking Distance Calibration (RDC).

To impose the above distance calibration into the representational space transform, we approximate the distance matrices by their corresponding distributions, and then a Kullback-Leibler (KL) divergence loss function is optimised for iteratively mapping the original distance distribution from the source domain towards the calibrated space. This provides an additional RDC Fine-Tuning (RDC-FT) model optimisation.

Our contributions from this work are three-fold: (1) To transform the biased distance matrix in the source domain representational space towards the target domain in CD-FSL, we use a re-ranking method to re-compute a Jaccard distance for distance calibration by discovering the reciprocal nearest neighbours within the task. We call this Ranking Distance Calibration. (2) We propose a non-linear subspace to shadow the pre-trained source domain representational space. This is designed to model any inherent non-linear transform in CD-FSL and used to facilitate the distance calibration process between the source and target domains. By modelling explicitly this nonlinearity, we formulate a more robust and generalisable Ranking Distance Calibration (RDC) model for CD-FSL. (3) We further impose RDC as a constraint to the model optimisation process. This is achieved by a RDC with Fine-Tuning (RDC-FT) for iteratively mapping the original source domain distance distribution to a calibrated target domain distance distribution for a more stable and improved CD-FSL.

We evaluated the proposed RDC and RDC-FT methods for CD-FSL on eight target domains. The results show that RDC can improve notably the conventional distance-based classifier, and RDC-FT can improve the representation for target domain to achieve competitive or better performance than the state-of-the-art CD-FSL models.

2 Related Work

Few-shot learning The approaches for general FSL can be broadly divided into two categories: optimisation-based methods [finn2017model, ravi2016optimization, oh2021boil] which learn a generalisable model initialisation and then adapt the model on a novel task with limited labelled data, and metric learning methods [snell2017prototypical, sung2018learning, zhang2020deepemd, li2021plain] that meta-learn a discriminative embedding space where the sample in novel task can be well-classified by a common or learned distance metric. Recently, some researches [chen2019closer, tian2020rethinking, li2020few_self] show that a simple pre-training method followed by a fine-tuning stage can achieve competitive or better performance than the metric learning methods. This observation also seems to be true in CD-FSL [guo2020broader].

Cross-domain few-shot learning The problem of CD-FSL was preliminarily studied in FSL [chen2019closer, tian2020rethinking, pan2021mfl], then [tseng2020cross_fwt, guo2020broader] expanded this setup and proposed two benchmarks to train a model on a single source domain and then generalise it to other domains. Some CD-FSL works [tseng2020cross_fwt, wang2021cross_ata, liang2021boosting_cdfsl] focus on learning a generalisable model by explicit or implicit data augmentation [tseng2020cross_fwt, wang2021cross_ata, liang2021boosting_cdfsl]. These approaches improve the model generalisation ability but easily result in ambiguous optimisation result since they ignore the adaption process for the target domain. Another methods [phoo2021STARTUP, das2021distractor_cdfsl, fu2021metamixup] target to the adaptation on the target domain by leveraging additional unlabelled data [phoo2021STARTUP], labelled data [fu2021metamixup] or the base data [das2021distractor_cdfsl]. In practice, the increased data can help model adaptation on the target domain but these information are not easy to obtain. In this work, we address the CD-FSL problem with an image retrieval view and mine the intra-task information to guide a ranking distance calibration process.

Figure 2: An overview of the proposed ranking distance calibration pipeline. is the feature extractor pre-trained on the source dataset with a standard cross entropy loss. Our RDC method contains two parts: 1) a ranking process on the original space to calibrate the to , and 2) a ranking process on a non-linear subspace to calibrate the to . The proposed RDC-FT method uses a KL loss between and to fine-tune . The NPC is used to classify the query data according to the pairwise distances, i.e. the calibrated distances by RDC and the Euclidean distances on the embeddings fine-tuned with RDC-FT, respectively.

Ranking in image retrieval  Image Retrieval (IR) is a classical vision task that aims to search from unlabelled gallery data to find the images that are most relevant to the probe image. Ranking is a classical approach in the IR field [he2004manifold_rank, huang2015cross_retrivel, liu2013pop_rank, loy2013person_rank]. Generally, the ranking method computes a ranking list based on a distance metric. And a series of re-ranking method were proposed as a post-processing step to improve the initial ranking result. One typical re-ranking based method [zhong2017rerank] used a concept of -reciprocal nearest neighbors [qin2011hello_kreciprocal] to explore more hard positive samples, then the enlarged -nearest neighbors list is used to recompute a Jaccard distance as an auxiliary distance matrix. A number of works [liu2018adaptivererank, sarfraz2018posererank] extended this idea to further benefit the retrieval performance. In this work, we reuse the re-ranking approach to address the CD-FSL problem.

3 Methodology

Problem formulation.  We start by defining a general FSL problem: given a source dataset and a target dataset , where the classes in and are disjoint. FSL aims to address the -way -shot classification task in by leveraging the limited data in and the prior knowledge learned from containing lots of labelled images. In specific, a FSL task contains a labelled support set and an unlabelled query set , where the images in and are both from the same classes and / are the number of images per class in /. The goal of FSL is to recognise the unlabelled query set when is small. Noticeable, the and in CD-FSL are from different domains. For instance, is a dataset containing lots of natural images while is a dataset collected from the remote sensing field.

Nearest prototype classifier.  Before fully developing our main backbone, we define the prototype classifier we used here. Given a feature extractor , we can extract the embedding for image in a FSL task . Nearest Prototype Classifier (NPC) first computes the prototypes for the classes, where the prototype for class is:

(1)

With the prototypes, the labels for in is assigned by:

(2)

where is a distance metric, e.g. Euclidean distance in this work, and is the distance between and .

Overview. The key insight of this paper is to formulate the FSL as the Image Retrieval (IR) task by sharing the same angle with [triantafillou2017fewretrivel]. In [triantafillou2017fewretrivel], the authors propose to optimise the mean average precision for FSL. Furthermore, our view of “FSL as IR” also emphasises the importance of maximally leveraging all the available information in this low-data regime whilst concerns on the calibration of pairwise distances in FSL. In particular, this work follows this view for FSL and consider each sample in FSL as the probe data in IR and treat the whole FSL data as the gallery data. To this end, we propose a ranking distance calibration process for CD-FSL, and our key methodology is to repurpose the re-ranking to find the relevant images from the FSL task for a given image. We overview the proposed method in Fig. 2.

3.1 Ranking Distance Calibration (RDC)

Motivation.  Previous works [zhong2017rerank, qin2011hello_kreciprocal] have suggested that discovering the -reciprocal nearest neighbors within the gallery data can benefit the re-ranking result for image retrieval. This observation encourages us, when considering few-shot as image retrieval, to reuse this -reciprocal nearest neighbor discovery process to calibrate the pairwise distances within FSL task. We give an intuitive example in Fig. 1 by mining the relative relationships among the samples by -reciprocal nearest neighbors to cluster the potential positive instances in the same class for re-weighting the pairwise distances. By this manner, the pairwise distances in FSL task can be encoded as a new distance matrix.

Here, we briefly describe the re-ranking process for our ranking distance calibration and detail the Jaccard distance computing process in Sec. D of the supplementary material. For a FSL task, we start by computing an original pairwise Euclidean distance matrix:

(3)

is the Euclidean distance between and , and represents the normalisation. Referring to , we can obtain the -nearest neighbors set , and is the -nearest neighbor set of . The re-ranking idea [zhong2017rerank] is to expand the by discovering more hard-positive samples for . The expand process for is guided by a -reciprocal nearest neighbors algorithm [qin2011hello_kreciprocal] and the expanded ranking list

is used to estimate a calibrated distance matrix.

-reciprocal discovery and encoding.  The principle of -reciprocal nearest neighbor discovery is that if is in , then should also occur in  [qin2011hello_kreciprocal]. With this assumption, the can be expanded as by:

(4)

where is the sample in and is the number of neighbors. Moreover, for many-shot cases, we further expand by uniting , where and are both in the support set as well same class. By this manner, finally the expanded -nearest neighbors set can be expanded as . To assign larger weights to closer neighbors while smaller weights to farther ones, the is further used to encode the

into a vector

, where is defined as the Gaussian kernel of the pairwise distance as:

(5)

After that, a query expansion strategy is employed to integrate most-likely samples to update the feature of by: where .

Jaccard distance.Referring to  [bai2016sparse, zhong2017rerank], the expanded ranked list is used as contextual knowledge to compute a Jaccard distance matrix by:

(6)

Following the re-weighting method in [zhong2017rerank], the number of candidates in the intersection and union set can be calculated as and , where and operate the element-based minimisation and maximisation for two input vectors, and is norm. Then the Jaccard distance in Eq.(6) can be re-formulated as:

(7)

Distance calibration.  Then this is used to calibrate the original distance matrix by a weighting strategy

(8)

where is a trade-off scalar to balance the two matrices.

3.2 RDC in Task-adaptive Non-linear Subspace

To further bridge the domain gap, we propose further improving the RDC in a non-linear subspace. We particular tailor a discriminative subspace to help calibrate the ranking in our CD-FSL task. The subspace is built upon the Principal Component Analysis (PCA) to extract crucial features from the original space. In specific, given the feature representations

, we have

(9)

is a transformation matrix mapping the feature with dimensions to a reduced feature with dimensions.

Hyperbolic tangent transformation. Generally, the PCA method can be directly used on the original embedding space. However, the original representation is scattered due to the biased and less-discriminative embedding; thus the dimensional reduction easily causes the information loss problem. To remit this issue, we consider to transform the original representations to a compact and representative non-linear space. By using the idea of kernels, we use a hyperbolic tangent function to construct a task-adaptive non-linear subspace. Our non-linear PCA method first computes a feature-toward kernel function by:

(10)

Then we use Singular Value Decomposition (SVD) to compute the eigenvalues

of and select the most -relevant eigenvalues , formulating the transformation matrix . To this end, a task-adaptive non-linear subspace is construct by Eq.(9) and Eq.(10).

Complementary distance calibration.  Our distance calibration process can be applied in the original linear embedding space (in Sec. 3.1) and a non-linear subspace (in Sec. 3.2). The original space has higher dimensions consist of full information but also disturbed by noisy task-irrelevant features, while the non-linear subspace reduce some task-irrelevant signal but loss some information. To this end, our RDC method co-leverages the calibrated distances in the two spaces to capture a robust and complementary distance matrix . The computing process of RDC method is in the line 4-8 of Alg. 1.

Remark. As the hyperbolic non-linear space has larger capability than Euclidean space [khrulkov2020hyperbolic, fang2021kernel_hyperbolic, yan2021unsupervised_hyperbolic], it can alleviate the information loss caused by the dimensionality reduction. Therefore, we use a hyperbolic tangent transformation to map the source domain linear space to a non-linear space. We note that the subspace learning has been preliminary explored in FSL work [yoon2019tapnet, simon2020adaptive] to learn task-adaptive or class-adaptive subspace. Critically, our subspace construction method is different from [yoon2019tapnet, simon2020adaptive], and our method does not need the sophisticated episode training process.

3.3 Fine-tuning with RDC

As provides a more robust and discriminative distance matrix, it is natural to ask whether this type of calibration knowledge can be used to optimise the feature extractor. To achieve this, we fine-tune the feature extractor by iteratively mapping the original distance distribution to the calibrated distance distribution, formulating the RDC with Fine-Tuning (RDC-FT) method as in Alg. 1.

Data: pre-trained feature extractor ; support set ; query set ; RDC: , ,

; Fine-tune: epochs

, learning rate , .
Result: Fine-tuned feature extractor .
/* RDC-FT: optimise by RDC */
1 Initialise ;
2 for  in  do
3       Extract embeddings for and by ;
       /* RDC: distance calibration */
4       Compute original distances by Eq.(3) ;
5       Compute calibrated distances by Eq.(8) ;
6       Construct a non-linear subspace by Eq.(9, 10) ;
7       Calibrate distance in subspace by Eq.(8) ;
8       Compute ;
       /* FT: optimise with */
9       Get the softened distribution / ;
10       Compute KL loss by Eq.(12);
11       ;
12      
13 end for
Algorithm 1 Ranking Distance Calibration with Fine-Tuning (RDC-FT)

Expanded -reciprocal list as attention. As in Eq.(6), the expanded ranking list is used to re-compute the pairwise distances. The calibrated pairwise distances in are more robust than these not in . Thus the can naturally be used as an attention mask . In particular, a is computed by

(11)

where is an attention scalar. During the fine-tuning process, the is used to re-weight the distance matrices and as and , respectively.

Choices of loss functions.  To achieve the distance distribution alignment, Mean Squared Error (MSE) loss and Kullback-Leibler (KL) divergence loss are alternatives. The MSE loss prefers to directly learn towards the target distance while KL divergence loss focuses on the distribution matching [kim2021comparing]. As KL loss learns this mapping process in a softening way, it is a better way to embed the calibration knowledge into the representations. Here we use:

(12)

where is the temperature-scaling hyper-parameter, and are the softened distributions of the re-weighted distances matrices and . Given a vector in the distance matrix , the softened distribution is denoted by , where is the -th value of .

Method 5-way 1-shot
CUB Cars Places Plantae CropDisease EuroSAT ISIC ChestX Ave.
ProtoNet [snell2017prototypical] 38.660.4 31.340.3 47.890.5 31.750.4 51.220.5 52.930.5 29.200.3 21.570.2 38.07
NPC 37.720.4 32.660.3 41.080.4 33.370.4 63.060.5 53.950.5 29.390.3 22.470.2 39.21
NPC+ norm 43.880.4 35.910.4 48.160.4 38.610.5 66.530.5 63.150.5 31.340.3 22.490.2 43.76
RDC (ours) 47.770.5 38.740.5 58.820.5 41.880.5 80.880.5 67.580.5 32.290.3 22.660.2 48.83
Method 5-way 5-shot
CUB Cars Places Plantae CropDisease EuroSAT ISIC ChestX Ave.
ProtoNet [snell2017prototypical] 57.550.4 43.980.4 68.050.4 46.180.4 79.980.3 75.360.4 39.980.3 24.190.2 54.41
NPC 59.360.4 51.280.4 67.610.4 51.410.4 86.370.3 75.940.4 38.720.3 25.550.2 57.03
NPC+ norm 62.000.4 52.140.4 70.180.4 53.870.4 87.390.3 78.810.4 40.750.3 25.980.2 58.89
RDC (ours) 63.390.4 52.750.4 72.830.4 55.300.4 88.030.3 79.120.4 42.100.3 25.100.2 59.83
Table 1: Comparisons with the baselines using NPC classifier w/o fine-tuning. The classification accuracies on 8 datasets with ResNet10 as the backbone. Our RDC method exploits the full data of FSL task with a NPC classifier. Bold: The best scores.

4 Experiments

Dataset.  Following the benchmarks in [wang2021cross_ata, liang2021boosting_cdfsl], we used miniImageNet as the source domain and another eight datasets , i.e. CUB, Cars, Places Plantae, CropDisease, EuroSAT, ISIC and ChestX, as target domains. In specific, miniImageNet [vinyals2016matching] is a subset of of ILSVRC-2012. CUB, Cars, Places and Plantae are the target domains proposed in [tseng2020cross_fwt] for the evaluation on natural image domains, while CropDisease, EuroSAT, ISIC and ChestX are four domains proposed in [guo2020broader] for generalising the model to domains with different visual characteristics. For all experiments, we resized all the images to 224224 pixels and used data augmentations in [wang2021cross_ata, tseng2020cross_fwt] as image transformation.

Evaluation protocol.  We followed the evaluation protocols in [wang2021cross_ata] to evaluate our method on CD-FSL. In specific, for each target domain, we randomly selected 2000 FSL tasks and each task contains 5 different classes. Each class has 1/5 support labelled data and additional 15 unlabelled data for evaluation the performance, formulating the 5-way 1/5-shot CD-FSL problem.

In all experiments, we reported the mean classification accuracy as well as 95% confidence interval on the query set of each domain. For comprehensive comparison, we listed the average accuracy (shown as Ave. in Tab. 

1 2 and 5) of 8 domains.

Implementation details.  Following previous works [tseng2020cross_fwt, wang2021cross_ata, guo2020broader], we used a ResNet10 as feature extractor. Further, we used the same hyper-parameters for the experiments on different domains to fairly validate the generalisation ability. In specific, the feature extractor are pre-trained for 400 epochs on the base classes of miniImageNet with an Adam optimizer. We set the learning rate as 0.001 and the batch size as 64. For our RDC method, we set , and , and we set the reduced dimensions for the non-linear subspace. For the fine-tuning stage in RDC-FT, we set the attention scalar , temperature and set epochs for model training using an Adam optimizer with learning rate as 0.001.

4.1 Comparison with baselines

As our methods are based on a simple NPC classifier, here we start by comparing our RDC method with some baseline methods which also use a NPC classifier and do not need fine-tuning on a target domain. These baselines are: NPC that uses a NPC classifier on the pre-trained embedding, NPC+ norm which utilises a NPC classifier on a normed feature embeddings, and ProtoNet [snell2017prototypical] that meta-learns a task-agnostic NPC classifier on miniImageNet. The results in Tab. 1 show that RDC largely outperforms these baselines, boosting the simple NPC classifier to a strong one. In particular, the performance on 1-shot learning is improved notably with increases on the Ave. accuracy compared to the baselines. This observation indicates that RDC is efficient to calibrate the distances by fully-leveraging the task information. We also note that the improvement on 5-shot is not as large as that on 1-shot. The reason is that the prototypes for the NPC classifier is more robust under many-shot setting, thus the original distances are less-biased and this calibration process improves less when the embedding is fixed. This limitation can be remitted by using the fine-tuning stage of our RDC-FT method.

Method 5-way 1-shot
CUB Cars Places Plantae CropDisease EuroSAT ISIC ChestX Ave.
GNN+FT [tseng2020cross_fwt] 45.500.5 32.250.4 53.440.5 32.560.4 60.740.5 55.530.5 30.220.3 22.000.2 41.53
GNN+LRP [sun2021explanationcdfsl] 43.890.5 31.460.4 52.280.5 33.200.4 59.230.5 54.990.5 30.940.3 22.110.2 41.01
TPN+ATA [wang2021cross_ata] 50.260.5 34.180.4 57.030.5 39.830.4 77.820.5 65.940.5 34.700.4 21.670.2 47.68
Fine-tuning [guo2020broader] 43.530.4 35.120.4 50.570.4 38.770.4 73.430.5 66.170.5 34.600.3 22.130.2 45.54
ConFT [das2021distractor_cdfsl] 45.570.8 39.110.7 49.970.8 43.090.8 69.710.9 64.790.8 34.470.6 23.310.4 46.25
RDC-FT (ours) 50.090.5 39.040.5 61.170.6 41.300.6 85.790.5 70.510.5 36.280.4 22.320.2 50.82
Method 5-way 5-shot
CUB Cars Places Plantae CropDisease EuroSAT ISIC ChestX Ave.
GNN+FT [tseng2020cross_fwt] 64.970.5 46.190.4 70.700.5 49.660.4 87.070.4 78.020.4 40.870.4 24.280.2 57.72
GNN+LRP [sun2021explanationcdfsl] 62.860.5 46.070.4 71.380.5 50.310.4 86.150.4 77.140.4 44.140.4 24.530.3 57.82
TPN+ATA* [wang2021cross_ata] 65.310.4 46.950.4 72.120.4 55.080.4 88.150.5 79.470.3 45.830.3 23.600.2 59.57
Fine-tuning [guo2020broader] 63.760.4 51.210.4 70.680.4 56.450.4 89.840.3 81.590.3 49.510.3 25.370.2 61.06
ConFT [das2021distractor_cdfsl] 70.530.7 61.530.7 72.090.7 62.540.7 90.900.6 81.520.6 50.790.6 27.500.5 64.68
NSAE(CE+CE) [liang2021boosting_cdfsl] 68.510.8 54.910.7 71.020.7 59.550.8 93.140.5 83.960.6 54.050.6 27.100.4 64.03
RDC-FT (ours) 67.230.4 53.490.5 74.910.4 57.470.4 93.300.3 84.290.3 49.910.3 25.070.2 63.21
Table 2: Comparisons with SoTA methods. The 5-way 1/5-shot classification accuracies on 8 domains with ResNet10 as the backbone. indicates the result reported in [wang2021cross_ata]. means using fine-tuning stage for target domain. * means exploiting the full data of FSL task. represents our implementation results with the official code of [das2021distractor_cdfsl]. Bold: The best scores.

4.2 Comparison with state-of-the-art methods

We further compare our RDC-FT method with State-of-The-Art (SoTA) methods: 1) meta-learners: GNN-FT [tseng2020cross_fwt] that meta-trains a GNN [satorras2018fewgnn] model with an additional Feature Transformation layer, GNN-LRP which uses a Layer-wise Relevance Propagation to guide the GNN model training, and TPN+ATA [wang2021cross_ata] that meta-learns TPN [liu2019fewTPN] with Adversarial Task Augmentation. 2) fine-tuning methods: a general Fine-tuning [guo2020broader] method, ConFT [das2021distractor_cdfsl] that fine-tunes model reusing the base classes, and NSAE [liang2021boosting_cdfsl]

which pre-trains and fine-tunes model with an additional autoencoder task to improve the model generalisation. From Tab. 

4.2, we observe that our RDC-FT method is superior to the SoTA methods on the 1-shot learning and competitive to SoTAs on the 5-shot learning. Also, we notice that the performance is not superior to ConFT and NSAE methods for the 5-shot learning. The behind reasons are: 1) our method explores the task information in an unsupervised way while the others focus on fine-tuning with more labelled data; thus these methods benefit a lot from the many-shot setting. 2) ConFT reuses more data from base classes for model fine-tuning. Thus the similar classes between source and target domain, e.g. birds, cars, help to build more robust decision boundaries when model learning on related target domains, e.g. CUB, Cars. But this approach requires more data and expensive computing resources. 3) NASE adopted an autoencoder to implicitly augment data to pre-train a generalisable model, and our method is theoretically orthogonal to this method for solving the CD-FSL problem.b

4.3 Ablation study

Component analysis. To investigate the efficacy of different components in RDC-FT, we ablate the contribution of each element in RDC-FT: RDC w/o subspace, RDC (in two spaces), RDC-FT w/o subspace and RDC-FT (in two spaces). As in Tab. 3, a simple RDC process without subspace learning, which calibrates the distances only on the pre-trained representation, largely boosts the baseline NPC classifier by i.e. 8.72% (1.87 %) improvement on 1-shot (5-shot). And the fine-tuning process, as in results of RDC-FT w/o subspace, can enlarge the improvement during the iteratively mapping process, achieving 11.15% (5.82 %) improvement on 1-shot (5-shot). Interestingly, we observe that the contribution of subspace for RDC (0.90% on 1-shot) is larger than that for RDC-FT (0.46% on 1-shot). This indicates that fine-tuning process can gradually alleviate the bias of pre-trained representations, thus the benefit of subspace becomes less in RDC-FT.

Method 5-way 1-shot 5-way 5-shot
Baseline NPC 39.21 57.03
+RDC w/o subspace 47.93(8.72%) 58.90(1.87%)
+RDC 48.83(9.62%) 59.83(2.80%)
+RDC-FT w/o subspace 50.36(11.15%) 62.85(5.82%)
+RDC-FT 50.82(11.61%) 63.21(6.18%)
Table 3: Component analysis of the proposed RDC-FT method. The results are the average accuracies of 8 target domains. ( %) indicates % improvement compared to the NPC baseline.
Method 5-way 1-shot
N/A linear Gaussian Poly. Sigmoid Ours
NPC+ norm 43.76 45.30 45.29 45.23 45.46 46.31
RDC 47.70 47.75 47.15 47.14 47.20 48.42
Method 5-way 5-shot
N/A linear Gaussian Poly. Sigmoid Ours
NPC+ norm 58.89 58.17 58.18 58.09 58.23 59.50
RDC 58.92 58.48 58.49 58.42 58.52 59.68
Table 4: Comparisons of different subspaces. The average accuracies of 8 target domains by using NPC+ norm and RDC methods on the subspaces constructed by KPCA with different kernels. N/A represents the original representation without subspace.

Comparison of different PCA methods.  In this part, we compare our non-linear subspace to the subspaces constructed by Kernel PCA (KPCA) methods with different kernel types (linear, Gaussian, Polynomial and Sigmoid). For fair comparison, all the dimensions of subspace are set as 64 and we used the default parameters of KPCA methods in scikit-learn. Table 4 shows that our non-linear subspace performs better than other KPCA methods. Interestingly, without RDC method, the KPCA methods can largely improve the performance (compared to N/A) on 1-shot learning, but the performances of KPCA methods are just competitive to the original space when applying the RDC method on different subspaces. However, our non-linear subspace achieves consistent and stable improvement both with and without RDC method, verifying the robustness and superiority of our non-linear subspace.

Effect of loss choices. We evaluated the performance of RDC-FT with different loss functions. The results in Fig. 3 show that these three losses achieve competitive performance on CUB, Cars, ISIC and chestX, while the KL loss performs mostly better than MSE loss on Places, Plantae, EuroSAT. These observations suggest the superiority of mapping the distance matrix in softened distributions. We conjecture this should attribute to the softening process which can alleviate the negative effect of the calibrated distances. Moreover, we note that the performance of KL loss can be further improved by an attention strategy on the distance matrices, verifying the efficacy of employing the expanded -nearest neighbors list as an attention reference.

(a) 5-way 1-shot evaluation.
(b) 5-way 5-shot evaluation.
Figure 3: Results of RDC-FT with different loss functions. The evaluation is conducted on 2000 tasks from 8 target domains.
Figure 4: Ranking lists of a 5-way 1-shot task from CUB. The images with red/blue rectangle are the ground-truth support data for a given query. The RDC method calibrates the original ranking list to yield correct recognition results (the images with red rectangle) or closer pairwise distances (the images with blue rectangle).

Visualisation. To qualitatively show the effectiveness of our RDC and RDC-FT methods. We first show a case study of a FSL task from CUB by comparing the original ranking list and the ranking list with RDC. As in Fig. 7, for a given query data, our RDC method pulls the ground-truth support data closer to the query data, arriving at a more accurate position. This process is achieved by the calibration process of our RDC method. For the RDC-FT method, we use t-SNE [van2008visualizing] to visualise the feature embeddings of FSL tasks randomly selected from target domains, i.e. CUB, CropDisease and EuroSAT. As in Fig. 5, the feature representations with RDC-FT (in the 2nd row plots) have less within-class variations and large class margins compared to these without RDC-FT process (in the 1st row plots), showing that the RDC-FT method can guild a task-specific embedding where the samples can easily be classified by a simple NPC classifier. Moreover, our RDC-FT method, as expected, is functioning as an implicit clustering process for FSL task. This can be qualitatively verified by the observation of the clustering effect as in the 2nd row plots of Fig. 5.

Figure 5: T-SNE visualisation of 5-way 1-shot tasks. Different colours refer to different classes. We visualise the task features before (the 1st row) and after (the 2nd row) the RDC-FT method.

Incorporate with other method. As RDC is a post-processing method, it can flexibly combine with other methods. Here we employed RDC on a general data augmentation method [yeh2020large]. The results in Tab. 5 indicate that RDC can achieve consistent improvement on other method, showing its generalisable ability. Currently we cannot evaluate our method on [liang2021boosting_cdfsl, liu2020feature] until their code is released.

Method 5-way 1-shot
CUB Cars Places Plantae Crop. Euro. ISIC ChestX
NPC+DA 42.63 36.16 48.04 38.20 71.61 66.79 34.49 22.38
RDC+DA 46.81 38.51 58.00 40.18 81.93 71.76 35.67 22.27
Method 5-way 1-shot
CUB Cars Places Plantae Crop. Euro. ISIC ChestX
NPC+DA 63.63 53.16 69.57 55.06 89.88 81.61 50.25 25.56
RDC+DA 65.39 54.10 72.82 56.46 91.27 81.88 50.91 25.21
Table 5: Incorporate RDC with others. represents using NPC on the model fine-tuned with the data augmentation in [yeh2020large].

5 Conclusions

In this paper, we proposed a Ranking Distance Calibration (RDC) method to calibrate the biased distances in CD-FSL. The calibration process is achieved by a re-ranking method with a -reciprocal discovery and encoding process. As the pre-trained linear embedding is biased for target domain, we further proposed a non-linear subspace followed by a calibration process on it. Our RDC method averages the calibrated distances on the two spaces to a robust distance matrix. Moreover, we introduced a RDC-FT method to fine-tune the embedding with the calibrated distances, yielding a discriminative representation for CD-FSL task.

Limitation and discussion.  As the image retrieval perspective of our approach is to discover the task information unsupervised, exploring comprehensive leveraging of the label information and the task information should be considered, especially in the many-shot cases, e.g. 5-shot.

We notice that a concurrent work [xi2021_reranking] also uses the re-ranking process for few-shot learning. Although our work and  [xi2021_reranking] share the image retrieval angle for few-shot learning, their key differences are summarised as follows: 1) Our work concerns on the -reciprocal process to calibrate the pairwise distance both in the pre-trained linear space and a task-adaptive non-linear space, while [xi2021_reranking] uses a graph view to improve the subgraph similarity on the pre-trained space; 2) For the representation optimisation stage, we use the calibrated distance matrix as a guidance to fine-tune the pre-trained feature extractor with a Kullback-Leibler divergence loss, whilst [xi2021_reranking] designs a Multi-Layer Perception to meta-learn a subgragh similarity refiner to optimise the feature extractor with a Cross Entropy loss. In particular, the meta-learning strategy in [xi2021_reranking] and the fine-tuning strategy in our work are orthogonal as they focus on the learning on source data and target data, respectively.

References

Appendix A Overview

In this supplementary material, we present:

  • To validate the robustness of the proposed RDC and RDC-FT methods, we analyse the sensitivities of different hyper-parameters, i.e. , and (in Sec. B);

  • To visualise the advantages and limitations of the proposed RDC and RDC-FT methods, we use t-SNE [maaten2008visualizing] to display the classification results (Figure 7 in Sec. C);

  • For better understanding the computing process of the Jaccard distance, we illustrate the algorithm of the Jaccard distance in Sec. D.

  • The notations for all symbols and hyper-parameters used in the main paper are defined (in Sec. E);

Appendix B Sensitivity analysis of the hyper-parameters

In all experiments of the main paper, we reported the results on 8 target domains with the same hyper-parameters. In practice, our method is robust to the hyper-parameters selection as shown in Fig. 6. Further, we analyse in depth three key hyper-parameters, , and .

(a) Sensitivity analysis of .
(b) Sensitivity analysis of .
(c) Sensitivity analysis of .
Figure 6: Analysis of the hyper-parameters, i.e. , and . The evaluation results are the average accuracies of 8 target domains.

b.1 Effect of the trade-off scalar

The trade-off scalar is used to balance the original distance and the Jaccard distance for the proposed RDC method, thus it is a critical hyper-parameter for RDC. We conducted experiments to test RDC with on the pre-trained space.

The results are shown in Tab. 6 and Fig. 6(a), from which we can see that assigning smaller weights to the original distance (smaller ) is a better choice for RDC. In particular, the best for 1-shot is 0.3 while that for 5-shot is 0.5. This indicates that the original distance becomes more robust when the shot increases, thus the original space should occupy larger weights in the calibrated distance. Besides, when is between 0.1 to 0.5, the average accuracies of RDC are stable, verifying the robustness of . Therefore, we conducted all experiments by setting in the main paper.

5-way 1-shot
CUB Car Places Plantae Crop Euro. ISIC Chestx Ave.
0.1 46.53 36.59 56.84 40.91 79.93 68.68 31.04 22.22 47.84
0.3 46.92 37.16 57.03 41.15 79.62 68.67 31.32 22.36 48.03
0.5 46.89 37.52 56.64 41.14 78.62 68.21 31.58 22.48 47.89
0.7 46.44 37.53 55.29 40.72 76.35 67.11 31.63 22.48 47.19
0.9 45.02 37.03 51.95 39.77 71.59 64.99 31.34 22.49 45.52
5-way 5-shot
CUB Car Places Plantae Crop Euro. ISIC Chestx Ave.
0.1 60.16 49.99 70.02 54.69 88.39 79.14 40.34 24.61 58.42
0.3 61.04 51.17 70.58 55.08 88.37 79.27 41.16 24.77 58.93
0.5 61.42 51.85 70.54 54.86 87.80 78.95 41.39 24.86 58.96
0.7 61.11 51.96 69.68 54.07 86.41 77.96 41.11 24.82 58.39
0.9 59.90 51.25 67.27 52.65 83.31 76.17 40.46 24.64 56.96
Table 6: Analysis of the trade-off scalar . Results of RDC with on the pre-trained space.

b.2 Influence of the reduced dimensions

The dimensions in the subspace is a key parameter to build our non-linear space. Typically, we choose ( represents the original space) to test the effects of different dimension .

Table 7 and Fig. 6(b) show that the performance on different subspaces are stable when is smaller than 128. This observation shows that the subspaces constructed by the hyperbolic tangent transformation are not sensitive to the reduced dimensions. In particular, the subspace with is the best dimension for 1-shot learning and that with is the best dimension for 5-shot learning. To make a balance among different shot learning, we set in all experiments of the main paper.

5-way 1-shot
CUB Car Places Plantae Crop Euro. ISIC Chestx Ave.
16 47.05 37.10 53.39 39.89 71.31 63.52 31.81 22.56 45.83
32 46.13 37.40 51.69 39.37 68.77 62.54 31.62 22.54 45.01
64 46.30 37.86 52.16 39.49 69.09 62.41 31.69 22.52 45.19
128 46.56 38.31 53.23 39.95 69.97 63.07 31.95 22.63 45.71
256 43.77 36.02 48.16 39.31 68.18 63.45 32.01 22.56 44.18
512 43.29 35.96 48.32 38.69 67.43 63.18 31.45 22.54 43.86
5-way 5-shot
CUB Car Places Plantae Crop Euro. ISIC Chestx Ave.
16 63.69 50.11 70.29 53.72 88.32 78.35 40.42 25.75 58.83
32 64.22 52.08 70.17 54.36 88.20 78.65 41.28 25.98 59.37
64 64.05 52.64 70.46 54.68 88.20 78.74 41.84 26.18 59.60
128 64.03 53.29 71.11 54.91 88.22 78.75 41.87 26.16 59.79
256 63.16 52.49 69.66 55.09 88.43 79.35 41.89 26.15 59.53
512 62.38 52.37 69.74 54.16 87.86 78.79 41.05 25.92 59.03
Table 7: Analysis of the number of the reduced dimensions . Results of NPC on the proposed non-linear subspaces with .
5-way 1-shot
CUB Car Places Plantae Crop Euro. ISIC Chestx Ave.
0 50.18 39.30 60.13 41.55 85.52 68.14 35.17 22.39 50.30
0.1 49.86 38.85 58.94 41.61 85.54 68.84 35.36 22.27 50.16
0.3 50.00 39.07 59.12 42.05 85.46 69.30 35.52 22.48 50.38
0.5 50.09 39.04 61.17 41.30 85.79 70.51 36.28 22.32 50.82
0.7 49.90 38.65 59.37 41.54 85.36 69.83 35.68 22.23 50.32
0.9 50.31 38.01 59.75 41.22 85.35 70.05 36.00 22.35 50.38
5-way 5-shot
CUB Car Places Plantae Crop Euro. ISIC Chestx Ave.
0 66.37 53.49 71.81 56.52 93.15 81.47 47.38 25.28 62.01
0.1 66.34 53.91 72.65 57.11 93.12 82.59 47.61 25.08 62.30
0.3 66.13 52.97 73.05 57.89 93.22 83.72 48.94 25.06 62.62
0.5 67.23 53.49 74.91 57.47 93.30 84.29 49.91 25.07 63.21
0.7 67.04 53.45 73.73 57.83 93.56 83.91 49.53 25.04 63.01
0.9 66.96 53.20 73.74 57.68 93.44 84.07 49.69 25.18 63.00
Table 8: Analysis of the attention scalar . Results of RDC-FT with the attention scalar . represents the RDC-FT results without attention strategy.

b.3 Effect of the attention scalar

The attention scalar is used to increase the weights of the calibrated distance occurred in , here we investigate the effectiveness of different and show the results in Tab. 8 and Fig. 6(c).

The results show that this attention strategy can benefit the representation adaptation for FSL task in the target domain. In specific, moderately increasing the attention scalar ( from 0.1 to 0.5) can improve the effectiveness of the attention strategy. To the contrary, overly increasing the attention scalar will introduce negative effect, resulting the decrease of the performance ( from 0.5 to 0.9). Therefore, the choice of in the main paper is a moderate and robust parameter for the attention strategy.

Appendix C Visualisation

To visualise the advantages and limitations of the proposed RDC and RDC-FT methods, we use t-SNE [maaten2008visualizing] to display a 5-way 1-shot FSL task from CropDisease as in Fig. 7. From the figure we observe that:

  • RDC can correct some misclassified samples that are closed to the support exemplars, i.e. the samples in red solid rectangles in plot(II) and plot(III). However, RDC cannot well-address the misclassified samples between different support exemplars, i.e. the failure cases in plot(III).

  • From the samples in red dashed rectangles of plot(I) and plot(IV), we note that RDC-FT can calibrate the distance-based distributions in the representational space, encouraging the feature representations to have less within-class variations and large class margins. Thus the fine-tuned representations are more discriminative for classification.

  • The failure cases of RDC, M-R1, M-R3, M-R24, and M-R5 in plot(III), can be correctly classified by RDC-FT with a simple NPC classifier in plot(V). This verifies the superiority of RDC-FT that gradually embeds the calibration information from RDC to the representational space.

Appendix D Details of Jaccard distance

The Jaccard distance computing is an important part of RDC. In specific, the concept of Jaccard distance derives from [bai2016sparse] and the re-weighting strategy for Jaccard distance is also used in [zhong2017rerank]. We briefly introduce the computing process of Jaccard distance in the main paper. Here, we further illustrate more details for clearer description as in Algorithm 1. In this pseudo-code, we illustrate the computing process of -reciprocal discovery and encoding in line 3-9, and the discovery process and encoding process are presented in line 5-8 and line 9, respectively. Then, the query expansion and Jaccard distance computing process are illustrated in line 11-14 of Algorithm 1.

Appendix E Symbols and hyper-parameters

To clearly and fast understand the equations in the main paper, we list the symbols and hyper-parameters in the Tab. 9 and Tab. 10, respectively.

Symbol Meaning
FSL task in the target domain
Feature of th sample in
Euclidean distance matrix in the original space
Jaccard distance matrix
Calibrated distance matrix in the original space
Calibrated distance matrix in the subspace
Complementary calibrated distance matrix
Pairwise distance between and
Pairwise distances between and
Jaccard distance between and
-nearest neighbors ranking list of
Expanded -nearest neighbors ranking list of
Gaussian kernel of pairwise distance between and
Table 9: Explanation of the symbols.
Hyper-parameter Meaning
Number of candidates in
Number of samples for updating
Trade-off scalar to balance and
Dimensions of feature in the subspace
Number of epochs in fine-tuning stage
Temperature-scaling hyper-parameter
Attention scalar
Table 10: Explanation of the hyper-parameters.
Figure 7: T-SNE visualisation of 5-way 1-shot FSL task from CropDisease. We show the classification results with the NPC classifier and the proposed RDC method in different representations, i.e. Pre-trained representations and task-adaptive representation by RDC-FT. The different colours of the points (round/cross/star) represent the ground-truth labels or the labels assigned by NPC/RDC.

algocf[t]