Few-Shot Learning (FSL) promises to allow a machine to learn novel concepts from limited experience, i.e. few novel target data and data-rich source data. Typically, the defaulted FSL assumes that the source and target data is in the same domain, but belong to different classes. In practice, FSL is required to generalise to different target domains. Cross-Domain Few-Shot Learning (CD-FSL) [tseng2020cross_fwt, guo2020broader, liang2021boosting_cdfsl, fu2021metamixup] has been studied more recently. In CD-FSL, the target data not only has a different label space but also are from a different domain to the source data.
It is nontrivial to directly extend the general FSL approach to address the CD-FSL challenges. In fact, many promising FSL methods [snell2017prototypical, finn2017model, sung2018learning, satorras2018fewgnn] performed poorly in CD-FSL [guo2020broader, tseng2020cross_fwt]. The central idea of these general FSL methods is to transfer and generalise the visual representations learned from source data to target data. However, the significant visual domain gap between the source and target data in CD-FSL makes it fundamentally difficulty to learn a shared visual representation across different domains.
A few recent CD-FSL studies [tseng2020cross_fwt, liu2020urt, liang2021boosting_cdfsl, wang2021cross_ata] try to learn a generalisable feature extractor to improve model transferability, which is a popular idea in domain generalisation and domain adaptation [volpi2018generalizing_dataaug, NEURIPS2020_adver_dataaug, zhou2021dg_mixstyle, li2021simple, wu2021striking] where the source and target domains share the same label space. Empirically, this approach shows some improvement on CD-FSL but it does not model any visual and label characteristics of the target domain and more importantly their cross-domain impact on the pre-trained source domain representation. We argue this cross-domain mapping between the source domain representation and its interpretion in the context of the target domain data characteristics is essential for effective CD-FSL. From a related perspective, other CD-FSL studies have considered fine-tuning the source domain feature representation from augmenting additional support data in the target domain, e.g. either explicitly augmenting the support data by adversarial training [wang2021cross_ata] and image transformations [guo2020broader], or implicitly augmenting the support data by training an auto-encoder [liang2021boosting_cdfsl]. However, these methods for CD-FSL are straightforward data-augmentation methods for increasing training data in target domain model fine-tuning, without considering how to quantify cross-domain relevance of the pre-trained source domain representation.
In this work, we consider an alternative approach with a new perspective to treat cross-domain few-shot learning as an image retrieval task. We wish to optimise model adaptation by leveraging target domain retrieval task context, that is, not only the labelled support data but also the unlabelled query data. To that end, we use a generic representation pre-trained by a cross entropy loss and a simple distance-based classifier as a baseline, then employ a -reciprocal neighbour discovery (as in Fig. 1) and encoding process to calibrate pairwise distances between each unlabelled query image and its likely matches. Our idea is both orthogonal and complementary to other generalisable model learning methods [guo2020broader, liu2020feature, liang2021boosting_cdfsl]. It can be flexibly used in either fine-tuning or without fine-tuning based model learning.
Generally, the distance matrix for CD-FSL task contains many incorrect
results as this distance is built on a potentially biased
pre-trained source domain representational space. To calibrate
this distance matrix towards the target domain so to reduce
its bias to the source domain, we explore the re-ranking concept
in the target domain by considering CD-FSL optimisation as
re-ranking in a retrieval task given few-shots as anchor points.
As in Fig. 1, re-ranking first computes a
-nearest neighbour ranking list. This is further expanded by
discovering the -reciprocal nearest neighbours in the target domain. The expanded ranking
list is used for re-computing a Jaccard distance to measure the
difference between the original ranking list and the expanded ranking
list, achieving a more robust and accurate distance matrix.
Critically, a pre-trained representation from source domain is
biased and poor for generalisation cross-domains in CD-FSL. The reason is that conventional
FSL methods assume implicitly linear transformations mostly between the source and
target data as they are sampled from the same domain.
This becomes invalid in CD-FSL with mostly nonlinear
transformations across source and target domains. To address this
problem, we propose a task-aware subspace mapping to
minimise transferring task-irrelevant representational information
from the source domain. In particular, instead of mapping to a
linear projection space, we explore a hyperbolic tangent function to
project the source domain representation to a non-linear
space To impose the above distance calibration into the representational
space transform, we approximate the distance matrices by their
corresponding distributions, and then a Kullback-Leibler (KL)
divergence loss function is optimised for iteratively mapping the
original distance distribution from the source domain towards the
calibrated space. This provides an additional RDC Fine-Tuning
(RDC-FT) model optimisation.
Critically, a pre-trained representation from source domain is biased and poor for generalisation cross-domains in CD-FSL. The reason is that conventional FSL methods assume implicitly linear transformations mostly between the source and target data as they are sampled from the same domain. This becomes invalid in CD-FSL with mostly nonlinear transformations across source and target domains. To address this problem, we propose a task-aware subspace mapping to minimise transferring task-irrelevant representational information from the source domain. In particular, instead of mapping to a linear projection space, we explore a hyperbolic tangent function to project the source domain representation to a non-linear space. Compared to the linear Euclidean space, this non-linear space performs a dimensionality reduction to optimise the retention of transferrable information from the source to the target domain. Moreover, we explore the idea of re-ranking to calibrate and align two distance matrices in two representational spaces between the original pre-trained source domain linear space and the new non-linear subspace. The calibrated matrices are combined to construct a single distance matrix for the target domain in CD-FSL. We call this Ranking Distance Calibration (RDC).
To impose the above distance calibration into the representational space transform, we approximate the distance matrices by their corresponding distributions, and then a Kullback-Leibler (KL) divergence loss function is optimised for iteratively mapping the original distance distribution from the source domain towards the calibrated space. This provides an additional RDC Fine-Tuning (RDC-FT) model optimisation.
Our contributions from this work are three-fold: (1) To transform the biased distance matrix in the source domain representational space towards the target domain in CD-FSL, we use a re-ranking method to re-compute a Jaccard distance for distance calibration by discovering the reciprocal nearest neighbours within the task. We call this Ranking Distance Calibration. (2) We propose a non-linear subspace to shadow the pre-trained source domain representational space. This is designed to model any inherent non-linear transform in CD-FSL and used to facilitate the distance calibration process between the source and target domains. By modelling explicitly this nonlinearity, we formulate a more robust and generalisable Ranking Distance Calibration (RDC) model for CD-FSL. (3) We further impose RDC as a constraint to the model optimisation process. This is achieved by a RDC with Fine-Tuning (RDC-FT) for iteratively mapping the original source domain distance distribution to a calibrated target domain distance distribution for a more stable and improved CD-FSL.
We evaluated the proposed RDC and RDC-FT methods for CD-FSL on eight target domains. The results show that RDC can improve notably the conventional distance-based classifier, and RDC-FT can improve the representation for target domain to achieve competitive or better performance than the state-of-the-art CD-FSL models.
2 Related Work
Few-shot learning The approaches for general FSL can be broadly divided into two categories: optimisation-based methods [finn2017model, ravi2016optimization, oh2021boil] which learn a generalisable model initialisation and then adapt the model on a novel task with limited labelled data, and metric learning methods [snell2017prototypical, sung2018learning, zhang2020deepemd, li2021plain] that meta-learn a discriminative embedding space where the sample in novel task can be well-classified by a common or learned distance metric. Recently, some researches [chen2019closer, tian2020rethinking, li2020few_self] show that a simple pre-training method followed by a fine-tuning stage can achieve competitive or better performance than the metric learning methods. This observation also seems to be true in CD-FSL [guo2020broader].
Cross-domain few-shot learning The problem of CD-FSL was preliminarily studied in FSL [chen2019closer, tian2020rethinking, pan2021mfl], then [tseng2020cross_fwt, guo2020broader] expanded this setup and proposed two benchmarks to train a model on a single source domain and then generalise it to other domains. Some CD-FSL works [tseng2020cross_fwt, wang2021cross_ata, liang2021boosting_cdfsl] focus on learning a generalisable model by explicit or implicit data augmentation [tseng2020cross_fwt, wang2021cross_ata, liang2021boosting_cdfsl]. These approaches improve the model generalisation ability but easily result in ambiguous optimisation result since they ignore the adaption process for the target domain. Another methods [phoo2021STARTUP, das2021distractor_cdfsl, fu2021metamixup] target to the adaptation on the target domain by leveraging additional unlabelled data [phoo2021STARTUP], labelled data [fu2021metamixup] or the base data [das2021distractor_cdfsl]. In practice, the increased data can help model adaptation on the target domain but these information are not easy to obtain. In this work, we address the CD-FSL problem with an image retrieval view and mine the intra-task information to guide a ranking distance calibration process.
Ranking in image retrieval Image Retrieval (IR) is a classical vision task that aims to search from unlabelled gallery data to find the images that are most relevant to the probe image. Ranking is a classical approach in the IR field [he2004manifold_rank, huang2015cross_retrivel, liu2013pop_rank, loy2013person_rank]. Generally, the ranking method computes a ranking list based on a distance metric. And a series of re-ranking method were proposed as a post-processing step to improve the initial ranking result. One typical re-ranking based method [zhong2017rerank] used a concept of -reciprocal nearest neighbors [qin2011hello_kreciprocal] to explore more hard positive samples, then the enlarged -nearest neighbors list is used to recompute a Jaccard distance as an auxiliary distance matrix. A number of works [liu2018adaptivererank, sarfraz2018posererank] extended this idea to further benefit the retrieval performance. In this work, we reuse the re-ranking approach to address the CD-FSL problem.
Problem formulation. We start by defining a general FSL problem: given a source dataset and a target dataset , where the classes in and are disjoint. FSL aims to address the -way -shot classification task in by leveraging the limited data in and the prior knowledge learned from containing lots of labelled images. In specific, a FSL task contains a labelled support set and an unlabelled query set , where the images in and are both from the same classes and / are the number of images per class in /. The goal of FSL is to recognise the unlabelled query set when is small. Noticeable, the and in CD-FSL are from different domains. For instance, is a dataset containing lots of natural images while is a dataset collected from the remote sensing field.
Nearest prototype classifier. Before fully developing our main backbone, we define the prototype classifier we used here. Given a feature extractor , we can extract the embedding for image in a FSL task . Nearest Prototype Classifier (NPC) first computes the prototypes for the classes, where the prototype for class is:
With the prototypes, the labels for in is assigned by:
where is a distance metric, e.g. Euclidean distance in this work, and is the distance between and .
Overview. The key insight of this paper is to formulate the FSL as the Image Retrieval (IR) task by sharing the same angle with [triantafillou2017fewretrivel]. In [triantafillou2017fewretrivel], the authors propose to optimise the mean average precision for FSL. Furthermore, our view of “FSL as IR” also emphasises the importance of maximally leveraging all the available information in this low-data regime whilst concerns on the calibration of pairwise distances in FSL. In particular, this work follows this view for FSL and consider each sample in FSL as the probe data in IR and treat the whole FSL data as the gallery data. To this end, we propose a ranking distance calibration process for CD-FSL, and our key methodology is to repurpose the re-ranking to find the relevant images from the FSL task for a given image. We overview the proposed method in Fig. 2.
3.1 Ranking Distance Calibration (RDC)
Motivation. Previous works [zhong2017rerank, qin2011hello_kreciprocal] have suggested that discovering the -reciprocal nearest neighbors within the gallery data can benefit the re-ranking result for image retrieval. This observation encourages us, when considering few-shot as image retrieval, to reuse this -reciprocal nearest neighbor discovery process to calibrate the pairwise distances within FSL task. We give an intuitive example in Fig. 1 by mining the relative relationships among the samples by -reciprocal nearest neighbors to cluster the potential positive instances in the same class for re-weighting the pairwise distances. By this manner, the pairwise distances in FSL task can be encoded as a new distance matrix.
Here, we briefly describe the re-ranking process for our ranking distance calibration and detail the Jaccard distance computing process in Sec. D of the supplementary material. For a FSL task, we start by computing an original pairwise Euclidean distance matrix:
is the Euclidean distance between and , and represents the normalisation. Referring to , we can obtain the -nearest neighbors set , and is the -nearest neighbor set of . The re-ranking idea [zhong2017rerank] is to expand the by discovering more hard-positive samples for . The expand process for is guided by a -reciprocal nearest neighbors algorithm [qin2011hello_kreciprocal] and the expanded ranking list
is used to estimate a calibrated distance matrix.
-reciprocal discovery and encoding. The principle of -reciprocal nearest neighbor discovery is that if is in , then should also occur in [qin2011hello_kreciprocal]. With this assumption, the can be expanded as by:
where is the sample in and is the number of neighbors. Moreover, for many-shot cases, we further expand by uniting , where and are both in the support set as well same class. By this manner, finally the expanded -nearest neighbors set can be expanded as . To assign larger weights to closer neighbors while smaller weights to farther ones, the is further used to encode the
into a vector, where is defined as the Gaussian kernel of the pairwise distance as:
After that, a query expansion strategy is employed to integrate most-likely samples to update the feature of by: where .
Jaccard distance. Referring to [bai2016sparse, zhong2017rerank], the expanded ranked list is used as contextual knowledge to compute a Jaccard distance matrix by:
Following the re-weighting method in [zhong2017rerank], the number of candidates in the intersection and union set can be calculated as and , where and operate the element-based minimisation and maximisation for two input vectors, and is norm. Then the Jaccard distance in Eq.(6) can be re-formulated as:
Distance calibration. Then this is used to calibrate the original distance matrix by a weighting strategy
where is a trade-off scalar to balance the two matrices.
3.2 RDC in Task-adaptive Non-linear Subspace
To further bridge the domain gap, we propose further improving the RDC in a non-linear subspace. We particular tailor a discriminative subspace to help calibrate the ranking in our CD-FSL task. The subspace is built upon the Principal Component Analysis (PCA) to extract crucial features from the original space. In specific, given the feature representations, we have
is a transformation matrix mapping the feature with dimensions to a reduced feature with dimensions.
Hyperbolic tangent transformation. Generally, the PCA method can be directly used on the original embedding space. However, the original representation is scattered due to the biased and less-discriminative embedding; thus the dimensional reduction easily causes the information loss problem. To remit this issue, we consider to transform the original representations to a compact and representative non-linear space. By using the idea of kernels, we use a hyperbolic tangent function to construct a task-adaptive non-linear subspace. Our non-linear PCA method first computes a feature-toward kernel function by:
Complementary distance calibration. Our distance calibration process can be applied in the original linear embedding space (in Sec. 3.1) and a non-linear subspace (in Sec. 3.2). The original space has higher dimensions consist of full information but also disturbed by noisy task-irrelevant features, while the non-linear subspace reduce some task-irrelevant signal but loss some information. To this end, our RDC method co-leverages the calibrated distances in the two spaces to capture a robust and complementary distance matrix . The computing process of RDC method is in the line 4-8 of Alg. 1.
Remark. As the hyperbolic non-linear space has larger capability than Euclidean space [khrulkov2020hyperbolic, fang2021kernel_hyperbolic, yan2021unsupervised_hyperbolic], it can alleviate the information loss caused by the dimensionality reduction. Therefore, we use a hyperbolic tangent transformation to map the source domain linear space to a non-linear space. We note that the subspace learning has been preliminary explored in FSL work [yoon2019tapnet, simon2020adaptive] to learn task-adaptive or class-adaptive subspace. Critically, our subspace construction method is different from [yoon2019tapnet, simon2020adaptive], and our method does not need the sophisticated episode training process.
3.3 Fine-tuning with RDC
As provides a more robust and discriminative distance matrix, it is natural to ask whether this type of calibration knowledge can be used to optimise the feature extractor. To achieve this, we fine-tune the feature extractor by iteratively mapping the original distance distribution to the calibrated distance distribution, formulating the RDC with Fine-Tuning (RDC-FT) method as in Alg. 1.
Expanded -reciprocal list as attention. As in Eq.(6), the expanded ranking list is used to re-compute the pairwise distances. The calibrated pairwise distances in are more robust than these not in . Thus the can naturally be used as an attention mask . In particular, a is computed by
where is an attention scalar. During the fine-tuning process, the is used to re-weight the distance matrices and as and , respectively.
Choices of loss functions. To achieve the distance distribution alignment, Mean Squared Error (MSE) loss and Kullback-Leibler (KL) divergence loss are alternatives. The MSE loss prefers to directly learn towards the target distance while KL divergence loss focuses on the distribution matching [kim2021comparing]. As KL loss learns this mapping process in a softening way, it is a better way to embed the calibration knowledge into the representations. Here we use:
where is the temperature-scaling hyper-parameter, and are the softened distributions of the re-weighted distances matrices and . Given a vector in the distance matrix , the softened distribution is denoted by , where is the -th value of .
Dataset. Following the benchmarks in [wang2021cross_ata, liang2021boosting_cdfsl], we used miniImageNet as the source domain and another eight datasets , i.e. CUB, Cars, Places Plantae, CropDisease, EuroSAT, ISIC and ChestX, as target domains. In specific, miniImageNet [vinyals2016matching] is a subset of of ILSVRC-2012. CUB, Cars, Places and Plantae are the target domains proposed in [tseng2020cross_fwt] for the evaluation on natural image domains, while CropDisease, EuroSAT, ISIC and ChestX are four domains proposed in [guo2020broader] for generalising the model to domains with different visual characteristics. For all experiments, we resized all the images to 224224 pixels and used data augmentations in [wang2021cross_ata, tseng2020cross_fwt] as image transformation.
We followed the evaluation protocols in [wang2021cross_ata] to evaluate our method on CD-FSL. In specific, for each target domain, we randomly selected 2000 FSL tasks and each task contains 5 different classes. Each class has 1/5 support labelled data and additional 15 unlabelled data for evaluation the performance, formulating the 5-way 1/5-shot CD-FSL problem.
In all experiments, we reported the mean classification accuracy as well as 95% confidence interval on the query set of each domain. For comprehensive comparison, we listed the average accuracy (shown as Ave. in Tab.
In all experiments, we reported the mean classification accuracy as well as 95% confidence interval on the query set of each domain. For comprehensive comparison, we listed the average accuracy (shown as Ave. in Tab.1 2 and 5) of 8 domains.
Implementation details. Following previous works [tseng2020cross_fwt, wang2021cross_ata, guo2020broader], we used a ResNet10 as feature extractor. Further, we used the same hyper-parameters for the experiments on different domains to fairly validate the generalisation ability. In specific, the feature extractor are pre-trained for 400 epochs on the base classes of miniImageNet with an Adam optimizer. We set the learning rate as 0.001 and the batch size as 64. For our RDC method, we set , and , and we set the reduced dimensions for the non-linear subspace. For the fine-tuning stage in RDC-FT, we set the attention scalar , temperature and set epochs for model training using an Adam optimizer with learning rate as 0.001.
4.1 Comparison with baselines
As our methods are based on a simple NPC classifier, here we start by comparing our RDC method with some baseline methods which also use a NPC classifier and do not need fine-tuning on a target domain. These baselines are: NPC that uses a NPC classifier on the pre-trained embedding, NPC+ norm which utilises a NPC classifier on a normed feature embeddings, and ProtoNet [snell2017prototypical] that meta-learns a task-agnostic NPC classifier on miniImageNet. The results in Tab. 1 show that RDC largely outperforms these baselines, boosting the simple NPC classifier to a strong one. In particular, the performance on 1-shot learning is improved notably with increases on the Ave. accuracy compared to the baselines. This observation indicates that RDC is efficient to calibrate the distances by fully-leveraging the task information. We also note that the improvement on 5-shot is not as large as that on 1-shot. The reason is that the prototypes for the NPC classifier is more robust under many-shot setting, thus the original distances are less-biased and this calibration process improves less when the embedding is fixed. This limitation can be remitted by using the fine-tuning stage of our RDC-FT method.
4.2 Comparison with state-of-the-art methods
We further compare our RDC-FT method with State-of-The-Art (SoTA) methods: 1) meta-learners: GNN-FT [tseng2020cross_fwt] that meta-trains a GNN [satorras2018fewgnn] model with an additional Feature Transformation layer, GNN-LRP which uses a Layer-wise Relevance Propagation to guide the GNN model training, and TPN+ATA [wang2021cross_ata] that meta-learns TPN [liu2019fewTPN] with Adversarial Task Augmentation. 2) fine-tuning methods: a general Fine-tuning [guo2020broader] method, ConFT [das2021distractor_cdfsl] that fine-tunes model reusing the base classes, and NSAE [liang2021boosting_cdfsl]
which pre-trains and fine-tunes model with an additional autoencoder task to improve the model generalisation. From Tab.4.2, we observe that our RDC-FT method is superior to the SoTA methods on the 1-shot learning and competitive to SoTAs on the 5-shot learning. Also, we notice that the performance is not superior to ConFT and NSAE methods for the 5-shot learning. The behind reasons are: 1) our method explores the task information in an unsupervised way while the others focus on fine-tuning with more labelled data; thus these methods benefit a lot from the many-shot setting. 2) ConFT reuses more data from base classes for model fine-tuning. Thus the similar classes between source and target domain, e.g. birds, cars, help to build more robust decision boundaries when model learning on related target domains, e.g. CUB, Cars. But this approach requires more data and expensive computing resources. 3) NASE adopted an autoencoder to implicitly augment data to pre-train a generalisable model, and our method is theoretically orthogonal to this method for solving the CD-FSL problem.b
4.3 Ablation study
Component analysis. To investigate the efficacy of different components in RDC-FT, we ablate the contribution of each element in RDC-FT: RDC w/o subspace, RDC (in two spaces), RDC-FT w/o subspace and RDC-FT (in two spaces). As in Tab. 3, a simple RDC process without subspace learning, which calibrates the distances only on the pre-trained representation, largely boosts the baseline NPC classifier by i.e. 8.72% (1.87 %) improvement on 1-shot (5-shot). And the fine-tuning process, as in results of RDC-FT w/o subspace, can enlarge the improvement during the iteratively mapping process, achieving 11.15% (5.82 %) improvement on 1-shot (5-shot). Interestingly, we observe that the contribution of subspace for RDC (0.90% on 1-shot) is larger than that for RDC-FT (0.46% on 1-shot). This indicates that fine-tuning process can gradually alleviate the bias of pre-trained representations, thus the benefit of subspace becomes less in RDC-FT.
|Method||5-way 1-shot||5-way 5-shot|
|+RDC w/o subspace||47.93(8.72%)||58.90(1.87%)|
|+RDC-FT w/o subspace||50.36(11.15%)||62.85(5.82%)|
Comparison of different PCA methods. In this part, we compare our non-linear subspace to the subspaces constructed by Kernel PCA (KPCA) methods with different kernel types (linear, Gaussian, Polynomial and Sigmoid). For fair comparison, all the dimensions of subspace are set as 64 and we used the default parameters of KPCA methods in scikit-learn. Table 4 shows that our non-linear subspace performs better than other KPCA methods. Interestingly, without RDC method, the KPCA methods can largely improve the performance (compared to N/A) on 1-shot learning, but the performances of KPCA methods are just competitive to the original space when applying the RDC method on different subspaces. However, our non-linear subspace achieves consistent and stable improvement both with and without RDC method, verifying the robustness and superiority of our non-linear subspace.
Effect of loss choices. We evaluated the performance of RDC-FT with different loss functions. The results in Fig. 3 show that these three losses achieve competitive performance on CUB, Cars, ISIC and chestX, while the KL loss performs mostly better than MSE loss on Places, Plantae, EuroSAT. These observations suggest the superiority of mapping the distance matrix in softened distributions. We conjecture this should attribute to the softening process which can alleviate the negative effect of the calibrated distances. Moreover, we note that the performance of KL loss can be further improved by an attention strategy on the distance matrices, verifying the efficacy of employing the expanded -nearest neighbors list as an attention reference.
Visualisation. To qualitatively show the effectiveness of our RDC and RDC-FT methods. We first show a case study of a FSL task from CUB by comparing the original ranking list and the ranking list with RDC. As in Fig. 7, for a given query data, our RDC method pulls the ground-truth support data closer to the query data, arriving at a more accurate position. This process is achieved by the calibration process of our RDC method. For the RDC-FT method, we use t-SNE [van2008visualizing] to visualise the feature embeddings of FSL tasks randomly selected from target domains, i.e. CUB, CropDisease and EuroSAT. As in Fig. 5, the feature representations with RDC-FT (in the 2nd row plots) have less within-class variations and large class margins compared to these without RDC-FT process (in the 1st row plots), showing that the RDC-FT method can guild a task-specific embedding where the samples can easily be classified by a simple NPC classifier. Moreover, our RDC-FT method, as expected, is functioning as an implicit clustering process for FSL task. This can be qualitatively verified by the observation of the clustering effect as in the 2nd row plots of Fig. 5.
Incorporate with other method. As RDC is a post-processing method, it can flexibly combine with other methods. Here we employed RDC on a general data augmentation method [yeh2020large]. The results in Tab. 5 indicate that RDC can achieve consistent improvement on other method, showing its generalisable ability. Currently we cannot evaluate our method on [liang2021boosting_cdfsl, liu2020feature] until their code is released.
In this paper, we proposed a Ranking Distance Calibration (RDC) method to calibrate the biased distances in CD-FSL. The calibration process is achieved by a re-ranking method with a -reciprocal discovery and encoding process. As the pre-trained linear embedding is biased for target domain, we further proposed a non-linear subspace followed by a calibration process on it. Our RDC method averages the calibrated distances on the two spaces to a robust distance matrix. Moreover, we introduced a RDC-FT method to fine-tune the embedding with the calibrated distances, yielding a discriminative representation for CD-FSL task.
Limitation and discussion. As the image retrieval perspective of our approach is to discover the task information unsupervised, exploring comprehensive leveraging of the label information and the task information should be considered, especially in the many-shot cases, e.g. 5-shot.
We notice that a concurrent work [xi2021_reranking] also uses the re-ranking process for few-shot learning. Although our work and [xi2021_reranking] share the image retrieval angle for few-shot learning, their key differences are summarised as follows: 1) Our work concerns on the -reciprocal process to calibrate the pairwise distance both in the pre-trained linear space and a task-adaptive non-linear space, while [xi2021_reranking] uses a graph view to improve the subgraph similarity on the pre-trained space; 2) For the representation optimisation stage, we use the calibrated distance matrix as a guidance to fine-tune the pre-trained feature extractor with a Kullback-Leibler divergence loss, whilst [xi2021_reranking] designs a Multi-Layer Perception to meta-learn a subgragh similarity refiner to optimise the feature extractor with a Cross Entropy loss. In particular, the meta-learning strategy in [xi2021_reranking] and the fine-tuning strategy in our work are orthogonal as they focus on the learning on source data and target data, respectively.
Appendix A Overview
In this supplementary material, we present:
To validate the robustness of the proposed RDC and RDC-FT methods, we analyse the sensitivities of different hyper-parameters, i.e. , and (in Sec. B);
For better understanding the computing process of the Jaccard distance, we illustrate the algorithm of the Jaccard distance in Sec. D.
The notations for all symbols and hyper-parameters used in the main paper are defined (in Sec. E);
Appendix B Sensitivity analysis of the hyper-parameters
In all experiments of the main paper, we reported the results on 8 target domains with the same hyper-parameters. In practice, our method is robust to the hyper-parameters selection as shown in Fig. 6. Further, we analyse in depth three key hyper-parameters, , and .
b.1 Effect of the trade-off scalar
The trade-off scalar is used to balance the original distance and the Jaccard distance for the proposed RDC method, thus it is a critical hyper-parameter for RDC. We conducted experiments to test RDC with on the pre-trained space.
The results are shown in Tab. 6 and Fig. 6(a), from which we can see that assigning smaller weights to the original distance (smaller ) is a better choice for RDC. In particular, the best for 1-shot is 0.3 while that for 5-shot is 0.5. This indicates that the original distance becomes more robust when the shot increases, thus the original space should occupy larger weights in the calibrated distance. Besides, when is between 0.1 to 0.5, the average accuracies of RDC are stable, verifying the robustness of . Therefore, we conducted all experiments by setting in the main paper.
b.2 Influence of the reduced dimensions
The dimensions in the subspace is a key parameter to build our non-linear space. Typically, we choose ( represents the original space) to test the effects of different dimension .
Table 7 and Fig. 6(b) show that the performance on different subspaces are stable when is smaller than 128. This observation shows that the subspaces constructed by the hyperbolic tangent transformation are not sensitive to the reduced dimensions. In particular, the subspace with is the best dimension for 1-shot learning and that with is the best dimension for 5-shot learning. To make a balance among different shot learning, we set in all experiments of the main paper.
b.3 Effect of the attention scalar
The results show that this attention strategy can benefit the representation adaptation for FSL task in the target domain. In specific, moderately increasing the attention scalar ( from 0.1 to 0.5) can improve the effectiveness of the attention strategy. To the contrary, overly increasing the attention scalar will introduce negative effect, resulting the decrease of the performance ( from 0.5 to 0.9). Therefore, the choice of in the main paper is a moderate and robust parameter for the attention strategy.
Appendix C Visualisation
To visualise the advantages and limitations of the proposed RDC and RDC-FT methods, we use t-SNE [maaten2008visualizing] to display a 5-way 1-shot FSL task from CropDisease as in Fig. 7. From the figure we observe that:
RDC can correct some misclassified samples that are closed to the support exemplars, i.e. the samples in red solid rectangles in plot(II) and plot(III). However, RDC cannot well-address the misclassified samples between different support exemplars, i.e. the failure cases in plot(III).
From the samples in red dashed rectangles of plot(I) and plot(IV), we note that RDC-FT can calibrate the distance-based distributions in the representational space, encouraging the feature representations to have less within-class variations and large class margins. Thus the fine-tuned representations are more discriminative for classification.
The failure cases of RDC, M-R1, M-R3, M-R24, and M-R5 in plot(III), can be correctly classified by RDC-FT with a simple NPC classifier in plot(V). This verifies the superiority of RDC-FT that gradually embeds the calibration information from RDC to the representational space.
Appendix D Details of Jaccard distance
The Jaccard distance computing is an important part of RDC. In specific, the concept of Jaccard distance derives from [bai2016sparse] and the re-weighting strategy for Jaccard distance is also used in [zhong2017rerank]. We briefly introduce the computing process of Jaccard distance in the main paper. Here, we further illustrate more details for clearer description as in Algorithm 1. In this pseudo-code, we illustrate the computing process of -reciprocal discovery and encoding in line 3-9, and the discovery process and encoding process are presented in line 5-8 and line 9, respectively. Then, the query expansion and Jaccard distance computing process are illustrated in line 11-14 of Algorithm 1.
Appendix E Symbols and hyper-parameters
|FSL task in the target domain|
|Feature of th sample in|
|Euclidean distance matrix in the original space|
|Jaccard distance matrix|
|Calibrated distance matrix in the original space|
|Calibrated distance matrix in the subspace|
|Complementary calibrated distance matrix|
|Pairwise distance between and|
|Pairwise distances between and|
|Jaccard distance between and|
|-nearest neighbors ranking list of|
|Expanded -nearest neighbors ranking list of|
|Gaussian kernel of pairwise distance between and|
|Number of candidates in|
|Number of samples for updating|
|Trade-off scalar to balance and|
|Dimensions of feature in the subspace|
|Number of epochs in fine-tuning stage|