Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation

01/23/2020 ∙ by Hung-Yu Tseng, et al. ∙ University of California, Merced Virginia Polytechnic Institute and State University 15

Few-shot classification aims to recognize novel categories with only few labeled images in each class. Existing metric-based few-shot classification algorithms predict categories by comparing the feature embeddings of query images with those from a few labeled images (support examples) using a learned metric function. While promising performance has been demonstrated, these methods often fail to generalize to unseen domains due to large discrepancy of the feature distribution across domains. In this work, we address the problem of few-shot classification under domain shifts for metric-based methods. Our core idea is to use feature-wise transformation layers for augmenting the image features using affine transforms to simulate various feature distributions under different domains in the training stage. To capture variations of the feature distributions under different domains, we further apply a learning-to-learn approach to search for the hyper-parameters of the feature-wise transformation layers. We conduct extensive experiments and ablation studies under the domain generalization setting using five few-shot classification datasets: mini-ImageNet, CUB, Cars, Places, and Plantae. Experimental results demonstrate that the proposed feature-wise transformation layer is applicable to various metric-based models, and provides consistent improvements on the few-shot classification performance under domain shift.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Few-shot classification (Lake et al., 2015) aims to recognize instances from novel categories (query instances) with only few labeled examples in each class (support examples). Among various recent approaches for addressing the few-shot classification problem, metric-based meta-learning methods (Garcia and Bruna, 2018; Sung et al., 2018; Vinyals et al., 2016; Snell et al., 2017; Oreshkin et al., 2018) have received considerable attention due to their simplicity and effectiveness. In general, metric-based few-shot classification methods make the prediction based on the similarity between the query image and support examples. As illustrated in Figure 1, metric-based approaches consist of 1) a feature encoder and 2) a metric function. Given an input task consisting of few labeled images (the support set) and unlabeled images (the query set) from novel classes, the encoder first extracts the image features. The metric function then takes the features of both the labeled and unlabeled images as input and predicts the category of the query images. Despite the success of recognizing novel classes sampled from the same domain as in the training stage (e.g., , both training and testing are on mini-ImageNet classes), Chen et al. (Chen et al., 2019a) recently raise the issue that existing metric-based approaches often do not generalize well to categories from different domains. The generalization ability to unseen domains, however, is of critical importance due to the difficulty to construct large training datasets for rare classes (e.g., , recognizing rare bird species in a fine-grained classification setting). As a result, understanding and addressing the domain shift problem for few-shot classification is of great interest.

Figure 1: Problem formulation and motivation. Metric-based meta-learning models usually consist of a feature encoder and metric function

. We aim to improve the generalization ability of the models training from seen domains to arbitrary unseen domains. The key observation is that the distributions of the image features extracted from tasks in the unseen domains are significantly different from those in the seen domains.

To alleviate the domain shift issue, numerous unsupervised domain adaptation techniques have been proposed (Pan and Yang, 2010; Chen et al., 2018; Tzeng et al., 2017)

. These methods focus on adapting the classifier of

the same category from the source to the target domain. Building upon the domain adaptation formulation, Dong and Xing (Dong and Xing, 2018) relax the constraint and transfer knowledge across domains for recognizing novel category in the one-shot setting. However, unsupervised domain adaptation approaches assume that numerous unlabeled images are available in the target domain during training. In many cases, this assumption may not be realistic. For example, the cost and efforts of collecting numerous images of rare bird species can be prohibitively high. On the other hand, domain generalization methods have been developed (Blanchard et al., 2011; Li et al., 2019) to learn classifiers that generalize well to multiple unseen domains without requiring the access to data from those domains. Yet, existing domain generalization approaches aim at recognizing instance from the same category in the training stage.

In this paper, we tackle the domain generalization problem for recognizing novel category in the few-shot classification setting. As shown in Figure 1(c), our key observation is that the distributions of the image features extracted from the tasks in different domains are significantly different. As a result, during the training stage, the metric function may overfit to the feature distributions encoded only from the seen domains and thus fail to generalize to unseen domains. To address the issue, we propose to integrate feature-wise transformation layer to modulate the feature activations with affine transformations into the feature encoder. The use of these feature-wise transformation layers allows us to simulate various distributions of image features during the training stage, and thus improve the generalization ability of the metric function in the testing phase. Nevertheless, the hyper-parameters of the feature-wise transformation layers may require meticulous hand-tuning due to the difficulty to model the complex variation of the image feature distributions across various domains. In light of this, we develop a learning-to-learn algorithm to optimize the proposed feature-wise transformation layers. The core idea is to optimize the feature-wise transformation layers so that the model can work well on the unseen domains after training the model using the seen domains. We make the source code and datasets public available to simulate future research in this field.111https://github.com/hytseng0509/CrossDomainFewShot

We make the following three contributions in this work:

  • We propose to use feature-wise transformation layers to simulate various image feature distributions extracted from the tasks in different domains. Our feature-wise transformation layers are method-agnostic and can be applied to various metric-based few-shot classification approaches for improving their generalization to unseen domains.

  • We develop a learning-to-learn method to optimize the hyper-parameters of the feature-wise transformation layers. In contrast to the exhaustive parameter hand-tuning process, the proposed learning-to-learn algorithm is capable of finding the hyper-parameters for the feature-wise transformation layers to capture the variation of image feature distribution across various domains.

  • We evaluate the performance of three metric-based few-shot classification models (including MatchingNet (Vinyals et al., 2016), RelationNet (Sung et al., 2018)

    , and Graph Neural Networks 

    (Garcia and Bruna, 2018)) with extensive experiments under the domain generalization setting. We show that the proposed feature-wise transformation layers can effectively improve the generalization ability of metric-based models to unseen domains. We also demonstrate further performance improvement with our learning-to-learn scheme for learning the feature-wise transformation layers.

2 Related Work

Few-shot classification.

Few-shot classification aims to learn to recognize novel categories with a limited number of labeled examples in each class. Significant progress has been made using the meta-learning based formulation. There are three main classes of meta-learning approaches for addressing the few-shot classification problem. First, recurrent-based frameworks (Rezende et al., 2016; Santoro et al., 2016) sequentially process and encode the few labeled images of novel categories. Second, optimization-based schemes (Finn et al., 2017; Rusu et al., 2019; Tseng et al., 2019; Vuorio et al., 2019) learn to fine-tune the model with the few example images by integrating the fine-tuning process in the meta-training stage. Third, metric-based methods (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Oreshkin et al., 2018; Sung et al., 2018; Lifchitz et al., 2019) classify the query images by computing the similarity between the query image and few labeled images of novel categories.

Among these three classes, metric-based methods have attracted considerable attention due to their simplicity and effectiveness. Metric-based few-shot classification approaches consist of 1) a feature encoder to extract features from both the labeled and unlabeled images and 2) a metric function that takes image features as input and predict the category of unlabeled images. For example, MatchingNet (Vinyals et al., 2016)

applies cosine similarity along with a recurrent network, ProtoNet 

(Snell et al., 2017) utilizes euclidean distance, RelationNet (Sung et al., 2018) uses CNN modules, GNN (Garcia and Bruna, 2018) employs graph convolution modules as the metric functions. However, these metric functions may fail to generalize to unseen domains since the distributions of the image features extracted from the task in various domains can be drastically different. Chen et al. (Chen et al., 2019a) recently show that the performance of existing few-shot classification methods degrades significantly under domain shifts. Our work focuses on improving the generalization ability of metric-based few-shot classification models to unseen domains.

Domain adaptation.

Domain adaptation methods (Pan and Yang, 2010) aim to reduce the domain shift between the source and target domains. Since the emergence of domain adversarial neural networks (DANN) (Ganin et al., 2016), numerous frameworks have been proposed to apply adversarial training to align the source and target distributions on the feature-level (Tzeng et al., 2017; Chen et al., 2018; Hsu et al., 2020) or on the pixel-level (Tsai et al., 2018; Hoffman et al., 2018; Bousmalis et al., 2017; Chen et al., 2019b; Lee et al., 2019a). Most domain frameworks, however, target at adapting knowledge of the same category learned from the source to target domain and thus are less effective to handle novel category as in the few-shot classification scenarios. One exception is the work by Dong and Xing (Dong and Xing, 2018) that address the domain shift issue in the one-shot learning setting. Nevertheless, these domain adaptation methods require access to the unlabeled images in the target domain during the training. Such an assumption may not be feasible in many applications due to the difficulty of collecting abundant examples of rare categories (e.g., rare bird species).

Domain generalization.

In contrast to the domain adaptation frameworks, domain generalization (Blanchard et al., 2011) methods aim at generalizing from a set of seen domains to the unseen domain without accessing instances from the unseen domain during the training stage. Before the emerging of learning-to-learn (i.e., meta-learning) (Ravi and Larochelle, 2017; Finn et al., 2017) approaches, several methods have been proposed for tackling the domain generalization problem. Examples include extracting domain-invariant features from various seen domains (Blanchard et al., 2011; Li et al., 2018b; Muandet et al., 2013), improving the classifiers by fusing classifiers learned from seen domains (Niu et al., 2015a, b), and decomposing the classifiers into domain-specific and domain-invariant components (Khosla et al., 2012; Li et al., 2017a). Another stream of work learns to augment the input data with adversarial learning (Shankar et al., 2018; Volpi et al., 2018). Most recently, a number of methods apply the learning-to-learn strategy to simulate the generalization process in the training stage (Balaji et al., 2018; Li et al., 2018a, 2019). Our method adopts a similar approach to train the proposed feature-wise transformation layers. The application context, however, differs from prior work as we focus on recognizing novel category from unseen domains in few-shot classification. The goal of this work is to make few-shot classification algorithms robust to domain shifts.

Learning-based data augmentation.

Data augmentation methods are designed to increase the diversity of data for the training process. Unlike the hand-crafted approaches such as horizontal flipping and random cropping, several recent approaches have been proposed to learn the data augmentation (Cubuk et al., 2019; DeVries and Taylor, 2017a; Lemley et al., 2017; Perez and Wang, 2017; Sixt et al., 2018; Tran et al., 2017). For instance, the SmartAugmentation (Lemley et al., 2017) scheme trains a network that combines multiple images from the same category. The Bayesian DA (Tran et al., 2017) method augments the data according to the distribution learned from the training set, and the RenderGAN (Sixt et al., 2018) model simulates realistic images using generative adversarial networks. In addition, the AutoAugment (Cubuk et al., 2019)

algorithm learns the augmentation via reinforcement learning. Two recent frameworks 

(Shankar et al., 2018; Volpi et al., 2018) target at augmenting the data by modeling to the variation across different domains with adversarial learning. Similar to these approaches for capturing the variations across multiple domains, we develop a learning-to-learn process to optimize the proposed feature-wise transformation layers for simulating various distributions of image features encoded from different domains.

Conditional normalization.

Conditional normalization aims to modulate the activation via a learned affine transformation conditioned on external data (e.g., an image of an artwork for capturing a specific style). Conditional normalization methods, including Conditional Batch Normalization 

(Dumoulin et al., 2017), Adaptive Instance Normalization (Huang and Belongie, 2017), and SPADE (Park et al., 2019), are widely used in the style transfer and image synthesis (Karras et al., 2019) tasks. In addition to image stylization and generation, conditional normalization has also been applied to align different data distributions for domain adaptation (Cariucci et al., 2017; Li et al., 2017b). In particular, the TADAM method (Oreshkin et al., 2018) applies conditional batch normalization to metric-based models for the few-shot classification task. The TADAM method aims to model the training task distribution under the same domain. In contrast, we focus on simulating various features distributions from different domains.

Regularization for neural networks.

Adding some form of randomness in the training stage is an effective way to improve generalization (Srivastava et al., 2014; Wan et al., 2013; Larsson et al., 2017; DeVries and Taylor, 2017b; Zhang et al., 2018; Ghiasi et al., 2018). The proposed feature-wise transformation layer for modulating the feature activations of intermediate layers (by applying random affine transformations) can also be viewed as a way to regularize network training.

3 Methodology

3.1 Preliminaries

Few-shot classification and metric-based method.

The few-shot classification problem is typically characterized as way (number of categories) and shot (number of labeled examples for each category). Figure 1 shows an example of how the metric-based frameworks operate in the -way -shot few shot classification task. A metric-based algorithm generally contains a feature encoder and a metric function . For each iteration during the training stage, the algorithm randomly samples categories and constructs a task . We denote the collection of input images as and the corresponding categorical labels as and . A task consists of a support set and a query set . The support set and query set are respectively formed by randomly selecting and samples for each of the categories.

The feature encoder first extracts the features from both the support and query images. The metric function then predicts the categories of the query images according to the label of support images , the encoded query image , and the encoded support images . The process can be formulated as

(1)

Finally, the training objective of a metric-based framework is the classification loss of the images in the query set,

(2)

The main difference between various metric-based algorithms lies in the design choice for the metric function . For instance, the MatchingNet (Vinyals et al., 2016)

method utilizes long-short-term memories (LSTM), the RelationNet 

(Sung et al., 2018)

model applies convolutional neural networks (CNN), and the GNN 

(Garcia and Bruna, 2018) scheme uses graph convolutional networks.

Problem setting.

In this work, we address the few-shot classification problem under the domain generalization setting. We denote a domain consisting of a collection of few-shot classification tasks as . We assume seen domains available in the training phase. The goal is to learn a metric-based few-show classification model using the seen domains, such that the model can generalize well to an unseen domain . For example, one can train the model with the mini-ImageNet (Ravi and Larochelle, 2017) dataset as well as some public available fine-grained few-shot classification domains, e.g., CUB (Welinder et al., 2010), and then evaluate the generalization ability of the model on an unseen plants domain. Note that our problem formulation does not access images in the unseen domain at the training stage.

(a) (b)
Figure 2: Method overview. (a) We propose a feature-wise transformation layer to modulate intermediate feature activation in the feature encoder

with the scaling and bias terms sampled from the Gaussian distributions parameterized by the hyper-parameters

and . During the training phase, we insert a collection of feature-wise transformation layers into the feature encoder to simulate feature distributions extracted from the tasks in various domains. (b) We design a learning-to-learn algorithm to optimize the hyper-parameters and of feature-wise transformation layers by maximizing the performance of the applied metric-based model on the pseudo-unseen domain (bottom) after it is optimized on the pseudo-seen domain (top).

3.2 Feature-Wise Transformation Layer

Our focus in this work is to improve the generalization ability of metric-based few-shot classification models to arbitrary unseen domains. As shown in Figure 1, due to the discrepancy between the feature distributions extracted from the task in the seen and unseen domains, the metric function may overfit to the seen domains and fail to generalize to the unseen domains. To address the problem, we propose to integrate a feature-wise transformation to augment the intermediate feature activations with affine transformations into the feature encoder . Intuitively, the feature encoder integrated with the feature-wise transformation layers can produce more diverse feature distributions which improve the generalization ability of the metric function . As shown in Figure 2(b), we insert the feature-wise transformation layer after the batch normalization layer in the feature encoder . The hyper-parameters and

indicate the standard deviations of the Gaussian distributions for sampling the affine transformation parameters. Given an intermediate feature activation map

in the feature encoder with the dimension of , we first sample the scaling term and bias term from Gaussian distributions,

(3)

We then compute the modulated activation as

(4)

where and . In practice, we insert the feature-wise transformation layers to the feature encoder at multiple levels.

3.3 Learning the Feature-Wise Transformation Layers

While we can empirically determine hyper-parameters of the feature-wise transformation layer, it remains challenging to hand-tune a generic set of parameters which are effective on different settings (i.e., different metric-based frameworks and different seen domains). To address this problem, we design a learning-to-learn algorithm to optimize the hyper-parameters of the feature-wise transformation layer. The core idea is that training the metric-based model integrated with the proposed layers on the seen domains should improve the performance of the model on the unseen domains.

We illustrate the process in Figure 2(b) and Algorithm 1. In each training iteration , we sample a pseudo-seen domain and a pseudo-unseen domain from a set of seen domains . Given a metric-based model with feature encoder and metric function , we first integrate the proposed layers with hyper-parameters into the feature encoder (i.e., ). We then use the loss in equation 2 to update the parameters in the metric-based model with the pseudo-seen task , namely

(5)

where is the learning rate. We then measure the generalization ability of the updated metric-based model by 1) removing the feature-wise transformation layers from the model and 2) computing the classification loss of the updated model on the pseudo-unseen task , namely

(6)

Finally, as the loss reflects the effectiveness of the feature-wise transformation layers, we optimize the hyper-parameters by

(7)

Note that the metric-based model and feature-wise transformation layers are jointly optimized in the training stage.

1 Require: Seen domains , learning rate Randomly initialize , and while training do
2       Randomly sample non-overlapping pseudo-seen and psuedo-unseen domains from the seen domains Sample a pesudo-seen task and a pseudo-unseen task // Update metric-based model with pseudo-seen task: Obtain , using equation 5 // Update feature-wise transformation layers with pseudo-unseen task: Obtain using equation 6 and equation 7
3 end while
Algorithm 1 Learning-to-Learn Feature-Wise Transformation.

4 Experimental Results

4.1 Experimental Setups

We validate the efficacy of the proposed feature-wise transformation layer with three existing metric-based algorithms (Vinyals et al., 2016; Sung et al., 2018; Garcia and Bruna, 2018) under two experimental settings.First, we empirically determine the hyper-parameters of the feature-wise transformation layers and analyze the impact of the feature-wise transformation layers. We train the few-shot classification model on the mini-ImageNet (Bousmalis et al., 2017) domain and evaluate the trained model on four different domains: CUB (Welinder et al., 2010), Cars (Krause et al., 2013), Places (Zhou et al., 2017), and Plantae (Van Horn et al., 2018). Second, we demonstrate the importance of the proposed learning-to-learn scheme for optimizing the hyper-parameters of feature-wise transformation layers. We adopt the leave-one-out setting by selecting an unseen domain from CUB, Cars, Places, and Plantae domains. The mini-ImageNet (Bousmalis et al., 2017) and the remaining domains then serve as the seen domains for training both the metric-based model and feature-wise transformation layers using Algorithm 1. After the training, we evaluate the trained model on the selected unseen domain.

Datasets.

We conduct experiments using five datasets: mini-ImageNet (Ravi and Larochelle, 2017), CUB (Welinder et al., 2010), Cars (Krause et al., 2013), Places (Zhou et al., 2017), and Plantae (Van Horn et al., 2018). Since the mini-ImageNet dataset serves as the seen domain for all experiments, we select the training iterations with the best accuracy on the validation set of the mini-ImageNet dataset for evaluation. More details of dataset processing are presented in Appendix A.1.

Implementation details.

We apply the feature-wise transformation layers to three metric-based frameworks: MatchingNet (Vinyals et al., 2016), RelationNet (Sung et al., 2018), and GNN (Garcia and Bruna, 2018). We use the public implementation from Chen et al. (Chen et al., 2019a) to train both the MatchingNet and RelationNet model.222https://github.com/wyharveychen/CloserLookFewShot For the GNN approach, we integrate the official implementation for graph convolutional network into Chen’s implementation.333https://github.com/vgsatorras/few-shot-gnn In all experiments, we adopt the ResNet-10 (He et al., 2016) model as the backbone network for our feature encoder .

We present the average results over trials for all the experiments. In each trial, we randomly sample categories (e.g., 5 classes for 5-way classification). For each category, we randomly select images (e.g., 1-shot or 5-shot) for the support set and images for the query set . We discuss the implementation details in Appendix A.2.

Pre-trained feature encoder.

Prior to the few-shot classification training stage, we first pre-train the feature encoder by minimizing the standard cross-entropy classification loss on the training categories in the mini-ImageNet dataset. This strategy can significantly improve the performance of metric-based models and is widely adopted in several recent frameworks (Rusu et al., 2019; Gidaris and Komodakis, 2018; Lifchitz et al., 2019).

4.2 Feature-Wise Transformation with Manual Parameter Tuning

We train the model using the mini-ImageNet dataset and evaluate the trained model with four other unseen domains: CUB, Cars, Places, and Plantae. We add the proposed feature-wise transformation layers after the last batch normalization layer of all the residual blocks in the feature encoder during the training stage. We empirically set and in all feature-wise transformation layers to be and , respectively. Table 1 shows the metric-based model trained with the feature-wise transformation layers performs favorably against the individual baselines. We attribute the improvement of generalization to the use of the proposed layers for making the feature encoder produce more diverse feature distributions in the training stage. As a by-product, we also observe the improvement on the seen domain (i.e., mini-ImageNet) since there is still a slight discrepancy between the feature distributions extracted from the training and testing sets of the same domain. It is noteworthy that we also compare the proposed method with several recent approaches (e.g.,  Lee et al. (2019b)) in Table 8 and Table 9. With the proposed feature-wise transformation layers, the GNN (Garcia and Bruna, 2018) model performs favorably against the state-of-the-art frameworks on both the seen domain (i.e.,  mini-ImageNet) and unseen domains.

5-way 1-Shot FT mini-ImageNet CUB Cars Places Plantae
MatchingNet -
RelationNet -
GNN -
5-way 5-Shot FT mini-ImageNet CUB Cars Places Plantae
MatchingNet -
RelationNet -
GNN -
Table 1: Few-shot classification results trained with the mini-ImageNet dataset. We train the model on the mini-ImageNet domain and evaluate the trained model on another domain. FT indicates that we apply the feature-wise transformation layers with empirically determined hyper-parameters to train the model.
5-way 1-Shot CUB Cars Places Plantae
MatchingNet -
FT
LFT
RelationNet -
FT
LFT
GNN -
FT
LFT
5-way 5-Shot CUB Cars Places Plantae
MatchingNet -
FT
LFT
RelationNet -
FT
LFT
GNN -
FT
LFT
Table 2: Few-shot classification results trained with multiple datasets. We use the leave-one-out setting to select the unseen domain and train the model as well as the feature-wise transformation layers using Algorithm 1. FT and LFT indicate applying the pre-determined and learning-to-learned feature-wise transformation, respectively.

4.3 Generalization from Multiple Domains

Here we validate the effectiveness of the proposed learning-to-learn algorithm for optimizing the hyper-parameters of the feature-wise transformation layers. We compare the metric-model trained with the proposed learning procedure to the model trained with the pre-determined feature-wise transformation layers. The leave-one-out setting is used to select one domain from the CUB, Cars, Places, and Plantae as the unseen domain for the evaluation. The mini-ImageNet and the remaining domains serve as the seen domains for training the model. Since we select the training iteration according to the validation performance on the mini-ImageNet domain for evaluation, we do not consider the mini-ImageNet as the unseen domain. We present the results in Table 2. We denote FT and LFT as applying pre-determined feature-wise transformation layers and those layers optimized with the proposed learning-to-learn algorithm, respectively. The models optimized with proposed learning scheme outperforms those trained with the pre-determined feature-wise transformation layers since the optimized feature-wise transformation layers can better capture the variation of feature distributions across different domains. Table 1 and Table 2 show that the proposed feature-wise transformation layers together with the learning-to-learn algorithm effectively mitigate the domain shift problem for metric-based frameworks.

Note since the proposed learning-to-learn approach optimizes the hyper-parameters via stochastic gradient descent, it may not find the global minimum that achieves the best performance. It is certainly possible to manually find another set of hyper-parameter setting that achieves better performance. However, this requires meticulous and computationally expensive hyper-parameter tuning. Specifically, the dimension of the hyper-parameters

and is for the -th feature-wise transformation layer in the feature encoder, where is the number of feature channels. As there are feature-wise transformation layers in the feature encoder , we need to perform the hyper-parameter search in a -dimensional space. In practice, the dimension of the the search space is .


(a) RelationNet (b) RelationNet + FT (c) RelationNet + LFT
Figure 3: T-SNE visualization of the image features extracted from tasks in different domains. We show the t-SNE visualization of the features extracted by the (a) original feature encoder , (b) feature encoder with pre-determined feature-wise transformation layers, and (c) feature encoder with learning-to-learned feature-wise transformation.
Figure 4: Visualization of the feature-wise transformation layers.

We show the quartile visualization of the activations

and from each feature-wise transformation layer that are optimized by the proposed learning-to-learn algorithm.

Visualizing feature-wise transformed features.

To demonstrate that the proposed feature-wise transformation layers can simulate various feature distributions extracted from the task in different domains, we show the t-SNE visualizations of the image features extracted by the feature encoder in the RelationNet (Sung et al., 2018) model in Figure 3. The model is trained with 5-way 5-shot classification setting on the mini-ImageNet, Cars, Places, and Plantae domains (i.e., corresponding to the fifth block of the second column in Table 2). We observe that the distance between features extracted from different domains becomes smaller with the help of feature-wise transformation layers. Furthermore, the proposed learning-to-learn scheme can further help the feature-wise transformation layers capture the variation of feature distributions from various domains, thus close the domain gap and improve the generalization ability of metric-based models.

Visualizing feature-wise transformation layers.

To better understand how the learned feature-wise transformation layers operate, we show the values of the and in the feature-wise transformation layer. Figure 4 presents the visualization. The values of scaling terms tend to become smaller in the deeper layers, particularly for those in the last residual block. On the other hand, the depth of the layer does not seem to have an apparent impact on the distributions of the bias terms . The distributions are also different across different metric-based classification methods. These results suggest the importance of the proposed learning-to-learn algorithm because there does not exist a set of optimal hyper-parameters of the feature-wise transformation layers which work well with all metric-based approaches.

5 Conclusions

We propose a method to effectively enhance metric-based few-shot classification frameworks under domain shifts. The core idea of our method lies in using the feature-wise transformation layer to simulate various feature distributions extracted from the tasks in different domains. We develop a learning-to-learn approach for optimizing the hyper-parameters of the feature-wise transformation layers by simulating the generalization process using multiple seen domains. From extensive experiments, we demonstrate that our technique is applicable to different metric-based few-shot classification algorithms and show consistent improvement over the baselines.

References

  • Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) MetaReg: towards domain generalization using meta-regularization. In NeurIPS, Cited by: §2.
  • G. Blanchard, G. Lee, and C. Scott (2011) Generalizing from several related classification tasks to a new unlabeled sample. In NIPS, Cited by: §1, §2.
  • K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, Cited by: §2, §4.1.
  • F. M. Cariucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò (2017) Autodial: automatic domain alignment layers. In ICCV, Cited by: §2.
  • W. Chen, Y. Liu, Z. Kira, Y. Wang, and J. Huang (2019a) A closer look at few-shot classification. In ICLR, Cited by: §A.2, §A.2, §1, §2, §4.1.
  • Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In CVPR, Cited by: §1, §2.
  • Y. Chen, Y. Lin, M. Yang, and J. Huang (2019b) CrDoCo: pixel-level domain transfer with cross-domain consistency. In CVPR, Cited by: §2.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation policies from data. In CVPR, Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: Table 3.
  • T. DeVries and G. W. Taylor (2017a) Dataset augmentation in feature space. In ICLR Workshop, Cited by: §2.
  • T. DeVries and G. W. Taylor (2017b) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §2.
  • N. Dong and E. P. Xing (2018) Domain adaption in one-shot learning. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    ,
    Cited by: §1, §2.
  • V. Dumoulin, J. Shlens, and M. Kudlur (2017) A learned representation for artistic style. In ICLR, Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2, §2.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR 17 (1), pp. 2096–2030. Cited by: §2.
  • V. Garcia and J. Bruna (2018) Few-shot learning with graph neural networks. In ICLR, Cited by: §A.3, 3rd item, §1, §2, §3.1, §4.1, §4.1, §4.2.
  • G. Ghiasi, T. Lin, and Q. V. Le (2018) DropBlock: a regularization method for convolutional networks. In NeurIPS, Cited by: §2.
  • S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In CVPR, Cited by: §A.3, §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §A.2, §4.1.
  • N. Hilliard, L. Phillips, S. Howland, A. Yankov, C. D. Corley, and N. O. Hodas (2018) Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376. Cited by: §A.1, Table 3.
  • J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle consistent adversarial domain adaptation. In ICML, Cited by: §2.
  • H. Hsu, C. Yao, Y. Tsai, W. Hung, H. Tseng, M. Singh, and M. Yang (2020) Progressive domain adaptation for object detection. In WACV, Cited by: §2.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §2.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §2.
  • A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba (2012) Undoing the damage of dataset bias. In ECCV, Cited by: §2.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML Workshop, Cited by: §2.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition, Cited by: Table 3, §4.1, §4.1.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §1.
  • G. Larsson, M. Maire, and G. Shakhnarovich (2017) Fractalnet: ultra-deep neural networks without residuals. In ICLR, Cited by: §2.
  • H. Lee, H. Tseng, Q. Mao, J. Huang, Y. Lu, M. K. Singh, and M. Yang (2019a)

    DRIT++: diverse image-to-image translation viadisentangled representations

    .
    arXiv preprint arXiv:1905.01270. Cited by: §2.
  • K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019b) Meta-learning with differentiable convex optimization. In CVPR, Cited by: §A.3, Table 8, Table 9, §4.2.
  • J. Lemley, S. Bazrafkan, and P. Corcoran (2017) Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5, pp. 5858–5869. Cited by: §2.
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017a) Deeper, broader and artier domain generalization. In ICCV, Cited by: §2.
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018a) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §2.
  • H. Li, S. Jialin Pan, S. Wang, and A. C. Kot (2018b) Domain generalization with adversarial feature learning. In CVPR, Cited by: §2.
  • Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou (2017b) Revisiting batch normalization for practical domain adaptation. In ICLR, Cited by: §2.
  • Y. Li, Y. Yang, W. Zhou, and T. M. Hospedales (2019) Feature-critic networks for heterogeneous domain generalization. In ICML, Cited by: §1, §2.
  • Y. Lifchitz, Y. Avrithis, S. Picard, and A. Bursuc (2019) Dense classification and implanting for few-shot learning. In CVPR, Cited by: §A.3, Table 8, §2, §4.1.
  • K. Muandet, D. Balduzzi, and B. Schölkopf (2013) Domain generalization via invariant feature representation. In ICML, Cited by: §2.
  • L. Niu, W. Li, and D. Xu (2015a) Multi-view domain generalization for visual recognition. In ICCV, Cited by: §2.
  • L. Niu, W. Li, and D. Xu (2015b) Visual recognition by learning from web data: a weakly supervised domain generalization approach. In ICCV, Cited by: §2.
  • B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In NeurIPS, Cited by: Table 8, §1, §2, §2.
  • S. J. Pan and Q. Yang (2010)

    A survey on transfer learning

    .
    IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §1, §2.
  • T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, Cited by: §2.
  • L. Perez and J. Wang (2017)

    The effectiveness of data augmentation in image classification using deep learning

    .
    arXiv preprint arXiv:1712.04621. Cited by: §2.
  • S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In CVPR, Cited by: Table 8.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §A.1, §A.2, §A.3, Table 3, §2, §3.1, §4.1.
  • D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra (2016) One-shot generalization in deep generative models. JMLR 48. Cited by: §2.
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. In ICLR, Cited by: §A.3, Table 8, §2, §4.1.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In ICML, Cited by: §2.
  • S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi (2018) Generalizing across domains via cross-gradient training. In ICLR, Cited by: §2, §2.
  • L. Sixt, B. Wild, and T. Landgraf (2018) Rendergan: generating realistic labeled data. Frontiers in Robotics and AI 5, pp. 66. Cited by: §2.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NIPS, Cited by: §1, §2, §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR 15 (1), pp. 1929–1958. Cited by: §2.
  • F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §A.3, 3rd item, §1, §2, §2, §3.1, §4.1, §4.1, §4.3.
  • T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid (2017) A bayesian data augmentation approach for learning deep models. In NIPS, Cited by: §2.
  • Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §2.
  • H. Tseng, S. De Mello, J. Tremblay, S. Liu, S. Birchfield, M. Yang, and J. Kautz (2019)

    Few-shot viewpoint estimation

    .
    In BMVC, Cited by: §2.
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1, §2.
  • G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie (2018) The inaturalist species classification and detection dataset. In CVPR, Cited by: Table 3, §4.1, §4.1.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In NIPS, Cited by: §A.3, 3rd item, §1, §2, §2, §3.1, §4.1, §4.1.
  • R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese (2018) Generalizing to unseen domains via adversarial data augmentation. In NeurIPS, Cited by: §2, §2.
  • R. Vuorio, S. Sun, H. Hu, and J. J. Lim (2019) Multimodal model-agnostic meta-learning via task-aware modulation. In NeurIPS, Cited by: §2.
  • L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In ICML, Cited by: §2.
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: Table 3, §3.1, §4.1, §4.1.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: §2.
  • B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    .
    TPAMI. Cited by: Table 3, §4.1, §4.1.

Appendix A Appendix

a.1 Dataset Collection

We use five few-shot classification datasets in all of our experiments: mini-ImageNet, CUB, Cars, Places, and Plantae. We follow the setting in Ravi and Larochelle (2017) and Hilliard et al. (2018) to process mini-ImageNet and CUB datasets. As for the other datasets, we manually process the dataset by random splitting the classes. The number of training, validation, testing categories for each dataset are summarized in Table 3.

Datasets mini-ImageNet CUB Cars Places Plantae
Source Deng et al. (2009) Welinder et al. (2010) Krause et al. (2013) Zhou et al. (2017) Van Horn et al. (2018)
# Training categories 64 100 98 183 100
# Validation categories 16 50 49 91 50
# Testing categories 20 50 49 91 50
Split setting Ravi and Larochelle (2017) Hilliard et al. (2018) randomly split randomly split randomly split
Table 3: Summarization of the datasets (domains). We additionally collect and split the Cars, Places, and Plantae datasets.

a.2 Additional implementation details

We use the implementation and adopt the setting of hyper-parameters from Chen et al. (Chen et al., 2019a).444https://github.com/wyharveychen/CloserLookFewShot We train the metric-based model and feature-wise transformation layers with a learning rate of and iterations. For feature-wise transformation layers, we apply L2 regularization with a weight of . The number of inner iterations adopted in the learning-to-learn scheme is set to be .

Matching Network.

We cannot utilize the MatchingNet implementation from Chen et al. (Chen et al., 2019a)

since they applied the Pytorch built-in LSTM module, which does not support second-order backpropagation. Without the second-order backpropagation, we are unable to optimize the feature augmentation layers using the proposed learning-to-learn algorithm. As a result, we re-implement the LSTM module for the MatchingNet model. To verify the correctness of our implementation, we evaluate the 5-way 5-shot performance with the ResNet-10 

(He et al., 2016) backbone network on the mini-ImageNet dataset (Ravi and Larochelle, 2017). Our implementation reports accuracy, which is similar to the result (i.e., ) reported by Chen et al. (Chen et al., 2019a). We make the source code and datasets public available to foster future progress in this field.555https://github.com/hytseng0509/CrossDomainFewShot

a.3 Additional experimental results

Ablation study on pre-trained metric encoder.

As described in Section 4.1, we pre-trained the metric encoder by minimizing the cross-entropy classification loss using the training categories from the mini-ImageNet dataset. To understand the impact of the pre-training, we conduct an ablation study using the leave-one-out experiment illustrated in Section 4.3. As shown in Table 4, pre-training the metric encoder substantially improve the few-shot classification performance of metric-based frameworks. Note that such a pre-training process is also adopted by several recent frameworks (Rusu et al., 2019; Gidaris and Komodakis, 2018; Lifchitz et al., 2019) to boost the few-shot classification performance.

1-Shot Pre-trained CUB Cars Places Plantae
MatchingNet -
RelationNet -
GNN -
5-Shot Pre-trained CUB Cars Places Plantae
MatchingNet -
RelationNet -
GNN -
Table 4: Ablation study on pre-trained metric encoder. We conduct leave-one-out setting to select the unseen domain to study the effectiveness of pre-training the feature encoder on the mini-ImageNet dataset.

Number of ways in testing stage.

In this experiment, we consider a practical scenario that the number of ways in the testing phase is different from that in the training stage. Since the GNN (Garcia and Bruna, 2018) framework requires the numbers of ways to be consistent in the training and testing, we evaluate the MatchingNet (Vinyals et al., 2016) and RelationNet (Sung et al., 2018) model with this setting. Table 5 reports the performances of the models trained on the mini-ImageNet, Cars, Places, and Plantae domains under the 5-way 5-shot setting (i.e., corresponding to the fourth and fifth block of the second column in Table 2). Our proposed learning-to-learned feature-wise transformation layers are capable of improving the generalization of metric-based models to the unseen domain under various numbers of ways in the testing stage.

5-Shot CUB 2-way CUB 5-way CUB 10-way CUB 20-way
MatchingNet -
FT
LFT
RelationNet -
FT
LFT
Table 5: Few-shot classification results under various numbers of ways in testing stage. We compare the 5-shot performance under various number of ways in the testing phase. The CUB dataset is select as the testing (unseen) domain. All the models are trained with 5-way 5-shot setting.

Pre-determined hyper-parameters of feature-wise transformation layers.

We demonstrate the difficulty to hand-tune the hyper-parameters of the proposed feature-wise transformation layers in this experiment. Different from the setting described in Section 4.2, we set the hyper-parameters and in all feature-wise transformation layers to be . The model is trained under -way setting using the mini-ImageNet domain, and evaluate it on the other domains. We report the -shot and -shot performance in Table 6. We denote applying feature-wise transformation layers with as FT and those with as FT*. We observe that the metric-based models applied with FT perform favorably against to those applied with FT*. In several cases, applying FT* even yields inferior results compared to the original training without the feature-wise transformation layers. This suggests the difficulty of hand-tuning the hyper-parameters and the importance of the proposed learning-to-learn scheme for optimizing the hyper-parameters of the feature-wise transformation layers.

1-Shot mini-ImageNet CUB Cars Places Plantae
MatchingNet FT
FT*
RelationNet FT
FT*
GNN FT
FT*
5-Shot mini-ImageNet CUB Cars Places Plantae
MatchingNet FT
FT*
RelationNet FT
FT*
GNN FT
FT*
Table 6: Few-shot classification results by applying different pre-determined hyper-parameters of feature-wise transformation layers. We train the model on the mini-ImageNet with a different set of pre-determined hyper-parameters of feature-wise transformation layers. FT and FT* indicate that we apply the feature-wise transformation layers with hyper-parameters to be and , respectively.

Hyper-parameter initialization for learning-to-learn.

For all the experiments, we initialize the parameters and to and , which we empirically determine in Section 4.2, to train the feature-wise transformation layers. In practice, we find that the cross-domain performance is not sensitive as long as the initialized values are within the same order (e.g.,   and ). Here we report the results of training the RelationNet with the initialization . In this experiment, we use the CUB dataset as the unseen domain for evaluation and conduct the training described in Algorithm 1. The -way -shot classification accuracy on the CUB dataset is , which is similar to the one we report in Table 2 ( i.e.,  ).

Learning-to-learn using a single domain.

The proposed learning-to-learn method requires multiple domains for training. Here we apply the learning-to-learn method based on one single domain. More specifically, we randomly sample two different tasks from the mini-ImageNet dataset in each iteration of the training process described in Algorithm 1. One task serves as the pseudo-seen task, while the other one serves as the pseudo-unseen task. We train the RelationNet model with the above-mentioned setting on 5-way 5-shot classification using the mini-ImageNet dataset only. As shown in Table 7, the performance improvement of the RelationNet model is not significant compared to the model trained with pre-determined hyper-parameters . This suggests that utilizing one single domain for learning the feature-wise transformation is not as effective as that using multiple domains (demonstrated in Table 2). The reason is that during the training phase, the discrepancy of the feature distributions extracted from the psudo-seen and pseudo-unseen tasks is not as significant since these two tasks are sampled from the same domain. As a result, the hyper-parameters cannot be effectively optimized to capture the variation of feature distributions sampled from various domains.

5-way 5-Shot mini-ImageNet CUB Cars Places Plantae
RelationNet FT
RelationNet LFT*
Table 7: Few-shot classification results by applying the learning-to-learn approach trained with a single seen domain. We attempt to conduct the proposed learning-to-learn trainin with as singe seen domain, denoted as LFT*. We train the model using the mini-ImageNet dataset and report the 5-way 5-shot classification accuracy.

Comparison with the state-of-the-art few-shot classification on the mini-ImageNet.

We compare the metric-based frameworks applied with the proposed feature-wise transformation layers to the state-of-the-art few-shot classification methods in Table 8. In this experiment, we train the model with the pre-determined hyper-parameters of feature-wise transformation layers on the training set of the mini-ImageNet (Ravi and Larochelle, 2017) dataset. Note that we do not use the learned version of the feature-wise transformation layers in the training to ensure fair comparison. Combining Table 4 and Table 8, we observe that the metric-based frameworks train with 1) pre-trained feature encoder, and 2) feature-wise transformation layers with carefully hand-tuned hyper-parameters can demonstrate competitive performance.

backbone method -way -shot -way -shot
ResNet-12 TADAM (Oreshkin et al., 2018)
DC (Lifchitz et al., 2019)
DC + IMP (Lifchitz et al., 2019) -
MetaOptNet-SVM-trainval (Lee et al., 2019b)
WRN-28 Qiao et al. (2018)
LEO (Rusu et al., 2019)
ResNet-10 MatchingNet -
FT
RelationNet -
FT
GNN -
FT
Table 8: Comparison to the state-of-the-art few-shot classification algorithms. We compare the metric-based frameworks applied with the proposed feature-wise transformation layers using pre-determined hyper-parameter (denoted as FT) to other state-of-the-art few-shot classification methods. Note that all the methods are trained only on the mini-ImageNet dataset. To ensure fair comparisons with other methods, we are unable to use the learned version of the feature-wise transformation layers described in Section 3.3. By augmenting existing metric-based few-shot classification models with the proposed feature-wise transformation layer, we obtain competitive performance when compared with many recent and more complicated methods. The best results in each block are highlighted in bold.

Comparison to the state-of-the-art few-shot classification under domain shift.

We evaluate the metric-based frameworks with the proposed feature-wise transformation layers and the state-of-the-art MetaOptNet approach Lee et al. (2019b). We use the model trained on the mini-ImageNet dataset provided by the authors for the evaluation on the other datasets.666https://github.com/kjunelee/MetaOptNet As shown in Table 9, while the MetaOptNet method achieves state-of-the-art performance on the mini-ImageNet dataset, this approach also suffers from the domain shifts in the cross-domain setting. Training the GNN framework with the pre-trained feature encoder and the proposed feature-wise transformation layers performs favorably against the MetaOptNet method under the cross-domain setting.

method CUB Cars Places Plantae
MetaOptNet-SVM-trainval
MatchingNet -
FT
RelationNet -
FT
GNN -
FT
Table 9: Evaluation with the state-of-the-art approach under the cross-domain setting. We evaluate the metric-based frameworks with the proposed feature-wise transformation layers using pre-determined hyper-parameter (denoted as FT) against the state-of-the-art MetaOptNet-SVM-trainval Lee et al. (2019b) method. Note that all the methods are trained only on the mini-ImageNet dataset. To ensure fair comparisons with other methods, we do not use the learned version of the feature-wise transformation layers described in Section 3.3. By augmenting the existing metric-based few-shot classification models with the proposed feature-wise transformation layer, we obtain competitive performance when compared with recent and more complicated methods. The best results are highlighted in bold.