Few-shot classification 
aims to classify instances from unseen classes with few labeled samples in each class. To this end, many meta-learning based models elaborately design various task-shared inductive bias (e.g., the metric function, the inference mechanism [4, 14]) to solve few-shot classification tasks. They demonstrate promising performance when evaluated on the tasks from the same domain with the training tasks (e.g., both training and testing are on the mini-ImageNet classes). However, some works [3, 8]
have shown that the existing meta-learning models perform undesirably when there exists domain shift between training tasks and test tasks (e.g., training on the mini-ImageNet classes and testing on the ISIC classes), and even underperform compared to traditional pre-training and fine-tuning. As a result, the cross-domain few-shot classification problem has attracted considerable attention from the machine learning community, especially the difficultsingle domain generalization problem [19, 8].
To generalize to unseen domains without accessing any data from those domains, some domain generalization models have been proposed [20, 12]. They learn the classifiers that generalize to the unseen domains, and assume that the source and unseen domains share the same classes. However, in the few-shot classification problem, the classes in the target tasks are unseen before. The most similar works to ours are  and  which aim to improve the performance of meta-learning models in cross-domain tasks.  introduces the feature-wise transformation layers for the metric-based meta-learning models which modulate the feature activation with affine transformation to improve the robustness of the metric functions. But as mentioned above, the different meta-learning models have various inductive bias, not just the metric functions.  uses explanation-guided training to prevent the feature extractor from overfitting to specific classes, but it needs to manually derive the explanations for different meta-learning models.
We aim to find a method that is general, easy to implement and can improve the robustness of various inductive bias. To this end, we resort to the task augmentation techniques which constructs ’challenging’ virtual tasks to increase the diversity of training tasks. For image classification, various hand-crafted data augmentation techniques (e.g., horizontal flip, random crop and color jitter) can be used for task augmentation. However, they have limited effect and cannot perform adaptive augmentation for different inductive bias. Recently, some works [16, 20] proposed adaptive sample (e.g., images) augmentation methods to improve the robustness of the model. Inspired by these works, we propose an inductive bias-adaptive task augmentation method to improve the cross-domain generalization ability of the meta-learning models.
Concretely, we consider the worst-case problem around the source task distribution
where represents the model parameters,
is the loss function which depends on the model’s inductive bias, andis the distance metric between task distributions. Compared with minimizing the loss function on the source task distribution , the solution to the worst-case problem (1) guarantees good performance on the wider space of task distributions which are distance away from , as illustrated in Figure 1. By solving the worst-case problem (1), we propose a task augmentation method. Since the loss function depends on the inductive bias, our method can adaptively generate ’challenging’ tasks according to the different inductive bias and increase the diversity of training tasks which improves the robustness of the model under domain shift. What’s more, our method can be used as a plug-and-play module for various meta-learning models.
The main contributions of this work are as follows:
To the best of our knowledge, this is the first work that introduces task augmentation into cross-domain few-shot classification to improve the generalization ability of meta-learning models under domain shift.
We consider the worst-case problem around the source task distribution , and propose a plug-and-play inductive bias-adaptive task augmentation method, which can be conveniently used for various meta-learning models.
We evaluate our method on the RelationNet , the GNN  and one of the state-of-the-art models TPN  with extensive experiments under the cross-domain setting. Experimental results show our method can significantly improve the cross-domain generalization performance of these models and outperforms  and . And under the same settings, the meta-learning models with our adversarial task augmentation module can outperform the traditional pre-training and fine-tuning under domain shift.
2 Related Work
Cross-domain few-shot classification.
Although various meta-learning models for few-shot classification have achieved impressive performance, they fail to generalize to unseen domains. To this end,  uses the feature-wise transformation layers to simulate various distributions of image features during training and thus improve the generalization capability of the metric function.  uses the explanation methods to upscale the features which are more relevant to the prediction, and penalize them more when overfitting occurs to avoid the intermediate features from specializing towards fixed classes. Different from them, we focus on improving the robustness of various inductive bias. Other models [13, 21] that appear in the CVPR 2020 Cross-Domain Few-Shot Learning Challenge use various techniques to solve cross-domain few-shot classification tasks, e.g., batch spectral regularization, model ensemble and large margin mechanism.
Domain generalization methods [20, 12] have been developed to generalizing from single or multiple seen domains to the unseen domains without accessing samples from them. However, these models consider the setting that the seen and unseen domains share the same categories. In contrast, in the cross-domain few-shot classification problem, the seen and the unseen domains have completely disjoint categories.
Adversarial training 
aims to make deep neural networks be capable of resistant to adversarial attacks. proposes principled adversarial training through distributionally robust optimization, where virtual images are model-adaptively generated by maximize some risk and the models learned with these new images become more robust. In this work, we introduce a similar model-adaptive augmentation method into the meta-learning models, and propose a plug-and-play module to generate virtual ’challenging’ tasks to improve the robustness of various meta-learning models.
3.1.1 Few-Shot Classification
Each few-shot classification task consists of a support set and a query set . If the support set contains classes with samples in each class, the few-shot classification task is called -way -shot. The query set contains the samples from the same classes with the support set . Formally, a few-shot task can be defined as , where and . Given the support set , our goal is to classify the samples in the query set correctly to one of the classes. Typically, the base learner is needed to output the optimal classifier of the task basing on the support set , i.e., and it depends on the inductive bias.
The main difference among meta-learning models for few-shot classification lies in the design choices for the inductive bias. For examples, the RelationNet 
chooses the metric function based on convolutional neural networks (CNNs), the GNN applies generic message-passing inference mechanism on a partially observed graphical model, and the TPN  utilizes the transductive label propagation. Meta-learning models aim to learn these inductive bias over a collection of tasks which are assumed to be sampled from the task distribution , and the learning objective is
where is the loss function, such as the classification loss of the samples in the query set , and represents the model parameters.
3.1.2 Cross-Domain Setting
Generally, the target tasks are assumed to come from the source task distribution . However, in this work we consider the few-shot classification under domain shift. Concretely, we focus on the single domain generalization problem because the data from multiple training domains may not always be available due to data acquiring budget or privacy issue. We denote the domain as the distribution of the few-shot classification tasks. The target tasks come from several unknown domains . The goal is to learn a meta-learning model using the single source domain , such that the model can generalize to the several unseen domains.
means using our adversarial task augmentation. Marked in bold are the best results in each block, as well as other results with an overlapping confidence interval.
3.2 Adversarial Task Augmentation
Next, we solve the worst-case problem (1) to get a plug-and-play model-adaptive task augmentation module. In order to make the loss function depending on the inductive bias of the meta-learning models, inspired by Equation (2), we define it as
To allow task distributions that have different support to that of the source task distribution , we use the Wasserstein distances as the metric . Concretely, for task distribution and both supported on the task space , let denotes their couplings, meaning measures on with and . The Wasserstein distance between and is
where is the transportation cost from to , satisfying and .
Let and be continuous. Let be the cross domain surrogate. For any distribution and any ,
and for any , we have
Thus, the continuity of the loss function and the transportation function with respect to needs to be satisfied to solve the worst-case problem (1). For this, we model the task
as the vector with the fixed dimension. A common approach is to use task embedding to model the tasks, but it is not applicable here. The reasons are as follows: 1) it conflicts with the definition of the loss function, i.e., calculating requires the support set and query set , not the task embedding; 2) we expect and to be the distribution of the tasks to generate virtual tasks not the task embedding. We treat each task as the vector concatenated by the samples and labels it contains, i.e.,
denotes the concatenation operation. This definition is equivalent to treating the distribution of the tasks as the joint distribution of samples and labels within the task, i.e.
Meanwhile, we assume that the number of samples in the task is fixed, so as the dimension of . The change of the elements of the samples in task leads to the change of , so the continuity of and can be satisfied. Another consideration for assuming a fixed number of samples in a task is that we want to generate the virtual task containing the same number of samples with source task .
In the worst-case problem (1), the supremum over task distributions is intractable, so we consider its Lagrangian relaxation with penalty parameter
Applying Lemma 1, our optimization problem becomes
Let be -Lipschitz smooth and be -strongly convex for each . If , there is unique satisfying
In the Lemma 2, ensures that the function is -strongly concave in , so that there exists the unique .
where is the learning rate. From the Equation (11), we can make two insights: 1) for the meta-learning models, the virtual task is more ’challenging’ than the source task and the loss function satisfies , so the model learned with it tends to be more robust; 2) since the loss function depends on the inductive bias, solving the Equation (11) is equivalent to adaptively generating the virtual task that is more ’challenging’ to the currently learned inductive bias.
For deep networks and other complex models, the supremum problem in Equation (11) cannot be solved accurately, so we use the gradient ascent process with early stopping to solve it. Concretely, let the set of all samples in a task be and their corresponding labels be , i.e.,
then and . We use the source task as the initialization of , and the task vector defined in Equation (7) as the optimization variable. Considering that in different few-shot classification tasks, samples with the same labels can correspond to different real category (e.g., cat, dog), so the change of label is not considered here, i.e., keeping . In the -th iteration, the update is
Here the regularization term is removed from the iteration goal and the reasons are as follows: 1) this term is used to constrain the proximity of the virtual task to the source task , but using the source task as the initialization and early stopping can achieve the same effect, see the Section 4.3 for the detailed discussion; 2) it reduces the computational overhead and hyper-parameters requiring hand-tuning. After iterations, we get the virtual ’challenging’ task and update the model parameters with it. See Algorithm 1 for the full description of the training process. Given an unseen task, the inference process is the same as the original meta-learning model. Note that if , Algorithm 1 becomes the original meta-learning training process, so our method is a plug-and-play module.
In this paper, we mainly consider the cross-domain few-shot image classification, and the convolutional neural networks (CNNs) are the necessary tools. However, CNNs tend to overfit on superficial local textures , so we use the random convolutions  that can change the local textures and keep the shape unchanged as the auxiliary augmentation technique for our adversarial task augmentation. Concretely, given an input image , where and are the height and width and is the number of feature channels, the filter size is first randomly sampled from the candidate pool
, then the Xavier normal distribution. In practice, for each task sampled from , we keep its all samples unchanged with probability , or use the same random convolution on its all samples to get a new task for training, as shown in the fourth line of Algorithm 1.
In this section, we evaluate the adversarial task augmentation method on the RelationNet , the GNN  and one of the state-of-the-art meta-learning models TPN , and compare it with  and . These meta-learning models have different kinds of inductive bias so as to verify the versatility and effectiveness of our method.
4.1 Experimental Settings
We conduct extensive experiments under cross-domain settings, using nine few-shot classification datasets: mini-ImageNet , CUB, Cars, Places, Plantae, CropDiseases, EuroSAT, ISIC and ChestX, which are introduced by  and . Each dataset consists of train/val/test splits and please refer to these references for more details. We use the mini-ImageNet domain as the single source domain, and evaluate the trained model on the other eight domains. We select the model parameters with the best accuracy on the validation set of the mini-ImageNet for model evaluation.
In all experiments, we use the ResNet-10  as the feature extractor and use the Adam optimizer with the learning rate . We find that setting or is sufficient to obtain satisfactory results, and we choose the learning rate of the gradient ascent process from . We set for all experiments and choose from . We evaluate the model in the 5-way 1-shot/5-shot settings using 2,000 randomly sampled episodes with 16 query samples per class, and report the average accuracy () as well as confidence interval.
Pre-trained feature extractor.
Instead of optimizing from scratch, we apply an additional pre-training strategy as in  which pre-trains the feature extractor by minimizing the standard cross-entropy classification loss on the 64 training classes in the mini-ImageNet dataset.
4.2 Evaluation for Adversarial Task Augmentation
We apply the adversarial task augmentation module to the RelationNet, the GNN, and the TPN models to evaluate its effect on improving the cross-domain generalization ability of the meta-learning models, and compare it with  which adds the feature-wise transformation layers to the feature extractor and  which uses explanation-guided training. All models are trained and tested in the same environment for the fair comparison and the results are shown in Table 1.
We can observe that with our adversarial task augmentation module, the cross-domain few-shot classification accuracy of the meta-learning models is consistently and significantly improved. And compared with  and , our method achieves comparable or more significant improvement, which means that adaptively enhancement of different inductive bias is more effective than enhancing artificially determined inductive bias. Moreover, applying the feature-wise transformation layers even harms the cross-domain generalization performance of the TPN model, while our method is still effective, which means that our method is more general, not just suitable for the metric-based meta-learning models (the RelationNet and the GNN models).
4.3 Ablation Study
Effect of the random convolution.
As aforementioned, we use the random convolution for auxiliary task augmentation. Here we study the effect it brings. Figure 2 shows the average few-shot classification accuracy on eight unseen domains without random convolution and that obtained by complete method. As we can see, without the random convolution, our method still improves the cross-domain generalization ability of the meta-learning models, and outperforms ’+FT’  and ’+LRP’ . Using the random convolution can achieve further improvements.
Is the regularization term useful?
In the Section 3.2, we remove the regularization term from the iteration goal and here we will show it is reasonable. We consider two common candidates for distance and find that they do not bring benefits. As we assumed, the label composition of few-shot classification tasks is the same as each other, so the distance between task and depends on the samples and . Let the feature vectors of and are and with . The first candidate is the direct sample-wise Euclidean distance, i.e., and the second candidate is the maximum mean discrepancy (MMD) distance, i.e., . Figure 3 shows the average few-shot classification accuracy on eight unseen domains without the regularization term, with the sample-wise Euclidean distance regularization term and with the maximum mean discrepancy (MMD) distance regularization term. We set the hyper-parameter and do not use the random convolution for the clear comparison. As we can see, using the regularization term does not bring obvious benefits or even is harmful, which shows that early stopping has already imposed enough constraints, and using the regularization term leads to excessive limits.
4.4 Comparison with Fine-tuning
 shows that in the cross-domain few-shot classification problem, traditional pre-training and fine-tuning outperform the meta-learning models. Here we re-examine this phenomenon through a different fair comparison, i.e., using data augmentation while solving an unseen task. Given an unseen task consisting of support samples and
query samples, for the fine-tuning, we use the pre-trained feature extractor as initialization and a fully connected layer as the classification head. For each epoch, we generatepseudo samples for each class based on the support samples using the data augmentation method from  and use these pseudo samples and the support samples for fine-tuning where we use the SGD optimizer with the learning rate 0.01 and the momentum 0.9 as in . For the meta-learning models, we use the parameters trained on the mini-ImageNet with our adversarial task augmentation method as the initialization and adapt the meta-learning models to the same samples as above at each iteration where the pseudo samples are used as the pseudo query set, and we use the Adam optimizer with the learning rate 0.001. All models are fine-tuned for 30 (or 50) epochs in the 5-way 1-shot (or 5-shot) tasks. Since all models use the same amount of target domain data when solving each unseen task, it is a fair comparison. The results are shown in Table 2. As we can see, the meta-learning models with our adversarial task augmentation module significantly outperform the traditional pre-training and fine-tuning even under domain shift.
In this paper, we aim to design a new method that can improve the cross-domain generalization capability of meta-learning models in the cross-domain few-shot learning. For this, we consider the worst-case problem around the source task distribution , and propose a plug-and-play inductive bias-adaptive task augmentation method, which significantly improves the cross-domain few-shot classification capability of various meta-learning models, and outperforms the existing works. This is the first work to achieve the above objective by generating ‘challenging’ virtual tasks. We also compare the meta-learning models with pre-training and fine-tuning under the same settings, and find that the meta-learning models with our method outperform the fine-tuning under domain shift.
-  (2019) Quantifying distributional model risk via optimal transport. Mathematics of Operations Research 44 (2), pp. 565–600. Cited by: §3.2.
-  (2013) Perturbation analysis of optimization problems. Springer Science & Business Media. Cited by: §3.2.
-  (2019) A closer look at few-shot classification. In 7th International Conference on Learning Representations, ICLR 2019, External Links: Cited by: §1.
-  (2018) Few-shot learning with graph neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: 3rd item, §1, §3.1.1, §4.
-  (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In 7th International Conference on Learning Representations, ICLR 2019, External Links: Cited by: §3.2.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §3.2.
-  (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §2.
-  (2020) A broader study of cross-domain few-shot learning. In ECCV, Vol. , pp. . Cited by: §1, §4.1, §4.4.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §4.1.
-  (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §1.
Network randomization: A simple technique for generalization in deep reinforcement learning. In 8th International Conference on Learning Representations, ICLR 2020, External Links: Cited by: §3.2.
-  (2019) Feature-critic networks for heterogeneous domain generalization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, K. Chaudhuri and R. Salakhutdinov (Eds.), , Vol. , pp. . External Links: Cited by: §1, §2.
-  (2020) Feature transformation ensemble model with batch spectral regularization for cross-domain few-shot classification. arXiv preprint arXiv:2005.08463. Cited by: §2.
-  (2019) Learning to propagate labels: transductive propagation network for few-shot learning. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: 3rd item, §1, §3.1.1, §4.
-  (2017) Optimization as a model for few-shot learning. In 5th International Conference on Learning Representations, ICLR 2017, External Links: Cited by: §4.1.
-  (2018) Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR 2018, External Links: Cited by: §1, §2, §3.2.
-  (2020) Explanation-guided training for cross-domain few-shot classification. arXiv preprint arXiv:2007.08790. Cited by: 3rd item, §1, §2, Table 1, §4.2, §4.2, §4.3, §4.
-  (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: 3rd item, §1, §3.1.1, §4.
-  (2020) Cross-domain few-shot classification via learned feature-wise transformation. In 8th International Conference on Learning Representations, ICLR 2020, External Links: Cited by: 3rd item, §1, §1, §2, Table 1, §4.1, §4.1, §4.2, §4.2, §4.3, §4.
-  (2018) Generalizing to unseen domains via adversarial data augmentation. In Advances in neural information processing systems, pp. 5334–5344. Cited by: §1, §1, §2.
-  (2020) Large margin mechanism and pseudo query set on cross-domain few-shot learning. arXiv preprint arXiv:2005.09218. Cited by: §2, §4.4.