Despite the great success of machine learning, a clear gap between human and artificial intelligence is the ability to learn from small samples, e.g., learning to recognize objects from limited examples. Inspired by human’s ability of learning to learn from experience, meta-learningvanschoren2018meta
aims to transfer the generic experience learned from multiple tasks of limited data to efficiently complete new tasks. As one of the most successful applications for meta-learning, few-shot learning targets at learning from a limited number of labeled examples, which has become a research trend recently. Few-shot image classification is a task where the classifier must learn to accommodate new classes not seen during training with limited examples.
Existing meta-learning algorithms can be categorized into three groups: (1) Metric-based methods vinyals2016matching; snell2017prototypical that learn an encoder for the examples and apply parameter-free inference (e.g., nearest neighbor) in the embedding space, (2) Optimization-based methods finn2017model; rusu2018meta; park2019meta that extract the meta-knowledge of optimization algorithms for fast adaption, and (3) Black-box- or Model-based methods santoro2016meta; mishra2017simple; munkhdalai2017meta
that directly learn to embed the datasets to model parameters for prediction. Among them, (1) and (2) have become the most popular methodologies and have been proved effective in various few-shot settings. However, there exist two challenges largely unexplored. First, most existing methods do not consider the time and resource efficiency or budget, which limits their ability to meet the requirement in many real-world applications. Furthermore, the success of existing methods heavily relies on careful hyperparameter designs (e.g., backbones, learning rates, etc.) on each specific dataset. In real-world scenarios, the datasets and tasks may be unknown, diverse, or even changing over time, making manually design of the most suitable hyperparameters very laborious.
To tackle these challenges, we design a novel practical meta-learning system (MetaDelta) for few-shot image classification tasks in this paper. Following the metric-based methods, MetaDelta firstly adopts pretrained convolutional networks as backbones to project images to latent vectors and trains the backbones with linear classifiers in a non-episodic way on the training classes. To improve the system’s generalization capacity to any unknown datasets under time and memory budgets, we employ multiple meta-learning models with multi-processing, while managing the time and resources with a central controller in the main process at the same time. Moreover, we implement a late-fusion meta-ensemble mechanism to improve the generalization ability by taking the prediction from each model into account. MetaDelta consistently outperforms competitive baselines on various datasets and ranks first in the final phase of AAAI 2021 MetaDL Challenge, which shows the superiority of our proposed meta-learning system.
2 AAAI 2021 MetaDL Challenge
In this section, we first introduce the workflow of a meta-learning system for the few-shot image classification task in AAAI 2021 MetaDL Challenge and then review the details and challenges of this competition.
As illustrated in Fig. 1, the workflow of such a meta-learning system is as follows. A meta-learner is first trained on the episodes or batches of data generated by the meta-train data generator. An episode refers to a -way -shot image classification task , where the support set and the query set are the training set and the test set of this task, respectively. Note that , and , where is the number of image categories of the task, and all are manually adjustable before meta-training. A batch refers to examples randomly sampled from the meta-training data. The meta-learner outputs a learner, which is then trained on the support set of each meta-test episode to output a specific predictor for episodic evaluation.
The AAAI 2021 MetaDL Challenge consists of a feedback phase and a final phase. During the feedback phase, an offline public dataset (the Omniglot dataset lake2015human) and an online feedback dataset
(i.e., the dataset is unknown/unavailable and only used to evaluate submissions uploaded by participants) are provided for the participants to develop their meta-learning systems. During the final phase, new online feedback datasets are used to evaluate the submissions. The evaluation metric is the average classification accuracy on the query sets of 600 meta-test episodes. All the meta-test episodes are defined as 5-way 1-shot image classification tasks with an unknown number of query examples in each class (i.e.,is unknown).
As a challenge on meta-learning with few-shot image classification settings and online judge, this competition has the following challenges.
Fast adaption without overfitting. This is the core challenge of meta-learning and is especially critical for few-shot learning settings. The 5-way 1-shot classification problem in the competition requires the meta-learners to learn generalized prior knowledge from meta-training tasks and a proper way for fast adaption on limited novel data without overfitting.
Time and resource efficiency. Another challenging aspect of this competition is the efficiency requirements for the submissions: the whole meta-training and meta-testing workflow should be finished within 2 hours on an Azure NV24 machine with 4 M60 Tesla GPU and 224 GB of RAM. A superior meta-learning algorithm should thus not only learn fast (fast adaption), but also meta-learn fast.
Generalization across different datasets. During both the feedback and final phases, the online feedback datasets are unknown/unavailable to participants. Therefore, a good submission must work well on any unknown dataset without manually tailored hyperparameters (e.g., learning rate, backbone structure, etc.). This is difficult since the image distribution of the feedback dataset may differ heavily from existing offline datasets with a different number of query examples, image size, etc. Furthermore, with the time and resource limitations, common AutoML methods such as hyperparameter optimization (HPO) and neural architecture search (NAS) can be too expensive to be applicable to automatically specify the best hyperparameters on the feedback/final dataset.
In this section, we elaborate on our meta-learning system (MetaDelta) for the AAAI 2021 MetaDL Challenge.
3.1 System Overview
The MetaDelta system is illustrated in Fig 2. To tackle the challenge of time and resource efficiency, we adopt a central controller in the main process to dispatch data and decide when to start and stop the meta-training/testing (top of Fig 2). Aiming to achieve good and robust performances on unknown feedback datasets, a meta-ensemble is learned to ensemble 4 different meta-learners. The 4 meta-learners are derived by training with different hyperparameters parallelly on 4 GPUs, which is managed by the central controller.
A specific meta-learner in MetaDelta is instantiated and used following the framework illustrated in the bottom of Fig 2
, which is capable of fast adaption in the few-shot image classification problem. During the meta-training period, we leverage batch training strategy to train a deep model to classify all the meta-training classes (e.g., if the meta-training set includes 500 classes, then the model applies 500-way classification). For the sake of time efficiency and generalization capacity to unknown datasets, we leverage universal pretrained CNN encoders (e.g., ResNet50 pretrained on ImageNet) to embed images into features, and add a classifier head onto the encoder for fine-tuning. During the meta-testing period, we discard the classifier head and map the images to embeddings with the fine-tuned encoder, and apply an efficient parameter-free decoder to predict the class labels of query images based on the embeddings. The optimal meta-learner components are selected based on experimental evaluations on various offline datasets.
As aforementioned in Sec 1, Metric based (e.g., Prototypical Networks, a.k.a. ProtoNet snell2017prototypical) and Optimization based (e.g., MAML finn2017model) methods are the most popular and effective meta-learners in existing literature. Through extensive experiments on the 5-way 1-shot image classification tasks, we find ProtoNet-like methods outperform MAML-like methods on various datasets, and thus select ProtoNet-like frameworks as our meta-learners (the bottom of Fig 2).
Different from ProtoNet, we apply non-episodic (i.e., batch) training instead of episodic training to learn a CNN encoder to embed images to feature vectors, since we find non-episodic training leads to more effective encoder in our experiment (See. Table 2). Then a parameter-free decoder is taken during meta-valid/testing periods to decode the vectors of each episode to the predicted labels. In detail, we iteratively train and evaluate the CNN encoder: First, the encoder will be meta-trained epochs using non-episodic training strategy. Then, the episodic classification accuracy will be calculated using parameter-free decoder on meta-valid dataset. The model with the best meta-valid episodic accuracy will be saved for further use.
Fine-tuned CNN Encoder
We select pretrained CNN backbones (pretrained on ImageNetdeng2009imagenet) as the initialized encoder and then fine-tune it on batches of meta-training data. We use pretrained deep backbones since they have strong generalization capacity and help the meta-learner to generalize to unknown feedback datasets. To some degree, the pretrained models can also be regarded as a meta knowledge collection of ImageNet. Moreover, compared to learning from scratch, fine-tuning on pretrained CNNs also saves time and computing resources, enabling effective training of powerful deep models within the time limit. In our experiment, we select ResNet50 he2016deep, ResNet152 he2016deep, WRN50 zagoruyko2016wide and MobileNet sandler2018mobilenetv2 for the four meta-learners in MetaDelta.
To fine-tune the backbones, a linear classifier head (i.e., fully-connected layer) is added to the final layer of the CNN encoder, and we randomly sample -way -shot batches from the meta-train classes for training. Here, -way -shot means each batch consists of classes with labeled examples for each class, which is set to keep class balance. To augment the image data and learn a more robust encoder, we follow chen2019self to apply the rotation loss: First, we rotate each image in a batch by degrees to get four images, which should be classified into the same class by the original classifier head. Then, another 4-way linear classifier head is added to the top of the CNN encoder to predict the four kinds of rotations. Finally, we optimize the weights of encoders by minimizing the following loss:
where is the classification loss, is the rotation loss, and is a hyperparameter for balance.
With the feature vectors encoded by the fine-tuned CNN encoder, we could further predict the labels of query examples in meta-valid/testing episodes with the help of parameter-free decoders. During meta-valid period, we use the decoder in ProtoNet snell2017prototypical to make inference. The models with the best few-shot classification accuracy on meta-valid dataset is chosen as the encoder for further use.
Specifically, given a meta-valid episode of -way -shot, we first compute the prototypes from the support set of the -th class:
where denotes the CNN encoder. Then, the ProtoNet decoder produces a distribution over classes for each query example based on a softmax over the Euclidean distances between its embedding and the prototypes:
where is a query example, and denotes Euclidean distance between vectors and
. The prediction is then made by classifying the example to the most probable class.
During meta-test period, we implement the soft -means based transductive decoder in MCT kye2020transductive to build more accurate prototypes by considering query embeddings. Concretely, the initial prototypes are the same as that in Eq. 1. The MCT decoder iteratively updates the prototypes for steps. For each step , we first calculate the confidence scores for each query example belonging to class in the same way as Eqn. 2:
Then, we update the prototypes based on the confidence scores for all query examples in episodic query set :
The predictions are finally made based on .444We do not use the learnable distance metric proposed in MCT, which brings no improvement in our preliminary experiments.
We find in our experiments that the accuracy trendlines of the ProtoNet decoder and MCT decoder are generally the same (see Fig 4), while the latter leads to higher accuracies given the same CNN encoder. Therefore, we take the ProtoNet decoder for meta-valid to accelerate training without missing the best models, and the MCT decoder for meta-test to make more accurate inferences. Note that using a decoder during meta-validation to calculate episodic accuracies (instead of batch-wise classification accuracies as during meta-training) is reasonable, since a CNN encoder that facilitates low meta-training loss does not ensure high episodic accuracy during meta-valid/test periods.
The meta-ensemble module is designed to tackle the challenge of generalization capacity, i.e., to improve the performance of MetaDelta on any unknown feedback/final dataset. Ensemble methods have been empirically proved to be effective in various supervised classification tasks rokach2010ensemble. In MetaDelta, the meta-ensemble module integrates the predicted probabilities of the four meta-learners and outputs the final predictions, as illustrated in Fig. 2.
The meta-ensemble model is trained after finishing the meta-training of all meta-learners. To train the meta-ensemble model, we divide the meta-valid data into a training set and a test set . Taking the concatenation of the predicted probabilities from the four loaded best meta-learners as input, several meta-ensemble models are trained on simultaneously and evaluated on
based on episodic accuracy. The best meta-ensemble model is then saved for the inference in the meta-test period. In our experiments, we implement voting, Gradient Boost Machine, General Linear Model, Naive Bayesian Classifier, and Random Forest555We leverage the Gradient Boost Machine library from a lib LightGBM (lightgbm.readthedocs.io/en/latest/index.html). Naive Bayesian Classifier and Random Forest models are implemented based on sklearn scikit-learn. as the meta-ensemble candidate models. Due to the diversity of suitable scenarios of these models, we argue that our meta-ensemble module is capable of dynamically adapting to the unknown feedback dataset by selecting the best ensemble model according to the meta-valid data. This design further improves the robustness of our system.
3.4 Central Controller
The central controller module aims at improving the time and resource efficiency of our system and avoiding timeout or memory overrun. All the multi-thread and multi-processing designs in this module facilitate our system to support a greater degree of parallelism and make full use of the computing resources.
The design of central controller is illustrated in Fig. 3
(during meta-training period). First, a timer is set in the main process to measure and estimate the time cost of meta-training/testing epochs, based on which the whole meta-training/testing procedure is supervised. A subprocess may be killed in advance by the central controller if it is predicted to run out of time budget. Under this framework, the main process starts a data manager thread to load, copy, and dispatch the episode/batch data. Then, four data preprocessor subprocesses are started to receive and preprocess the data copies according to the requirements of specific meta-learner trainer subprocesses. The preprocessed data is sent to the corresponding data buffer, which is designed to support asynchronous meta-training in different subprocesses. The data preprocessor subprocess will not sleep until timeout or all the buffers are full. Each meta-learner trainer subprocess is run on one GPU, and the main process and data preprocessor subprocesses are run on CPU and the RAM.
During meta-valid and meta-test periods, the same central controller framework is used except for the training and inference of the meta-ensemble module in the main process.
In this section, we demonstrate our ranking in the final phase and experimental results in offline evaluations.
4.1 Final Phase Results
AAAI 2021 MetaDL Challenge consists of a feedback phase and a final phase, which phases take different unknown online feedback datasets to test the submissions of participants. The ranks of the top 3 teams in the final phase are shown in Table 1. Our team ranks first with a meta-test accuracy of 0.4042, which validates the effectiveness and generalization capacity of MetaDelta.
|Team Name||Accuracy score|
|Dataset||# Meta-tr||# Meta-val||# Meta-te|
|Omniglot||882 (25)||81 (5)||659 (20)|
|tieredImageNet||350 (10)||56 (2)||167 (8)|
4.2 Offline Evaluation Results
To evaluate the proposed MetaDelta in an offline environment, we conduct experiments on four public datasets for the few-shot image classification task and compare MetaDelta with several meta-learning baselines.
Besides the public offline dataset Omniglot lake2015human provided by the Challenge, we also select three popular few-shot image classification datasets to evaluate the generalization capacity of MetaDelta. The datasets include CIFAR-100 krizhevsky2009learning, miniImageNet vinyals2016matching, and tieredImageNet ren2018meta. For each dataset (except for the officially provided Omniglot), we randomly partition the classes into meta-train, meta-valid, and meta-test sets according to the ratio of 5:1:4. The statistics of the five datasets is demonstrated in Table 3. Note that the images in all datasets are reshaped to to be consistent with the official interface (not included in the time budget, as the official images are claimed to always be of this size).
Several representatives of optimization-based and metric-based meta-learning methods are adopted as our baselines. For optimization-based methods, we select MAML finn2017model and MetaCurvature park2019meta, an enhanced version of MAML that transforms the inner-update gradients to improve generalization capacity. CifarCNN is chosen as the backbone (base learner) in MAML and MetaCurvature due to its effectiveness and efficiency - larger backbones like (pretrained) ResNet50 cannot converge to good optimum within the time limit of the competition. For metric-based methods, we select ProtoNet snell2017prototypical with CifarCNN and pretrained ResNet50 backbones as our baselines, which applies episodic training rather than batch training as in MetaDelta. Moreover, the following variants of MetaDelta are adopted as baselines to show the impact of different components: 1) Base: It fine-tunes a pretrained ResNet50 by batch training and makes inference with ProtoNet decoder during meta-testing. 2) Base+MCT: It adopts MCT decoder for accurate prediction during meta-testing. 3) Base+Rot+MCT: This baseline further applies the rotation loss augmentation in the batch training. 4) MetaDelta: This is our final system with meta-ensemble on the predictions of four meta-learners.
The performance comparison is listed in Table 2. The proposed MetaDelta significantly outperforms other baselines on all datasets except Omniglot. However, we do not adopt MAML-based methods in our system as one of the meta-learners due to their low performance on the majority of datasets. As shown in Table 2, the meta-learner adopted in MetaDelta (Base and the following variants) surpasses the typical MAML- and ProtoNet-based baselines by a large margin on CIFAR-100, miniImageNet, and tieredImageNet, and the MCT decoder with rotation loss augmentation help to boost the performance.
Why ProtoNet Better We notice in our experiment that, in almost all the datasets, our ProtoNet baseline (Base) and its variants outperform MAML by a large margin. We suspect that this superiority of ProtoNet-like methods is derived from the implicit utilization of the prior knowledge of image data (e.g., locality, translation invariance, etc.) when combining pre-trained encoders with distance based decoders. In particular, most ProtoNet-like methods are specifically designed for few-shot image classification tasks, projecting the images into latent vectors and making inferences based on the pairwise distances. On the other hand, MAML-like methods adopt a more general framework without any assumption on the data or tasks, being not capable of leveraging these prior knowledge.
We further conduct several ablation experiments to demonstrate the functionality of backbones and parameter-free decoders. Concretely, we implement single meta-learners with the backbones of ResNet50, ResNet152, WRN50 and MobileNet, and apply decoders of ProtoNet snell2017prototypical (Euclidean), MCT kye2020transductive, Laplacian ziko2020laplacian, and Graph Propagation (Graph) rodriguez2020embedding.
Table 4 lists the ablation results on the pretrained backbones, indicating that different backbones show superiority on different datasets. This observation motivates our design of taking different backbones in the four meta-learners and applying the meta-ensemble module, which aims at improving the generalization capacity of MetaDelta to unknown feedback datasets.
Table 5 and Fig 4 demonstrates the ablation results of different decoders in meta-learners. We can observe that the trending curves of the meta-valid accuracies of different decoders are akin to each other, which indicates that the best models saved according to Euclidean decoder and MCT decoder during meta-validation are the same with high probability. Therefore, we apply the Euclidean decoder during meta-validation for acceleration and the MCT decoder during meta-testing for higher accuracies (as shown in Table 5).
In this paper, we propose MetaDelta, a meta-learning system for few-shot image classification, which tackles two challenges of practical significance: 1) time and resource efficiency, and 2) generalization to unknown feedback datasets. For meta-learners in MetaDelta, we adopt a pretrained CNN encoder fine-tuned by batch training and a parameter-free decoder for inference. The meta-training of multiple meta-learners is arranged by a central controller with multi-processing techniques and a meta-ensemble module is applied to integrate the predictions. The resulting system ranks first in the final phase of the AAAI 2021 MetaDL Challenge. For future work, we plan to apply the domain generalization techniques carlucci2019domain; li2019episodic
in computer vision to further enhance the generalization capacity of MetaDelta to any unknown datasets.
This research is supported by the National Key Research and Development Program of China (No.2020AAA0106300, 2020AAA0107800, 2018AAA0102000) and National Natural Science Foundation of China No.62050110.