Optimized Generic Feature Learning for Few-shot Classification across Domains

01/22/2020 ∙ by Tonmoy Saikia, et al. ∙ Google University of Freiburg 18

To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning. In this paper, we propose to use cross-domain, cross-task data as validation objective for hyper-parameter optimization (HPO) to improve on this goal. Given a rich enough search space, optimization of hyper-parameters learn features that maximize validation performance and, due to the objective, generalize across tasks and domains. We demonstrate the effectiveness of this strategy on few-shot image classification within and across domains. The learned features outperform all previous few-shot and meta-learning approaches.



There are no comments yet.


page 3

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generalization to unseen samples distinguishes machine learning from traditional data mining. For machine learning, it is not enough to store sufficiently many examples and to access them efficiently, but the examples are supposed to yield a generic model, which makes reasonable decisions (far) beyond the observed training samples. Pushing the limits of generalization is probably the most important goal of machine learning.

To measure generalization, typically a dataset is split into a training and a test set. With no overlap between these sets, the test error indicates generalization to unseen samples. However, data in the training and test set are typically very similar. More challenging test schemes are cross-dataset testing and few-shot learning. In cross-dataset testing, the test set is from an independent dataset, which involves a considerable domain shift. In few-shot learning, a trained model is tested on unseen classes. Only few (sometimes one) samples of the new classes are made available to learn their decision function. This is only possible, if the features from the pre-trained model are rich enough to provide the discriminative information for the new classes: the features must generalize to new tasks.

Deep networks exhibit far more parameters than there are training samples. Without appropriate regularization, they would learn the training samples and not generalize at all. However, we all know that deep networks do generalize. This is partially due to explicit regularization, such as weight decay, dropout, data augmentation, or ensembles, and also due to implicit regularization imposed by the optimization settings and network architecture [25, 40, 61]. Although the research community has developed good practices for all the many design choices, it is difficult to find the right balance, since there are mutual dependencies between parameters [22]

. Even worse, the right balance depends much on the training set the network draws its information from: training on ImageNet can require very different regularization than training on a smaller, less diverse dataset.

In this paper, we propose rigorous hyperparameter optimization (HPO) to automatically find the optimal balance between a set of factors that influence regularization. The use of HPO for improving the test performance of deep networks is very common

[5, 12, 20, 23, 30]. What distinguishes this paper from previous works, is that we analyze the suitability of HPO for improving generalization rather than optimal test performance on a particular benchmark dataset. To this end, we choose the validation objective to match what we want: generalization. We investigate two levels of generic feature learning: (1) within-domain generalization in few-shot learning, and (2) cross-domain generalization. Few-shot learning often does not work well in unseen domains [2, 11]. Thus, in case (2) we investigate if HPO can improve cross-domain performance when cross-domain tasks are used for optimization.

Our investigation gave us several scientific insights. (1) We show that HPO boosts few-shot performance clearly beyond the state-of-the-art, even with a rather small search space. This confirms that manually finding the right balance in the set of options that affect regularization is extremely hard. (2) Cross-domain HPO further improves few-shot performance in the target domain. (3) The search confirms some common best practice choices and discovers some new trends regarding the batch size, the choice of the optimizer, and the way to use data augmentation.

Figure 1: Hyperparameter optimization (train, validation and test stages). Multiple feature extractors are learned on the training set with hyperparameters sampled by BOHB. The validation objective is the average accuracy over multiple few-shot tasks sampled from the validation split. A few-shot task

consists of a support set and a query set. For each few-shot task a classifier is learned based on the support set and the feature extractor from the training stage. The query set is used for evaluation, i.e., to measure classification accuracy. The best model is then evaluated on the test set using the same protocol as during validation. Few-shot tasks may be from a domain different from that of the training set.

2 Related work

Feature learning. A standard approach to obtain a common feature representation is to train a classification network on ImageNet. These features generalize surprisingly well and are commonly used for finetuning on target domains or tasks [18, 17, 32, 49]. Studies have shown that data augmentation [5, 4, 31] and regularization techniques [14, 15]

can help learn feature representations which generalize better. Unsupervised approaches use proxy tasks and loss functions to learn good feature representations. Examples of proxy tasks could be recovering images under corruption 

[42, 56] or image patch orderings [8, 39]. Loss functions focus on contrastive [55, 59] or adverserial learning [9, 10]. In most cases, unsupervised feature learning is not competitive with features obtained from ImageNet.

Few-shot learning. Few-shot learning approaches can primarily be categorized into optimization [13, 43, 46, 54] and metric learning approaches [27, 53, 57]. Optimization approaches are designed with the goal to adapt quickly to new few-shot tasks. Ravi and Larochelle [43] used an LSTM to learn an update rule for a good initialization for a model’s parameters, such that it performs well given a small training set on a new classification task. Finn et al[13] proposed MAML which uses gradient updates obtained from target tasks (unseen during training) to learn better weight initialization to prevent over-fitting. However, methods such as MAML have been shown to work poorly on larger datasets [54]. State-of-the-art methods [29, 41, 52] often rely on metric learning [7]. Typically such methods meta-learn an embedding over multiple meta-training tasks [57] which are then used to classify query examples based on some distance metric. For instance, Vinyals et al[57] proposed matching networks, which label query set examples as a linear combination of the support set labels. Examples in the support set with embedding representations closer to the query set example are assigned a higher weight. Snell et al[52] assume that there exists an embedding space where samples of the same class cluster around a prototype. Their method builds class prototypes by taking the mean of the support set embedding for each example. Query set examples are then assigned class labels based on their closeness to class prototypes.

Recently, it has been shown that the standard feature learning approach, of first training a backbone network on a large number of base classes and learning a few-shot classifier on top, is competitive [2, 11, 54]. In this work, we use standard feature learning.

Hyperparameter optimization. Bayesian optimization is the most common approach for hyperparameter search and has been used to obtain state-of-the-art results on CIFAR-10 [1]. Standard Bayesian optimization does not scale well to tasks with longer training time or large hyperparameter search spaces. In such cases, methods that run mostly random search like Hyperband [30] are often more efficient. BOHB [12]

adds Bayesian optimization to Hyperband to learn from previously sampled hyperparameter configurations. In computer vision, BOHB was used to optimize training a disparity network 

[47]. We use BOHB for hyperparameter optimization targeted towards optimal features for few-shot learning tasks.

3 Optimized feature learning

To optimize generalization via hyperparameter optimization (HPO), we must define an appropriate search space, an informative validation objective, and the dataset splits on which we optimize. An overview of the whole optimization pipeline with the different data splits is shown in Figure 1.

3.1 Hyperparameter search space

The definition of the search space is decisive for the success of HPO. With a naive way of thinking, one would make the search space as large as possible to explore all possibilities. However, this will make HPO terribly inefficient, and the search will not find optimal values even after many GPU years. Thus, some informed choice of interesting hyperparameters has to be made. We experiment with two search spaces, which are motivated by best practice insights of previous works [35, 51, 61] and by scientific questions that we want to get answered in this work.

Search space S1 (Optimization). In this smaller search space we consider five training parameters, which are relevant for regularization: the choice of optimizer, the batch size, the amount of regularization, the initial learning rate, and its decay frequency, i.e., the number of mini-batch updates until the learning rate is decreased. The learning rate is decayed to following a cosine schedule [33].

Search space S2 (Optimization & Augmentation). Since data augmentation plays an important role for regularization, we define a second, larger search space that includes data augmentation on top of S1. Search spaces on data augmentation can quickly become very large. AutoAugment [5]

searched for optimal augmentation policies, i.e., sequences of augmentation operations, using reinforcement learning. A huge search space tends to yield sub-optimal results after finite compute time. Indeed, in a follow-up work Cubuk 

et al[4] claimed that the full policy search was unnecessary, as randomly choosing transformations from the overall set of transformations performs on-par. This leaves the question on the table: is this a) because it is not critical to tune exact magnitudes for each augmentation operation as long as there is a diverse set of operations, or b) because the search with reinforcement learning found a sub-optimal policy?

We shed some more light on this using a smaller search space than in AutoAugment, i.e., a subset of 10 operations from AutoAugment (rotate, posterize, solarize, color, contrast, brightness, sharpness, shear, translate, cutout [6]

) and only a single magnitude parameter for each of them. In particular, we sample the transformation magnitude from a Gaussian distribution clipped at

and optimize the standard deviation parameter (search range described in Section 

4.1). We allow augmentation operations to be applied at random and optimize the number of operations applied at a time, which yields a single additional parameter.

3.2 Validation objective

Apart from the training objective, which is optimized on the training set, HPO requires an additional validation objective on a held out validation set. We propose to use the accuracy of few-shot classification as validation objective to measure generalization performance. The overall scheme, as illustrated in Figure 1, contains multiple training and optimization stages and different data subsets used for these. We describe it in detail in the following subsections.

3.2.1 Training a classification network for feature extraction

The inner part of the optimization is training the weights of a normal classification network using the hyperparameter sample currently evaluated by HPO. We use the ResNet18 implementation from [54]. Each training run with hyperparameters yields a different network with weights and corresponding features .

3.2.2 Sampling of few-shot learning tasks

The features are validated on how well they perform as basis for a few-shot learning task. To this end, we evaluate the average result of multiple few-shot classification tasks, which are randomly sampled from the validation set.

Each few-shot task consists of classes, with a few labeled examples ( per class). For each task, a classifier is trained on these few labeled examples (support set) and is evaluated on unseen examples of the same task (query set). The number of classes in the task is commonly called ways and the number of labelled examples is termed as shots. We measure for each task the classification accuracy on its query set. Each task may have variable number of ways and shots.

3.2.3 Training the few-shot classifier for a task

For classifying images in a few-shot task, we freeze the parameters of the feature extractor and train a classifier , where and represents the number of classes or ways, on the support set. This classifier takes the feature embedding as input.

We experiment with two types of classifiers: a linear classifier and a nearest centroid classifier [36]. The weight matrix for the linear classifier is trained using the cross entropy loss. The nearest centroid classifier [36]

involves computing a class prototype by taking the average of embeddings from each class’ support set examples. Each query set example is assigned the class, whose prototype is nearest according to the negative cosine similarity 


3.2.4 Optimizing the hyperparameters with BOHB

The task of BOHB is to sample hyperparameters to train a feature extractor, , such that the average accuracy over the few-shot tasks on the validation set is maximized. We emphasize that refers to the hyperparameters for training the feature extractor and not the weights of the few-shot classifiers.

BOHB [12] combines the benefits of Bayesian optimization [50] and Hyperband [30]. Hyperband performs efficient random search by dynamically allocating more resources to promising configurations and stopping bad ones early. To improve over Hyperband, BOHB replaces its random sampling with model based sampling once there are enough samples to build a reliable model.

To improve efficiency, BOHB uses cheaper approximations of the validation objective , where refers to a budget configuration with , where the true validation objective is recovered with

. The budget usually refers to the number of mini-batch updates or number of epochs 

[12, 47] but may also be other parameters such as image resolution or training time [60]. We use the number of mini-batch updates as budget. Similar to Hyperband, BOHB repeatedly calls Successive Halving (SH) [24] to advance promising configurations evaluated on smaller budgets to larger ones. SH starts with a fixed number of configurations on the cheapest budget and retains the the best fraction (

) for advancing to the next and more expensive budget. BOHB uses a multivariate kernel density estimator (KDE) to model the densities of the good and bad configurations and uses them to sample promising hyperparameter configurations. For more details we refer to the original paper 


3.3 Testing the generalization performance

At test time, we take the best model found by BOHB, and evaluate it on few-shot tasks sampled from the test set. Like the validation objective, we compute the test objective as the average classification accuracy over few-shot tasks.

4 Experiments and results

With the experiments we are interested in answering the following scientific questions: (1) Does HPO improve generalization performance as measured by few-shot tasks? (2) Does HPO help transfer features to a target domain? (3) Can it yield features that generalize better to unseen domains? (4) How critical is the optimization of data augmentation parameters? (5) How much do we gain by ensembling top performing models from HPO?

4.1 Experimental setup

Datasets. We use mini-ImageNet [57] and datasets from the more challenging Meta-Dataset benchmark [54].

The mini-ImageNet dataset is commonly used for few-shot learning. It consists of images per class. The training, validation and testing splits consist of , , and classes each. Since all data is from the same dataset, it only enables evaluating cross-task, within-domain generalization, but not generalization across domains.

To this end, we use the datasets from the Meta-Dataset benchmark. It is much larger than previous few-shot learning benchmarks and consists of multiple datasets of different data distributions. Also, it does not restrict few-shot tasks to have fixed ways and shots, thus representing a more realistic scenario. It consists of datasets from diverse domains: ILSVRC-2012 [45] (the ImageNet dataset, consisting of natural images with categories), Omniglot [28] (hand-written characters, classes), Aircraft [34] (dataset of aircraft images, 100 classes) , CUB-200-2011 [58] (dataset of Birds, 200 classes), Describable Textures [3] (different kinds of texture images with 43 categories), Quick Draw [26] (black and white sketches of 345 different categories), Fungi [48] (a large dataset of mushrooms with   categories), VGG Flower [38] (dataset of flower images with categories), Traffic Signs [21] (German traffic sign images with 43 classes) and MSCOCO (images collected from Flickr, classes). All datasets except Traffic signs and MSCOCO have a training, validation and test split (proportioned roughly into , , ). The datasets Traffic Signs and MSCOCO are reserved for testing only. We note that there exist other ImageNet splits, such as TieredImageNet [44], therefore to avoid ambiguity we refer to Meta-Dataset’s ImageNet version as ImageNet-GBM (Google Brain Montreal).

Group Datasets
Birds, Omniglot, QuickDraw,
DTD, Fungi, Aircraft, VGG Flower
MSCOCO, TrafficSign
Table 1: Dataset groups (Meta-Dataset). is used for training and evaluation whereas and are used for evaluations only.

We group the datasets in the Meta-Dataset benchmark into three subsets , , as shown in Table 1. A dataset in each subset has a training, validation, and test split. We always use only the respective split for the training, validation and testing stages. Testing on , for example, means using the test splits of the datasets in . The splits of (ImageNet-GBM) are used for training, validation, and testing, while the splits of are used only for validation and testing. is reserved for testing.

Evaluation details. We compute the average accuracy over few-shot tasks sampled from a dataset. Few-shot tasks are sampled from the validation split during validation and the test split during testing. Sampled tasks from mini-ImageNet have fixed ways and shots. The Meta-Dataset benchmark tasks have variable ways and shots, which are randomly sampled. We use the same implementation and sampler settings as defined for the Meta-Dataset benchmark [54].

As described in Section 3.2, we experiment with two types of few-shot classifiers (Linear and N-Centroid). If a linear classifier is used, a new weight matrix must be learned for mapping the features to the target classes. We train the linear classifier for steps using SGD with momentum of . The learning rate and regularization factor is fixed to and , respectively, in all experiments.

We ran BOHB on - parallel GPU (Nvidia P100) workers using the default settings of and the number of mini-batch updates as a budget parameter (see Section 3.2.4). Each worker trains a feature extractor and computes the validation accuracy on few-shot tasks. For mini-ImageNet and Meta-Dataset, we allocate a maximum training budget of and mini-batch updates respectively. These numbers were chosen assuming a batch size of . Note that we also have the batch size as a hyperparameter in our search spaces. Therefore, the number of updates need to be scaled up or down (for the same effective number of epochs). For instance if the sampled batch sizes are or for mini-ImageNet, the resulting mini-batch updates will be and , respectively. We ran BOHB for successive halving iterations.

Search space ranges. The search space ranges for operations in search spaces and are shown in Table 2.

Hyperparameter Range/Values
Optimization params
Optimizer {SGD, ADAM}
Learning rate
L2 regularization
Decay every
Batch size
Augmentation params
Number of operations ()
Standard deviation for ()

Table 2: Search space ranges

4.2 Impact of hyperparameter optimization

Classifier Hyperparams Test accuracy
1-shot 5-shot
Linear Random
Linear Default
Linear BOHB
N-Centroid Random
N-Centroid Default
N-Centroid BOHB
Table 3: HPO on mini-ImageNet. Test accuracies are reported for -way -shot and -shot tasks.
Classifier Hyperparams Test Accuracy
Linear Random
Linear Default
Linear BOHB
N-Centroid Random
N-Centroid Default
N-Centroid BOHB
Table 4: HPO on the Meta-Dataset. We report average test accuracies for each dataset group. shows within-domain, and shows cross-domain performance (see Table 1).

To assess the impact of hyperparameters on few-shot performance we performed an experiment of training the feature extractor under three settings: 1) with randomly sampled hyperparameters; 2) with hyperparameters provided by publicly available code (result of manual optimization); 3) with hyperparameters obtained with BOHB. For sampling hyperparameters in 1) and 3) we used the same search space , consisting of optimization hyperparameters only. For the random baseline we report the mean and standard deviation averaged over runs. For each setting, we allocated the same training budget. In the default setting, we used hyperparameter choices from Chen et al[2]. When running BOHB on the Meta-dataset, we used for training and validation, and we performed separate HPO for each classifier type (Linear or N-centroid).

Table 3 and Table 4 show results on mini-ImageNet and Meta-dataset, respectively. In both cases, there is a clear benefit of HPO over random and default parameters. The N-centroid classifier always outperformed the linear classifier, even for the Meta-dataset dataset, where test datasets may have very different data distribution ( and ) compared to the training and validation set. Thus, for the remaining experiments we only report the results with the N-centroid classifier.

4.3 Optimizing for different data distributions

Validation Test Accuracy
ImageNet-GBM Omniglot Quickdraw Birds
Table 5: Test performance of models optimized with BOHB (search space ) and ImageNet-GBM for training. Cross-domain validation improves the domain transfer compared to within-domain validation even if the validation set is in a different domain than the test set.

Can HPO help learn a feature extractor from ImageNet which transfers better to target domains (under a limited data regime)? We trained on ImageNet-GBM’s train split and validated on few-shot tasks sampled from a different domain’s validation split. We consider QuickDraw (consisting of sketch images) and Omniglot (handwritten character images) as domains that are very different from ImageNet. As similar domain, we consider the Birds dataset. After optimization, the feature extractor is evaluated on few-shot tasks sampled from a target domain’s test split. We only use the search space for this experiment. The results are shown in Table 5.

Improved transfer to other domains. Using the target domain’s data for validation is useful to make a feature extractor that transfers better to that domain. For QuickDraw and Omniglot, this leads to an improvement by - over an optimized baseline which uses ImageNet as a the validation domain. For the Birds dataset, we see only a small improvement, because the validation domain is already mostly included in the training domain.

It is worth noting that optimizing for a target domain (e.g. Omniglot) does not destroy much of the performance on other domains (e.g. ImageNet). Thus, there is no over-fitting problem as if one would finetune the network on Omniglot, which would destroy its performance on ImageNet.

Transferring to mixed domains. We also experimented with a validation objective, where few-shot tasks are sampled from a mixture of different domains. We used the validation splits from the dataset group ; see Table 1. The results are shown in Table 6. We still observe a performance gain but the effect is diminished compared to validating on single datasets.

Validation Test Accuracy
Table 6: Mixed domain transfer. Feature extractors are trained on and validated on few-shot tasks sampled from (see Table 1). We report average test performance on both and .

4.4 Evaluating generalization on unseen domains

In the previous section we showed that it is possible to tune a feature extractor, such that it performs better on target domains that are different from the training domain by using a validation objective from that domain.

It would be even more practical, if we could learn universal features that generalize to domains unseen during training or validation. To this end, we performed a cross-validation experiment with feature extractors trained on ImageNet-GBM and validated on and tested on , where, and are disjoint set of datasets which do not include ImageNet-GBM. Few-shot tasks are sampled uniformly at random from (during validation) and (during testing).

Split Validation Test
() ()
Aircraft, Birds, Textures,
Quickdraw, Fungi
VGG Flower, Omniglot,
Traffic Sign, MSCOCO
VGG Flower, Omniglot, Fungi,
Aircraft, QuickDraw
Birds, Textures,
Traffic Sign, MSCOCO
Birds, Textures, Omniglot,
VGG Flower, Fungi
Aircraft, QuickDraw,
Traffic Sign, MSCOCO
QuickDraw, Birds, Textures,
VGG Flower, Aircraft
Fungi, Omniglot,
Traffic Sign, MSCOCO
Table 7: Cross-validation splits for evaluating generalization. Cross validation splits are: , , , .

We construct four cross-validation splits (, , , ) each having validation and test subset (shown in Table 7). Since Traffic Sign and MSCOCO are always used only for testing in the Meta-Dataset framework [54], we do not use them for validation. For each cross-validation split, we run BOHB to optimize the features using its validation subset. The best model is then tested on the unseen datasets in the split. The test results are shown in Table 8. To compute a final generalization score, for each dataset we take the average across non-zero test scores across each cross-validation split. Finally, we take an average over accuracy and standard deviations (in the last column) across each dataset. We compare the average performance with a model validated on ImageNet-GBM only (last row).

Consistent generalization. Results in Table 8 show that the cross-validation score is similar to that of a model trained and validated on ImageNet-GBM only. Irrespective of the validation dataset used, we do not observe a tendency to overfit and our models generalize equally well on unseen domains. However, we also do not achieve a substantial benefit regarding generalization to unseen domains from using cross-validation: the average performance is similar to validation only on ImageNet-GBM.

Test source Dataset avg.
VGG Flower
Traffic Sign
Average acc. After cross validation
Model validated with ImageNet-GBM
Table 8: Cross validation results. We report results for each test dataset in a cross validation split. All validation datasets in a split are marked with ”-”. On the rightmost column we report average scores for each dataset. The final cross validation score is computed by taking the average across this column.

4.5 Optimization of data augmentation parameters

We pick-up the questions from Section 3.1 and test if HPO can be useful in the context of data augmentation or if it is superfluous. We used the larger search space described in Section 3.1. We experiment with three settings: First, tuning optimization hyperparameters and standard deviation () for each operation and (search space ). Second, modifying to have a common standard deviation () for each operation and thirdly, tuning optimization hyperparameters with the condition that augmentation operations have random ’s and the number of operations . The results are reported in Table 9 and show that adding data augmentation to the search space does improve performance. The improvement is not as large as the gain from optimizing the parameters in S1, but it is substantial. We observed that the validation objective is more sensitive towards hyperparameters which are more important. To analyze, we train multiple models by fixing one subset and randomizing the rest from . We consider three cases: 1) freeze optimization parameters and randomize 2) randomize along with

3) freeze augmentation and randomize optimization parameters. The variance of the validation objective is shown in Figure 

2. We observe that randomizing the optimization hyperparameters show the largest performance variance, which indicates that they must be carefully tuned.

Tune Opt. Data aug. Test accuracy

1-shot 5-shot
default (fixed)
default (fixed)
Table 9: Tuning data augmentation. We report test performance on mini-ImageNet. The first and second row represent models trained with fixed augmentation settings (random crops, flipping and intensity changes from [2]). In the second row we tune the optimization parameters (search space ). The third row shows performance after tuning search space . In the fourth row, we tune a common standard deviation per augmentation operation. The configuration in the last row fixes to and samples random for each operation. The first column denotes if the optimization parameters were tuned or not.
Figure 2: Performance variation with search space randomization. Randomizing the optimization search space shows biggest performance variation, suggesting that they are sensitive and thus more important to tune.

4.6 Comparison to the state of the art

With the improved generalization obtained with HPO, how do the results compare to the state of the art in few-shot learning? In Table 10 we observe that BOHB-N-Centroid tuned on the simple search space (optimization only) with fixed standard augmentation (random crops, flipping and intensity changes) clearly improves over comparable methods [2]. Using the larger search space (augmentation & optimization), we are comparable to the test accuracy of a network ensemble [11]. We also report our results after ensembling. Unlike [11] we do not retrain the best model multiple times, but rather take the ensemble from the top models discovered by BOHB.

Also on Meta-Dataset, results compare favorably to existing approaches (see Table 11). It is worth noting that our performance gains come without the need for any additional adaptation (i.e the feature extractor does not need to be optimized on few-shot tasks during evaluation). Since, ensembling provides such large performance gains [11], we also report ensemble results for the Meta-dataset benchmark. Since the compared methods do not use ensembles, this comparison is not on the same ground anymore, yet shows that the gains are mostly complementary. Also, we found that optimizing hyperparameters using the larger search space did not lead to an improvement with ImageNet as training source. We conjecture that ImageNet data provides sufficient regularization to our ResNet18 feature extractor and additional data augmentation does not help.

Model Aug. Network Input Ensm. Test accuracy
1-shot 5-shot
TADAM  [41] standard ResNet12
MetaOptNet  [29] standard ResNet12
LEO  [46] standard WideResNet
FEAT  [19] standard WideResNet
Robust20  [11] standard WideResNet
Robust20  [11] standard ResNet18
Linear  [2] standard ResNet18
Cosine  [2] standard ResNet18
BOHB-NC- standard ResNet18
standard ResNet18
BOHB-NC- learned ResNet18
learned ResNet18
Table 10: Comparison to state-of-the-art (mini-ImageNet). We report results using only the nearest centroid classifier (BOHB-NC). The suffixes and denote the search space used. Since different methods use different input image resolutions and network backbones we also indicate that. The column ”Ensm.” determines whether a model uses ensembling during evaluation.
Method Val Adapt Test accuracy
BOHB-NC-S1 (ensm.)
Table 11: Comparison to state-of-the-art (Meta-Dataset). We compare performance of our models (BOHB-NC-S1) to other approaches on the Meta-Dataset benchmark (numbers are from  [54]). We report results of models trained on and optimized within-domain () and cross-domain () validation sets. We also report performance after ensembling top models in each case (denoted by ensm.). The ”Adapt” column indicates if parameters of the feature extractor are also optimized during few-shot evaluation. Note, here the baseline methods also use hyperparameter optimization [54].

5 Analysis

(a) BOHB-N-Centroid (mini-ImageNet)
(b) BOHB-N-Centroid (ImageNet-GBM)
Figure 3: Hyperparameter relationships. Parallel coordinate plots are shown for BOHB runs using N-Centroid classifiers on mini-ImageNet (top) and ImageNet-GBM (bottom). The first axis represents the validation objective (accuracy). We consider samples in the good performance region, i.e, within , where represents the maximum validation accuracy. The remaining axes represent hyperparameters in search space . The numbers above and below each axis shows the minimum and maximum value. A connected line across the axes map a validation accuracy to a hyperparameter configuration. Yellow and blue correspond to good and bad hyperparameter configurations.

What are good hyperparameters? We use parallel coordinate plots (PCP) [37] to visualize relationships between the validation objective and the different hyperparameters in our search space (shown in Figure 3). We observe that configurations with lower learning rate and smaller batch size (resulting in more mini-batch updates) typically lead to better performance. This is in line with the common understanding in the community [35]. In practice, we found that a batch size of consistently performed well. For mini-ImageNet experiments, BOHB finds good configurations with both ADAM and SGD, however for experiments on ImageNet-GBM, SGD clearly outperforms ADAM. With regard to search space we observed that the number of randomly sampled augmentation operations () is often chosen to be one or two, i.e., it is advantageous to apply transformations one after the other. More details are in the supplementary material.

6 Conclusions

In this paper we aimed for features that generalize well across tasks and across domains. We showed that hyperparameter optimization with few-shot classification as validation objective is a very powerful tool to this end. Apart from the typical within-domain analysis, we also investigated few-shot learning across domains. We saw that HPO adds to cross-domain generalization even when the optimization is not run across domains but on a rich enough dataset like ImageNet. Moreover, it provides a way to adapt to a specific domain without destroying generalization on other domains. We shedded more light on the ongoing discussion whether data augmentation can benefit from optimizing its parameters and got a positive answer for a reasonably sized search space. Finally, we found that HPO is well compatible with ensembles.


  • [1] H. Bertrand, R. Ardon, M. Perrot, and I. Bloch (2017)

    Hyperparameter optimization of deep neural networks: combining hyperband with Bayesian model selection

    In CAP, Cited by: §2.
  • [2] W. Chen, Y. Liu, Z. Kira, Y. Wang, and J. Huang (2019) A closer look at few-shot classification. In ICLR, Cited by: §1, §2, §4.2, §4.6, Table 10, Table 9.
  • [3] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In CVPR, Cited by: §4.1.
  • [4] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §2, §3.1.
  • [5] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le (2019) AutoAugment: learning augmentation strategies from data. In CVPR, Cited by: §1, §2, §3.1.
  • [6] T. Devries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv abs/1708.04552. External Links: Link, 1708.04552 Cited by: §3.1.
  • [7] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto (2020) A baseline for few-shot image classification. In ICLR, External Links: Link Cited by: §2.
  • [8] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §2.
  • [9] J. Donahue, P. Krähenbühl, and T. Darrell (2017) Adversarial feature learning. In ICLR, External Links: Link Cited by: §2.
  • [10] J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. arXiv abs/1907.02544. External Links: Link, 1907.02544 Cited by: §2.
  • [11] N. Dvornik, C. Schmid, and J. Mairal (2019) Diversity with cooperation: ensemble methods for few-shot classification. In ICCV, Cited by: §1, §2, §4.6, §4.6, Table 10.
  • [12] S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. In ICML, External Links: Link Cited by: §1, §2, §3.2.4, §3.2.4.
  • [13] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2.
  • [14] X. Gastaldi (2017) Shake-shake regularization. In ICLRW, Cited by: §2.
  • [15] G. Ghiasi, T. Lin, and Q. V. Le (2018) Dropblock: a regularization method for convolutional networks. In NeurIPS, Cited by: §2.
  • [16] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In CVPR, Cited by: §3.2.3.
  • [17] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §2.
  • [18] R. Girshick (2015) Fast r-cnn. In ICCV, Cited by: §2.
  • [19] F. Hao, J. Cheng, L. Wang, and J. Cao (2019) Instance-level embedding adaptation for few-shot learning. IEEE Access 7, pp. 100501–100511. External Links: Link, Document Cited by: Table 10.
  • [20] D. Ho, E. Liang, X. Chen, I. Stoica, and P. Abbeel (2019) Population based augmentation: efficient learning of augmentation policy schedules. In ICML, External Links: Link Cited by: §1.
  • [21] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel (2013) Detection of traffic signs in real-world images: the German Traffic Sign Detection Benchmark. In IJCNN, Cited by: §4.1.
  • [22] F. Hutter, H. Hoos, and K. Leyton-Brown (2014) An efficient approach for assessing hyperparameter importance. In ICML, Cited by: §1.
  • [23] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu (2017) Population based training of neural networks. arXiv abs/1711.09846. External Links: Link, 1711.09846 Cited by: §1.
  • [24] K. Jamieson and A. Talwalkar (2016) Non-stochastic best arm identification and hyperparameter optimization. In AISTATS, Cited by: §3.2.4.
  • [25] Y. Jiang, B. Neyshabur, D. Krishnan, H. Mobahi, and S. Bengio (2020) Fantastic generalization measures and where to find them. In ICLR, External Links: Link Cited by: §1.
  • [26] J. Jongejan, H. Rowley, T. Kawashima, J. Kim, and N. Fox-Gieg (2016) The quick, draw!-ai experiment. Mount View, CA, accessed Feb 17, pp. 2018. Cited by: §4.1.
  • [27] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICMLW, Cited by: §2.
  • [28] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §4.1.
  • [29] K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019) Meta-learning with differentiable convex optimization. In CVPR, Cited by: §2, Table 10.
  • [30] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017-01) Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR 18 (1), pp. 6765–6816. External Links: ISSN 1532-4435, Link Cited by: §1, §2, §3.2.4.
  • [31] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim (2019) Fast autoaugment. In NeurIPS, Cited by: §2.
  • [32] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.
  • [33] I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    In ICLR, External Links: Link Cited by: §3.1.
  • [34] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. Technical report External Links: 1306.5151 Cited by: §4.1.
  • [35] D. Masters and C. Luschi (2018) Revisiting small batch training for deep neural networks. arXiv abs/1804.07612. External Links: Link, 1804.07612 Cited by: §3.1, §5.
  • [36] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka (2013-11) Distance-based image classification: generalizing to new classes at near-zero cost. IEEE Trans. PAMI 35 (11), pp. 2624–2637. External Links: Document, ISSN Cited by: §3.2.3.
  • [37] R. E. Moustafa (2011) Parallel coordinate and parallel coordinate density plots. Wiley Interdisciplinary Reviews: Computational Statistics 3 (2), pp. 134–148. Cited by: §5.
  • [38] M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In ICCV, Cited by: §4.1.
  • [39] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
  • [40] R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2018) Sensitivity and generalization in neural networks: an empirical study. In ICLR, External Links: Link Cited by: §1.
  • [41] B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In NeurIPS, Cited by: §2, Table 10.
  • [42] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §2.
  • [43] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, External Links: Link Cited by: §2.
  • [44] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. In ICLR, External Links: Link Cited by: §4.1.
  • [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §4.1.
  • [46] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. In ICLR, External Links: Link Cited by: §2, Table 10.
  • [47] T. Saikia, Y. Marrakchi, A. Zela, F. Hutter, and T. Brox (2019) AutoDispNet: improving disparity estimation with automl. In ICCV, Cited by: §2, §3.2.4.
  • [48] B. Schroeder and Y. Cui (2018) FGVCx fungi classification challenge 2018. External Links: Link Cited by: §4.1.
  • [49] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun (2014) OverFeat: integrated recognition, localization and detection using convolutional networks. In ICLRW, External Links: Link Cited by: §2.
  • [50] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas (2016) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104, pp. 148–175. Cited by: §3.2.4.
  • [51] S. L. Smith and Q. V. Le (2018) A bayesian perspective on generalization and stochastic gradient descent. In ICLR, External Links: Link Cited by: §3.1.
  • [52] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NeurIPS, Cited by: §2.
  • [53] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §2.
  • [54] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and H. Larochelle (2020) Meta-dataset: a dataset of datasets for learning to learn from few examples. In ICLR, External Links: Link Cited by: §2, §2, §3.2.1, §4.1, §4.1, §4.4, Table 11.
  • [55] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv abs/1807.03748. External Links: Link, 1807.03748 Cited by: §2.
  • [56] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    In ICML, Cited by: §2.
  • [57] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In NeurIPS, Cited by: §2, §4.1.
  • [58] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1.
  • [59] Z. Wu, Y. Xiong, X. Y. Stella, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §2.
  • [60] A. Zela, A. Klein, S. Falkner, and F. Hutter (2018)

    Towards automated deep learning: efficient joint neural architecture and hyperparameter search

    In ICMLW, Cited by: §3.2.4.
  • [61] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In ICLR, External Links: Link Cited by: §1, §3.1.

Appendix A Data augmentation

a.1 Number of augmentation operations

During our experiments with BOHB on search space (see Section 3.1), we found that good configurations had (the number of data augmentation operations applied per mini-batch) set to low values, as shown in Figure (i). The best configuration had . We observe that as increases, the validation accuracy drops.

Figure (i): Optimal values for . We plot the validation accuracy vs number of augmentation operations per mini-batch () for a BOHB run on mini-ImageNet while optimizing hyperparameters on search space . Each circle represents a configuration sampled by BOHB.

a.2 Results for Meta-Dataset on search space S2

We found that optimizing hyperparameters using the larger search space (optimization + augmentation) did not lead to an improvement on the Meta-Dataset benchmark. The results compared to search space are shown in Table (i). We conjecture that training data from ImageNet-GBM () provides sufficient regularization to our ResNet18 feature extractor and any additional data augmentation does not help.

Method Test accuracy
Table (i): Results on Meta-Dataset (with search spaces and ). We compare test performance on the three dataset groups (described in Section 4.1 of the paper). The BOHB models were trained and validated on . The suffixes and denote the search space used.

Appendix B More results with a linear classifier

b.1 Cross-domain validation

We report results using the linear classifier with cross-domain validation, again showing better transfer on tasks from a different data distribution (see Section 4.3). The results are shown in Table (ii) and are complementary to Table 5 in the main paper (where a N-Centroid classifier was used).

Validation Test Accuracy
ImageNet-GBM Omniglot Quickdraw Birds
Table (ii): Test performance of models trained on ImageNet-GBM and optimized with BOHB (search space ). Cross-domain validation improves the domain transfer. Here, we report results using the Linear classifier for few shot classification tasks. These results are complementary to Table 5 in the main paper.

b.2 Hyperparameter relationships

Similar to Figure 3 in the main paper, we use parallel coordinate plots to visualize relationships between the validation objective and the different hyperparameters in our search space, while using the linear classifier (shown in Figure (ii)). Supporting the discussion in Section 5, we show that configurations with smaller batch sizes lead to optimal performance, also with the linear classifier. A similar pattern is observed for the choice of optimizer (SGD outperforms ADAM on ImagNet-GBM).

(a) BOHB-Linear (mini-ImageNet)
(b) BOHB-Linear (ImageNet-GBM)
Figure (ii): Hyperparameter relationships. Parallel coordinate plots are shown for BOHB runs using Linear classifiers on mini-ImageNet (top) and ImageNet-GBM (bottom). The first axis represents the validation objective (accuracy). The remaining axes represent hyperparameters in search space .

Appendix C Ensembling

In Figure (iii)

, we observe that ensembling logits from top

models obtained from BOHB perform better on 5-shot tasks compared to an ensemble of the same size obtained by re-training the best model. However, on 1-shot tasks, we end up having the same performance in both cases as the ensemble size increases.

(a) 1-shot test performance
(b) 5-shot test performance
Figure (iii): Performance vs ensemble size. We visualize the effect of increasing the ensemble size on test accuracy for 5-way few shot tasks sampled from mini-ImageNet. The top and bottom figures show performance trajectories for 1-shot and 5-shot tasks respectively. The blue curve denotes the performance trajectory of ensembling the top configurations from BOHB, while the orange curve shows the one obtained by re-training the best model (found by BOHB) times.

Appendix D Detailed results on Meta-Dataset

We report more detailed results showing performance on individual datasets from Meta-Dataset. Results for all our models in comparison to baselines from the Meta-Dataset benchmark are shown in Table (iii).

A visual comparison of our best models with baseline models (ProtoNet and ProtoMAML) is shown in Figure (iv). We observe that BOHB-NC has a performance which is comparable or better. With ensembling we can see large boosts in performance, even for difficult datasets such as QuickDraw and Omniglot which are from a very different data distribution.

With adaption No adaption No adaption (Ours)
Test source Finetune Proto-MAML MatchingNet ProtoNet BOHB-L BOHB-NC BOHB-L BOHB-NC BOHB-NC-E BOHB-NC-E
Val. Val. Val. Val. Val. Val. Val. Val. Val. Val.
VGG Flower
Traffic Sign

Table (iii): Results on Meta-Dataset. We report results of our models (prefixed with BOHB) on all datasets from the Meta-Dataset benchmark. The suffixes ”L”, ”NC” indicate the choice of few shot classifier; linear or nearest centroid. ”NC-E” indicates that the results are from an ensembled model using the nearest centroid classifier. The validation dataset(s) used is shown under the model name. We compare to top 2 baselines from Meta-Dataset benchmark with adaption (Finetune, Proto-MAML) and without adaption (ProtoNet, MatchingNet). Here adaption indicates if the parameters of the feature extractor were optimized during few shot evaluation.
Figure (iv): Visualizing performance on individual datasets from Meta-Dataset. Each model uses Imagenet-GBM as training source. In comparision to baselines ProtoNet and ProtoMAML, BOHB-NC has a performance which is comparable or better. With ensembling we can see large boosts in performance, even on datasets from a different data distributions (example; QuickDraw and Omniglot).