Generalization to unseen samples distinguishes machine learning from traditional data mining. For machine learning, it is not enough to store sufficiently many examples and to access them efficiently, but the examples are supposed to yield a generic model, which makes reasonable decisions (far) beyond the observed training samples. Pushing the limits of generalization is probably the most important goal of machine learning.
To measure generalization, typically a dataset is split into a training and a test set. With no overlap between these sets, the test error indicates generalization to unseen samples. However, data in the training and test set are typically very similar. More challenging test schemes are cross-dataset testing and few-shot learning. In cross-dataset testing, the test set is from an independent dataset, which involves a considerable domain shift. In few-shot learning, a trained model is tested on unseen classes. Only few (sometimes one) samples of the new classes are made available to learn their decision function. This is only possible, if the features from the pre-trained model are rich enough to provide the discriminative information for the new classes: the features must generalize to new tasks.
Deep networks exhibit far more parameters than there are training samples. Without appropriate regularization, they would learn the training samples and not generalize at all. However, we all know that deep networks do generalize. This is partially due to explicit regularization, such as weight decay, dropout, data augmentation, or ensembles, and also due to implicit regularization imposed by the optimization settings and network architecture [25, 40, 61]. Although the research community has developed good practices for all the many design choices, it is difficult to find the right balance, since there are mutual dependencies between parameters 
. Even worse, the right balance depends much on the training set the network draws its information from: training on ImageNet can require very different regularization than training on a smaller, less diverse dataset.
In this paper, we propose rigorous hyperparameter optimization (HPO) to automatically find the optimal balance between a set of factors that influence regularization. The use of HPO for improving the test performance of deep networks is very common[5, 12, 20, 23, 30]. What distinguishes this paper from previous works, is that we analyze the suitability of HPO for improving generalization rather than optimal test performance on a particular benchmark dataset. To this end, we choose the validation objective to match what we want: generalization. We investigate two levels of generic feature learning: (1) within-domain generalization in few-shot learning, and (2) cross-domain generalization. Few-shot learning often does not work well in unseen domains [2, 11]. Thus, in case (2) we investigate if HPO can improve cross-domain performance when cross-domain tasks are used for optimization.
Our investigation gave us several scientific insights. (1) We show that HPO boosts few-shot performance clearly beyond the state-of-the-art, even with a rather small search space. This confirms that manually finding the right balance in the set of options that affect regularization is extremely hard. (2) Cross-domain HPO further improves few-shot performance in the target domain. (3) The search confirms some common best practice choices and discovers some new trends regarding the batch size, the choice of the optimizer, and the way to use data augmentation.
2 Related work
Feature learning. A standard approach to obtain a common feature representation is to train a classification network on ImageNet. These features generalize surprisingly well and are commonly used for finetuning on target domains or tasks [18, 17, 32, 49]. Studies have shown that data augmentation [5, 4, 31] and regularization techniques [14, 15]
can help learn feature representations which generalize better. Unsupervised approaches use proxy tasks and loss functions to learn good feature representations. Examples of proxy tasks could be recovering images under corruption[42, 56] or image patch orderings [8, 39]. Loss functions focus on contrastive [55, 59] or adverserial learning [9, 10]. In most cases, unsupervised feature learning is not competitive with features obtained from ImageNet.
Few-shot learning. Few-shot learning approaches can primarily be categorized into optimization [13, 43, 46, 54] and metric learning approaches [27, 53, 57]. Optimization approaches are designed with the goal to adapt quickly to new few-shot tasks. Ravi and Larochelle  used an LSTM to learn an update rule for a good initialization for a model’s parameters, such that it performs well given a small training set on a new classification task. Finn et al.  proposed MAML which uses gradient updates obtained from target tasks (unseen during training) to learn better weight initialization to prevent over-fitting. However, methods such as MAML have been shown to work poorly on larger datasets . State-of-the-art methods [29, 41, 52] often rely on metric learning . Typically such methods meta-learn an embedding over multiple meta-training tasks  which are then used to classify query examples based on some distance metric. For instance, Vinyals et al.  proposed matching networks, which label query set examples as a linear combination of the support set labels. Examples in the support set with embedding representations closer to the query set example are assigned a higher weight. Snell et al.  assume that there exists an embedding space where samples of the same class cluster around a prototype. Their method builds class prototypes by taking the mean of the support set embedding for each example. Query set examples are then assigned class labels based on their closeness to class prototypes.
Recently, it has been shown that the standard feature learning approach, of first training a backbone network on a large number of base classes and learning a few-shot classifier on top, is competitive [2, 11, 54]. In this work, we use standard feature learning.
Hyperparameter optimization. Bayesian optimization is the most common approach for hyperparameter search and has been used to obtain state-of-the-art results on CIFAR-10 . Standard Bayesian optimization does not scale well to tasks with longer training time or large hyperparameter search spaces. In such cases, methods that run mostly random search like Hyperband  are often more efficient. BOHB 
adds Bayesian optimization to Hyperband to learn from previously sampled hyperparameter configurations. In computer vision, BOHB was used to optimize training a disparity network. We use BOHB for hyperparameter optimization targeted towards optimal features for few-shot learning tasks.
3 Optimized feature learning
To optimize generalization via hyperparameter optimization (HPO), we must define an appropriate search space, an informative validation objective, and the dataset splits on which we optimize. An overview of the whole optimization pipeline with the different data splits is shown in Figure 1.
3.1 Hyperparameter search space
The definition of the search space is decisive for the success of HPO. With a naive way of thinking, one would make the search space as large as possible to explore all possibilities. However, this will make HPO terribly inefficient, and the search will not find optimal values even after many GPU years. Thus, some informed choice of interesting hyperparameters has to be made. We experiment with two search spaces, which are motivated by best practice insights of previous works [35, 51, 61] and by scientific questions that we want to get answered in this work.
Search space S1 (Optimization). In this smaller search space we consider five training parameters, which are relevant for regularization: the choice of optimizer, the batch size, the amount of regularization, the initial learning rate, and its decay frequency, i.e., the number of mini-batch updates until the learning rate is decreased. The learning rate is decayed to following a cosine schedule .
Search space S2 (Optimization & Augmentation). Since data augmentation plays an important role for regularization, we define a second, larger search space that includes data augmentation on top of S1. Search spaces on data augmentation can quickly become very large. AutoAugment 
searched for optimal augmentation policies, i.e., sequences of augmentation operations, using reinforcement learning. A huge search space tends to yield sub-optimal results after finite compute time. Indeed, in a follow-up work Cubuket al.  claimed that the full policy search was unnecessary, as randomly choosing transformations from the overall set of transformations performs on-par. This leaves the question on the table: is this a) because it is not critical to tune exact magnitudes for each augmentation operation as long as there is a diverse set of operations, or b) because the search with reinforcement learning found a sub-optimal policy?
We shed some more light on this using a smaller search space than in AutoAugment, i.e., a subset of 10 operations from AutoAugment (rotate, posterize, solarize, color, contrast, brightness, sharpness, shear, translate, cutout 
) and only a single magnitude parameter for each of them. In particular, we sample the transformation magnitude from a Gaussian distribution clipped at
and optimize the standard deviation parameter (search range described in Section4.1). We allow augmentation operations to be applied at random and optimize the number of operations applied at a time, which yields a single additional parameter.
3.2 Validation objective
Apart from the training objective, which is optimized on the training set, HPO requires an additional validation objective on a held out validation set. We propose to use the accuracy of few-shot classification as validation objective to measure generalization performance. The overall scheme, as illustrated in Figure 1, contains multiple training and optimization stages and different data subsets used for these. We describe it in detail in the following subsections.
3.2.1 Training a classification network for feature extraction
The inner part of the optimization is training the weights of a normal classification network using the hyperparameter sample currently evaluated by HPO. We use the ResNet18 implementation from . Each training run with hyperparameters yields a different network with weights and corresponding features .
3.2.2 Sampling of few-shot learning tasks
The features are validated on how well they perform as basis for a few-shot learning task. To this end, we evaluate the average result of multiple few-shot classification tasks, which are randomly sampled from the validation set.
Each few-shot task consists of classes, with a few labeled examples ( per class). For each task, a classifier is trained on these few labeled examples (support set) and is evaluated on unseen examples of the same task (query set). The number of classes in the task is commonly called ways and the number of labelled examples is termed as shots. We measure for each task the classification accuracy on its query set. Each task may have variable number of ways and shots.
3.2.3 Training the few-shot classifier for a task
For classifying images in a few-shot task, we freeze the parameters of the feature extractor and train a classifier , where and represents the number of classes or ways, on the support set. This classifier takes the feature embedding as input.
We experiment with two types of classifiers: a linear classifier and a nearest centroid classifier . The weight matrix for the linear classifier is trained using the cross entropy loss. The nearest centroid classifier 
involves computing a class prototype by taking the average of embeddings from each class’ support set examples. Each query set example is assigned the class, whose prototype is nearest according to the negative cosine similarity.
3.2.4 Optimizing the hyperparameters with BOHB
The task of BOHB is to sample hyperparameters to train a feature extractor, , such that the average accuracy over the few-shot tasks on the validation set is maximized. We emphasize that refers to the hyperparameters for training the feature extractor and not the weights of the few-shot classifiers.
BOHB  combines the benefits of Bayesian optimization  and Hyperband . Hyperband performs efficient random search by dynamically allocating more resources to promising configurations and stopping bad ones early. To improve over Hyperband, BOHB replaces its random sampling with model based sampling once there are enough samples to build a reliable model.
To improve efficiency, BOHB uses cheaper approximations of the validation objective , where refers to a budget configuration with , where the true validation objective is recovered with
. The budget usually refers to the number of mini-batch updates or number of epochs[12, 47] but may also be other parameters such as image resolution or training time . We use the number of mini-batch updates as budget. Similar to Hyperband, BOHB repeatedly calls Successive Halving (SH)  to advance promising configurations evaluated on smaller budgets to larger ones. SH starts with a fixed number of configurations on the cheapest budget and retains the the best fraction (
) for advancing to the next and more expensive budget. BOHB uses a multivariate kernel density estimator (KDE) to model the densities of the good and bad configurations and uses them to sample promising hyperparameter configurations. For more details we refer to the original paper.
3.3 Testing the generalization performance
At test time, we take the best model found by BOHB, and evaluate it on few-shot tasks sampled from the test set. Like the validation objective, we compute the test objective as the average classification accuracy over few-shot tasks.
4 Experiments and results
With the experiments we are interested in answering the following scientific questions: (1) Does HPO improve generalization performance as measured by few-shot tasks? (2) Does HPO help transfer features to a target domain? (3) Can it yield features that generalize better to unseen domains? (4) How critical is the optimization of data augmentation parameters? (5) How much do we gain by ensembling top performing models from HPO?
4.1 Experimental setup
The mini-ImageNet dataset is commonly used for few-shot learning. It consists of images per class. The training, validation and testing splits consist of , , and classes each. Since all data is from the same dataset, it only enables evaluating cross-task, within-domain generalization, but not generalization across domains.
To this end, we use the datasets from the Meta-Dataset benchmark. It is much larger than previous few-shot learning benchmarks and consists of multiple datasets of different data distributions. Also, it does not restrict few-shot tasks to have fixed ways and shots, thus representing a more realistic scenario. It consists of datasets from diverse domains: ILSVRC-2012  (the ImageNet dataset, consisting of natural images with categories), Omniglot  (hand-written characters, classes), Aircraft  (dataset of aircraft images, 100 classes) , CUB-200-2011  (dataset of Birds, 200 classes), Describable Textures  (different kinds of texture images with 43 categories), Quick Draw  (black and white sketches of 345 different categories), Fungi  (a large dataset of mushrooms with categories), VGG Flower  (dataset of flower images with categories), Traffic Signs  (German traffic sign images with 43 classes) and MSCOCO (images collected from Flickr, classes). All datasets except Traffic signs and MSCOCO have a training, validation and test split (proportioned roughly into , , ). The datasets Traffic Signs and MSCOCO are reserved for testing only. We note that there exist other ImageNet splits, such as TieredImageNet , therefore to avoid ambiguity we refer to Meta-Dataset’s ImageNet version as ImageNet-GBM (Google Brain Montreal).
We group the datasets in the Meta-Dataset benchmark into three subsets , , as shown in Table 1. A dataset in each subset has a training, validation, and test split. We always use only the respective split for the training, validation and testing stages. Testing on , for example, means using the test splits of the datasets in . The splits of (ImageNet-GBM) are used for training, validation, and testing, while the splits of are used only for validation and testing. is reserved for testing.
Evaluation details. We compute the average accuracy over few-shot tasks sampled from a dataset. Few-shot tasks are sampled from the validation split during validation and the test split during testing. Sampled tasks from mini-ImageNet have fixed ways and shots. The Meta-Dataset benchmark tasks have variable ways and shots, which are randomly sampled. We use the same implementation and sampler settings as defined for the Meta-Dataset benchmark .
As described in Section 3.2, we experiment with two types of few-shot classifiers (Linear and N-Centroid). If a linear classifier is used, a new weight matrix must be learned for mapping the features to the target classes. We train the linear classifier for steps using SGD with momentum of . The learning rate and regularization factor is fixed to and , respectively, in all experiments.
We ran BOHB on - parallel GPU (Nvidia P100) workers using the default settings of and the number of mini-batch updates as a budget parameter (see Section 3.2.4). Each worker trains a feature extractor and computes the validation accuracy on few-shot tasks. For mini-ImageNet and Meta-Dataset, we allocate a maximum training budget of and mini-batch updates respectively. These numbers were chosen assuming a batch size of . Note that we also have the batch size as a hyperparameter in our search spaces. Therefore, the number of updates need to be scaled up or down (for the same effective number of epochs). For instance if the sampled batch sizes are or for mini-ImageNet, the resulting mini-batch updates will be and , respectively. We ran BOHB for successive halving iterations.
Search space ranges. The search space ranges for operations in search spaces and are shown in Table 2.
|Number of operations ()|
|Standard deviation for ()|
4.2 Impact of hyperparameter optimization
To assess the impact of hyperparameters on few-shot performance we performed an experiment of training the feature extractor under three settings: 1) with randomly sampled hyperparameters; 2) with hyperparameters provided by publicly available code (result of manual optimization); 3) with hyperparameters obtained with BOHB. For sampling hyperparameters in 1) and 3) we used the same search space , consisting of optimization hyperparameters only. For the random baseline we report the mean and standard deviation averaged over runs. For each setting, we allocated the same training budget. In the default setting, we used hyperparameter choices from Chen et al. . When running BOHB on the Meta-dataset, we used for training and validation, and we performed separate HPO for each classifier type (Linear or N-centroid).
Table 3 and Table 4 show results on mini-ImageNet and Meta-dataset, respectively. In both cases, there is a clear benefit of HPO over random and default parameters. The N-centroid classifier always outperformed the linear classifier, even for the Meta-dataset dataset, where test datasets may have very different data distribution ( and ) compared to the training and validation set. Thus, for the remaining experiments we only report the results with the N-centroid classifier.
4.3 Optimizing for different data distributions
Can HPO help learn a feature extractor from ImageNet which transfers better to target domains (under a limited data regime)? We trained on ImageNet-GBM’s train split and validated on few-shot tasks sampled from a different domain’s validation split. We consider QuickDraw (consisting of sketch images) and Omniglot (handwritten character images) as domains that are very different from ImageNet. As similar domain, we consider the Birds dataset. After optimization, the feature extractor is evaluated on few-shot tasks sampled from a target domain’s test split. We only use the search space for this experiment. The results are shown in Table 5.
Improved transfer to other domains. Using the target domain’s data for validation is useful to make a feature extractor that transfers better to that domain. For QuickDraw and Omniglot, this leads to an improvement by - over an optimized baseline which uses ImageNet as a the validation domain. For the Birds dataset, we see only a small improvement, because the validation domain is already mostly included in the training domain.
It is worth noting that optimizing for a target domain (e.g. Omniglot) does not destroy much of the performance on other domains (e.g. ImageNet). Thus, there is no over-fitting problem as if one would finetune the network on Omniglot, which would destroy its performance on ImageNet.
Transferring to mixed domains. We also experimented with a validation objective, where few-shot tasks are sampled from a mixture of different domains. We used the validation splits from the dataset group ; see Table 1. The results are shown in Table 6. We still observe a performance gain but the effect is diminished compared to validating on single datasets.
4.4 Evaluating generalization on unseen domains
In the previous section we showed that it is possible to tune a feature extractor, such that it performs better on target domains that are different from the training domain by using a validation objective from that domain.
It would be even more practical, if we could learn universal features that generalize to domains unseen during training or validation. To this end, we performed a cross-validation experiment with feature extractors trained on ImageNet-GBM and validated on and tested on , where, and are disjoint set of datasets which do not include ImageNet-GBM. Few-shot tasks are sampled uniformly at random from (during validation) and (during testing).
We construct four cross-validation splits (, , , ) each having validation and test subset (shown in Table 7). Since Traffic Sign and MSCOCO are always used only for testing in the Meta-Dataset framework , we do not use them for validation. For each cross-validation split, we run BOHB to optimize the features using its validation subset. The best model is then tested on the unseen datasets in the split. The test results are shown in Table 8. To compute a final generalization score, for each dataset we take the average across non-zero test scores across each cross-validation split. Finally, we take an average over accuracy and standard deviations (in the last column) across each dataset. We compare the average performance with a model validated on ImageNet-GBM only (last row).
Consistent generalization. Results in Table 8 show that the cross-validation score is similar to that of a model trained and validated on ImageNet-GBM only. Irrespective of the validation dataset used, we do not observe a tendency to overfit and our models generalize equally well on unseen domains. However, we also do not achieve a substantial benefit regarding generalization to unseen domains from using cross-validation: the average performance is similar to validation only on ImageNet-GBM.
|Test source||Dataset avg.|
|Average acc.||After cross validation|
|Model validated with ImageNet-GBM|
4.5 Optimization of data augmentation parameters
We pick-up the questions from Section 3.1 and test if HPO can be useful in the context of data augmentation or if it is superfluous. We used the larger search space described in Section 3.1. We experiment with three settings: First, tuning optimization hyperparameters and standard deviation () for each operation and (search space ). Second, modifying to have a common standard deviation () for each operation and thirdly, tuning optimization hyperparameters with the condition that augmentation operations have random ’s and the number of operations . The results are reported in Table 9 and show that adding data augmentation to the search space does improve performance. The improvement is not as large as the gain from optimizing the parameters in S1, but it is substantial. We observed that the validation objective is more sensitive towards hyperparameters which are more important. To analyze, we train multiple models by fixing one subset and randomizing the rest from . We consider three cases: 1) freeze optimization parameters and randomize 2) randomize along with
3) freeze augmentation and randomize optimization parameters. The variance of the validation objective is shown in Figure2. We observe that randomizing the optimization hyperparameters show the largest performance variance, which indicates that they must be carefully tuned.
|Tune Opt.||Data aug.||Test accuracy|
4.6 Comparison to the state of the art
With the improved generalization obtained with HPO, how do the results compare to the state of the art in few-shot learning? In Table 10 we observe that BOHB-N-Centroid tuned on the simple search space (optimization only) with fixed standard augmentation (random crops, flipping and intensity changes) clearly improves over comparable methods . Using the larger search space (augmentation & optimization), we are comparable to the test accuracy of a network ensemble . We also report our results after ensembling. Unlike  we do not retrain the best model multiple times, but rather take the ensemble from the top models discovered by BOHB.
Also on Meta-Dataset, results compare favorably to existing approaches (see Table 11). It is worth noting that our performance gains come without the need for any additional adaptation (i.e the feature extractor does not need to be optimized on few-shot tasks during evaluation). Since, ensembling provides such large performance gains , we also report ensemble results for the Meta-dataset benchmark. Since the compared methods do not use ensembles, this comparison is not on the same ground anymore, yet shows that the gains are mostly complementary. Also, we found that optimizing hyperparameters using the larger search space did not lead to an improvement with ImageNet as training source. We conjecture that ImageNet data provides sufficient regularization to our ResNet18 feature extractor and additional data augmentation does not help.
What are good hyperparameters? We use parallel coordinate plots (PCP)  to visualize relationships between the validation objective and the different hyperparameters in our search space (shown in Figure 3). We observe that configurations with lower learning rate and smaller batch size (resulting in more mini-batch updates) typically lead to better performance. This is in line with the common understanding in the community . In practice, we found that a batch size of consistently performed well. For mini-ImageNet experiments, BOHB finds good configurations with both ADAM and SGD, however for experiments on ImageNet-GBM, SGD clearly outperforms ADAM. With regard to search space we observed that the number of randomly sampled augmentation operations () is often chosen to be one or two, i.e., it is advantageous to apply transformations one after the other. More details are in the supplementary material.
In this paper we aimed for features that generalize well across tasks and across domains. We showed that hyperparameter optimization with few-shot classification as validation objective is a very powerful tool to this end. Apart from the typical within-domain analysis, we also investigated few-shot learning across domains. We saw that HPO adds to cross-domain generalization even when the optimization is not run across domains but on a rich enough dataset like ImageNet. Moreover, it provides a way to adapt to a specific domain without destroying generalization on other domains. We shedded more light on the ongoing discussion whether data augmentation can benefit from optimizing its parameters and got a positive answer for a reasonably sized search space. Finally, we found that HPO is well compatible with ensembles.
Hyperparameter optimization of deep neural networks: combining hyperband with Bayesian model selection. In CAP, Cited by: §2.
-  (2019) A closer look at few-shot classification. In ICLR, Cited by: §1, §2, §4.2, §4.6, Table 10, Table 9.
-  (2014) Describing textures in the wild. In CVPR, Cited by: §4.1.
-  (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §2, §3.1.
-  (2019) AutoAugment: learning augmentation strategies from data. In CVPR, Cited by: §1, §2, §3.1.
Improved regularization of convolutional neural networks with cutout. arXiv abs/1708.04552. External Links: Cited by: §3.1.
-  (2020) A baseline for few-shot image classification. In ICLR, External Links: Cited by: §2.
-  (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §2.
-  (2017) Adversarial feature learning. In ICLR, External Links: Cited by: §2.
-  (2019) Large scale adversarial representation learning. arXiv abs/1907.02544. External Links: Cited by: §2.
-  (2019) Diversity with cooperation: ensemble methods for few-shot classification. In ICCV, Cited by: §1, §2, §4.6, §4.6, Table 10.
-  (2018) BOHB: robust and efficient hyperparameter optimization at scale. In ICML, External Links: Cited by: §1, §2, §3.2.4, §3.2.4.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2.
-  (2017) Shake-shake regularization. In ICLRW, Cited by: §2.
-  (2018) Dropblock: a regularization method for convolutional networks. In NeurIPS, Cited by: §2.
-  (2018) Dynamic few-shot visual learning without forgetting. In CVPR, Cited by: §3.2.3.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §2.
-  (2015) Fast r-cnn. In ICCV, Cited by: §2.
-  (2019) Instance-level embedding adaptation for few-shot learning. IEEE Access 7, pp. 100501–100511. External Links: Cited by: Table 10.
-  (2019) Population based augmentation: efficient learning of augmentation policy schedules. In ICML, External Links: Cited by: §1.
-  (2013) Detection of traffic signs in real-world images: the German Traffic Sign Detection Benchmark. In IJCNN, Cited by: §4.1.
-  (2014) An efficient approach for assessing hyperparameter importance. In ICML, Cited by: §1.
-  (2017) Population based training of neural networks. arXiv abs/1711.09846. External Links: Cited by: §1.
-  (2016) Non-stochastic best arm identification and hyperparameter optimization. In AISTATS, Cited by: §3.2.4.
-  (2020) Fantastic generalization measures and where to find them. In ICLR, External Links: Cited by: §1.
-  (2016) The quick, draw!-ai experiment. Mount View, CA, accessed Feb 17, pp. 2018. Cited by: §4.1.
-  (2015) Siamese neural networks for one-shot image recognition. In ICMLW, Cited by: §2.
-  (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §4.1.
-  (2019) Meta-learning with differentiable convex optimization. In CVPR, Cited by: §2, Table 10.
-  (2017-01) Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR 18 (1), pp. 6765–6816. External Links: Cited by: §1, §2, §3.2.4.
-  (2019) Fast autoaugment. In NeurIPS, Cited by: §2.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.
SGDR: stochastic gradient descent with warm restarts. In ICLR, External Links: Cited by: §3.1.
-  (2013) Fine-grained visual classification of aircraft. Technical report External Links: Cited by: §4.1.
-  (2018) Revisiting small batch training for deep neural networks. arXiv abs/1804.07612. External Links: Cited by: §3.1, §5.
-  (2013-11) Distance-based image classification: generalizing to new classes at near-zero cost. IEEE Trans. PAMI 35 (11), pp. 2624–2637. External Links: Cited by: §3.2.3.
-  (2011) Parallel coordinate and parallel coordinate density plots. Wiley Interdisciplinary Reviews: Computational Statistics 3 (2), pp. 134–148. Cited by: §5.
-  (2008) Automated flower classification over a large number of classes. In ICCV, Cited by: §4.1.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
-  (2018) Sensitivity and generalization in neural networks: an empirical study. In ICLR, External Links: Cited by: §1.
-  (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In NeurIPS, Cited by: §2, Table 10.
-  (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §2.
-  (2017) Optimization as a model for few-shot learning. In ICLR, External Links: Cited by: §2.
-  (2018) Meta-learning for semi-supervised few-shot classification. In ICLR, External Links: Cited by: §4.1.
-  (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §4.1.
-  (2019) Meta-learning with latent embedding optimization. In ICLR, External Links: Cited by: §2, Table 10.
-  (2019) AutoDispNet: improving disparity estimation with automl. In ICCV, Cited by: §2, §3.2.4.
-  (2018) FGVCx fungi classification challenge 2018. External Links: Cited by: §4.1.
-  (2014) OverFeat: integrated recognition, localization and detection using convolutional networks. In ICLRW, External Links: Cited by: §2.
-  (2016) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104, pp. 148–175. Cited by: §3.2.4.
-  (2018) A bayesian perspective on generalization and stochastic gradient descent. In ICLR, External Links: Cited by: §3.1.
-  (2017) Prototypical networks for few-shot learning. In NeurIPS, Cited by: §2.
-  (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §2.
-  (2020) Meta-dataset: a dataset of datasets for learning to learn from few examples. In ICLR, External Links: Cited by: §2, §2, §3.2.1, §4.1, §4.1, §4.4, Table 11.
-  (2018) Representation learning with contrastive predictive coding. arXiv abs/1807.03748. External Links: Cited by: §2.
Extracting and composing robust features with denoising autoencoders. In ICML, Cited by: §2.
-  (2016) Matching networks for one shot learning. In NeurIPS, Cited by: §2, §4.1.
-  (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1.
-  (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §2.
Towards automated deep learning: efficient joint neural architecture and hyperparameter search. In ICMLW, Cited by: §3.2.4.
-  (2017) Understanding deep learning requires rethinking generalization. In ICLR, External Links: Cited by: §1, §3.1.
Appendix A Data augmentation
a.1 Number of augmentation operations
During our experiments with BOHB on search space (see Section 3.1), we found that good configurations had (the number of data augmentation operations applied per mini-batch) set to low values, as shown in Figure (i). The best configuration had . We observe that as increases, the validation accuracy drops.
a.2 Results for Meta-Dataset on search space S2
We found that optimizing hyperparameters using the larger search space (optimization + augmentation) did not lead to an improvement on the Meta-Dataset benchmark. The results compared to search space are shown in Table (i). We conjecture that training data from ImageNet-GBM () provides sufficient regularization to our ResNet18 feature extractor and any additional data augmentation does not help.
Appendix B More results with a linear classifier
b.1 Cross-domain validation
We report results using the linear classifier with cross-domain validation, again showing better transfer on tasks from a different data distribution (see Section 4.3). The results are shown in Table (ii) and are complementary to Table 5 in the main paper (where a N-Centroid classifier was used).
b.2 Hyperparameter relationships
Similar to Figure 3 in the main paper, we use parallel coordinate plots to visualize relationships between the validation objective and the different hyperparameters in our search space, while using the linear classifier (shown in Figure (ii)). Supporting the discussion in Section 5, we show that configurations with smaller batch sizes lead to optimal performance, also with the linear classifier. A similar pattern is observed for the choice of optimizer (SGD outperforms ADAM on ImagNet-GBM).
Appendix C Ensembling
In Figure (iii)
, we observe that ensembling logits from topmodels obtained from BOHB perform better on 5-shot tasks compared to an ensemble of the same size obtained by re-training the best model. However, on 1-shot tasks, we end up having the same performance in both cases as the ensemble size increases.
Appendix D Detailed results on Meta-Dataset
We report more detailed results showing performance on individual datasets from Meta-Dataset. Results for all our models in comparison to baselines from the Meta-Dataset benchmark are shown in Table (iii).
A visual comparison of our best models with baseline models (ProtoNet and ProtoMAML) is shown in Figure (iv). We observe that BOHB-NC has a performance which is comparable or better. With ensembling we can see large boosts in performance, even for difficult datasets such as QuickDraw and Omniglot which are from a very different data distribution.
||With adaption||No adaption||No adaption (Ours)|