Convolutional neural networks [lecun1989backpropagation]
have become standard tools in computer vision to model images, leading to outstanding results in many tasks such as classification[krizhevsky2012imagenet], object detection [blitznet, ssd, faster-rcnn], or semantic segmentation [blitznet, long2015fully, ronneberger2015u]. Massively annotated datasets such as ImageNet [imagenet] or COCO [coco] seem to have played a key role in this success. However, annotating a large corpus is expensive and not always feasible, depending on the task at hand. Improving the generalization capabilities of deep neural networks and removing the need for huge sets of annotations is thus of utmost importance.
While such a grand challenge may be addressed from different complementary points of views, e.g., large-scale unsupervised learning[caron2018deep]
, self-supervised learning[doersch2017multi, gidaris2018unsupervised], or by developing regularization techniques dedicated to deep networks [bietti2019kernel, yoshida2017spectral], we choose in this paper to focus on variance-reduction principles based on ensemble methods.
Specifically, we are interested in few-shot classification, where a classifier is first trained from scratch on a medium-sized annotated corpus—that is, without leveraging external data or a pre-trained network, and then we evaluate its ability to adapt to new classes, for which only very few annotated samples are provided (typically 1 or 5). Unfortunately, simply fine-tuning a convolutional neural network on a new classification task with very few samples has been shown to provide poor results [finn2017model], which has motivated the community to develop dedicated approaches.
The dominant paradigm in few-shot learning builds upon meta-learning [finn2017model, ravioptimization, schmidhuber1997shifting, thrun1998lifelong, snell2017prototypical, vinyals2016matching], which is formulated as a principle to learn how to adapt to new learning problems. These approaches split a large annotated corpus into classification tasks, and the goal is to transfer knowledge across tasks in order to improve generalization. While the meta-learning principle seems appealing for few-shot learning, its empirical benefits have not been clearly established yet. There is indeed strong evidence [chen19closerfewshot, gidaris2018dynamic, qiao2018few]
that training CNNs from scratch using meta-learning performs substantially worse than if CNN features are trained in a standard fashion—that is, by minimizing a classical loss function relying on corpus annotations; on the other hand, learning only the last layer with meta-learning has been found to produce better results[gidaris2018dynamic, qiao2018few]. Then, it was recently shown in [chen19closerfewshot] that simple distance-based classifiers could achieve similar accuracy as meta-learning approaches.
Our paper goes a step further and shows that meta-learning-free approaches can be improved and significantly outperform the current state of the art in few-shot learning. Our angle of attack consists of using ensemble methods to reduce the variance of few-shot learning classifiers, which is inevitably high given the small number of annotations. Given an initial medium-sized dataset (following the standard setting of few-shot learning), the most basic ensemble approach consists of first training several CNNs independently before freezing them and removing the last prediction layer. Then, given a new class (with few annotated samples), we build a mean centroid classifier for each network and estimate class probabilities—according to a basic probabilistic model—of test samples based on the distance to the centroids[mensink2013distance, snell2017prototypical]. The obtained probabilities are then averaged over networks, resulting in higher accuracy.
While we show that the basic ensemble method where networks are trained independently already performs well, we introduce penalty terms that allow the networks to cooperate during training, while encouraging enough diversity of predictions, as illustrated in Figure 1. The motivation for cooperation is that of easier learning and regularization, where individual networks from the ensemble can benefit from each other. The motivation for encouraging diversity is classical for ensemble methods [dietterich2000ensemble], where a collection of weak learners making diverse predictions often performs better together than a single strong one. Whereas these two principles seem in contradiction with each other at first sight, we show that both principles are in fact useful and lead to significantly better results than the basic ensemble method. Finally, we also show that a single network trained by distillation [hinton2015distilling] to mimic the behavior of the ensemble also performs very well, which brings a significant speed-up at test time. In summary, our contributions are three-fold:
We introduce mechanisms to encourage cooperation and diversity for learning an ensemble of networks. We study these two principles for few-shot learning and characterize the regimes where they are useful.
We show that it is possible to significantly outperform current state-of-the-art techniques for few-shot classification without using meta-learning.
As a minor contribution, we also show how to distill the knowledge from an ensemble into a single network with minor loss in accuracy on novel categories, by using additional unlabeled data.
2 Related Work
In this section, we discuss related work on few-shot learning, meta-learning, and ensemble methods.
Typical few-shot classification problems consist of two parts called meta-training and meta-testing [chen19closerfewshot]. During the meta-training stage, one is given a large-enough annotated dataset, which is used to train a predictive model. During meta-testing, novel categories are provided along with few annotated examples, and we evaluate the capacity of the predictive model to retrain or adapt, and then generalize on these new classes.
Meta-learning approaches typically sample few-shot learning classification tasks from the meta-training dataset, and train a model such that it should generalize on a new task that has been left aside. For instance, in [finn2017model] a “good network initialization” is learned such that a small number of gradient steps on a new problem is sufficient to obtain a good solution. In [ravioptimization]
, the authors learn both the network initialization and an update rule (optimization model) represented by a Long-Term-Short-Memory network (LSTM). Inspired by few-shot learning strategies developed before deep learning approaches became popular[mensink2013distance], distance-based classifiers based on the distance to a centroid were also proposed, e.g., prototypical networks [snell2017prototypical], or more sophisticated classifiers with attention [vinyals2016matching]. All these methods consider a classical backbone network, and train it from scratch using meta-learning.
Recently, these meta-learning were found to be sub-optimal [gidaris2018dynamic, oreshkin2018tadam, qiao2018few]. Specifically, better results were obtained by training the network on the classical classification task using the meta-training data in a first step, and then only fine-tuning with meta-learning in a second step [oreshkin2018tadam, rusu2018meta, ye2018learning]. Others such as [gidaris2018dynamic, qiao2018few] simply freeze the network obtained in the first step, and train a simple prediction layer with meta-learning, which results in similar performance. Finally, the paper [chen19closerfewshot] demonstrates that simple baselines without meta-learning—based on distance-based classifiers—work equally well. Our paper pushes such principles even further and shows that by appropriate variance-reduction techniques, these approaches can significantly outperform the current state of the art.
It is well known that ensemble methods reduce the variance of estimators and subsequently may improve the quality of prediction [friedman2001elements]. To gain accuracy from averaging, various randomization or data augmentation techniques are typically used to encourage a high diversity of predictions [breiman1996heuristics, dietterich2000ensemble]. While individual classifiers of the ensemble may perform poorly, the quality of the average prediction turns out to be sometimes surprisingly high.
Even though ensemble methods are costly at training time for neural networks, it was shown that a single network trained to mimic the behavior of the ensemble could perform almost equally well [hinton2015distilling]—a procedure called distillation—thus removing the overhead at test time. To improve the scalability of distillation in the context of highly-parallelized implementations, an online distillation procedure is proposed in [anil2018large]. There, each network is encouraged to agree with the averaged predictions made by other networks of the ensemble, which results in more stable models. The objective of our work is however significantly different. The form of cooperation they encourage between networks is indeed targeted to scalability and stability (due to industrial constraints), but online distilled networks do not necessarily perform better than the basic ensemble strategy. Our goal, on the other hand, is to improve the quality of prediction and do better than basic ensembles.
To this end, we encourage cooperation in a different manner, by encouraging predictions between networks to match in terms of class probabilities conditioned on the prediction not being the ground truth label. While we show that such a strategy alone is useful in general when the number of networks is small, encouraging diversity becomes crucial when this number grows. Finally, we show that distillation can help to reduce the computational overhead at test time.
3 Our Approach
In this section, we present our approach for few-shot classification, starting with preliminary components.
3.1 Preliminaries about mean-centroid classifiers
We now explain how to perform few-shot classification with a fixed feature extractor and a mean centroid classifier.
Few-shot classification with prototype classifier.
During the meta-training stage, we are given a dataset with annotations, which we use to train a prediction function represented by a CNN. Formally, after training the CNN on
, we remove the final prediction layer and use the resulting vectoras a set of visual features for a given image . The parameters represent the weights of the network, which are frozen after this training step.
During meta-testing, we are given a new dataset , where is a number of new categories and is the number of available examples for each class. The ’s represent image-label pairs. Then, we build a mean centroid classifier, leading to the class prototypes
Finally, a test sample is assigned to the nearest centroid’s class. Simple mean-centroid classifiers have proven to be effective in the context of few-shot classification [chen19closerfewshot, mensink2013distance, snell2017prototypical], which is confirmed in the following experiment.
Motivation for mean-centroid classifier.
where the weights can be learned with gradient descent by maximizing the likelihood of the probabilistic model
is a distance function, such as Euclidian distance or negative cosine similarity. Since the coefficients are learned from data and not set arbitrarily toas in (1), one would potentially expect this method to produce better classifiers if appropriately regularized. When we run the evaluation of the aforementioned classifiers on 1000 5-shot learning tasks sampled from miniImagenet-test (see experimental section for details about this dataset), we get similar results on average: for (1) vs. for (2), confirming that learning meaningful parameters in this very-low-sample regime is difficult.
3.2 Learning ensembles of deep networks
During meta-training, one needs to minimize the following loss function over a training set :
where is a CNN as before. The cost function is the cross-entropy between ground-truth labels and predicted class probabilities , where is the normalized exponential function, and is a weight decay parameter.
When training an ensemble of networks independently, one would solve (4) for each network separately. While these terms may look identical, solutions provided by deep neural networks will typically differ when trained with different initializations and random seeds, making ensemble methods appealing in this context.
In this paper, we are interested in ensemble of networks, but we also want to model relationships between its members; this may be achieved by considering a pairwise penalty function , leading to the joint formulation:
where is the vector obtained by concatenating all the parameters . By carefully designing the function and setting up appropriately the parameter , it is possible to achieve desirable properties of the ensemble, such as diversity of predictions or collaboration during training.
3.3 Ensembles with diversity and cooperation
To reduce the high variance of few-shot learning classifiers, we use ensemble methods trained with a particular interaction function , as in (5). Then, once the parameters have been learned during meta-training, classification in meta-testing is performed by considering a collection of mean-centroid classifiers associated to the basic probabilistic model presented in Eq. (3). Given a test image, the class probabilities are averaged. Such a strategy was found to perform empirically better than a voting scheme.
As we show in the experimental section, the choice of pairwise relationship function significantly influences the quality of the ensemble. Here, we describe three different strategies, which all provide benefits in different regimes, starting by a criterion encouraging diversity of predictions.
One way to encourage diversity consists of introducing randomization in the learning procedure, e.g., by using data augmentation [breiman1996heuristics, friedman2001elements] or various initializations. Here, we also evaluate the effect of an interaction function that acts directly on the network predictions. Given an image , two models parametrized by and respectively lead to class probabilities and . During training, and are naturally encouraged to be close to the assignment vector in with a single non-zero entry at position , where is the class label associated to and is the number of classes.
From [hinton2015distilling], we know that even though only the largest entry of or is used to make predictions, other entries—typically not corresponding to the ground truth label —carry important information about the network. It becomes then natural to consider the probabilities and conditioned on not being the ground truth label. Formally, these are obtained by setting to zero the entry in and renormalizing the corresponding vectors such that they sum to one. Then, we consider the following diversity penalty
When combined with the loss function, the resulting formulation encourages the networks to make the right prediction according to the ground-truth label, but then they are also encouraged to make different second-best, third-best, and so on, choice predictions (see Figure 1). This penalty turns out to be particularly effective when the number of networks is large, as shown in the experimental section. It typically worsens the performance of individual classifiers on average, but make the ensemble prediction more accurate.
Apparently opposite to the previous principle, encouraging the conditional probabilities to be similar—though with a different metric—may also improve the quality of prediction by allowing the networks to cooperate for better learning. Our experiments show that such a principle alone may be effective, but it appears to be mostly useful when the number of training networks is small, which suggests that there is a trade-off between cooperation and diversity that needs to be found.
Specifically, our experiments show that using the negative cosine—in other words, the opposite of (6)—is ineffective. However, a penalty such as the symmetrized KL-divergence turned out to provide the desired effect:
By using this penalty, we managed to obtain more stable and faster training, resulting in better performing individual networks, but also—perhaps surprisingly—a better ensemble. Unfortunately, we also observed that the gain of ensembling diminishes with the number of networks in the ensemble since the individual members become too similar.
Robustness and cooperation.
Given experiments conducted with the two previous penalties, a trade-off between cooperation and diversity seems to correspond to two regimes (low vs. high number of networks). This motivated us to develop an approach designed to achieve the best trade-off. When considering the cooperation penalty (7), we try to increase diversity of prediction by several additional means. i) We randomly drop some networks from the ensemble at each training iteration, which causes the networks to learn on different data streams and reduces the speed of knowledge propagation. ii) We introduce Dropout within each network to increase randomization. iii) We feed each network with a different (crop, color) transformation of the same image, which makes the ensemble more robust to input image transformations. Overall, this strategy was found to perform best in most scenarios (see Figure 2).
3.4 Ensemble distillation
As most ensemble methods, our ensemble strategy introduces a significant computional overhead at training time. To remove the overhead at test time, we use a variant of knowledge distillation [hinton2015distilling] to compress the ensemble into a single network . Given the meta-training dataset , we consider the following cost function on example :
where, is cross-entropy, is a one-hot embedding of the true label . The second term performs distillation with parameter (see [hinton2015distilling]). It encourages the single model to be similar to the average output of the ensemble. In our experiments, we are able to obtain a model with performance relatively close to that of the ensemble (see Section 4).
Modeling out-of-distribution behavior.
When distillation is performed on the dataset , the network mimics the behavior of the ensemble on a specific data distribution. However, new categories are introduced at test time. Therefore, we also tried distillation by using additional unnannotated data, which yields slightly better performance.
We now present experiments to study the effect of cooperation and diversity for ensemble methods, and start with experimental and implementation details.
4.1 Experimental Setup
We use MiniImageNet [ravioptimization] which is derived from the original ImageNet [imagenet] dataset and Caltech-UCSD Birds (CUB) 200-2011 [WahCUB_200_2011]. MiniImageNet consists of 100 categories—64 for training, 16 for validation and 20 for testing—with 600 images each. The CUB dataset consists of 11,788 images of birds of more than 200 species. We adopt train, val, and test splits from [ye2018learning], which were originally created by randomly splitting all 200 species in 100 for training, 50 for validation, and 50 for testing.
In few-shot classification, the test set is used to sample 5-way classification problems, where only examples of each category are provided for training and 15 for evaluation. We follow [finn2017model, gidaris2018dynamic, oreshkin2018tadam, qiao2018few, ravioptimization] and test our algorithms for 1 and 5 and is set to . Each time, classes and corresponding train/test examples are sampled at random. For all our experiments we report the mean accuracy (in %) over tasks and confidence interval.
For all experiments, we use the Adam optimizer [adam] with an initial learning rate , which is decreased by a factor 10 once during training when no improvement in validation accuracy is observed for
consecutive epochs. ForMiniImageNet, we use , and 20 for the CUB dataset. When distilling an ensemble into one network, is doubled. We use random crops and color augmentation during training as well as weight decay with parameter . All experiments are conducted with the ResNet18 architecture [resnet], which allows us to train our ensembles of networks on a single GPU. Input images are then re-scaled to the size , and organized in mini-batches of size 16. Validation accuracy is computed by running 5-shot evaluation on the validation set. During the meta-testing stage, we take central crops of size from images and feed them to the feature extractor. No other preprocessing is used at test time. When building a mean centroid classifier, the distance in (3) is computed as the negative cosine similarity [snell2017prototypical], which is rescaled by a factor 10. For reproducibility purposes, our implementation will be made publicly available.
4.2 Ensembles for Few-Shot Classification
In this section, we study the effect of ensemble training with pairwise interaction terms that encourage cooperation or diversity. For that purpose, we analyze the link between the size of ensembles and their 1- and 5-shot classification performance on the MiniImageNet and CUB datasets.
Details about the three strategies.
When models are trained jointly, the data stream is shared across all networks and weight updates happen simultaneously. This is achieved by placing all models on the same GPU and optimizing the loss (5). When training a diverse ensemble, we use the cosine function (6) and selected the parameter that performed best on the validation set among the tested values (, for ) for and networks. Then, this value was kept for other values of . To enforce cooperation between networks, we use the symmetrized KL function (7) and selected the parameter in the same manner. Finally, the robust ensemble strategy is trained with the cooperation relationship penalty and the same parameter , but we use Dropout with probability 0.1 before the last layer; each of the network is dropped from the ensemble with probability 0.2 at every iteration; different networks receive different transformation of the same image, i.e. different random crops and color augmentation.
Tables 1 and 2 summarize the few-shot classification accuracies of ensembles trained with our strategies and compare with basic ensembles. On the MiniImageNet dataset, the results for 1- and 5-shot classification are consistent with each other. Training with cooperation allows smaller ensembles () to perform better, which leads to higher individual accuracy of the networks from the ensemble, as seen in Figure 2. However, when , cooperation is less effective, as opposed to the diversity strategy, which benefits from larger . As we can see from Figure 2, individual members of the ensemble become worse, but the ensemble accuracy improves substantially. Finally, the robust strategy seems to perform best for all values of in almost all settings. The situation for the CUB dataset is similar, although we notice that robust ensembles perform similarly as the diversity strategy for .
|Independent||77.28 0.46||78.27 0.45||79.38 0.43||80.02 0.43||80.30 0.43||80.57 0.42|
|Diversity||77.28 0.46||78.34 0.46||79.18 0.43||79.89 0.43||80.82 0.42||81.18 0.42|
|Cooperation||77.28 0.46||78.67 0.46||80.20 0.42||80.60 0.43||80.72 0.42||80.80 0.42|
|Robust||77.28 0.46||78.71 0.45||80.26 0.43||81.00 0.42||81.22 0.43||81.59 0.42|
|Robust-dist||79.44 0.44||79.84 0.44||80.01 0.42||80.25 0.44||80.63 0.42|
|Robust-dist++||79.16 0.46||80.00 0.44||80.25 0.42||80.35 0.44||81.19 0.43|
|Independent||58.71 0.62||60.04 0.60||60.83 0.63||61.34 0.61||61.93 0.61||62.06 0.61|
|Diversity||58.71 0.63||59.95 0.61||61.27 0.62||61.43 0.61||62.23 0.61||62.47 0.62|
|Cooperation||58.71 0.62||60.20 0.61||61.46 0.61||61.61 0.61||62.06 0.61||62.12 0.62|
|Robust||58.71 0.62||60.91 0.62||62.36 0.60||62.70 0.61||62.97 0.62||63.95 0.61|
|Robust-dist||62.33 0.62||62.64 0.60||63.14 0.61||63.01 0.62||63.06 0.61|
|Robust-dist ++||62.07 0.62||62.81 0.60||63.39 0.61||63.20 0.62||63.73 0.62|
|Independent||79.47 0.49||81.34 0.46||82.57 0.46||83.16 0.45||83.80 0.45||83.95 0.46|
|Diversity||79.47 0.49||81.09 0.45||82.23 0.46||82.91 0.46||84.30 0.44||85.20 0.43|
|Cooperation||79.47 0.49||81.69 0.46||82.95 0.47||83.43 0.47||84.01 0.44||84.26 0.44|
|Robust||79.47 0.49||82.90 0.46||83.36 0.46||83.62 0.45||84.47 0.46||84.62 0.44|
|Robust-dist||82.72 0.47||82.95 0.46||83.27 0.46||83.61 0.46||83.57 0.45|
|Robust-dist++||82.53 0.48||83.04 0.45||83.37 0.46||83.22 0.46||83.21 0.44|
|Independent||64.25 0.73||66.60 0.72||67.64 0.71||68.07 0.70||68.93 0.70||69.64 0.69|
|Diversity||64.25 0.73||65.99 0.71||66.71 0.72||68.19 0.71||69.35 0.70||70.07 0.70|
|Cooperation||64.25 0.73||67.21 0.71||67.93 0.70||68.22 0.70||68.69 0.70||68.80 0.68|
|Robust||64.25 0.73||67.33 0.71||68.01 0.72||68.53 0.70||68.59 0.70||69.47 0.69|
|Robust-dist||67.47 0.71||67.29 0.72||68.09 0.70||68.71 0.71||68.77 0.71|
|Robust-dist++||67.01 0.74||67.62 0.72||68.68 0.71||68.38 0.70||68.68 0.69|
4.3 Distilling an ensemble into a single network
We distill robust ensembles of all sizes to study knowledge transferability with growing ensemble size. To do so, we use the meta-training dataset and optimize the loss (8) with parameters and . For the strategy using external data, we randomly add at each iteration 8 images (without annotations) from the COCO [coco] dataset to the 16 annotated samples from the meta-training data. Those images contribute only to the distillation part of the loss (8). Tables 1, 2 display model accuracies for both MiniImageNet and CUB datasets. For 5-shot classification on MiniImageNet, the difference between ensemble and its distilled version is rather low (around 1%), while adding extra non-annotated data helps reducing this gap. Surprisingly, 1-shot classification accuracy is slightly higher for distilled models than for their corresponding full ensembles. On the CUB dataset, distilled models stop improving after , even though the performance of full ensembles keeps growing. This seems to indicate that the capacity of the single network may have been reached, which suggests using a more complex architecture here. Consistently with such hypothesis, adding extra data is not as helpful as for MiniImageNet, most likely because data distributions of COCO and CUB are more different.
In Table 3, we also compare the performance of our distilled networks with other baselines from the literature, including current state-of-the-art meta-learning approaches, showing that our approach does significantly better on the MiniImageNet dataset.
|Cosine + Attention [gidaris2018dynamic]||ResNet18||73.00 0.64||56.20 0.86|
|PPA [qiao2018few]||ResNet18||73.74 0.19||59.60 0.41|
|TADAM [oreshkin2018tadam]||ResNet18||76.70 0.30||58.50 0.30|
|LEO [rusu2018meta]||WideResNet||77.59 0.12||61.76 0.08|
|FEAT [ye2018learning]||WideResNet||78.32 0.16||61.72 0.11|
|Linear Classifier [chen19closerfewshot]||ResNet18||74.27 0.63||51.75 0.80|
|Cosine Classifier [chen19closerfewshot]||ResNet18||75.68 0.63||51.87 0.77|
|Robust 20-dist (ours)||ResNet18||80.63 0.43||63.06 0.63|
|Robust 20-dist++ (ours)||ResNet18||81.19 0.43||63.73 0.62|
|Robust 20 Full||ResNet18||81.59 0.42||63.95 0.61|
4.4 Study of relationship penalties
There are many possible ways to model relationship between the members of an ensemble. In this subsection, we study and discuss such particular choices.
Input to relationship function.
As noted by [hinton2015distilling]
, class probabilities obtained by the softmax layer of a network seem to carry a lot of information and are useful for distillation. However, after meta-training, such probabilities are often close to binary vectors with a dominant value associated to the ground-truth label. To make small values more noticeable, distillation uses a parameter, as in (8). Given such a class probability computed by a network, we experimented such a strategy consisting of introducing new probabilities , where the contributions of non ground-truth values are emphasized. When used within our diversity (6) or cooperation (7) penalties, we however did not see any improvement over the basic ensemble method. Instead, we found that computing the class probabilities conditioned on not being the ground truth label, as explained in Section 3.3, would perform much better.
This is illustrated on the following experiment with two network ensembles of size , where we compared the two strategies. We enforce similarity on the full probability vectors in the first one, computed with softmax at following [anil2018large], and with conditionally non-ground-truth probabilities for the second one as defined in Section 3.3. When using the cooperation training formulation, the second strategy turns out to perform about 1% better than the first one (79.79 % vs 80.60%), when tested on MiniImageNet. Similar observations have been made using the diversity criterion. In comparison, the basic ensemble method without interactions achieves about .
Choice of relationship function.
In principle, any similarity measure taken with the negative sign should potentially enforce cooperation, when used as a penalty. For promoting diversity, the same should hold in principle but with the opposite sign as well.
Here, we show that in fact, selecting the right criterion for comparing probability vectors (cosine similarity, L2 distance, symmetrized KL divergence), is crucial depending on the desired effect (cooperation or diversity). In Table 4, we perform such a comparison for an ensemble with networks on the MiniImageNet dataset for a shot classification task, when plugging the above function in the formulation 5, with a specific sign. The parameter for each experiment is chosen such that the performance on the validation set is maximized.
When looking for diversity, the cosine similarity performs slightly better than negative L2 distance, although the accuracies are within error bars. Using negative with various was either not distinguishable from independent training or was hurting the performance for larger values of (not reported on the table). As for cooperation, positive gives better results than L2 distance or negative cosine similarity. We believe that this behavior is due to important difference in the way these functions compare small values in probability vectors. While negative cosine or L2 losses would penalize heavily the largest difference, concentrates on values that are close to 0 in one vector and are greater in the second one.
|Cooperation (+)||80.14 0.43||80.29 0.44||80.72 0.42|
|Diversity (-)||80.54 0.44||80.82 0.42||79.81 0.43|
4.5 Performance under domain shift
Finally, we evaluate the performance of ensemble methods under domain shift. We proceed by meta-training the models on the MiniImageNet training set and evaluate the model on the CUB-test set. The following setting was first proposed by [chen19closerfewshot] and aims at evaluating the performance of algorithms to adapt when the difference between training and testing distributions is large. To compare to the results reported in the original work, we adopt their CUB test split. Table 5 compares our results to the ones listed in [chen19closerfewshot]. We can see that neither the full robust ensemble nor its distilled version are able to do better than training a linear classifier on top of a frozen network. Yet, it does significantly better than distance-based approaches (denoted by cosine classifier in the table). However, if a diverse ensemble is used, it achieves the best accuracy. This is not surprising and highlights the importance of having diverse models when ensembling weak classifiers.
|MatchingNet [vinyals2016matching]||53.07 0.74|
|MAML [finn2017model]||51.34 0.72|
|ProtoNet [snell2017prototypical]||62.02 0.70|
|Linear Classifier [chen19closerfewshot]||65.57 0.70|
|Cosine Classifier [chen19closerfewshot]||62.04 0.76|
|Robust 20-dist++ (ours)||64.23 0.58|
|Robust 20 Full (ours)||65.04 0.57|
|Diverse 20 Full (ours)||66.17 0.55|
In this paper, we show that distance-based classifiers for few-shot learning suffer from high variance, which can be significantly reduced by using an ensemble of classifiers. Unlike traditional ensembling paradigms where diversity of predictions is encouraged by various randomization and data augmentation techniques, we show that encouraging the networks to cooperate during training is also important.
The overall performance of a single network obtained by distillation (with no computational overhead at test time) leads to state-of-the-art performance for few shot learning, without relying on the meta-learning paradigm. While such a result may sound negative for meta-learning approaches, it may simply mean that a lot of work remains to be done in this area to truly learn how to learn or to adapt.
This work was supported by a grant from ANR (MACARON project under grant number ANR-14-CE23-0003-01), by the ERC grant number 714381 (SOLARIS project), the ERC advanced grant ALLEGRO and gifts from Amazon and Intel.