Learning Generalizable Representations via Diverse Supervision

11/29/2019 ∙ by Ziqi Pang, et al. ∙ Tsinghua University Carnegie Mellon University Peking University 9

The problem of rare category recognition has received a lot of attention recently, with state-of-the-art methods achieving significant improvements. However, we identify two major limitations in the existing literature. First, the benchmarks are constructed by randomly splitting the categories of artificially balanced datasets into frequent (head), and rare (tail) subsets, which results in unrealistic category distributions in both of them. Second, the idea of using external sources of supervision to learn generalizable representations is largely overlooked. In this work, we attempt to address both of these shortcomings by introducing the ADE-FewShot benchmark. It stands upon the ADE dataset for scene parsing that features a realistic, long-tail distribution of categories as well as a diverse set of annotations. We turn it into a realistic few-shot classification benchmark by splitting the object categories into head and tail based on their distribution in the world. We then analyze the effect of applying various supervision sources on representation learning for rare category recognition, and observe significant improvements.



There are no comments yet.


page 1

page 2

page 3

page 6

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The world around us is inherently long-tail: a few categories, such as chairs and cars, are dully common, whereas lots of others, such as lotus flowers and ice cream cones, are disappointingly rare. Traditional methods for image recognition [18, 15, 14] used to ignore this observation, training and testing their models on artificially balanced datasets [4, 21]. As a result, the approaches that thrived in curated environments under-performed in more realistic settings [37, 12].

More recently, this issue has been addressed in the few-shot [16, 34, 7, 24], and long-tail learning [38, 41] literature. Although these works have reported significant improvements in modeling rare categories, they are limited in both how they define the experimental protocol and how they approach the problem. Below we describe each of these limitations in more detail, together with our proposed ways to address them.

Figure 1: Our ADE-FewShot benchmark. By converting ADE20K [40] into a classification dataset, we construct a benchmark with natural long-tailed distribution of objects. To learn generalizable feature representations, diverse sources of supervision are leveraged during the training process, including parts, semantic segmentation map, and attributes.

First, we would like to point out a major limitation in the way that current few-shot datasets are constructed. Most of the work tends to take the existing, artificially balanced datasets, randomly split the list of categories into head and tail, and sub-sample training examples for the tail categories. This, however, results in highly unrealistic data distributions, with lots of ice cream cones and very few cars. As attractive as living in such a world might seem, the approaches designed and evaluated on it will crumble when faced with reality. To mitigate this issue, we propose a new few-shot classification benchmark with a natural distribution of rare categories. Instead of collecting such a dataset from scratch, we re-purpose the existing ADE20K dataset [40] originally for scene parsing. This dataset was collected by labeling all the objects in a diverse set of images with an open vocabulary, which resulted in a natural long-tail distribution of categories (see Figure 1). We convert it into an image classification dataset by cropping regions around the ground truth object masks, which allows us to easily evaluate existing few-shot learning methods under a realistic data distribution.

Second, we observe that most of the state-of-the-art approaches are focused on either metric learning [16, 34], or meta-learning [7, 24]

on the label set of the head categories. However, very recently it has been shown that the performance of theses methods can be matched by a simple baseline approach that learns an image representation on the head categories using a cosine classifier 

[3]. Based on this observation, several work proposed to use external cues to learn a more generalizable representation on the head. In particular, Dvornik et al. [6] proposed to ensemble several models to ensure diversity of the learned representation, whereas Gidaris et al. [8] achieved the same goal via multi-task learning with a self-supervised objective. In contrast, Tokmakov et al. [33] used supervised cues in the form of discriminative attribute annotations. In this work, we explore additional sources of supervision that can be helpful in learning representations that generalize to rare categories from few examples. These experiments are enabled by the wide variety of annotations available in ADE20K. In particular, we study the effect of localization supervision in the form of object masks and bounding boxes, background segmentation, scene-level labels, and object part annotations (see Figure 1).

Finally, it is natural to ask whether these cues are complementary, and if combining all of them can bring the performance of the tail categories close to that of the head. To answer this question, we study several ways of combining different sources of supervision and discover that cumulative improvements can be achieved to an extent, but further exploration is needed to fully bridge the gap.

To sum up, our main contributions are three-fold. (1) We propose a novel benchmark for evaluating long-tail and few-shot learning in a realistic setting. It is based on the ADE20K dataset for semantic segmentation and features a natural long-tail distribution of object categories. (2) We study the effect of heterogeneous forms of supervision on learning generalizable features and demonstrate that they can indeed significantly improve the model’s performance on rare categories. (3) We analyze the potential of combining several external cues to further boost the few-shot performance of the model, and report some initial encouraging results.

The rest of the paper is organized as follows. We begin by discussing the related work in Section 2. We then provide the details of our proposed dataset for long-tail and few-shot object recognition in Section 3. Next, we define the experimental protocol in Section B and report a quantitative and qualitative analysis of the effect of various forms of supervision on modeling the tail in Section 5. Finally, we conclude and outline directions for future work in Section 6.

2 Related Work

Few-shot benchmarks

have been traditionally constructed by randomly splitting categories in balanced datasets, such as ImageNet 

[4], into frequent “head”, and rare “tail” parts, and then subsampling examples in the rare categories to imitate data dearth in the real world [24, 13]. As a step in a more realistic direction, Ren et al. [25] proposed to split ImageNet on a super-category level. While these approaches are very practical, they do not properly emulate how the world is organised. We claim that categories either common or rare have inherent properties (such as scale, context, or super-category), which are lost in such a random splitting. In this work we propose a new few-shot learning benchmark with a realistic category distribution, which is based on the ADE20K [40] dataset for scene parsing. In addition, this dataset provides a diverse collection of supervision sources, allowing us to study the effect of various cues on few-shot classification performance.

In [39] the authors also proposed a few-shot benchmark with a realistic, long-tailed category distribution. However, their dataset is focused entirely on animal species. In contrast, our benchmark covers a much wider vocabulary of categories. Derived from the abundance of ADE20K, our ADE-FewShot benchmark contains 482 categories in total, covering diverse concepts of living and non-living, objects indoor and outdoor. ADE-FewShot also exhibits a larger head/tail imbalance. The category appears most frequently has more than 20,000 occurrences, while the rarest category only has 15 instances, which proposes serious challenge from the aspect of long-tail distribution.

Few-shot learning is a classic problem of recognition with only a few training examples [32]. Lake et al. [19] explicitly encode compositionality and causality properties with Bayesian probabilistic programs. Learning then boils down to constructing programs that best explain the observations and can be done efficiently with a single example per category. However, this approach is limited in that the programs have to be manually defined for each new domain.

State-of-the-art methods for few-shot learning can be categorized into the ones based on metric learning [16, 34, 28, 30] — training a network to predict whether two images belong to the same category, and the ones built around the idea of meta-learning [7, 24, 36] — training with a loss that explicitly enforces easy adaptation of the weights to new categories with only a few examples. Separately from these approaches, some work proposes to learn to generate additional examples for unseen categories [35, 13]. Recently, it has been shown that the performance of these complex approaches can be matched by a simple method based on learning a cosine classifier on the base categories [9, 3].

Based on this observation, several work has proposed to use external cues to regularize representation learning on the base set, either in an unsupervised [8, 6], or in a supervised [33] way. In this work, we explore the supervised as well as the self-supervised directions, and study the effect of diverse forms of supervision on the model’s classification performance on rare categories. Very recently, Wertheimer and Hariharan [39] proposed to utilize localization supervision to improve classification performance on rare categories. Notice, however, that they use bounding box labels for rare classes to aid learning a classifier directly. In contrast, we utilize localization supervision to learn a more generalizable representation on the base categories, and only use class label on the novel classes. Moreover, we explore many more individual forms of supervision, in addition to bounding boxes, as well as their combinations.

Multi-task learning

is the problem of learning a single model with multiple loss functions 

[1]. Most of the work in this domain seeks to maximize the performance of each individual task, by designing a specialized network architecture [17], balancing the weights of the loss functions [27], or introducing adaptive weight sharing between the tasks [29, 2]. In contrast, we seek to combine multiple sources of supervision to regularize the learning of an image classification representation with the goal of improving classification performance on the rare categories. We compare several multi-task learning strategies under this objective, and find that a simple sequential training approach results in the top performance.

3 The ADE-FewShot Benchmark

In this section, we set out to construct a new benchmark for learning to recognize rare categories in the wild. In our dataset, we want to capture several properties of the tasks that have been largely overlooked in the past. In particular, we want it to have a realistic and diverse category distribution, capturing all the richness and complexity of the visual world. In addition, unlike most of the existing datasets built around object-centered images [24, 13, 25], we want ours to capture the objects with context. At the same time, the dataset should allow to evaluate the existing few-shot learning methods without significantly modifying the algorithms. Finally, we want to explore the benefits of including as many supervisory signals as possible during representation learning on the frequent categories for improving the classification performance on the rare classes.

We take the following steps to satisfy these objectives. First, we choose ADE20K [40] dataset as a basis for constructing our few-shot learning benchmarks, due to its diversity and richness of the vocabulary. It contains over 3000 categories, covering objects, object parts, and background categories. We filter out the classes useful for our task and split them into base (frequent) and novel (rare) ones. Second, ADE20K was originally proposed as a scene parsing dataset with instance segmentation-level annotations. We convert it into a few-shot classification benchmark by cropping regions around ground truth segments. Finally, we capitalize on the rich set of labels in ADE20K, including object and part localization annotations, scene labels, and background labels. We also augment this list by providing attribute and class hierarchy labels for the selected categories. Our benchmark will be released soon. Bellow we describe each of these steps in more detail.

3.1 Category selection

ADE20K features more than 3,000 categories, covering objects, object parts and background classes, such as sky or grass. The distribution of categories is not artificially balanced, and thus exhibits a high imbalance. In particular, the most frequent category has more than 20,000 instances, while the least frequent one has less than 10 (see Figure 1). To better serve the objective of long-tailed and few-shot learning, both of these extremes are contained in our dataset. Below we summarize the filtering steps used to select the categories for our benchmark.

First, we manually split the classes into object, part, and stuff. Since in our benchmark we want to focus on object classification, only the corresponding categories are included in our label set, resulting in 1,971 classes. We keep the part and stuff labels as additional sources of supervision, however, and evaluate their effect on the novel category classification performance in Section 5.2.

Second, we further filter the object categories based on their frequency. In particular, we only keep the classes with at least 15 instances in the dataset, resulting in 482 categories. While this excludes the most challenging classes from the benchmark, we argue that including them would introduce significant noise in the evaluation. Indeed, measuring performance on 2-3 test images is dominated by noise and is not informative of the models recognition ability. At the same time, even after this filtering step the list of classes we are left with is sufficiently large and diverse for building a benchmark.

Finally, to mimic the typical setup in the few-shot classification benchmarks, we split the categories into base and novel subsets, where the former is for the representation learning, and the latter for evaluation. Instead of splitting the categories at random, however, we follow the natural distributions of object in the world, and select the classes that have more than 100 instances in the dataset as base and the remaining ones as novel. As far we know, this results in the most realistic few-shot learning benchmark to date, where the natural regularities between frequent and infrequent categories are captured.

Overall, our dataset contains 189 base and 293 novel categories. For each of the categories in the base set, 1/6 of the data is held out for validation. In the novel set, we randomly select 5 instances in each category for training and the rest are used for evaluation. We further divide the novel set, into 100 novel-val, 193 novel-test categories at random, where the former is used for hyper-parameter selection, and the latter for reporting the final performance.

Figure 2: Construction of our ADE-FewShot benchmark from ADE20K. With the segments annotated in ADE20K, we first crop out every object using tight bounding boxes. Then we imitate realistic data distribution by enlarging the boxes with context and applying random jitters to avoid center bias.

3.2 Converting ADE into a Classification Benchmark

With original ADE20K designed for scene parsing, we need to convert it into a classification dataset to serve our needs. Since ADE20K provides instance-level masks for the objects, it can be easily transformed by cropping the regions around the masks and treating them as independent images. However, using tight crops would result in an unrealistic data distribution, since objects typically appear in context. Therefore, we propose to simulate a more realistic distribution by box enlargement and random jitters. In particular, we compute the average context ratio (area of context divided by area of tight bounding box) in the ImageNet [4] dataset and enlarge the original bounding boxes accordingly. We then apply a random shift to the box to avoid center bias (see Figure 2).

3.3 Collecting Diverse Sources of Supervision

Evaluating the effect of diverse forms of supervision on the model’s few-shot classification ability is one of the main goals of this work. To this end, we accumulate all the labels provided in ADE20K, which include localization supervision for the object categories (both in the form of masks and bounding boxes), object part annotations, stuff category segmentation, as well as scene labels. In addition, inspired by [33], we collect category-level attribute annotations by inheriting the same attribute set on the ImageNet dataset and manually assigning each attribute to the base categories, not novel ones. As for class hierarchical labels, we extract the WordNet Tree provided by ADE20K and record every node from the object category to the root of the tree.

Combining all these diverse forms of supervision in a single image classification framework is non-trivial. In the next section we discuss our experimental setup as well as evaluation protocol.

Figure 3: Our multi-task learning framework. The model is based on the Faster-RCNN [26]

architecture for object detection. Additional forms of supervision are leveraged either to the feature vector of the object crop or the feature map of the whole image. Our framework is flexible for incorporation of other types of potential supervision, including the self-supervised ones.

4 Experimental Protocol

A model operating on ADE-FewShot takes input in the form of a specific region on a scene image, and outputs the category of the object in this area. In order to well combine the objective of classification and the input form, our backbone is identical to the classification branch of the Faster R-CNN [26]

. The whole scene image propagates through a convolutional neural network such as ResNet

[15] and produces a feature map. Then the feature map of the region is fed into an RoI-Align layer [14], whose output is the feature vector of the area. Finally, the model predicts the category of the region with an additional linear layer (see the uppermost branch in Figure 3). On the base set, the model is trained in the standard manner of convolutional neural networks. On the novel set, we freeze the feature extractor of the model, only fine-tune the linear classification layer.

When experimenting with additional sources of supervision we apply them to the feature vector of the object (e.g. attributes, class hierarchy, parts, object localization), and the whole image (e.g. scene labels). The choice of these two options depends on whether the kind of supervision is directly related to object or the whole image. For example, attributes and localization are characteristics of an object, while scene labels correspond to the image.

More formally, denoting the feature extractor as , and the classifier as , the classification loss for an input with class label is:


For a supervision , with a label , denoting the task-specific loss, and denoting task specific layer, e.g. linear classifier for attribute supervision, the corresponding loss function is:


When multiple types of supervision are combined in training, the final loss function becomes the linear combination of their individual losses and the classification loss:


where is a hyper-parameter balancing different objectives, which is selected on the validation set.

Model Method Top-1 Top-5
Baseline \ 23.65 44.38
+Segmentation MTL 24.98 45.41
+Segmentation CL 26.57 48.18
+Segmentation+Attribute MTL 26.14 47.91
+Segmentation+Attribute CL 27.51 49.33
Table 1: Comparison of adding supervision sources through multi-task learning and curriculum learning for 5-shot learning. The ‘MTL’ and ‘CL’ stands for multi-task learning and curriculum learning. In row 2 and 3, in the case of segmentation supervision, the comparison on adding single supervision demonstrates the superiority of curriculum learning. In row 3 and 4, curriculum learning still outperforms multi-task learning when adding two supervision sources, segmentation and attribute.

We experiment with two approaches for combining multiple supervision sources: multi-task learning, where all types of supervision are applied at once, and curriculum learning, where the losses are added one by one. According to the performance in Table 1, adding supervision sources sequentially results in a better novel classification performance, and we use it for all the remaining experiments.

Under the backbone of ResNet-18 [15], we experimented with varied implementations of the model and different hyper-parameter settings. In the process of data loading, we resize all the short edges of the images to the length of 800 following the protocol in ADE20K dataset [40]. As for the model, we modified the down-sampling rate of ResNet-18, with the first three Residual Blocks yielding a down-sampling rate of 2, which together down-samples the image by 8, compared to 32 in original ResNet. During the training of the model, we use the batch size of 8, optimizer of SGD with learning rate 0.1, and cosine scheduler [23]

. The whole training process of the baseline model totally takes 6 epochs to run, roughly 3 hours on a 4-GPU machine.

We evaluate the performance on the novel set using the standard 1-shot and 5-shot settings on the full set of classes (100-way for novel-val and 193-way for novel-test) and report top-1 and top-5 accuracy.

5 Experimental Evaluation

5.1 Benchmarking State-of-the-Art Approaches

We begin by providing an evaluation of several recent few-shot learning methods on the ADE-FewShot benchmark. In particular, we focus on the best performing approaches from the recent study of Chen et al. [3], and report the results of prototypical networks [28], relational network [31], as well as the linear and cosine classifier baselines that have shown promising results in [3]. We show 5-shot accuracy on the novel set in Figure 4.

Figure 4: Benchmarking few-shot learning approaches on the novel set of ADE-FewShot. 5-shot accuracy is reported.

Similar to [3], we observe that prototypical networks show strong results, however, the gap between prototypical and relational networks is more significant on our benchmark. We hypothesize that this is due to the fact that relational networks do not generalize well to more complex scenarios. Another interesting observation is that, although both linear and cosine classifiers show strong performance, the linear variant actually performs better both in the 1-shot and in the 5-shot regimes. The opposite trend was reported in [3]. We use the linear classifier for the rest of the experiments in the paper.

5.2 Exploration of Individual Supervision Sources

Several recent work has explored the effect of additional supervisory signals on model’s few shot classification performance. In particular, Tokmakov et al. [33] and Li et al. [20] proposed to use category-level attribute labels and class hierarchy respectively to learn more generalizable representations. In a different line of work, Wertheimer and Hariharan [39] proposed to use bounding box supervision for the novel categories to improve the classifier’s discriminative ability. Finally, Gidaris et al. [8]

used self-supervised learning objectives to regularize representation learning on the base categories.

In this section, we unify and generalize these results by exploring a diverse set of supervision sources on the proposed ADE-FewShot benchmark. We begin by studying semantic supervision sources proposed in  [33, 20] and additionally evaluate the effect of object part and scene labels in Section 5.2.1. Next, we turn to the localization supervision in Section 5.2.2, but, in contrast to [39], we evaluate how providing different forms of localization labels for the base categories affects the generalization ability of the learned representation. Finally, in Section 5.2.3 we confirm the observation of  [8] that self-supervised objectives are capable of regularizing representation learning on the base categories, but also show that the improvements are lower than those achieved with additional supervised labels. The results are summarized in Table 2

5.2.1 Semantic Supervision Sources


We begin with exploring class-level attribute supervision. Following [33], we use a multi-label classification loss for the attributes as an additional training objective and apply it to the feature vector of the object. As can be seen from the second row of Table 2, attribute classification objective results in learning a representation that requires fewer examples to learn to recognize novel categories. In particular, top-5 classification performance in the 5-shot scenario improves by more than 3%, and the improvements on the base set are a lot less significant, confirming the observations of [33].

One way to explain this effect derives from the observation that the learned attribute classifier is able to recognize common attributes on the novel classes, as shown in Figure 5. In particular, the classifier correctly recognizes the texture and material of the bathtub. Although the wood prediction is incorrect, it is a reasonable mistake, given the brown color of the wall. Notice that the attributes are not used directly for novel category recognition, however, the fact that they are captured in the feature space benefits learning more generalizable representations.

Figure 5: Classification of attributes generalizes to unseen objects. Most of the attributes predicted on this unseen image “bathtub” are consistent with the visual appearance of the object. The only wrong prediction “has wood” is also reasonable considering the brown color of the wall.
Class hierarchy.

Next, we study the effect of incorporating the hierarchical structure of the categories into the feature space of the network. In [20] the authors propose to utilize a hierarchical embedding space, where the feature representation is transformed to different levels of semantic hierarchy to classify corresponding concepts. We compare this variant to the baseline in the third row of Table 2, and observe a notable improvement, although it is lower than that of the attribute supervision.

We further experiment with a simplified version of class hierarchy supervision, that is, classifying an object on multiple concept levels. For instance, we classify a cat as cat, mammal, and animal. During implementation, we split the class hierarchy into four levels and enforce the learning of four independent classifiers on each of the levels. As can be seen from the row 4 of Table 2, this naive approach results in a higher performance than the complex method of [20], and it still does not outperform the richer attribute supervision.

Scene labels.

We now move beyond previously explored sources of semantic supervision, and study the effect of scene labels on learning generalizable object representations. Recall that in our architecture, object representations are obtained by RoI-Align corresponding regions from a feature map of the whole scene (see Figure 3). Thus, we can apply scene classifier directly to the average pooled feature map of the last convolutional layer of the network and optimize this objective jointly with the object classifier. The results are shown in row 5 of Table 2. Although scene labels are not directly related to the object categories, we observe a significant improvement in the novel classification performance. We attribute it to the fact that scenes are correlated with certain groups of object categories in the same way as class attributes or elements of the class hierarchy (e.g., farm animals tend to appear outside and vehicles are found in urban environments), and they play the same role in regularizing learning of object representation.

Object parts.

ADE20K provides per-pixel part annotation for some objects, for example, chairs have legs and arms. With such part segmentation annotations, we can aggregate part labels for object categories. We then use them in the same way as attributes. The results in row 6 of Table 2 indicate that part labels indeed results in an improved generalization performance, although the improvement is not as significant as that of the other types of semantic supervision.

Type of supervision Model Base-Val Novel-test set
1-shot 5-shot
Top-1 Top-5 Top-1 Top-5
Linear 44.13 7.53 16.89 23.65 44.38
Semantic supervision sources +Attribute 45.38 8.09 17.53 26.21 47.54
+Hierarchy Embedding 44.57 8.01 17.87 25.67 46.85
+Hierarchy Classifier 46.43 8.36 18.50 26.12 47.11
+Scene 45.03 7.90 18.61 26.37 47.28
+Part 45.68 7.92 17.76 25.94 47.12
Localization supervision sources +Bounding Box 45.97 7.67 17.69 26.19 47.32
+Segmentation Region 45.68 8.21 17.88 25.50 47.38
+Segmentation FCN 45.82 8.37 18.06 26.57 48.18
+Stuff 43.86 7.11 15.57 22.18 42.57
+(Object+ Background) 46.03 7.62 17.45 26.92 48.47
Self supervision +Rotation 44.31 6.69 15.69 24.12 44.96
+Patch Location 44.43 7.81 17.17 25.04 45.45
Table 2: Comparison of Different supervision sources on the base-validation set and novel-test set of ADE-FewShot.

5.2.2 Localization Supervision Sources

We explore the effect of localization supervision on the generalization performance of object representations. We begin with the progressively more expensive forms of location annotations for the objects themselves, and then evaluate whether providing segmentation labels for stuff categories can improve the object representations as well.

Bounding boxes.

Unlike [39], we study the effect of providing bounding box labels for the base categories. The intuition is that this will allow the object representation to focus on the objects and not on the background, thus generalize better to unseen object distributions. To this end, we add a bounding box regression layer after the RoI-Align stage, following the design of the R-CNN model for object detection [11]. As can be seen from the corresponding row of Table 2, this results in a significant performance improvement on the novel categories, confirming our intuition.

Segmentation masks.

We now study whether providing a more precise object localization information in the form of object masks can further improve the generalization ability of the model. We experiment with two variants of providing segmentation supervision. The first version follows the protocol in Mask R-CNN [14], where we predict a binary mask inside a RoI-Aligned region. The second version operates in the semantic segmentation regime [22], appending an additional convolutional layer after the feature map. Notice that we only apply segmentation supervision to the base categories and ignore the pixels corresponding to the novel objects and base validation objects during representation learning.

The numbers in Table 2 demonstrate that both approaches result in an improvement over the baseline, and the semantic segmentation model achieves top performance among all the variants. We hypothesize that the problem of semantic labeling of the pixels in a large feature map is more difficult than the alternative formulation of binary pixel classification in a cropped region, and thus it provides a stronger regularization signal for representation learning.

Stuff segmentation.

ADE20K provides segmentation masks for stuff categories in addition to objects and parts, allowing us to explore the effect of this supervisory signal on learning representation for few-shot object recognition. However, the results in Table 2 show a small decrease in performance with respect to the baseline. We argue that this is because the stuff supervision forces the representation to focus on the background features, which the object classifier then latches onto.

To further explore this phenomenon , we combine the foreground and stuff labels together, giving a weight of 0.1 to the background classes. This combined supervision results in a performance improvement, outperforming even the variant with foreground segmentation only. This result demonstrates that stuff supervision can still be helpful, but only when combined with foreground supervision.

5.2.3 Self-supervision

Finally, we validate the recent observation that self-supervised objectives can also be used to regularize representation learning [8]. Identical to [8], we explore Rotation [10] and Relative Patch Location [5] losses. For the first method, a region with the short edge of 600 pixels is cropped out of the whole image and rotated randomly with the choice of . Then the model has to correctly classify the chosen rotation. As for the Relative Patch Location, the input image is first divided into a grid. Then the center crop and another randomly picked patch are passed though the model, which has to predict their relative location.

The results of applying these approaches are shown in the last two rows of Table 2. Both objectives result in an improvement in few-shot learning performance, but the improvements are significantly lower than those obtained with supervised labels. These results demonstrate that although self-supervision has its merits, providing additional ground truth labels still results in learning more generalizable representations.

5.2.4 Varying the amount of supervision

After we have seen that providing diverse sources of supervision can result in significant improvements in few-shot learning performance, it is natural to ask whether these improvements can be obtained at a lower cost. In this section, we explore the effect of the fraction of the data for which additional labels are provided on the quality of the learned representation. In particular, we experiment with two kinds of supervision that have shown the strongest improvement before: attributes and semantic segmentation, and vary the proportion of the labeled data.

Figure 6: Novel accuracy under varying fraction of supervision. The performance increases when additional labels are available even for as little as 20% of the instances.

The results are presented in Figure 6. We can observe that for both types of supervision labeling as little as 20% of the data already results in significant improvements over the baseline. Indeed, labeling 20% of instances already covers a lot of categories as well as different scenarios, providing a sufficiently strong regularization signal for representation learning. We can also notice that the top-5 accuracy keeps improving with the proportion of the labeled data, whereas top-1 accuracy remains largely constant.

5.3 Combining Multiple Supervision Sources

Finally, we explore whether even larger improvements can be obtained by combining several complementary sources of supervision. The results are reported in Table 3.

First, we observe that training a model with segmentation and attribute losses results in a significant performance improvement compared to each individual supervision, although the effect is not additive. A similar observation can be made by combining segmentation and category hierarchy labels. Both these combinations result in about 5% top-5 accuracy improvement in the 5-shot regime.

Model Base-val 1-shot 5-shot
Top-1 Top-5 Top-1 Top-5
Linear 44.13 7.53 16.89 23.65 44.38
+Seg+Attr 46.07 8.41 18.56 27.51 49.33
+Seg+Hie 47.31 8.89 19.18 27.57 49.23
+Seg+Bbox 47.42 8.49 17.59 27.53 49.06
+Seg+Hie+Attr 46.64 8.52 19.02 26.95 49.32
+Seg+Attr+Hie 47.27 9.21 19.52 27.73 49.91
Table 3: Combining several supervisory signals in a single model on the base-validation and novel-test sets of ADE-FewShot. We observe that they result in a cumulative improvement, and the performance eventually saturates.

Somewhat more surprisingly, combing two localization supervision sources, segmentation and bounding box also improves over segmentation alone, showing that these two signals are in fact complementary, though to a lesser extend than the combinations of localization and semantic supervision sources.

Finally, we experiment with combining three types of labels. Unfortunately, performance seems to saturate at this point, further indicating that the improvements from combining supervision sources are not additive. We can also observe that the order in which supervision sources are added matters, with the rule of thumb being that the harder signals should be applied first.

6 Conclusion

In this paper we have introduced ADE-FewShot — a realistic few-shot learning benchmark with a rich set of annotation. We have explored how these signals can be used to improve the generalization performance of a recognition model on the tail categories and observed that a variety of supervision sources, as well as their combinations are in fact helpful for the task. However, the performance quickly saturates as more supervision sources are added. Devising novel forms of multi-task learning that are able to better combine the benefits of diverse labels is an important direction for future work.


  • [1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In NeurIPS, 2007.
  • [2] Felix JS Bragman, Ryutaro Tanno, Sebastien Ourselin, Daniel C Alexander, and Jorge Cardoso. Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. In ICCV, 2019.
  • [3] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Wang, and Jia-Bin Huang. A closer look at few-shot classification. In ICLR, 2019.
  • [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [5] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICLR, 2015.
  • [6] Nikita Dvornik, Cordelia Schmid, and Julien Mairal. Diversity with cooperation: Ensemble methods for few-shot classification. In ICCV, 2019.
  • [7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In NeurIPS, 2017.
  • [8] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, and Matthieu Cord. Boosting few-shot visual learning with self-supervision. In ICCV, 2019.
  • [9] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
  • [10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  • [11] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [12] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  • [13] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In CVPR, 2017.
  • [14] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [16] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In

    ICML Deep Learning Workshop

    , 2015.
  • [17] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In ICCV, 2017.
  • [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In NeurIPS, 2012.
  • [19] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [20] Aoxue Li, Tiange Luo, Zhiwu Lu, Tao Xiang, and Liwei Wang. Large-scale few-shot learning: Knowledge transfer with class hierarchy. In CVPR, 2019.
  • [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [22] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [23] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • [24] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
  • [25] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
  • [26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  • [27] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In NeurIPS, 2018.
  • [28] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NeurIPS, 2017.
  • [29] Gjorgji Strezoski, Nanne van Noord, and Marcel Worring. Many task learning with task routing. In ICCV, 2019.
  • [30] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M. Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
  • [31] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
  • [32] Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In NeurIPS, 1996.
  • [33] Pavel Tokmakov, Yu-Xiong Wang, and Martial Hebert. Learning compositional representations for few-shot recognition. In ICCV, 2019.
  • [34] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In NeurIPS, 2016.
  • [35] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In CVPR, 2018.
  • [36] Yu-Xiong Wang and Martial Hebert. Learning from small sample sets by combining unsupervised meta-training with CNNs. In NeurIPS, 2016.
  • [37] Yu-Xiong Wang and Martial Hebert. Learning to learn: Model regression networks for easy small sample learning. In ECCV, 2016.
  • [38] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In NeurIPS, pages 7029–7039, 2017.
  • [39] Davis Wertheimer and Bharath Hariharan. Few-shot learning with localization in realistic settings.
  • [40] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20k dataset. In CVPR, 2017.
  • [41] Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. Capturing long-tail distributions of object subcategories. In CVPR, pages 915–922, 2014.


In this supplementary material we provide additional experiments, further validating our benchmark and the proposed approach to representation learning via diverse supervision. We begin by reporting results on the full set of categories in ADE-FewShot in Section A, and demonstrate that adding additional sources of supervision leads to improvement in this setting as well. We then show that the observed benefits of diverse supervision are general and not limited to the linear classifiers, by reporting similar, consistent improvements with prototypical networks in Section B. Finally, we demonstrate generalization of the proposed approach to networks of varying depths in Section C.

Appendix A Evaluation on All Categories

In the main paper we have reported all the results on the tail categories of our benchmark under standard few-shot learning protocol. For completeness, we now evaluate the linear classifier baseline without and with additional sources of supervision on the full set of categories. To this end, we first pre-train the backbone on the head categories, in the same way as in few-shot learning experiments. Following the implementation of [33], we then freeze the backbone and re-learn the classifier on the full set of categories with balanced sampling. Notice that, aligned with the experimental settings in the main paper, no additional sources of supervision are used in the fine-tuning stage.

The results are reported in Table 4. We observe similar trends to the ones seen on just tail categories. In particular, additional sources of supervision help to improve performance in the joint space. These improvements can be combined, but are not fully additive, and performance saturates as more sources are added. These experiments confirm that our approach is also applicable in the long-tail scenario.

Model All Classes
Top-1 Top-5
Baseline 18.07 37.23
+Seg 19.91 39.58
+Seg+Attr 20.52 41.48
+Seg+Attr+Hie 21.33 42.26
Table 4: Additional sources of supervision improves the performance on all the classes in the long-tail scenario.

Appendix B Diverse Supervision for Prototypical Networks

We now evaluate whether representations learned under diverse sources of supervision can benefit more sophisticated few-shot learning methods. To this end, we learn prototypical networks [28], which have shown top performance in [3] and competitive performance compared to linear classifier approach in our experiments as reported in the main paper. On top of backbones trained with various sources of supervision in the main paper, we report results of prototypical networks in Table 5. Again, similar trends to the ones in the main paper are observed, which confirms that diverse sources of supervision lead to learning richer representations, which can be beneficial in a variety of scenarios.

Model Novel-test 5-shot
Top-1 Top-5
Prototypical networks 23.33 42.98
+Seg 24.17 44.98
+Seg+Attr 25.22 45.86
+Seg+Attr+Hie 26.24 46.23
Table 5: Similar to the results on linear classifiers in the main paper, adding additional supervision yields improvement for prototypical networks, and the benefits saturate as the number of supervision sources increases.
Model ResNet-10 ResNet-34
Base-val 1-shot 5-shot Base-val 1-shot 5-shot
Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
Baseline 42.29 7.22 15.73 25.02 43.74 45.07 7.62 16.77 23.54 43.54
+Attr 42.78 7.81 16.74 25.79 44.69 46.47 7.85 17.29 25.48 46.49
+Seg 43.96 7.69 16.57 25.82 45.28 46.58 7.73 17.14 26.61 47.94
+Seg+Attr 44.19 7.98 17.53 26.27 45.68 47.18 8.23 18.25 27.04 48.16
+Seg+Attr+Hie 44.25 8.11 18.02 26.75 46.35 48.25 8.76 19.14 27.85 49.26
Table 6: Adding diverse supervision sources has a consistent positive effect on different network architectures. The improvements increase with the network depth.

Appendix C Effect of Network Depth

Finally, we study the effect of the network depth on the improvements obtained through diverse supervision. To this end, in addition to the results with a ResNet-18 architecture reported in the main paper, we evaluate a shallower ResNet-10 and a deeper ResNet-34 architectures in Table 6.

We observe that both backbones benefit from additional supervision sources. However, the improvements are not uniform, but instead increase with the depth of the network. This is natural, since deeper networks have higher capacity and thus can better incorporate additional information. Notice that this trend is the opposite to the one observed in [3], where the authors have shown that improvements of sophisticated few-shot learning methods over a baseline decrease as network depth increases. Thus, learning rich features representations via diverse supervision is a promising approach for realistic applications.