Learning to Segment the Tail

04/02/2020 ∙ by Xinting Hu, et al. ∙ Nanyang Technological University 1

Real-world visual recognition requires handling the extreme sample imbalance in large-scale long-tailed data. We propose a "divide&conquer" strategy for the challenging LVIS task: divide the whole data into balanced parts and then apply incremental learning to conquer each one. This derives a novel learning paradigm: class-incremental few-shot learning, which is especially effective for the challenge evolving over time: 1) the class imbalance among the old-class knowledge review and 2) the few-shot data in new-class learning. We call our approach Learning to Segment the Tail (LST). In particular, we design an instance-level balanced replay scheme, which is a memory-efficient approximation to balance the instance-level samples from the old-class images. We also propose to use a meta-module for new-class learning, where the module parameters are shared across incremental phases, gaining the learning-to-learn knowledge incrementally, from the data-rich head to the data-poor tail. We empirically show that: at the expense of a little sacrifice of head-class forgetting, we can gain a significant 8.3% AP improvement for the tail classes with less than 10 instances, achieving an overall 2.0% AP boost for the whole 1,230 classes[%s].



There are no comments yet.


page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The long-tail distribution inherently exists in our visual world, where a few head classes occupy most of the instances [48, 33, 36, 44]

. This is inevitable when we are interested in modeling large-scale datasets, because the class observational probability in nature follows Zipf’s law 

[28]. Therefore, it is prohibitively expensive to counter the nature and collect a balanced sample-rich large-scale dataset, catering for training a robust visual recognition system using the prevailing models [12, 8, 32, 3].

Figure 1: Overview of the proposed Learning to Segment the Tail (LST) method for LVIS [9]. To tackle the severe imbalance, we divide the overall dataset into balanced sub-parts and train the instance segmentation model phase-by-phase incrementally. We use knowledge distillation and the proposed balanced replay to confront the catastrophic forgetting [26] in the fewer- and fewer-shot learning over time. Here, is the resultant model.

In this paper, we study a practical large-scale visual recognition task on the challenging real-world dataset: Large Vocabulary Instance Segmentation (LVIS) [9]. As shown in Figure 1, across the 1k+ instance object classes, the number of training instances per class drops from thousands in the head to only a few in the tail (i.e., 26k+ “banana” vs. only 1 “drone”). Empirical studies show that the models trained using such a long-tailed dataset tend to please the common classes but neglect the rare ones [9]. The reasons are two-fold: 1) class imbalance causes the head classes trained thousand times more than the tail classes, and 2) the few-shot samples in the long tail render the generalization a great challenge (i.e., around 300 classes with less than 10 samples). Therefore, the key solution for LVIS is to well address not only the imbalance but also the few-shot learning at a large scale.

Unfortunately, conventional works on either “imbalance” or “few-shot” are fundamentally not scalable to LVIS. On the one hand, it is well-known that works on data re-sampling [11, 2, 10] — up-sampling the rare tail classes or down-sampling the frequent head classes — can prevent the training from being dominated by the head. Nonetheless, as they do not introduce any new diversity, they struggle in the trade-off between the tail over-fit — the heavy repetitions of the few-shot samples, and the head under-fit — the significant abandon of the many-shot samples. On the other hand, conventional few-shot learning that transfers the model from a data-rich “base set” to a data-poor “novel set” [38, 43], however, is not yet practical in LVIS, as any base or novel split will be eventually imbalanced due to the scale, undermining the generalization ability that is already challenging in few-shot learning [42]. Besides, the scale also raises major memory issues in the episodic training [41] adopted by recent meta-learning based methods [18, 31].

An intuitive strategy to address the scale is to divide the large “body” into “parts”, conquer each of them, and then merge them incrementally. As illustrated in Figure 1, each subset is more balanced and easier to handle. Essentially, the “divide&conquer” strategy for LVIS poses a novel learning paradigm: class-incremental few-shot learning. However, the merge to stitch the parts back to a whole is no longer a trivial adoption of any off-the-shelf class-incremental learning method [30, 35]. The reason is that different from traditional class-incremental learning scenarios, our incremental phases over time, will face 1) more imbalanced data of the old classes and 2) fewer data of the new classes. This leaves the network more vulnerable to “catastrophic forgetting” [26] in learning the new classes, not to mention the fact that they are fewer- and fewer-shot.

To implement the novel paradigm for the LVIS task, we propose the balanced replay scheme for knowledge review and the meta-learning based weight-generator module for fast few-shot adaptation. We call our approach: Learning to Segment the Tail (LST). In a nutshell, LST can be summarized in Algorithm 1. After training the first phase that comes with the abundant labeled data as the bootstrap, we start the incremental learning in phases (e.g., equals 3 in Figure 1). Given the relatively balanced subset in -th phase using data replay (BalancedReplay in Section 3.3), new classes can be learned and old classes can be fine-tuned simultaneously (UpdateModel in Section 3.2). To transfer the knowledge step by step from the “easy” many-shot head to the “difficult” few-shot tail, we furthur adopt a meta weight generator [7] ( MWG in Section 3.4).

1: Dataset pre-processing
2: The final phase model parameters
3: Base classes training
4:for  do
5:     ;
7:end for
8:function UpdateModel()
9:      Model initialization
10:     repeat
11:         when use meta-module then
12:               Sample support set
14:         end when Few-shot weight generation
16:     until converge Old & new classes fine-tuning
17:end function
Algorithm 1 Learning to Segment the Tail (T+1 phases)

We validate the proposed LST on the large-scale long-tailed benchmark LVIS, which contains 1,230 entry-level instance categories. Experimental results show that our LST improves the instance segmentation results over the baseline by 7.08.0% AP on the tail classes while gaining a 2.2% overall improvement for the whole classes. The results illuminate us a promising direction for tackling the severe class imbalance in long-tailed data: class-incremental few-shot learning.

Our contributions can be summarized as follows:

  • We are among the first to study the task of large vocabulary instance segmentation, which is of high practical value by focusing on the severe class imbalance and few-shot learning in the field of instance segmentation.

  • We develop a novel learning paradigm for LVIS: class-incremental few-shot learning.

  • The proposed Learning to Segment the Tail (LST) for the above paradigm outperforms baseline methods, especially over the tail classes, where the model can adapt to unseen classes instantly without training.

2 Related Work

Instance segmentation. Our instance segmentation backbone is based on the popular region-based frameworks [20, 12, 4, 23], in particular, Mask R-CNN [12] and its semi-supervised extension Mask R-CNN [16], which can transfer mask predictor from merely box annotation. However, they cannot scale up for the large-scale long-tailed dataset such as LVIS [9], which is the focus of our work.

Imbalanced classification. Re-sampling and re-weighting are the two major efforts to tackle the class imbalance. The former aims to re-balance the training samples across classes [15, 2, 10, 5]

; while the latter focuses on assigning different weights to adjust the loss function 

[17, 40, 47, 6]. Some works on generalized few-shot learning [46, 19] also deal with an extremely imbalanced dataset, extending the test label space of few-shot learning to both base and novel rare classes. We propose a novel re-sampling strategy. Different from previous works that perform on image-level re-sampling, we address the imbalance of dataset on instance-level.

Learning without forgetting & learning to learn. Existing works mainly focus on how to learn new knowledge with less forgetting, and how to generalize from the learning process, i.e., learning to learn. To cope with the ever-evolving data, class-incremental learning methods  [35, 15, 37, 1] adapt the original model trained on old classes to new classes, where knowledge distillation [14, 21] and old data replay [30, 24] are applied to minimize the forgetting. For few-shot learning, meta-learning based works transfer the learning-to-learn knowledge through feature representation [31, 29, 18]

, classifier weights 

[46, 7], and the regression of model parameters [42, 43] from the data-rich base classes, to obtain a good model initialization for the data-poor new classes. We propose a class-incremental few-shot learning paradigm that can be seen as a non-trivial combination of these two fields.

3 Learning to Segment the Tail

LVIS is a Large Vocabulary Instance Segmentation dataset, which contains 1,230 instance classes [9]. The number of images per class in LVIS has a natural long-tail distribution, with 700+ classes containing less than 100 training samples. To tackle the challenging dataset in the proposed LST using the “divide&conquer” strategy, we first present the division method in Section 3.1, and discuss our class-incremental learning pipeline in Section 3.2. In Section 3.3 and Section 3.4, we detail how to use BalancedReplay and MWG for knowledge review and few-shot adaptation.

3.1 Dataset Pre-processing

Our guideline for the division is to alleviate the intra-phase imbalance of the dataset, where each of division is relatively balanced. We first sort the classes by the number of instance-level samples in a descending order, obtaining a sorted class set . Then we divide the sorted categories into mutually exclusive groups . Correspondingly, we have a sub-dataset with images and annotations for each .

Specifically, after grouping the sorted top classes as the bootstrap group , and splitting the remaining classes into evenly spaced bins , we obtain the sorted class sets with groups = . By assigning data to the corresponding groups, we convert the whole dataset into , as shown in line 1 of Algorithm 1, where each is composed of all the annotated images containing any instance of . Following this setting, the data is fed to the network step-wisely, so that our model is trained in a class-incremental learning style.

Figure 2:

Overview of our framework for learning the instance segmentation model incrementally. It is based on the two-stage instance segmentation architecture, training the overall imbalanced dataset in incremental phases with sampled data for both old and new classes. In incremental phases, the weights of backbone are frozen, and the distillation is calculated using ground truth box annotations between the classification logits of the current and previous networks to avoid forgetting.

3.2 Class-Incremental Instance Segmentation

Class-incremental learning aims to learn a unified model that can recognize classes of both previous and current phase [30]. In our scenario, we aim to train our network on , obtaining models from to , and finally deliver as our resultant model that can detect all instance classes in LVIS. Here, we adopt the popular definition inherited from works in incremental learning and few-shot learning [7, 35]: classes in are termed as base classes; for phase , classes in are called old classes and classes in current are called new classes. For training and evaluation in each phase , we will not handle anything in the future classes . As phases goes by, the data in for new classes becomes fewer and fewer, and the data for old classes become more and more imbalanced. To tackle the inter-phase imbalance, we propose a novel sampling scheme for the old data, which will be discussed in Section 3.3.

Our overall architecture is shown in Figure 2. We build our class-incremental learning framework based on Mask R-CNN [16], which is a modified version of Mask R-CNN [12]. Mask R-CNN is an instance segmentation model that can be used in partially supervised domain by obtaining a category’s mask parameters from its bounding box parameters. We adopted this weight transfer module so that the class-agnostic transfer function weights can be shared between incremental phases, which can 1) alleviate the training burden for massive mask layers of 1,230 classes and 2) avoid the unstable knowledge distillation of the mask logits across classes (i.e., times more compared to the class logits distillation in Eq. (2

)). Besides, we replaced the last classifier layer in the detection branch with scaled cosine similarity operator, because it has been shown effective in eliminating the bias caused by the significant variance in magnitudes 

[30, 7, 15]

. Formally, given the feature vector

, the output logits vector of cosine similarity classifier with weights is:


where and are the L2-Normalized vectors.

Then the class-specific mask weights in the mask branch are generated from using the class-agnostic weight prediction function in Mask R-CNN [16].

The overall class-incremental learning pipeline is shown in Algorithm 1, and it is composed of two stages:

Stage 1. Base classes training. This training phase () delivers the model for base classes, where the backbone and RoI heads are jointly trained. The trained classification weight vectors for top classes are denoted as . We assume that if the data in base classes are sufficiently abundant and relatively balanced, the training of can work effectively as the bootstrap for the whole system.

We calculate the instance segmentation loss . The RPN loss , classification loss , bounding-box loss and mask loss are identical as those defined in Fast R-CNN [8] and Mask R-CNN [12].

Figure 3: A running example of different re-sampling strategies. Given images of “person” and “guitar” from different phases, we show the observable instances for each image in training ROI heads using different re-sampling strategies. As shown in (c), compared to (a) and (b), by omitting the annotations of “person” in images except for the ones we sampled, our instance-level balanced replay can construct a relatively balanced dataset with much less computation overhead.

Stage 2. Class-incremental learning. In each phase t (from to ), the number of classifiers is expanded, which leads to the following adjustments to the training procedure in Stage 1:

Network Expansion. After initialized from the last phase’s model , the current model needs to grow for recruiting new class-specific layers, i.e., the bounding-box, classification and regression layers and the mask prediction layer for new classes. Recall our modifications of the backbone, the weights of mask layers can be transferred from the weights of box layers, so the expansion of the network is only implemented on the box head.

Figure 4: Architecture of our framework combined with weight generator. At the start of each step, classifier weights for old classes are copied from the previous network. Based on the features of new classes samples and base classifier weights, weights for new classes are predicted by our weight generator and concatenated. After obtaining the whole classifier, our weight generator is jointly trained with the network using input image on both old and new categories.

Freezing and knowledge distillation. As discussed in class-incremental learning works [30, 15], these two strategies are broadly used to address catastrophic forgetting, the significant performance drop on previous data when adapting a model to new data. Data rehearsal [30] is another strategy to prevent forgetting by reviewing old data, which is discussed in Section 3.3. In our scenario, 1) by freezing the weights in the backbone, a strong constraint on the previous representation is imposed, 2) by knowledge distillation, the discriminative representation learned previously is not shifted severely during the new learning step. Our distillation loss is defined as:


where and are the output logits vectors for classes in using both old model trained in phase and current , respectively. Note that the output in phase t also incorporates new categories in , we use to indicate the sliced logits only corresponding to previous classes . is the L2-distance to measure the difference between logits. We choose L2-distance here in avoid of the grid search of temperature as in conventional distillation loss [21], thanks to the already normalized logits (i.e., logits lie in the same range ) using cosine.

The purpose of Eq. (2) is to let the new model mimic the the old model’s behavior (i.e., generate similar output logits), so that the knowledge learned from old network can be preserved. It is worth noting that distillation requires the same input sample going through old and new networks separately. Different from the classification task, in instance segmentation, proposals are dynamically predicted. To this end, we use the ground truth bounding boxes of novel classes as samples in each step for distillation. Overall, for each incremental phase , knowledge distillation loss is added to the final loss as .

3.3 Instance-level Data Balanced Replay

As shown in Figure 1, within each incremental phase, the variance of instance number is narrowed. However, the inter-phase imbalance (i.e., the gap in the number of samples among phases) exists, leading to a dilemma: if we replay all the previous data, it will definitely break the balance, introducing the imbalance back to our network; if we discard replay, catastrophic forgetting will happen [30].

Moreover, previous re-sampling strategies [9, 34] can not be applied gracefully in the instance-level vision tasks. For image-level re-sampling that regularizes the number of images per category, the inherent class co-occurrence may hinder its effectiveness. For example, in Figure 3 (a), as “guitar” usually co-occur with “person”222 We use “person” to replace “baby” used to represent a set of synonymous labels: “child”, “boy”, “girl”, “man”, “woman” and “human” in LVIS for readability., the adjustment on the number of “guitar” instances will always unnecessarily adjust the number of “person” instances at the same time. An alternative one-instance-for-one-image strategy in Figure 3 (b) can assure the absolute balance, however, the additional computational cost for feed-forwarding the same image multiple times is tremendous. Based on those observations, we proposed the Instance-level Data Balanced Replay strategy. For phase , it works as follows:

  • calculate : the average number of instances over all categories in set ;

  • calculate : the average number of instances over all images containing annotations from the corresponding old category ;

  • construct the replay set : for each category , randomly sample images from images in containing category , where only those annotations belonging to category are considered valid in the training.

As illustrated in Figure 3 (c), by replaying the balanced set of old data using the above strategy, we dynamically collects a relatively balanced dataset in each phase .

3.4 Meta Weight Generator

So far, the proposed class-incremental pipeline is able to tackle the intra-&inter-phase imbalance while preserves the performance of classes from the previous phases. However, the challenge of few-shot learning becomes severe as we approach to the tail classes. Therefore, we adopt a Meta Weight Generator (MWG) module [46] as shown in Figure 4, which utilizes the base knowledge learned and inherited from the previous phases to dynamically generate the weight matrix of the current phase. The motivation is: given robust feature backbone and classifiers learned for the base classes (i.e., Stage 1 in Section 3.2), it is possible learning to directly “write” new classifiers for the new classes, based on the new sample feature itself and its similarities to the base classifiers [7]. For an intuitive example, we can customize a “drone” classifier by using a “drone” sample feature and how the sample looks like the base classes, e.g., 50% “airplane”, 30% “fan”, and 20% “frisbee”.

Formally, in the -th incremental phase, we decompose the classifier weight matrix into two parts: , for the old and the new classes, respectively. Following the Gidaris&Komodakis’s work [7], is dynamically generated. In particular, we retrieve the base classifier weights from , then learn how to compose . Take an image containing RoIs of a new category as an example, for each RoI feature vector , 1) the feature vector is fed to an attention kernel function to get the coefficients as: , where are the weight coefficients used to attend base classifiers weights , is a learnable matrix that transforms to the query vector, and is a set of learnable keys (one per base category); 2) the classification weight is first generated for each RoI feature independently and then averaged over all RoIs of category in this image as the final predicted weight vector of category . For each RoI feature , the corresponding classifier weight is calculated as:


where denotes element-wise multiplication, and are learnable weight vectors.

For the initialization of the -th phase, is copied from the previous phase . For the episodic training [41], each episode is composed of a support set and a query set sampled from . The support set is for applying MWG to generate (Eq. (3)), and the query set is for collecting loss from the predictions using the full model : the concatenated classifiers ] as well as other network parameters, and then update . This joint training assures that the classifier weights and the meta-learner are synchronized in the -th phase. After the episodic training, we set the weights for a novel category by averaging the predicted weights of all the instances of class in . Then, the meta-module can be completely detached, and we are ready to deliver the model .

Baseline [12] 0.0 0.0 0.0 12.8 20.9 28.3 17.9 28.9 18.8 17.9
Modified backbone 0.0 0.0 0.0 13.9 19.9 27.6 17.8 28.2 18.8 17.7
Class-aware Sampling [34] 0.0 0.0 0.0 20.0 20.2 24.5 19.5 31.6 20.5 19.3
Repeat-factor Sampling [9] 4.0 0.0 2.9 19.9 21.4 27.8 20.8 33.3 22.0 20.6
LST w/o MWG (Ours) 12.0 9.3 11.7 27.1 21.3 22.3 22.8 36.4 24.1 22.3
LST w MWG (Ours) 13.6 10.7 11.2 26.8 21.7 23.0 23.0 36.7 24.8 22.6
Table 1: Results of our LST and the comparison with other methods on LVIS val set. All experiments are performed based on ResNet-50-FPN Mask R-CNN.

4 Experiments

We conducted experiments on LVIS [9] using the standard metrics for instance segmentation. AP was calculated across IoU threshold from 0.5 to 0.95 over all categories. AP50 (or AP75) means using an IoU threshold 0.5 (or 0.75) to identify whether a prediction is positive. To better display the results from the head to the tail, AP, AP, AP, AP, AP, AP were evaluated for the sets of categories which containing only 1, <5, <10, 10 100, 100 1,000 and 1,000 training object instances. AP for object detection was reported as AP.

4.1 Implementation Details

We implemented our architectures and other baselines (e.g., Mask R-CNN [16]) on the Mask R-CNN [12] code base maskrcnn_benchmark333https://github.com/facebookresearch/maskrcnn-benchmark. For Section 3.2, we implemented as follows: 1) mask weights were generated by a class-agnostic MLP mask branch together with the weights transferred from the classifiers of the box head following Hu et al[16]

; 2) cosine normalization was applied to both of the feature vectors and the classifier weights, to obtain the classification logits. Note that the ReLU non-linearity in the final layer was removed to allow the feature vectors to take both positive and negative values.

We initialized the scaling factor of cosine similarity as 10. All the models were initialized using the released model pre-trained on COCO [22], and trained by using SGD with 1e-4 weight decay and 0.9 momentum. Each minibatch had 8 training images, and the images were resized to that its shorter edge is 800-pixel. No other augmentation was used except for horizontal flipping. Models were evaluated using the 5k val images. Following Gupta et al[9], we increased the number of detections per image up to top 300 (vs. top 100 for COCO) and reduced the minimum score threshold from the default of 0.05 to 0.0.

For Section 3, in Stage 1, we chose  = 270, where each of the top classes has 400+ instances. 512 RoIs were selected per image, and the positive-negative ratio is . For training the top

classes, the learning rate was set to 0.01 and decayed to 0.001 and 0.0001 after 6 epochs and 8 epochs (10 epochs in total). In Stage 2, we split the rest classes into 6 groups. For each incremental phase, we only sampled 100 proposals per image as the number of valid annotations per image shrinks when adopting our balanced replay strategy. Recall the freezing operation in Section 

3.2, we froze the top 3 layers of ResNet [13] in the backbone in each incremental learning phase. The learning rate was from 0.002 and divided by 10 after 6 epochs (10 epochs in total). More experiments on the choice of and the number of phases are presented in Section 4.3.

Figure 5: Performance comparison on a subset of tail classes between our LST and the joint training baseline (Mask R-CNN). We observe that the baseline APs for many few-shot categories is zero due to the extreme imbalance.
Baseline 3.5 20.1 25.1 31.5 23.0
Ours 14.4 30.0 25.0 26.9 26.3
Table 2: Results of our LST and baseline implemented on ResNeXt-101-32x8d-FPN Mask R-CNN.

4.2 Results and Analyses on LVIS

Results. As shown in Table 1, our method evaluated at the last phase, i.e., the whole dataset, outperforms the baselines in the tail classes (AP and AP) by a large margin. The overall AP for both object detection and instance segmentation improves. Especially, as shown in Figure 5, we randomly sampled 60 classes from the tail classes, whose number of instances in the training set is smaller than 100, and reported the result with and without using our LST which is class-incremental. We observe that our approach obtains remarkable improvement in most tail categories. We also compared our method with other re-sampling methods proposed to tackle the imbalanced data, where repeat-factor sampling [9] essentially up-samples the images containing annotations from tail classes, and class-aware sampling [34] is an alternate oversampling method. The results show that our method surpasses all the other image-level re-sampling approaches on the tail classes, bringing an improvement in overall AP as well. In Figure 6, we visualized the predicted coefficients vectors of our weight generator for samples in the last phase. The coefficient vectors of visually or semantically similar classes tend to be close, which shows our weight generator’s effectiveness in relating the learning process for data-rich and data-poor classes. Due to limited resources, all the above models were implemented on ResNet-50-FPN. We further report the result applying our method to ResNeXt-101-32x8d-FPN [45] in Table 2 ( = 270, 3 phases), which also shows significant improvement. With more powerful computing resource available, we would like to follow the settings of Tan et al.’s work [39] to further improve our performances. We believe that our findings are regardless of visual backbones and data augmentation tricks.

Figure 6: t-SNE [25] embeddings of the coefficients for few-shot categories in the last phase. As noted, semantically and visually similar classes are close (i.e., “diary” and “diskette”, “custard” and “wasabi”).

Analyses. Oksuz et al[27] pointed out that the imbalance among different foreground categories, owing to the dataset itself, undermines the performance of popular recognition models. The results of our baseline models in Table 1 validate this opinion, showing the tendency that the recognition on rare categories performs much worse than the frequent ones (0.0% vs. 28.3%) in LVIS. By re-balancing the dataset, previous re-sampling works like Gupta et al[9] or Shen et al[34] somewhat improve the performance for the tail classes. However, we show that they are less effective than our LST. The reason is that they struggle in the trade-off between the tail over-fit and the head under-fit. Furthermore, recall Figure 3, our method is more suitable for instance-based tasks as we essentially tackle the overall imbalance over instances. What is more, for Gupta et al.’s work [9], the threshold used for guiding the re-sampling of the whole dataset is sensitive to the data distribution and thus needs to be carefully tuned. As a result, the method is not flexible when new observations are added to the current dataset, bringing about an expansion of the tail. In contrast, the experiments in Section 4.3 show that our method is robust to the distribution inside each incremental phase, revealing the potential of our work to be applied to open classes with rarer data.

4.3 Ablation Study

110 160 7 22.4
270 160 6 22.8
270 320 3 22.9
270 480 2 22.9
270 960 1 21.8
590 160 4 21.2
590 320 2 21.4
Table 3: Ablation study for different size of base classes and the number of incremental phases.
Figure 7: Comparison between networks with using knowledge distillation (shadow fill bars) and without using it (solid fill bars). Results are reported for old classes (yellow bars), new classes (blue bars) and old&new classes (green bars).

Choose of and the size of phase. The influence of different and the number of phases is shown in Table 3. We empirically show that, on the one hand, the final performance is sensitive to the choice of , as the training on the more imbalanced base dataset (i.e., = 590) undermines the reliability of and further influences the following phases. On the other hand, the results are relatively robust to the size of each incremental phase, as the balanced replay can always provide a relatively balanced dataset when locates in a moderate range.

Knowledge distillation. We split the rest 960 classes into 6 phases, and examined the influence of using knowledge distillation in each phase by comparing the performances on new classes, old classes, new&old classes, respectively. As shown in Figure 7, models trained without distilling classification logits of two adjacent phases perform consistently worse than the model using the distillation on new&old classes. In the first few phases, the performance of new classes without distillation is higher, because it is trivial that when the new-class data is abundant, “forgetting” all the old classes are beneficial to focus on the performances of new classes. But, when the number of instances for each category become fewer and fewer, the distillation becomes more important for both new and old classes. The final instance segmentation AP for the whole dataset with and without knowledge distillation is 22.8% vs. 21.6%, demonstrating the effectiveness of the distillation.

Balanced replay. Figure 8 shows the effect of our Balanced Replay (BR) compared to the baseline that uses all the data from old&new classes in each phase. It is worth noting that although more data is used for training, the severe imbalance causes the gradually worse performance than our method’s. Besides, our method needs far less storage memory consumption and training iterations to converge.

(a) LVIS-bboxAP (6 phases)
(b) LVIS-maskAP (6 phases)
Figure 8: Performance comparison for the models trained with and without our balanced replay. For every incremental phase, the detection and instance segmentation performance evaluated on new&old classes are reported.
Figure 9: Performance comparison for the models trained with and without our meta weight generator. For every incremental phase, the instance segmentation performance was evaluated on the whole new&old classes, and we only report the results on the new classes to highlight the few-shot learning performances.

Meta weight generator. We examined the performance of our system with and without using meta weight generator. As shown in Table 1, both of them offer a very significant boost on few-shot recognition, while the meta-module based method does better on extreme few-shot classes (i.e., AP, AP). More specifically, we evaluated the models at each phase for all classes and report the performance of new classes (Figure 9). It is easy to see that among the two, the meta-module based solution exhibits better few-shot recognition behavior, especially for the <5-shot classes in the last phase (5.3% vs. 8.0%), without affecting the recognition performance of all classes. However, compared to the conventional training, the episodic training for meta-module is memory-inefficient. In our implementation, 160 is the maximum phase size for network armed with MWG, so we only report the results using 6 incremental phases. We would like to explore a better combination of meta-learning and fine-tuning in future work.

5 Conclusions

We addressed the problem of large-scale long-tailed instance segmentation by formulating a novel paradigm: class-incremental few-shot learning, where any large dataset can be divided into groups and incrementally learned from the head to the tail. This paradigm introduces two new challenges over time: 1) for countering the catastrophic forgetting, the old classes are more and more imbalanced, 2) the new classes are more and more few-shot. To this end, we develop the Learning to Segment the Tail (LST) method, equipped with a novel instance-level balanced replay technique and a meta-weight generator for few-shot classes adaptation. Experimental results on the LVIS dataset [9] demonstrated that LST could gain a significant improvement for the tail classes and achieve an overall boost for the whole 1,230 classes. LST offers a novel and practical solution for learning from large-scale long-tailed data: we can use only one downside — head-class forgetting, to trade off the two challenges — the large vocabulary and few-shot learning.

Acknowledgements. We thank all the reviewers for their constructive comments. This work was supported by Alibaba-NTU JRI, and partly supported by Major Scientific Research Project of Zhejiang Lab (No. 2019DB0ZX01).


  • [1] F. M. Castro, M. J. Marin-Jimenez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-End Incremental Learning. In ECCV, Cited by: §2.
  • [2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: Synthetic Minority Over-sampling Technique. In

    Journal of artificial intelligence research

    Cited by: §1, §2.
  • [3] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) Hybrid Task Cascade for Instance Segmentation. In CVPR, Cited by: §1.
  • [4] L. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam (2018) MaskLab: Instance Segmentation by Refining Object Detection With Semantic and Direction Features. In CVPR, Cited by: §2.
  • [5] Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie (2019) Class-Balanced Loss Based on Effective Number of Samples. In CVPR, Cited by: §2.
  • [6] Q. Dong, S. Gong, and X. Zhu (2017)

    Class rectification hard mining for imbalanced deep learning

    In ICCV, Cited by: §2.
  • [7] S. Gidaris and N. Komodakis (2018) Dynamic Few-Shot Visual Learning Without Forgetting. In CVPR, Cited by: §1, §2, §3.2, §3.2, §3.4, §3.4.
  • [8] R. Girshick (2015) Fast R-CNN. In ICCV, Cited by: §1, §3.2.
  • [9] A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: A Dataset for Large Vocabulary Instance Segmentation. In CVPR, Cited by: Figure 1, §1, §2, §3.3, Table 1, §3, §4.1, §4.2, §4.2, §4, §5.
  • [10] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In

    IEEE International Joint Conference on Neural Networks

    Cited by: §1, §2.
  • [11] H. He and E. A. Garcia (2008) Learning from imbalanced data. In IEEE Transactions on Knowledge & Data Engineering, Cited by: §1.
  • [12] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017) Mask R-CNN. In ICCV, Cited by: §1, §2, §3.2, §3.2, Table 1, §4.1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §4.1.
  • [14] G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the knowledge in a neural network. In NeurIPS, Cited by: §2.
  • [15] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a Unified Classifier Incrementally via Rebalancing. In CVPR, Cited by: §2, §2, §3.2, §3.2.
  • [16] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick (2018) Learning to Segment Every Thing. In CVPR, Cited by: §2, §3.2, §3.2, §4.1.
  • [17] C. Huang, Y. Li, C. Change Loy, and X. Tang (2016) Learning Deep Representation for Imbalanced Classification. In CVPR, Cited by: §2.
  • [18] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019) Few-Shot Object Detection via Feature Reweighting. In ICCV, Cited by: §1, §2.
  • [19] A. Li, T. Luo, T. Xiang, W. Huang, and L. Wang (2019) Few-Shot Learning With Global Class Representations. In ICCV, Cited by: §2.
  • [20] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017) Fully Convolutional Instance-Aware Semantic Segmentation. In CVPR, Cited by: §2.
  • [21] Z. Li and D. Hoiem (2016) Learning Without Forgetting. In ECCV, Cited by: §2, §3.2.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. In ECCV, Cited by: §4.1.
  • [23] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path Aggregation Network for Instance Segmentation. In CVPR, Cited by: §2.
  • [24] Y. Liu, Y. Su, A. Liu, B. Schiele, and Q. Sun (2020-06) Mnemonics training: multi-class incremental learning without forgetting. In CVPR, Cited by: §2.
  • [25] L. v. d. Maaten and G. Hinton (2008) Visualizing Data using t-SNE. In JMLR, Cited by: Figure 6.
  • [26] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. Psychology of Learning and Motivation - Advances in Research and Theory 24, pp. 109–165. Cited by: Figure 1, §1.
  • [27] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas (2019) Imbalance Problems in Object Detection: A Review. arXiv preprint arxiv:1909.00169. Cited by: §4.2.
  • [28] D. M. W. Powers (1998) Applications and explanations of zipf’s law. Association for Computational Linguistics, pp. 151–160. Cited by: §1.
  • [29] H. Qi, M. Brown, and D. G. Lowe (2018) Low-Shot Learning With Imprinted Weights. In CVPR, Cited by: §2.
  • [30] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) iCaRL: Incremental Classifier and Representation Learning. In CVPR, Cited by: §1, §2, §3.2, §3.2, §3.2, §3.3.
  • [31] M. Ren (2019) Incremental few-shot learning with attention attractor networks. In NeurIPS, Cited by: §1, §2.
  • [32] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1.
  • [33] C. Russell,Bryan, A. Torralba, K. P Murphy, and W. T Freeman (2008) Labelme: a database and web-based tool for image annotation. In IJCV, Cited by: §1.
  • [34] L. Shen, Z. Lin, and Q. Huang (2016)

    Relay backpropagation for effective learning of deep convolutional neural networks

    In ECCV, Cited by: §3.3, Table 1, §4.2, §4.2.
  • [35] K. Shmelkov, C. Schmid, and K. Alahari (2017) Incremental Learning of Object Detectors Without Catastrophic Forgetting. In ICCV, Cited by: §1, §2, §3.2.
  • [36] M. Spain and P. Perona (2007) Measuring and predicting importance of objects in our visual world. In Technical Report CNS- TR-2007-002, Cited by: §1.
  • [37] G. Sun, Y. Cong, and X. Xu (2018) Active Lifelong Learning With ”Watchdog”. In AAAI, Cited by: §2.
  • [38] Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019)

    Meta-transfer learning for few-shot learning

    In CVPR, Cited by: §1.
  • [39] J. Tan (2020) Equalization loss for long-tailed object recognition. ArXiv:2003.05176. Cited by: §4.2.
  • [40] K. M. Ting (2000) A comparative study of cost-sensitive boosting algorithms. In ICML, Cited by: §2.
  • [41] O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In NeurIPS, Cited by: §1, §3.4.
  • [42] Y. Wang, D. Ramanan, and M. Hebert (2017) Learning to model the tail. In NeurIPS, Cited by: §1, §2.
  • [43] Y. Wang, D. Ramanan, and M. Hebert (2019) Meta-learning to detect rare objects. In ICCV, Cited by: §1, §2.
  • [44] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010)

    SUN database: Large-scale scene recognition from abbey to zoo

    In CVPR, Cited by: §1.
  • [45] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He (2017-07) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §4.2.
  • [46] H. Ye, H. Hu, D. Zhan, and F. Sha (2019) Learning Classifier Synthesis for Generalized Few-Shot Learning. arXiv preprint arxiv:1906.02944. Cited by: §2, §2, §3.4.
  • [47] Z. Zhou and X. Liu (2006) On Multi-Class Cost-Sensitive Learning. In AAAI, Cited by: §2.
  • [48] G. K. Zipf (2013) The psycho-biology of language: An introduction to dynamic philology. In Routledge, Cited by: §1.