. This is inevitable when we are interested in modeling large-scale datasets, because the class observational probability in nature follows Zipf’s law. Therefore, it is prohibitively expensive to counter the nature and collect a balanced sample-rich large-scale dataset, catering for training a robust visual recognition system using the prevailing models [12, 8, 32, 3].
In this paper, we study a practical large-scale visual recognition task on the challenging real-world dataset: Large Vocabulary Instance Segmentation (LVIS) . As shown in Figure 1, across the 1k+ instance object classes, the number of training instances per class drops from thousands in the head to only a few in the tail (i.e., 26k+ “banana” vs. only 1 “drone”). Empirical studies show that the models trained using such a long-tailed dataset tend to please the common classes but neglect the rare ones . The reasons are two-fold: 1) class imbalance causes the head classes trained thousand times more than the tail classes, and 2) the few-shot samples in the long tail render the generalization a great challenge (i.e., around 300 classes with less than 10 samples). Therefore, the key solution for LVIS is to well address not only the imbalance but also the few-shot learning at a large scale.
Unfortunately, conventional works on either “imbalance” or “few-shot” are fundamentally not scalable to LVIS. On the one hand, it is well-known that works on data re-sampling [11, 2, 10] — up-sampling the rare tail classes or down-sampling the frequent head classes — can prevent the training from being dominated by the head. Nonetheless, as they do not introduce any new diversity, they struggle in the trade-off between the tail over-fit — the heavy repetitions of the few-shot samples, and the head under-fit — the significant abandon of the many-shot samples. On the other hand, conventional few-shot learning that transfers the model from a data-rich “base set” to a data-poor “novel set” [38, 43], however, is not yet practical in LVIS, as any base or novel split will be eventually imbalanced due to the scale, undermining the generalization ability that is already challenging in few-shot learning . Besides, the scale also raises major memory issues in the episodic training  adopted by recent meta-learning based methods [18, 31].
An intuitive strategy to address the scale is to divide the large “body” into “parts”, conquer each of them, and then merge them incrementally. As illustrated in Figure 1, each subset is more balanced and easier to handle. Essentially, the “divide&conquer” strategy for LVIS poses a novel learning paradigm: class-incremental few-shot learning. However, the merge to stitch the parts back to a whole is no longer a trivial adoption of any off-the-shelf class-incremental learning method [30, 35]. The reason is that different from traditional class-incremental learning scenarios, our incremental phases over time, will face 1) more imbalanced data of the old classes and 2) fewer data of the new classes. This leaves the network more vulnerable to “catastrophic forgetting”  in learning the new classes, not to mention the fact that they are fewer- and fewer-shot.
To implement the novel paradigm for the LVIS task, we propose the balanced replay scheme for knowledge review and the meta-learning based weight-generator module for fast few-shot adaptation. We call our approach: Learning to Segment the Tail (LST). In a nutshell, LST can be summarized in Algorithm 1. After training the first phase that comes with the abundant labeled data as the bootstrap, we start the incremental learning in phases (e.g., equals 3 in Figure 1). Given the relatively balanced subset in -th phase using data replay (BalancedReplay in Section 3.3), new classes can be learned and old classes can be fine-tuned simultaneously (UpdateModel in Section 3.2). To transfer the knowledge step by step from the “easy” many-shot head to the “difficult” few-shot tail, we furthur adopt a meta weight generator  ( MWG in Section 3.4).
We validate the proposed LST on the large-scale long-tailed benchmark LVIS, which contains 1,230 entry-level instance categories. Experimental results show that our LST improves the instance segmentation results over the baseline by 7.08.0% AP on the tail classes while gaining a 2.2% overall improvement for the whole classes. The results illuminate us a promising direction for tackling the severe class imbalance in long-tailed data: class-incremental few-shot learning.
Our contributions can be summarized as follows:
We are among the first to study the task of large vocabulary instance segmentation, which is of high practical value by focusing on the severe class imbalance and few-shot learning in the field of instance segmentation.
We develop a novel learning paradigm for LVIS: class-incremental few-shot learning.
The proposed Learning to Segment the Tail (LST) for the above paradigm outperforms baseline methods, especially over the tail classes, where the model can adapt to unseen classes instantly without training.
2 Related Work
Instance segmentation. Our instance segmentation backbone is based on the popular region-based frameworks [20, 12, 4, 23], in particular, Mask R-CNN  and its semi-supervised extension Mask R-CNN , which can transfer mask predictor from merely box annotation. However, they cannot scale up for the large-scale long-tailed dataset such as LVIS , which is the focus of our work.
; while the latter focuses on assigning different weights to adjust the loss function[17, 40, 47, 6]. Some works on generalized few-shot learning [46, 19] also deal with an extremely imbalanced dataset, extending the test label space of few-shot learning to both base and novel rare classes. We propose a novel re-sampling strategy. Different from previous works that perform on image-level re-sampling, we address the imbalance of dataset on instance-level.
Learning without forgetting & learning to learn. Existing works mainly focus on how to learn new knowledge with less forgetting, and how to generalize from the learning process, i.e., learning to learn. To cope with the ever-evolving data, class-incremental learning methods [35, 15, 37, 1] adapt the original model trained on old classes to new classes, where knowledge distillation [14, 21] and old data replay [30, 24] are applied to minimize the forgetting. For few-shot learning, meta-learning based works transfer the learning-to-learn knowledge through feature representation [31, 29, 18]
, classifier weights[46, 7], and the regression of model parameters [42, 43] from the data-rich base classes, to obtain a good model initialization for the data-poor new classes. We propose a class-incremental few-shot learning paradigm that can be seen as a non-trivial combination of these two fields.
3 Learning to Segment the Tail
LVIS is a Large Vocabulary Instance Segmentation dataset, which contains 1,230 instance classes . The number of images per class in LVIS has a natural long-tail distribution, with 700+ classes containing less than 100 training samples. To tackle the challenging dataset in the proposed LST using the “divide&conquer” strategy, we first present the division method in Section 3.1, and discuss our class-incremental learning pipeline in Section 3.2. In Section 3.3 and Section 3.4, we detail how to use BalancedReplay and MWG for knowledge review and few-shot adaptation.
3.1 Dataset Pre-processing
Our guideline for the division is to alleviate the intra-phase imbalance of the dataset, where each of division is relatively balanced. We first sort the classes by the number of instance-level samples in a descending order, obtaining a sorted class set . Then we divide the sorted categories into mutually exclusive groups . Correspondingly, we have a sub-dataset with images and annotations for each .
Specifically, after grouping the sorted top classes as the bootstrap group , and splitting the remaining classes into evenly spaced bins , we obtain the sorted class sets with groups = . By assigning data to the corresponding groups, we convert the whole dataset into , as shown in line 1 of Algorithm 1, where each is composed of all the annotated images containing any instance of . Following this setting, the data is fed to the network step-wisely, so that our model is trained in a class-incremental learning style.
3.2 Class-Incremental Instance Segmentation
Class-incremental learning aims to learn a unified model that can recognize classes of both previous and current phase . In our scenario, we aim to train our network on , obtaining models from to , and finally deliver as our resultant model that can detect all instance classes in LVIS. Here, we adopt the popular definition inherited from works in incremental learning and few-shot learning [7, 35]: classes in are termed as base classes; for phase , classes in are called old classes and classes in current are called new classes. For training and evaluation in each phase , we will not handle anything in the future classes . As phases goes by, the data in for new classes becomes fewer and fewer, and the data for old classes become more and more imbalanced. To tackle the inter-phase imbalance, we propose a novel sampling scheme for the old data, which will be discussed in Section 3.3.
Our overall architecture is shown in Figure 2. We build our class-incremental learning framework based on Mask R-CNN , which is a modified version of Mask R-CNN . Mask R-CNN is an instance segmentation model that can be used in partially supervised domain by obtaining a category’s mask parameters from its bounding box parameters. We adopted this weight transfer module so that the class-agnostic transfer function weights can be shared between incremental phases, which can 1) alleviate the training burden for massive mask layers of 1,230 classes and 2) avoid the unstable knowledge distillation of the mask logits across classes (i.e., times more compared to the class logits distillation in Eq. (2
)). Besides, we replaced the last classifier layer in the detection branch with scaled cosine similarity operator, because it has been shown effective in eliminating the bias caused by the significant variance in magnitudes[30, 7, 15]
. Formally, given the feature vector, the output logits vector of cosine similarity classifier with weights is:
where and are the L2-Normalized vectors.
Then the class-specific mask weights in the mask branch are generated from using the class-agnostic weight prediction function in Mask R-CNN .
The overall class-incremental learning pipeline is shown in Algorithm 1, and it is composed of two stages:
Stage 1. Base classes training. This training phase () delivers the model for base classes, where the backbone and RoI heads are jointly trained. The trained classification weight vectors for top classes are denoted as . We assume that if the data in base classes are sufficiently abundant and relatively balanced, the training of can work effectively as the bootstrap for the whole system.
Stage 2. Class-incremental learning. In each phase t (from to ), the number of classifiers is expanded, which leads to the following adjustments to the training procedure in Stage 1:
Network Expansion. After initialized from the last phase’s model , the current model needs to grow for recruiting new class-specific layers, i.e., the bounding-box, classification and regression layers and the mask prediction layer for new classes. Recall our modifications of the backbone, the weights of mask layers can be transferred from the weights of box layers, so the expansion of the network is only implemented on the box head.
Freezing and knowledge distillation. As discussed in class-incremental learning works [30, 15], these two strategies are broadly used to address catastrophic forgetting, the significant performance drop on previous data when adapting a model to new data. Data rehearsal  is another strategy to prevent forgetting by reviewing old data, which is discussed in Section 3.3. In our scenario, 1) by freezing the weights in the backbone, a strong constraint on the previous representation is imposed, 2) by knowledge distillation, the discriminative representation learned previously is not shifted severely during the new learning step. Our distillation loss is defined as:
where and are the output logits vectors for classes in using both old model trained in phase and current , respectively. Note that the output in phase t also incorporates new categories in , we use to indicate the sliced logits only corresponding to previous classes . is the L2-distance to measure the difference between logits. We choose L2-distance here in avoid of the grid search of temperature as in conventional distillation loss , thanks to the already normalized logits (i.e., logits lie in the same range ) using cosine.
The purpose of Eq. (2) is to let the new model mimic the the old model’s behavior (i.e., generate similar output logits), so that the knowledge learned from old network can be preserved. It is worth noting that distillation requires the same input sample going through old and new networks separately. Different from the classification task, in instance segmentation, proposals are dynamically predicted. To this end, we use the ground truth bounding boxes of novel classes as samples in each step for distillation. Overall, for each incremental phase , knowledge distillation loss is added to the final loss as .
3.3 Instance-level Data Balanced Replay
As shown in Figure 1, within each incremental phase, the variance of instance number is narrowed. However, the inter-phase imbalance (i.e., the gap in the number of samples among phases) exists, leading to a dilemma: if we replay all the previous data, it will definitely break the balance, introducing the imbalance back to our network; if we discard replay, catastrophic forgetting will happen .
Moreover, previous re-sampling strategies [9, 34] can not be applied gracefully in the instance-level vision tasks. For image-level re-sampling that regularizes the number of images per category, the inherent class co-occurrence may hinder its effectiveness. For example, in Figure 3 (a), as “guitar” usually co-occur with “person”222 We use “person” to replace “baby” used to represent a set of synonymous labels: “child”, “boy”, “girl”, “man”, “woman” and “human” in LVIS for readability., the adjustment on the number of “guitar” instances will always unnecessarily adjust the number of “person” instances at the same time. An alternative one-instance-for-one-image strategy in Figure 3 (b) can assure the absolute balance, however, the additional computational cost for feed-forwarding the same image multiple times is tremendous. Based on those observations, we proposed the Instance-level Data Balanced Replay strategy. For phase , it works as follows:
calculate : the average number of instances over all categories in set ;
calculate : the average number of instances over all images containing annotations from the corresponding old category ;
construct the replay set : for each category , randomly sample images from images in containing category , where only those annotations belonging to category are considered valid in the training.
As illustrated in Figure 3 (c), by replaying the balanced set of old data using the above strategy, we dynamically collects a relatively balanced dataset in each phase .
3.4 Meta Weight Generator
So far, the proposed class-incremental pipeline is able to tackle the intra-&inter-phase imbalance while preserves the performance of classes from the previous phases. However, the challenge of few-shot learning becomes severe as we approach to the tail classes. Therefore, we adopt a Meta Weight Generator (MWG) module  as shown in Figure 4, which utilizes the base knowledge learned and inherited from the previous phases to dynamically generate the weight matrix of the current phase. The motivation is: given robust feature backbone and classifiers learned for the base classes (i.e., Stage 1 in Section 3.2), it is possible learning to directly “write” new classifiers for the new classes, based on the new sample feature itself and its similarities to the base classifiers . For an intuitive example, we can customize a “drone” classifier by using a “drone” sample feature and how the sample looks like the base classes, e.g., 50% “airplane”, 30% “fan”, and 20% “frisbee”.
Formally, in the -th incremental phase, we decompose the classifier weight matrix into two parts: , for the old and the new classes, respectively. Following the Gidaris&Komodakis’s work , is dynamically generated. In particular, we retrieve the base classifier weights from , then learn how to compose . Take an image containing RoIs of a new category as an example, for each RoI feature vector , 1) the feature vector is fed to an attention kernel function to get the coefficients as: , where are the weight coefficients used to attend base classifiers weights , is a learnable matrix that transforms to the query vector, and is a set of learnable keys (one per base category); 2) the classification weight is first generated for each RoI feature independently and then averaged over all RoIs of category in this image as the final predicted weight vector of category . For each RoI feature , the corresponding classifier weight is calculated as:
where denotes element-wise multiplication, and are learnable weight vectors.
For the initialization of the -th phase, is copied from the previous phase . For the episodic training , each episode is composed of a support set and a query set sampled from . The support set is for applying MWG to generate (Eq. (3)), and the query set is for collecting loss from the predictions using the full model : the concatenated classifiers ] as well as other network parameters, and then update . This joint training assures that the classifier weights and the meta-learner are synchronized in the -th phase. After the episodic training, we set the weights for a novel category by averaging the predicted weights of all the instances of class in . Then, the meta-module can be completely detached, and we are ready to deliver the model .
|Class-aware Sampling ||0.0||0.0||0.0||20.0||20.2||24.5||19.5||31.6||20.5||19.3|
|Repeat-factor Sampling ||4.0||0.0||2.9||19.9||21.4||27.8||20.8||33.3||22.0||20.6|
|LST w/o MWG (Ours)||12.0||9.3||11.7||27.1||21.3||22.3||22.8||36.4||24.1||22.3|
|LST w MWG (Ours)||13.6||10.7||11.2||26.8||21.7||23.0||23.0||36.7||24.8||22.6|
We conducted experiments on LVIS  using the standard metrics for instance segmentation. AP was calculated across IoU threshold from 0.5 to 0.95 over all categories. AP50 (or AP75) means using an IoU threshold 0.5 (or 0.75) to identify whether a prediction is positive. To better display the results from the head to the tail, AP, AP, AP, AP, AP, AP were evaluated for the sets of categories which containing only 1, <5, <10, 10 100, 100 1,000 and 1,000 training object instances. AP for object detection was reported as AP.
4.1 Implementation Details
We implemented our architectures and other baselines (e.g., Mask R-CNN ) on the Mask R-CNN  code base maskrcnn_benchmark333https://github.com/facebookresearch/maskrcnn-benchmark. For Section 3.2, we implemented as follows: 1) mask weights were generated by a class-agnostic MLP mask branch together with the weights transferred from the classifiers of the box head following Hu et al. 
; 2) cosine normalization was applied to both of the feature vectors and the classifier weights, to obtain the classification logits. Note that the ReLU non-linearity in the final layer was removed to allow the feature vectors to take both positive and negative values.
We initialized the scaling factor of cosine similarity as 10. All the models were initialized using the released model pre-trained on COCO , and trained by using SGD with 1e-4 weight decay and 0.9 momentum. Each minibatch had 8 training images, and the images were resized to that its shorter edge is 800-pixel. No other augmentation was used except for horizontal flipping. Models were evaluated using the 5k val images. Following Gupta et al. , we increased the number of detections per image up to top 300 (vs. top 100 for COCO) and reduced the minimum score threshold from the default of 0.05 to 0.0.
For Section 3, in Stage 1, we chose = 270, where each of the top classes has 400+ instances. 512 RoIs were selected per image, and the positive-negative ratio is . For training the top
classes, the learning rate was set to 0.01 and decayed to 0.001 and 0.0001 after 6 epochs and 8 epochs (10 epochs in total). In Stage 2, we split the rest classes into 6 groups. For each incremental phase, we only sampled 100 proposals per image as the number of valid annotations per image shrinks when adopting our balanced replay strategy. Recall the freezing operation in Section3.2, we froze the top 3 layers of ResNet  in the backbone in each incremental learning phase. The learning rate was from 0.002 and divided by 10 after 6 epochs (10 epochs in total). More experiments on the choice of and the number of phases are presented in Section 4.3.
4.2 Results and Analyses on LVIS
Results. As shown in Table 1, our method evaluated at the last phase, i.e., the whole dataset, outperforms the baselines in the tail classes (AP and AP) by a large margin. The overall AP for both object detection and instance segmentation improves. Especially, as shown in Figure 5, we randomly sampled 60 classes from the tail classes, whose number of instances in the training set is smaller than 100, and reported the result with and without using our LST which is class-incremental. We observe that our approach obtains remarkable improvement in most tail categories. We also compared our method with other re-sampling methods proposed to tackle the imbalanced data, where repeat-factor sampling  essentially up-samples the images containing annotations from tail classes, and class-aware sampling  is an alternate oversampling method. The results show that our method surpasses all the other image-level re-sampling approaches on the tail classes, bringing an improvement in overall AP as well. In Figure 6, we visualized the predicted coefficients vectors of our weight generator for samples in the last phase. The coefficient vectors of visually or semantically similar classes tend to be close, which shows our weight generator’s effectiveness in relating the learning process for data-rich and data-poor classes. Due to limited resources, all the above models were implemented on ResNet-50-FPN. We further report the result applying our method to ResNeXt-101-32x8d-FPN  in Table 2 ( = 270, 3 phases), which also shows significant improvement. With more powerful computing resource available, we would like to follow the settings of Tan et al.’s work  to further improve our performances. We believe that our findings are regardless of visual backbones and data augmentation tricks.
Analyses. Oksuz et al.  pointed out that the imbalance among different foreground categories, owing to the dataset itself, undermines the performance of popular recognition models. The results of our baseline models in Table 1 validate this opinion, showing the tendency that the recognition on rare categories performs much worse than the frequent ones (0.0% vs. 28.3%) in LVIS. By re-balancing the dataset, previous re-sampling works like Gupta et al.  or Shen et al.  somewhat improve the performance for the tail classes. However, we show that they are less effective than our LST. The reason is that they struggle in the trade-off between the tail over-fit and the head under-fit. Furthermore, recall Figure 3, our method is more suitable for instance-based tasks as we essentially tackle the overall imbalance over instances. What is more, for Gupta et al.’s work , the threshold used for guiding the re-sampling of the whole dataset is sensitive to the data distribution and thus needs to be carefully tuned. As a result, the method is not flexible when new observations are added to the current dataset, bringing about an expansion of the tail. In contrast, the experiments in Section 4.3 show that our method is robust to the distribution inside each incremental phase, revealing the potential of our work to be applied to open classes with rarer data.
4.3 Ablation Study
Choose of and the size of phase. The influence of different and the number of phases is shown in Table 3. We empirically show that, on the one hand, the final performance is sensitive to the choice of , as the training on the more imbalanced base dataset (i.e., = 590) undermines the reliability of and further influences the following phases. On the other hand, the results are relatively robust to the size of each incremental phase, as the balanced replay can always provide a relatively balanced dataset when locates in a moderate range.
Knowledge distillation. We split the rest 960 classes into 6 phases, and examined the influence of using knowledge distillation in each phase by comparing the performances on new classes, old classes, new&old classes, respectively. As shown in Figure 7, models trained without distilling classification logits of two adjacent phases perform consistently worse than the model using the distillation on new&old classes. In the first few phases, the performance of new classes without distillation is higher, because it is trivial that when the new-class data is abundant, “forgetting” all the old classes are beneficial to focus on the performances of new classes. But, when the number of instances for each category become fewer and fewer, the distillation becomes more important for both new and old classes. The final instance segmentation AP for the whole dataset with and without knowledge distillation is 22.8% vs. 21.6%, demonstrating the effectiveness of the distillation.
Balanced replay. Figure 8 shows the effect of our Balanced Replay (BR) compared to the baseline that uses all the data from old&new classes in each phase. It is worth noting that although more data is used for training, the severe imbalance causes the gradually worse performance than our method’s. Besides, our method needs far less storage memory consumption and training iterations to converge.
Meta weight generator. We examined the performance of our system with and without using meta weight generator. As shown in Table 1, both of them offer a very significant boost on few-shot recognition, while the meta-module based method does better on extreme few-shot classes (i.e., AP, AP). More specifically, we evaluated the models at each phase for all classes and report the performance of new classes (Figure 9). It is easy to see that among the two, the meta-module based solution exhibits better few-shot recognition behavior, especially for the <5-shot classes in the last phase (5.3% vs. 8.0%), without affecting the recognition performance of all classes. However, compared to the conventional training, the episodic training for meta-module is memory-inefficient. In our implementation, 160 is the maximum phase size for network armed with MWG, so we only report the results using 6 incremental phases. We would like to explore a better combination of meta-learning and fine-tuning in future work.
We addressed the problem of large-scale long-tailed instance segmentation by formulating a novel paradigm: class-incremental few-shot learning, where any large dataset can be divided into groups and incrementally learned from the head to the tail. This paradigm introduces two new challenges over time: 1) for countering the catastrophic forgetting, the old classes are more and more imbalanced, 2) the new classes are more and more few-shot. To this end, we develop the Learning to Segment the Tail (LST) method, equipped with a novel instance-level balanced replay technique and a meta-weight generator for few-shot classes adaptation. Experimental results on the LVIS dataset  demonstrated that LST could gain a significant improvement for the tail classes and achieve an overall boost for the whole 1,230 classes. LST offers a novel and practical solution for learning from large-scale long-tailed data: we can use only one downside — head-class forgetting, to trade off the two challenges — the large vocabulary and few-shot learning.
Acknowledgements. We thank all the reviewers for their constructive comments. This work was supported by Alibaba-NTU JRI, and partly supported by Major Scientific Research Project of Zhejiang Lab (No. 2019DB0ZX01).
-  (2018) End-to-End Incremental Learning. In ECCV, Cited by: §2.
SMOTE: Synthetic Minority Over-sampling Technique.
Journal of artificial intelligence research, Cited by: §1, §2.
-  (2019) Hybrid Task Cascade for Instance Segmentation. In CVPR, Cited by: §1.
-  (2018) MaskLab: Instance Segmentation by Refining Object Detection With Semantic and Direction Features. In CVPR, Cited by: §2.
-  (2019) Class-Balanced Loss Based on Effective Number of Samples. In CVPR, Cited by: §2.
Class rectification hard mining for imbalanced deep learning. In ICCV, Cited by: §2.
-  (2018) Dynamic Few-Shot Visual Learning Without Forgetting. In CVPR, Cited by: §1, §2, §3.2, §3.2, §3.4, §3.4.
-  (2015) Fast R-CNN. In ICCV, Cited by: §1, §3.2.
-  (2019) LVIS: A Dataset for Large Vocabulary Instance Segmentation. In CVPR, Cited by: Figure 1, §1, §2, §3.3, Table 1, §3, §4.1, §4.2, §4.2, §4, §5.
ADASYN: Adaptive synthetic sampling approach for imbalanced learning.
IEEE International Joint Conference on Neural Networks, Cited by: §1, §2.
-  (2008) Learning from imbalanced data. In IEEE Transactions on Knowledge & Data Engineering, Cited by: §1.
-  (2017) Mask R-CNN. In ICCV, Cited by: §1, §2, §3.2, §3.2, Table 1, §4.1.
-  (2016-06) Deep residual learning for image recognition. In , Cited by: §4.1.
-  (2014) Distilling the knowledge in a neural network. In NeurIPS, Cited by: §2.
-  (2019) Learning a Unified Classifier Incrementally via Rebalancing. In CVPR, Cited by: §2, §2, §3.2, §3.2.
-  (2018) Learning to Segment Every Thing. In CVPR, Cited by: §2, §3.2, §3.2, §4.1.
-  (2016) Learning Deep Representation for Imbalanced Classification. In CVPR, Cited by: §2.
-  (2019) Few-Shot Object Detection via Feature Reweighting. In ICCV, Cited by: §1, §2.
-  (2019) Few-Shot Learning With Global Class Representations. In ICCV, Cited by: §2.
-  (2017) Fully Convolutional Instance-Aware Semantic Segmentation. In CVPR, Cited by: §2.
-  (2016) Learning Without Forgetting. In ECCV, Cited by: §2, §3.2.
-  (2014) Microsoft COCO: Common Objects in Context. In ECCV, Cited by: §4.1.
-  (2018) Path Aggregation Network for Instance Segmentation. In CVPR, Cited by: §2.
-  (2020-06) Mnemonics training: multi-class incremental learning without forgetting. In CVPR, Cited by: §2.
-  (2008) Visualizing Data using t-SNE. In JMLR, Cited by: Figure 6.
-  (1989) Catastrophic interference in connectionist networks: the sequential learning problem. Psychology of Learning and Motivation - Advances in Research and Theory 24, pp. 109–165. Cited by: Figure 1, §1.
-  (2019) Imbalance Problems in Object Detection: A Review. arXiv preprint arxiv:1909.00169. Cited by: §4.2.
-  (1998) Applications and explanations of zipf’s law. Association for Computational Linguistics, pp. 151–160. Cited by: §1.
-  (2018) Low-Shot Learning With Imprinted Weights. In CVPR, Cited by: §2.
-  (2017) iCaRL: Incremental Classifier and Representation Learning. In CVPR, Cited by: §1, §2, §3.2, §3.2, §3.2, §3.3.
-  (2019) Incremental few-shot learning with attention attractor networks. In NeurIPS, Cited by: §1, §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1.
-  (2008) Labelme: a database and web-based tool for image annotation. In IJCV, Cited by: §1.
-  (2016) . In ECCV, Cited by: §3.3, Table 1, §4.2, §4.2.
-  (2017) Incremental Learning of Object Detectors Without Catastrophic Forgetting. In ICCV, Cited by: §1, §2, §3.2.
-  (2007) Measuring and predicting importance of objects in our visual world. In Technical Report CNS- TR-2007-002, Cited by: §1.
-  (2018) Active Lifelong Learning With ”Watchdog”. In AAAI, Cited by: §2.
Meta-transfer learning for few-shot learning. In CVPR, Cited by: §1.
-  (2020) Equalization loss for long-tailed object recognition. ArXiv:2003.05176. Cited by: §4.2.
-  (2000) A comparative study of cost-sensitive boosting algorithms. In ICML, Cited by: §2.
-  (2016) Matching networks for one shot learning. In NeurIPS, Cited by: §1, §3.4.
-  (2017) Learning to model the tail. In NeurIPS, Cited by: §1, §2.
-  (2019) Meta-learning to detect rare objects. In ICCV, Cited by: §1, §2.
SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, Cited by: §1.
-  (2017-07) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §4.2.
-  (2019) Learning Classifier Synthesis for Generalized Few-Shot Learning. arXiv preprint arxiv:1906.02944. Cited by: §2, §2, §3.4.
-  (2006) On Multi-Class Cost-Sensitive Learning. In AAAI, Cited by: §2.
-  (2013) The psycho-biology of language: An introduction to dynamic philology. In Routledge, Cited by: §1.