The Devil is in Classification: A Simple Framework for Long-tail Instance Segmentation

07/23/2020 ∙ by Tao Wang, et al. ∙ National University of Singapore Institute of Computing Technology, Chinese Academy of Sciences Salesforce 0

Most existing object instance detection and segmentation models only work well on fairly balanced benchmarks where per-category training sample numbers are comparable, such as COCO. They tend to suffer performance drop on realistic datasets that are usually long-tailed. This work aims to study and address such open challenges. Specifically, we systematically investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset, and unveil that a major cause is the inaccurate classification of object proposals. Based on such an observation, we first consider various techniques for improving long-tail classification performance which indeed enhance instance segmentation results. We then propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach. Without bells and whistles, it significantly boosts the performance of instance segmentation for tail classes on the recent LVIS dataset and our sampled COCO-LT dataset. Our analysis provides useful insights for solving long-tail instance detection and segmentation problems, and the straightforward SimCal method can serve as a simple but strong baseline. With the method we have won the 2019 LVIS challenge. Codes and models are available at <>.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection and instance segmentation aim to localize and segment individual object instances from an input image. The widely adopted solutions to such tasks are built on region-based two-stage frameworks, e.g., Faster R-CNN [ren2015faster] and Mask R-CNN [he2017mask]. Though these models have demonstrated remarkable performance on several class-balanced benchmarks, such as Pascal VOC [everingham2010pascal], COCO [lin2014microsoft] and OpenImage [openimage], they are seldom evaluated on datasets with long-tail distribution that is common in realistic scenarios [reed2001pareto] and dataset creation [everingham2010pascal],[krishna2017visual],[lin2014microsoft]. Recently, Gupta et al. [gupta2019] introduce the LVIS dataset for large vocabulary long-tail instance segmentation model development and evaluation. They observe the long-tail distribution can lead to severe performance drop of the state-of-the-art instance segmentation model [gupta2019]. However, the reason for such performance drop is not clear yet.

In this work, we carefully study why existing models are challenged by long-tailed distribution and develop solutions accordingly. Through extensive analysis on Mask R-CNN in Sec. 3, we show one major cause of performance drop is the inaccurate classification of object proposals, which is referred to the bias of classification head. Fig. 1

shows a qualitative example. Due to long-tail distribution, under standard training schemes, object instances from the tail classes are exposed much less frequently to the classifier than the ones from head classes

222we use head classes and many-shot classes interchangeably., leading to poor classification performance on tail classes.

To improve proposal classification, we first consider incorporating several common strategies developed for long-tail classification into current instance segmentation frameworks, including loss re-weighting [huang2016learning],[tang2008svms], adaptive loss adjustment (focal loss [lin2017focal], class-aware margin loss [cao2019learning]), and data re-sampling [he2009learning, shen2016relay]. We find such strategies indeed improve long-tail instance segmentation performance, but their improvement on tail classes is limited and facing the trade-off problem of largely sacrificing performance on head classes. We thus propose a simple and efficient framework after a thorough analysis of the above strategies. Our method, termed SimCal, aims to correct the bias in the classification head with a decoupled learning scheme. Specifically, after normal training of an instance segmentation model, it first collects class balanced proposal samples with a new bi-level sampling scheme that combines image-level and instance-level sampling, and then uses these collected proposals to calibrate the classification head. Thus performance on tail classes can be improved. SimCal also incorporates a simple dual head inference component that effectively mitigates performance drop on head classes after calibration.

Based on our preliminary findings, extensive experiments are conducted on LVIS [gupta2019] dataset to verify the effectiveness of our methods. We also validate the proposed method with SOTA multi-stage instance segmentation model HTC [chen2019hybrid] and our sampled long-tail version of COCO dataset (COCO-LT). From our systematic study, we make the following intriguing observations:

  • Classification is the primary obstacle preventing state-of-the-art region-based object instance detection and segmentation models from working well on long-tail data distribution. There is still a large room for improvement along this direction.

  • By simply calibrating the classification head of a trained model with a bi-level class balanced sampling in the decoupled learning scheme, the performance for tail classes can be effectively improved.

Figure 1: (a) Examples of object proposal and instance segmentation results from ResNet50-FPN Mask R-CNN, trained on long-tail LVIS dataset. The RPN can generate high-quality object proposals (yellow bounding boxes with high confidence scores) even on long-tail distribution, e.g., cargo ship (7 training instances) and vulture (4 training instances). However, they are missed in final detection and segmentation outputs (green bounding boxes and masks) due to poor proposal classification performance. Other proposal candidates and detection results are omitted from images for clarity. (b) Comparison of proposal recall (COCO style Average Recall) and AP between COCO and LVIS dataset with Mask R-CNN model. (c) Pilot experiment results on Mask R-CNN with class-agnostic and class-wise box and mask heads on ResNet50-FPN backbone, evaluated with LVIS v0.5 val set. mrcnn-ag* denotes standard inference with 0.05 confidence threshold as in optimal settings of COCO, while mrcnn-ag means inference with threshold 0.0. Note for all later experiments we use 0.0 threshold. AP denotes box AP. props-gt means testing with ground truth labels of the proposals

2 Related Works

2.0.1 Object Detection and Segmentation

Following the success of R-CNN [girshick2014rich], Fast R-CNN [girshick2015fast] and Faster R-CNN [ren2015faster] architectures, the two-stage pipeline has become prevailing for object detection. Based on Faster R-CNN, He et al. [he2017mask] propose Mask R-CNN that extends the framework to instance segmentation with a mask prediction head to predict region based mask segments. Lots of later works try to improve the two-stage framework for object detection and instance segmentation. For example, [huang2019mask],[jiang2018acquisition] add IOU prediction branch to improve confidence scoring for object detection and instance segmentation respectively. Feature augmentation and various training techniques are thoroughly examined by [liu2018path]. Recently, [cai2018cascade] and [chen2019hybrid] further extend proposal based object detection and instance segmentation to multi-stage and achieve state-of-the-art performance. In this work, we study how to improve proposal-based instance segmentation models over long-tail distribution.

2.0.2 Long-tailed Recognition

Recognition on long-tail distribution is an important research topic as imbalanced data form a common obstacle in real-world applications. Two major approaches tackling long-tail problems are sampling [buda2018systematic],[byrd2019effect], [he2009learning],[shen2016relay] and loss re-weighting [cui2019class],[huang2016learning],[huang2019deep]. Sampling methods over-sample minority classes or under-sample majority classes to achieve data balance to some degree. Loss re-weighting assigns different weights to different classes or training instances adaptively, e.g., by inverse class frequency. Recently, [cui2019class] proposes to re-weight loss by the number of inversed effective samples. [cao2019learning] explores class aware margin for classifier loss calculation. In addition, [kang2019decoupling] tries to examine the relation of feature and classifier learning in an imbalanced setting. [wang2017learning] develops a meta learning framework that transfers knowledge from many-shot to few-shot classes. Existing works mainly focus on classification, while the crucial tasks of long-tail object detection and segmentation on remain largely unexplored.

3 Analysis: Performance Drop on Long-tail Distribution

We investigate the performance decline phenomenon of popular two-stage frameworks for long-tail instance detection and segmentation.

Our analysis is based on experiments on LVIS v0.5 train and validation sets. The LVIS dataset [gupta2019] is divided into 3 sets: rare, common, and frequent, among which rare and common contain tail classes and frequent includes head classes. We report AP on each set, denoted as AP, AP, AP. For simplicity, we train a baseline Mask R-CNN with ResNet50-FPN backbone and class agnostic box and mask prediction heads. As shown in Fig. 1 (c), our baseline model (denoted as mrcnn-ag*) performs poorly, especially on tail categories (rare set, AP, AP), with 0 box and mask AP.

Usually, the confidence threshold is set to a small positive value (e.g. 0.05 for COCO) to filter out low-quality detections. Since LVIS contains 1,230 categories, the softmax activation gives much lower average confidence scores, thus we minish the threshold here. However, even lowering the threshold to 0 (mrcnn-ag), the performance remains very low for tail classes, and improvement on rare is much smaller than that of common (6.1 vs 2.7 for segmentation AP, 5.9 vs 2.8 for bbox AP). This reveals the Mask R-CNN model trained with the normal setting is heavily biased to the head classes.

We then calculate proposal recall of mrcnn-ag model and compare with the one trained on COCO dataset with the same setting. As shown in Fig. 1 (b), the same baseline model trained on LVIS only has a drop of 8.8% (55.9 to 51.0) in proposal recall compared with that on COCO, but notably, has a 45.1% (32.8 to 18.0) drop in overall mask AP. Since the box and mask heads are class agnostic, we hypothesize that the performance drop is mainly caused by the degradation of proposal classification.

To verify this, for the proposals generated by RPN [ren2015faster], we assign their ground truth class labels to the second stage as its classification results. Then we evaluate the AP. As shown in Fig. 1 (c), the mask AP for tail classes is increased by a large margin, especially on rare and common sets. Such findings also hold for the box AP. Surprisingly, with normal class-wise box and mask heads (standard version of Mask R-CNN), performance on tail classes is also boosted significantly. This suggests the box and mask head learning are less sensitive to long-tail training data than classification.

The above observations indicate that the low performance of the model over tail classes is mainly caused by poor proposal classification on them. We refer to this issue as classification head bias. Addressing the bias is expected to effectively improve object detection and instance segmentation results.

4 Solutions: Alleviating Classification Bias

Based on the above analysis and findings, we first consider using several existing strategies of long-tail classification, and then present a new calibration framework to correct the classification bias for better detection and segmentation on long-tail distribution.

4.1 Using Existing Long-tail Classification Approaches

We adapt some popular approaches of image classification to solving our long-tail instance detection and segmentation problem, as introduced below. We conduct experiments to see how our adapted methods work in Sec. 5.2. Given a sample

, the model outputs logits denoted as

, and

is probability prediction on the true label


4.1.1 Loss Re-weighting

[cui2019class], [huang2016learning], [khan2017cost], [tang2008svms], [ting2000comparative] This line of works alleviate the bias by applying different weights to different samples or categories, such that tail classes or samples receive higher attention during training, thus improving the classification performance. For LVIS, we consider a simple and effective inverse class frequency re-weighting strategy adopted in [huang2016learning, wang2017learning]. Concretely, the training samples of each class are weighted by where is the training instance number of class .

is a hyperparameter. To handle noise, the weights are clamped to

. The weight for the background is also a hyperparameter. During training, the second stage classification loss is weighted as .

4.1.2 Focal Loss

[lin2017focal] Focal loss can be regarded as loss re-weighting that adaptively assigns a weight to each sample by the prediction. It was originally developed for foreground-background class imbalance for one-stage detectors, and also applicable to alleviating the bias in long-tail problems since head-class samples tend to get smaller losses due to sufficient training, and the influence of tail-class samples would be again enlarged. Here we use the multi-class extension of Focal loss .

4.1.3 Class-aware Margin Loss

[cao2019learning] This method assigns a class dependent margin to loss calculation. Specifically, a larger margin will be assigned for the tail classes, so they are expected to generalize better with limited training samples. We adopt the margin formulation  [cao2019learning] where is the training instance number for class as above and plug the margin into cross entropy loss .

4.1.4 Repeat Sampling

[he2009learning], [shen2016relay]

Repeat sampling directly over-samples data (images) with a class-dependent repeating factor, so that the tail classes can be more frequently involved in optimization. Consequently, the training steps for each epoch will be increased due to the over-sampled instances. However, this type of methods are not trivially applicable to detection frameworks since multiple instances from different classes frequently exist in one image.

[gupta2019] developed a specific sampling strategy for LVIS dataset, calculating a per-image repeat factor based on a per-category repeat threshold and over-sampling each training image according to the repeat factor in each epoch. Note that box and mask learning will also be affected by this method.

We implement the adapted version of  [cao2019learning], [cui2019class], [gupta2019], [lin2018focal] for experiments. See Sec. 5.2 for details. From the results, we find the above approaches indeed bring some performance improvements over the baselines, which however are very limited. Re-weighting methods tend to complicate the optimization of deep models with extreme data imbalance [cui2019class], which is the case for object detection with long-tail distribution, leading to poor performance on head classes. Focal loss well addresses the imbalance between foreground and easy background samples, but has difficulty in tackling the imbalance between foreground object classes with more similarity and correlation. For class-aware margin loss, the prior margin enforced in loss calculation also complicates the optimization of a deep model, leading to larger drop of performance on head classes. The repeat sampling strategy suffers from overfitting since it repeatedly samples from tail classes. Also, it additionally samples more data during training, leading to increased computation cost. In general, the diverse object scale and surrounding context in object instance detection further complicate above-discussed limitations, making these methods hardly suitable for our detection tasks.

4.2 Proposed SimCal: Calibrating the Classifier

Figure 2: Left: Illustration of proposed bi-level sampling scheme. Refer to Sec. 4.2.1 for more details. Right: Architecture of proposed method. I: training or test image sets; R: random sampling; CBS: class-balanced sampling; C: classification head; B: box regression head; M: mask prediction head; CC: calibrated classification head. Blue modules are in training mode while grey modules indicate frozen. (a) (b) show standard Mask R-CNN training and inference respectively. (c) (d) show proposed calibration and dual head inference respectively

We find in Sec. 3 that significant performance gain on tail classes can be achieved with merely ground truth proposal labels, and as discussed in Sec. 4.1, exiting classification approaches are not very suitable for tackling our long-tail instance segmentation task. Here, we propose a new SimCal framework to calibrate the classification head by retraining it with a new bi-level sampling scheme while keeping the other parts frozen after standard training. This approach is very simple and incurs negligible additional computation cost since only the classification head requires gradient back-propagation. The details are given as follows.

4.2.1 Calibration Training with Bi-level Sampling

As shown in Fig. 2, we propose a bi-level sampling scheme to collect training instances for calibrating the classification head through retraining. To create a batch of training data, first, object classes (i.e., to ) are sampled uniformly form all the classes (which share the same probability). Then, we randomly sample images that contain the categories respectively (i.e., I to I), and feed them to the model. At the object level, we only collect proposals that belong to the sampled classes and background for training. Above, we only sample 1 image for each sampled class for simplicity, but note that the number of sampled images can also be larger. As shown in Fig. 2 right (a), after standard training, we freeze all the model parts (including backbone, RPN, box and mask heads) except for the classification head, and employ the bi-level sampling to retrain the classification head, which is initialized with the original head. Then, the classification head is fed with fairly balanced proposal instances, thus enabling the model to alleviate the bias. Different from conventional fine-tuning conducted on a small scale dataset after pretraining on a large one, our method only changes the data sample distribution. Refer to supplementary material for more implementation details, including foreground and background ratio and matching IOU threshold for proposals. Formally, the classification head is trained with loss:


where is the number of sampled classes per batch, is the number of proposal samples for class , is for background, is cross entropy loss, and and denotes model prediction and ground truth label.

4.2.2 Dual Head Inference

After the above calibration, the classification head is now balanced over classes and can perform better on tail classes. However, the performance on head classes drops. To achieve optimal overall performance, here we consider combining the new balanced head and the original one that have higher performance respectively on tail classes and on head classes. We thus propose a dual head inference architecture.

An effective combining scheme is to simply average the models’ classification predictions [alpaydin1993multiple], [breiman1996bagging], [krogh1995neural], but we find this is not optimal as the original head is heavily biased to many-shot classes. Since the detection models adopt class-wise post-processing (i.e., NMS) and the prediction does not need to be normalized, we propose a new combining scheme that directly selects prediction from the two classifiers for the head and tail classes:


where indexes the classes, is the number of classes, stands for background, and denote the (+)-dimensional predictions of calibrated and original heads respectively, is the combined prediction, is the training instance number of class , and is the threshold number controlling the boundary of head and tail classes. Other parts of inference remain the same (Fig. 2 (d)). Our dual head inference is with small overhead compared to the original model.

4.2.3 Bi-level Sampling vs. Image Level Repeat Sampling

Image level repeat sampling (e.g.,  [gupta2019]), which is traditionally adopted, balances the long-tail distribution at the image level, while our bi-level sampling alleviates the imbalance at the proposal level. Image level sampling approaches train the whole model directly, while we decouple feature and classification head learning, and adjust the classification head only with bi-level class-centric sampling and keep other parts the freezed after training under normal image-centric sampling. We also empirically find the best setting (t=0.001) of IS [gupta2019]

additionally samples about 23k training images (56k in total) per epoch, leading to more than 40% increase of training time. Comparatively, our method incurs less than 5% additional time and costs much less GPU memory since only a small part of the model needs backpropagation.

5 Experiments

In this section, we first report experiments of using exiting classification approaches to solve our long-tail instance segmentation problem. Then we evaluate our proposed solution, i.e. the SimCal framework, analyze its model designs and test its generalizality.

Our experiments are mainly conducted on LVIS dataset [gupta2019]. Besides, to check the generalizability of our method, we sample a new COCO-LT dataset from COCO [lin2014coco]. We devise a complimentary instance-centric category division scheme that helps to more comprehensively analyze model performance. For each experiment, we report result with median overall AP over 3 runs.

5.1 Datasets and Metrics

Figure 3: Category distribution of COCO (2017) and sampled COCO-LT datasets. The categories are sorted in descending numbers of training instances

5.1.1 Datasets

1) LVIS [gupta2019]. It is a recent benchmark for large vocabulary long-tail instance segmentation [gupta2019]. The source images are from COCO dataset, while the annotation follows an iterative object spotting process that captures the long-tail category statistic naturally appearing in images. Current released version v0.5 contains 1,230 and 830 object classes respectively in its train and validation set, with test set unknown. Refer to Fig. 1 (a) for train set category distribution. The three sets contain about 50k, 5k and 20k images correspondingly. 2) COCO-LT. We sample it from COCO [lin2014coco]

by following an exponential distribution on training instance statistics to create a long-tail version. COCO-LT contains 80 classes and about 100k images. Fig. 

3 shows the category distribution. Due to space limitations, we defer details of sampling process to supplement.

Set Total Divided by #image Divided by #instance
rare common frequent
Train 1230 454 461 315 294 453 302 181
Train-on-val 830 125 392 313 67 298 284 181
Table 1: Diffenrent category division scheme, with LVIS v0.5 dataset [gupta2019]. The left part is division based on training image number as in  [gupta2019], the right part is proposed scheme based on training intance number. Train-on-val means categories that appear in the validation set

5.1.2 Metrics

We adopt AP as overall evaluation metric. Object categories in LVIS are divided into

rare, common, frequent sets [gupta2019], respectively containing 10, 10-100, and 100 training images. We show in Table 1 the category distribution of training and validation sets. Besides data splitting based on image number, we devise a complimentary instance-centric category division scheme, considering number of instances is a widely adopted measurement for detection in terms of benchmark creation, model evaluation [openimage], [everingham2010pascal], [lin2014coco]. In particular, we divide all the categories into four bins333Note we use “bin” and “set” interchangeably. based on the number of training instances, with #instances 10, 10-100, 100-1000, and 1000, as shown in Table 1. Accordingly, we calculate AP on each bin as complementary metrics, denoted as AP, AP, AP, and AP. Such a division scheme offers a finer dissection of model performance. For example, AP corresponds to the commonly referred few-shot object detection regime [chen2018lstd], [kang2019few], [karlinsky2019repmet]. rare set (10 training images) contains categories that have up to 219 training instances (‘chickpea’), so AP cannot well reflect model’s few-shot learning capability. AP reflects performance on classes with COCO level training data, while most classes in frequent set (100 images) have much less than 1,000 training instances (e.g., ‘fire-alarm’: 117). With the two division schemes, we can report AP on both image-centric (AP, AP, AP) and instance-centric (AP, AP, AP, AP) bins for LVIS. For COCO-LT, since the per-category training instance number varies in a much larger range, we divide the categories into four bins with 20, 20-400, 400-8000, and 8000 training instances and report performance as AP, AP, AP, AP on these bins. Unless specified, AP is evaluated with COCO style by mask AP.

5.2 Evaluating Adapted Existing Classification Methods

We apply adapted discussed methods in Sec. 4.1 to classification head of Mask R-CNN for long-tail instance segmentation, including  [cao2019learning], [cui2019class], [gupta2019], [lin2018focal]. Results are summarized in Table 2. We can see some improvements have been achieved on tail classes. For example, 6.0, 6.2, 7.7 absolute margins on AP and 10.1, 8.7, 11.6 on AP for loss re-weighting (LR), focal loss (FL) and image level repeat sampling (IS) are observed, respectively. However, on the other hand, they inevitably lead to drop of performance on head classes, e.g., more than 2.0 drop for all methods on AP and AP. Performance drop on head classes is also observed in imbalanced classification [he2009learning], [tang2008svms]. Overall AP is improved by at most 2.5 in absolute value (i.e., IS). Similar observation holds for box AP.

r50 0.0 17.1 23.7 29.6 3.7 20.0 28.4 20.7 0.0 15.9 24.6 30.5 3.3 19.0 30.0 20.8
CM 2.6 21.0 21.8 26.6 8.4 21.2 25.5 21.0 2.8 20.0 22.0 26.6 6.8 20.5 26.4 20.7
LR 6.0 23.3 22.0 25.1 13.8 22.4 24.5 21.9 6.0 21.2 22.3 25.5 11.3 21.5 24.9 21.4
FL 6.2 21.0 22.0 27.0 12.4 20.9 25.9 21.5 5.8 20.5 22.7 28.0 10.5 21.0 27.0 21.7
IS 7.7 25.6 21.8 27.4 15.3 23.7 25.6 23.2 6.7 22.8 22.1 27.4 11.6 22.2 26.7 22.0
Table 2: Results on LVIS by adding common strategies in long-tail classification to Mask R-CNN in training. r50 means Mask R-CNN on ResNet50-FPN backbone with class-wise box and mask heads (standard version). CM, LR, FL and IS denote discussed class aware margin loss, loss re-weighting, Focal loss and image level repeat sampling respectively. AP denotes box AP. We report result with median overall AP over 3 runs

5.3 Evaluating Proposed SimCal

In this subsection, we report the results of our proposed method applied on mask R-CNN. We evaluate both class-wise and class-agnostic versions of the model. Here for dual head inference is set to 300.

Model cal dual AP AP AP AP AP AP AP AP
bbox r50-ag 0.0 12.8 22.8 28.3 2.8 16.4 27.5 18.6
12.4 23.2 20.6 23.5 17.7 21.2 23.4 21.5
12.4 23.4 21.4 27.3 17.7 21.3 26.4 22.7
r50 0.0 15.9 24.6 30.5 3.3 19.0 30.0 20.8
8.1 21.0 22.4 25.5 13.4 20.6 25.7 21.4
8.2 21.3 23.0 29.5 13.7 20.6 28.7 22.6
mask r50-ag 0.0 13.3 21.4 27.0 2.7 16.8 25.6 18.0
13.2 23.1 20.0 23.0 18.2 21.4 22.2 21.2
13.3 23.2 20.7 26.2 18.2 21.5 24.7 22.2
r50 0.0 17.1 23.7 29.6 3.7 20.0 28.4 20.7
10.2 23.5 21.9 25.3 15.8 22.4 24.6 22.3
10.2 23.9 22.5 28.7 16.4 22.5 27.2 23.4
Table 3: Results on LVIS by applying SimCal to Mask R-CNN with ResNet50-FPN. r50-ag and r50 denote models with class-agnostic and class-wise heads (box/mask) respectively. cal and dual means calibration and dual head inference. Refer to supplementary file for an anlaysis on LVIS result mean and std
r50 0.0 17.1 23.7 29.6 3.7 20.0 28.4 20.7
CM 4.6 21.0 22.3 28.4 10.0 21.1 27.0 21.6
LR 6.9 23.0 22.1 28.8 13.4 21.7 26.9 22.5
FL 7.1 21.0 22.1 28.4 13.1 21.5 26.5 22.2
IS 6.8 23.2 22.5 28.0 14.0 22.0 27.0 22.7
ours 10.2 23.9 22.5 28.7 16.4 22.5 27.2 23.4
Table 4: Results for augmenting discussed long-tail classification methods with proposed decoupled learning and dual head inference.
orig 0.0 13.3 21.4 27.0 18.0
cal 8.5 20.8 17.6 19.3 18.4
avg 8.5 20.9 19.6 24.6 20.3
sel 8.6 22.0 19.6 26.6 21.1
Table 5: Comparison between proposed combining scheme (sel) and averaging (avg).

5.3.1 Calibration Improves Tail Performance

From results in Table 3, we observe consistent improvements on tail classes for both class-agnostic and class-wise version of Mask R-CNN (more than 10 absolute mask and box AP improvement on tail bins). Overall mask and box AP are boosted by a large margin. But we also observe a significant drop of performance on head class bins, e.g., 23.7 to 21.9 on AP and 29.6 to 25.3 on AP for the class-wise version of Mask R-CNN. With calibration, the classification head is effectively balanced.

5.3.2 Dual Head Inference Mitigates Performance Drop on Head Classes

The model has a minor performance drop on the head class bins but an enormous boost on the tail class bins. For instance, we observe 0.8 drop of AP but 13.3 increase on AP for r50-ag model. It can be seen that with the proposed combination method, the detection model can effectively gain the advantage of both calibrated and original classification heads.

5.3.3 Class-wise Prediction is Better for Head Classes While Class-agnostic One is Better for Tail Classes

We observe AP of r50-ag with cal and dual is 3.1 higher (13.3 vs 10.2) than that of r50 while AP is 2.5 lower (26.2 vs 28.7), which means class-agnostic heads (box/mask) have an advantage on tail classes, while class-wise heads perform better for many-shot classes. This phenomenon suggests that a further improvement can be achieved by using class-agnostic head for tail classes so they can benefit from other categories for box and mask prediction, and class-wise head for many-shot classes as they have abundant training data to learn class-wise prediction, which is left for future work.

5.3.4 Comparing with Adapted Existing Methods

For fair comparison, we also consider augmenting the discussed imbalance classification approaches with proposed decoupled learning framework. With the same baseline Mask R-CNN trained in the normal setting, we freeze other parts except for classification head, and use these methods to calibrate the head. After that, we apply the dual head inference for evaluation. As shown in Table 5, they have similar performance on head classes as dual head inference is used. They nearly all get improved on tail classes than the results in Table 2 (e.g., 4.6 vs 2.6, 6.9 vs 6.0, and 7.1 vs 6.2 on AP for CM, LR, and FL methods respectively), indicating the effectiveness of the decoupled learning scheme for recognition of tail classes. The image level repeat sampling (IS) gets worse performance than that in Table 2, suggesting box and mask learning also benefits a lot from the sampling. Our method achieves higher performance, i.e., 10.2 and 23.9 for AP and AP, which validates effectiveness of the proposed bi-level sampling scheme.

Figure 4: (a) Model performance as a function of calibration steps. The result is obtained with r50-ag model (Table 3). (b) Effect of design choice of calibration head. Baseline: original model result; 2fc_ncm [guerriero2018deepncm]: we have tried to adopt the deep nearest class mean classifier learned with 2fc representation. 2fc_rand: 2-layer fully connected head with 1024 hidden units, random initialized; 3fc-rand: 3-layer fully connected head with 1024 hidden units, random initialized. 3fc-ft: 3fc initialized from original head. (c) Effect of boundary number (with r50-ag)

5.4 Model Design Analysis of SimCal

5.4.1 Calibration Dynamics

As shown in Fig. 4 (a), with the progress of calibration, model performance is progressively balanced over all the class bins. Increase of AP on tail bins (i.e., AP, AP) and decrease of AP on the head (i.e., AP, AP) are observed. With about 10-20k steps, AP on all the bins and overall AP converge to a steady value.

5.4.2 Design Choice of Calibration Head

While the proposed calibration method tries to calibrate the original head, we can perform the calibration training on other head choices. As shown in Fig. 4 (b), we have tried different instantiations instead of the original head. It is interesting that with random initialization, 3-layer fully connected head performs worse than 2-layer head on AP (i.e., 2fc_rand vs 3fc-rand). But when it is initialized from the original 3-layer head, the performance is significantly boosted by 4.1 and 4.3 AP respectively on AP and AP (i.e., 3fc_ft). This phenomenon indicates that training under random sampling can help the classification head learn general features and perform well when calibrating with balanced sampling. We only compare them on the tail class bins since they perform on par on head class bins with dual head inference.

5.4.3 Combining Scheme and Head/Tail Boundary for Dual Heads

As shown in Table 5. Our combination approach achieves much higher performance than simple averaging. Refer to supplementary material for more alternative combining choices. We also examine the effect of head/tail boundary as in Fig. 4 (c). For the same model, we vary the boundary threshold instance number from 10 to 1000. The AP is very close to optimal () in . Thus dual head is insensitive to the exact value of hyperparameter T in a wide range.

Model Val Test
best [gupta2019] 15.6 27.5 31.4 27.1 9.8 21.1 30.0 20.5
htc-x101 5.6 33.0 33.7 37.0 13.7 34.0 36.6 31.9 5.9 25.7 35.3 22.9
IS 10.2 32.3 33.2 36.6 17.6 33.0 36.1 31.9
2fc_rand 12.9 32.2 33.5 37.1 18.5 33.3 36.1 32.1 10.3 25.3 35.1 23.9
3fc_ft 18.8 34.9 33.0 36.7 24.7 33.7 36.4 33.4
Table 6: Results with Hybrid Task Cascade (HTC) on LVIS. With backbone of ResNeXt101-64x4d-FPN. best denotes best single model performance reported in  [gupta2019]. The remaining rows are our experiment results with HTC. 2fc_rand and 3fc_ft are different design choices of classification head (Sec. 5.4.2). Only 2fc_rand is available on test set as the evaluation server is closed
Model cal dual AP AP AP AP AP AP AP AP AP AP
r50-ag 0.0 8.2 24.4 26.0 18.7 0.0 9.5 27.5 30.3 21.4
r50-ag 15.0 16.2 22.4 24.1 20.6 14.5 17.9 24.8 27.6 22.9
r50-ag 15.0 16.2 24.3 26.0 21.8 14.5 18.0 27.3 30.3 24.6
Table 7: Results on COCO-LT, evaluated on minival set. AP, AP, AP, AP correspond to bins of [1, 20), [20, 400), [400, 8000), [8000, -) training instances

5.5 Generalizability Test of SimCal

5.5.1 Performance on SOTA Models

We further apply the proposed method to state-of-the-art multi-stage cascaded instance segmentation model, Hybrid Task Cascade [chen2019hybrid] (HTC), by calibrating classification heads at all the stages. As shown in Table 6, our method brings significant gain on tail classes and minor drop on many-shot classes. Notably, the proposed approach leads to much higher gain than the image level repeat sampling method (IS), (i.e., 8.5 and 2.5 higher on AP and AP respectively). We achieve state-of-the-art single model performance on LVIS, which is 6.3 higher in absolute value than the best single model reported in [gupta2019] (33.4 vs 27.1). And with test set, a consistent gain is observed.

5.5.2 Performance on COCO-LT

As shown in Table 7, similar trend of performance boost as LVIS dataset is observed. On COCO-LT dual head inference can enjoy nearly full advantages of both the calibrated classifier on tail classes and the original one on many shot classes. But larger drop of performance on many-shot classes with LVIS is observed. It may be caused by the much stronger inter-class competition as LVIS has much larger vocabulary.

6 Conclusions

In this work, we carefully investigate two-stage instance segmentation model’s performance drop with long-tail distribution data and reveal that the devil is in proposal classification. Based on this finding, we first try to adopt several common strategies in long-tail classification to improve the baseline model. We also propose a simple calibration approach, SimCal, for improving the second-stage classifier on tail classes. It is demonstrated that SimCal significantly enhances Mask R-CNN and SOTA multi-stage model HTC. A large room of improvement still exists along this direction. We hope our pilot experiments and in-depth analysis together with the simple method would benefit future research.

6.0.1 Acknowledgement

Jiashi Feng was partially supported by MOE Tier 2 MOE 2017-T2-2-151, NUS_ECRA_FY17_P08, AISG-100E-2019-035.