1.1 Seesaw Loss
Existing object detectors struggle on long-tailed datasets, exhibiting unsatisfactory performance on rare classes. We observe that the detector’s classifier tends to predict higher confidence for frequent classes and lower scores for rare classes. Note that a training sample for positive class is also a negative sample for other classes in a multi-class classifier. The overwhelming number of samples in frequent classes leads to models whose rare class confidences are severely suppressed.
To tackle this problem, we propose Seesaw Loss for long-tailed instance segmentation. Seesaw Loss dynamically re-balances the penalty to each category during training, according to a relative ratio of cumulative training instances between different categories. Seesaw Loss has three properties. 1) Seesaw Loss is dynamic the relative ratio between categories. It dynamically modifies the penalty according to the relative ratio of instance numbers between each category pair rather than split categories into different groups [12, 19]. 2) Seesaw Loss is smooth and makes no clear distinction between frequent and rare classes. It smoothly adjusts the punishment on rare classes when the training instances are positive samples of other relatively frequent classes. 3) Seesaw Loss is self-calibrated so that it can be applied in a distribution-agnostic manner. It directly learns to balance the penalty to each categories during training, without relying on known dataset distributions [1, 5, 15, 19] or a specific data sampler [7, 9].
Seesaw Loss. Seesaw Loss can be derived from cross-entropy loss whose general formulation can be written as
where is the activation of classifier, for cross-entropy loss and is the label.
Seesaw loss accumulates the number of training samples for each category during each training iteration. Given an instance with positive label , for the other category , Seesaw Loss dynamically adjusts the penalty for negative label the relative ratio of accumulated training samples as
When category is more frequent than category , Seesaw Loss will reduce the penalty on category for samples of category by a factor of , like a seesaw. The exponent adjusts the scale and is set to in experiments. If category is far more frequent than category , the punishment will be significantly alleviated to protect the category . Otherwise, Seesaw Loss will keep the penalty on negative classes to reduce misclassification.
Classifier Design. Different from traditional detectors which predict classification activation as , we adopt a normalized linear layer as
is a temperature factor and set as 20 in experiments. The normalized linear layer reduces the scale variance of features and weights of different categories, thus improves the performance of tail classes. Different from-norm  that only normalizes the weights at test time, our normalization is applied to both weights and features during training and testing. The combination of normalized linear layer and softmax shares a similar form of cosine softmax [15, 20].
To further mitigate the extreme imbalance between background category and large vocabulary foreground categories, we adopt an objectness branch to predict objectness scores. This branch also adopts normalized linear layer, and is trained by cross-entropy loss.
During inference, both the classification score of various categories and score of objectness are activated with a softmax function. The final detection score for category of a bounding box is
We propose HTC-Lite, a light-weight version of Hybrid Task Cascade (HTC) , to accelerate the training and inference speed while maintaining similar performance. As shown in Figure 1, the modification are in two folds: replacing the semantic segmentation branch by a global context encoding branch and reducing mask heads.
Context Encoding Branch. Since semantic segmentation annotations are unavialable for LVIS dataset, we replace the semantic segmentation branch by a global context encoder 
trained by a semantic encoding loss. The context encoder applies convolution layers and global average pooling to obtain a vector of a image for multi-label prediction. This vector is also added to the RoI features used by box heads and mask heads.
Reduced Mask Heads. To further reduce the cost of instance segmentation, HTC-Lite only keeps one mask head in the last stage, which also spares the original interleaved information passing.
Experimental Setting. We perform experiments on the LVIS v1 benchmark . We use the train split for training and report the performance on the val split for ablation study. No external data and annotations are adopted except for standard ImageNet-1k  classification dataset for pre-training the backbone. We adopts mmdetection as the codebase. Model ensemble is not adopted in our challenge entry.
|Sampler||Loss||Bbox AP||Mask AP|
|Random||EQL||19.3 (+2.4)||18.4 (+2.4)||1.8 (+1.8)||17.1 (+4.8)||27.1|
|Random||Seesaw||24.3 (+7.4)||23.3 (+7.3)||13.0 (+13.0)||22.9 (+10.6)||28.2|
2.1 Ablation Study of Seesaw Loss
We verify the effectiveness of Seesaw Loss on a Mask R-CNN with ResNet-50-FPN  Backbone, trained with multi-scale training and random sampler for 1x training schedule. We also compare Seesaw Loss with Equalization Loss (EQL), the winning method in LVIS challenge 2019, to show the advantages of Seesaw Loss. As shown in Table 1, Seesaw loss significantly improves the baseline performance and surpasses EQL, especially on rare and common classes. The remarkable improvements on and validates the effectiveness of Seesaw Loss for long-tailed instance segmentation.
2.2 Step by Step Results
|Modification||Schedule||Bbox AP||Mask AP|
|+ SyncBN||2x||20.2 (+0.1)||18.9 (+0.2)||0.7||16.0||30.3|
|+ CARAFE Upsample||2x||20.4 (+0.2)||19.4 (+0.5)||0.7||16.5||30.9|
|+ HTC-Lite||2x||23.6 (+3.2)||21.9 (+2.5)||1.1||19.8||33.5|
|+ TSD||2x||25.5 (+1.9)||23.5 (+1.6)||2.3||22.3||34.0|
|+ Mask scoring||2x||25.6 (+0.1)||23.9 (+0.4)||2.8||22.4||35.0|
|+ Training-time augmentaion||45e||28.1 (+2.5)||26.5 (+2.6)||3.6||25.7||37.4|
|+ Better neck||45e||29.1 (+1.0)||27.0 (+0.5)||3.5||25.8||38.6|
|+ Better backbone||45e||32.1 (+3.0)||29.9 (+2.9)||4.2||29.4||41.8|
|+ Seesaw Loss||45e||39.8 (+7.7)||36.8 (+6.9)||25.5||35.6||42.9|
|+ Finetuning||1x||40.6 (+0.8)||37.3 (+0.5)||26.4||36.3||43.1|
|+ Test-time augmentation||-||41.5 (+0.9)||38.8 (+1.5)||26.4||38.3||44.9|
CARAFE Upsample. CARAFE  is used for upsampling in the mask head.
HTC-Lite. We use HTC-Lite as described in Section 1.2.
TSD. TSD  is used to replace the box heads in all three stages in HTC-Lite.
Mask Scoring. We further use the mask IoU head  to improve mask results.
Training Time Augmentation.
We train the model with stronger augmentations with 45 epochs. The learning rate is decreased by 0.1 at 30 and 40 epochs. We randomly resize the image with its longer edge in range of 768 to 1792 pixels. And then we randomly crop the image to size ofafter adopting instaboost augmentation .
Better Neck. We replace the neck archtecture by an enhanced version of Feature Pyramid Grids (FPG) . The enhanced FPG uses deformable convolution v2 (DCNv2)  after feature upsampling, and a downsampler version of CARAFE  for feature downsampling.
Seesaw Loss. We apply the proposed Seesaw Loss to classification branches of the TSD box head, in all cascading stages. Furthermore, we remove the original progressive constraint (PC) loss on classification branches in TSD.
Finetuning with Repeat Factor Sampling. After obtaining the model with Seesaw Loss trained by a random sampler, we freeze all components in the original model. Then we finetune a new classification branch for each cascading stage on the fixed model using repeat factor sampler  by 1x schedule. During inference, the classification scores of original classification branches and the scores of finetuned classification branches are averaged to get the final scores.
Test Time Augmentation. We adopt multi-scale testing with horizontal flipping. Specifically, images scales are 1200, 1400, 1600, 1800 and 2000 pixels.
Final Performance on Test-dev. After adding the abovementioned components step by step, we finally achieve 38.8% AP on the val split and 38.92% AP on the test-dev split.
-  Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: Advances in Neural Information Processing Systems (2019)
-  Chen, K., Cao, Y., Loy, C.C., Lin, D., Feichtenhofer, C.: Feature pyramid grids (2020)
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: Hybrid task cascade for instance segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
-  Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
-  Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Fang, H.S., Sun, J., Wang, R., Gou, M., Li, Y.L., Lu, C.: Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 682–691 (2019)
-  Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019)
-  He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
-  Hu, X., Jiang, Y., Tang, K., Chen, J., Miao, C., Zhang, H.: Learning to segment the tail. In: CVPR (2020)
-  Huang, Z., Huang, L., Gong, Y., Huang, C., Wang, X.: Mask scoring r-cnn. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
-  Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y.: Decoupling representation and classifier for long-tailed recognition. In: ICLR (2020)
-  Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., Feng, J.: Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10991–11000 (2020)
-  Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: CVPR (2017)
-  Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR (2018)
-  Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: CVPR (2019)
-  Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., Sun, J.: MegDet: A large mini-batch object detector. CVPR (2018)
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. IJCV (2015)
-  Song, G., Liu, Y., Wang, X.: Revisiting the sibling head in object detector. CVPR (2020)
-  Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., Yan, J.: Equalization loss for long-tailed object recognition. In: CVPR (2020)
Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. CVPR (2018)
-  Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C.C., Lin, D.: Carafe: Content-aware reassembly of features. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
-  Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
-  Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Muller, J., Manmatha, R., Li, M., Smola, A.: Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955 (2020)
-  Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: CVPR (2019)