Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax

06/18/2020 ∙ by Yu Li, et al. ∙ National University of Singapore Institute of Computing Technology, Chinese Academy of Sciences 12

Solving long-tail large vocabulary object detection with deep learning based models is a challenging and demanding task, which is however under-explored.In this work, we provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. We find existing detection methods are unable to model few-shot classes when the dataset is extremely skewed, which can result in classifier imbalance in terms of parameter magnitude. Directly adapting long-tail classification models to detection frameworks can not solve this problem due to the intrinsic difference between detection and classification.In this work, we propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training. It implicitly modulates the training process for the head and tail classes and ensures they are both sufficiently trained, without requiring any extra sampling for the instances from the tail classes.Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors with various backbones and frameworks on both object detection and instance segmentation. It beats all state-of-the-art methods transferred from long-tail image classification and establishes new state-of-the-art.Code is available at https://github.com/FishYuLi/BalancedGroupSoftmax.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection [33, 31, 26, 24, 22, 1]

is one of the most fundamental and challenging tasks in computer vision. Recent advances are mainly driven by large-scale datasets that are manually balanced, such as PASCAL VOC 

[10] and COCO [25]. However in reality, the distribution of object categories is typically long-tailed [32]. Effective solutions that adapt state-of-the-art detection models to such class-imbalanced distribution are highly desired yet still absent. Recently, a long-tail large vocabulary object recognition dataset LVIS [15] is released, which greatly facilitates object detection research in much more realistic scenarios.

Figure 1: Sorted number of training instances (#ins) for categories in COCO and LVIS training set, and the corresponding classifier weight norm from Faster R-CNN model trained on COCO and LVIS. The x-axis represents the sorted category index of COCO and LVIS. We align 80 classes of COCO with 1230 classes of LVIS for better visualization. Category 0 indicates background.

A straightforward solution to long-tail object detection is to train a well-established detection model (e.g., Faster R-CNN [33]) on the long-tail training data directly. However, big performance drop would be observed when adapting detectors designed for fairly balanced datasets (e.g., COCO) to a long-tail one (e.g., LVIS), for which the reasons still remain unclear due to multiple entangled factors. Inspired by [21], we decouple the representation and classification modules within the detection framework, and find the weight norms of the proposal classifier corresponding to different categories are severely imbalanced, since low-shot categories get few chances to be activated. Through our analysis, this is one direct cause of the poor long-tail detection performance, which is intrinsically induced by data imbalance. As shown in Figure 1, we sort the category-wise classifier weight norms of models trained on COCO and LVIS respectively by the number of instances in the training set. For COCO, the relatively balanced data distribution leads to relatively balanced weight norms for all categories, except for background class (CID=0, CID for Category ID). For LVIS, it is obvious that the category weight norms are imbalanced and positively correlated with the number of training instances. Such imbalanced classifiers (w.r.t. their parameter norm) would make the classification scores for low-shot categories (tail classes) much smaller than those of many-shot categories (head classes). After standard softmax, such imbalance would be further magnified thus the classifier wrongly suppresses the proposals predicted as low-shot categories.

The classifier imbalance roots in data distribution imbalance—classifiers for the many-shot categories would see more and diverse training instances, leading to dominating magnitude. One may consider using solutions to long-tail classification to overcome such an issue, including re-sampling training instances to balance the distribution [16, 8, 34, 27] and re-weighting classification loss at category level [6, 2, 19] or instance level [24, 35]. The re-sampling based solutions are applicable to detection frameworks, but may lead to increased training time and over-fitting risk to the tail classes. Re-weighting based methods are unfortunately very sensitive to hyper-parameter choices and not well applicable to detection frameworks due to difficulty in dealing with the special background class, an extremely many-shot category. We empirically find none of these methods works well on long-tail detection problem.

In this work, to address the classifier imbalance, we introduce a simple yet effective balanced group softmax (BAGS) module into the classification head of a detection framework. We propose to put object categories with similar numbers of training instances into the same group and compute group-wise softmax cross-entropy loss separately. Treating categories with different instance numbers separately can effectively alleviate the domination of the head classes over tail classes. However, due to the lack of diverse negative examples for each group training, the resultant model suffers too many false positives. Thus, BAGS further adds a category others into each group and introduces the background category as an individual group, which can alleviate the suppression from head classes over tail classes by keeping their classifiers balanced while preventing false positives by categories background and others.

We experimentally find BAGS works very well. It improves by 9% – 19% the performance on tail classes of various frameworks including Faster R-CNN [33], Cascade R-CNN [1], Mask R-CNN [17] and HTC [4] with ResNet-50-FPN [18, 23] and ResNeXt-101-x64x4d-FPN [40] backbones consistently on the long-tail object recognition benchmark LVIS [15], with the overall mAP lifted by around 3% – 6%.

To sum up, this work makes following contributions:

  • Through comprehensive analysis, we reveal the reason why existing models perform not well for long-tail detection, i.e. their classifiers are imbalanced and not trained equally well, reflected by the observed imbalanced classifier weight norms.

  • We propose a simple yet effective balanced group softmax module to address the problem. It can be easily combined with object detection and instance segmentation frameworks to improve their long-tail recognition performance.

  • We conduct extensive evaluations with state-of-the-art long-tail classification methods for object detection. Such benchmarking not only deepens our understandings of these methods as well as the unique challenges of long-tail detection, but also provides reliable and strong baselines for future research in this direction.

2 Related Works

Compared with balanced distribution targeted object detection [13, 33, 1], and few-shot object detection [20, 3, 41, 11], the challenging and practical long-tail object detection problem is still underexplored. Though Ouyang et al[29] proposes the concept of long-tail object detection, their work focuses on the imbalanced training data distribution on ILSVRC DET dataset [7] without few-shot setting for tail classes like LVIS [15]. [15] proposes repeat factor sampling (RFS) serving as a baseline. Classification calibration [39] enhances RFS by calibrating classification scores of tail classes with another head trained with ROI level class-balanced sampling strategy. Below we first review general object detection methods, and then long-tail classification methods.

General object detection

Deep learning based object detection frameworks are divided into anchor-based and anchor-free ones. Anchor-based approaches [12, 13, 33, 31, 24] explicitly or implicitly extract features for individual regions thus convert object detection into proposal-level classification which have been largely explored. In contrast, anchor-free approaches focus on detecting key points of objects and construct final detection boxes by properly combining detected key points [22, 9, 43] or expanding the representation of key points [42, 38]. For such detectors, proposal classification is achieved by classifying the key points.

These popular object detection frameworks all employ a softmax classifier for either proposal classification or key-point classification. Our proposed balanced group softmax module can be easily plugged into such mainstream detectors by simply replacing the original softmax classifier. For simplicity, we mainly experiment with anchor-based detectors Faster R-CNN [33] and Cascade R-CNN [1] as well as their corresponding instance segmentation approaches Mask R-CNN [17] and HTC [4].

Long-tail classification

Long-tail classification is attracting increasing attention due to its realistic applications. Current works leverage data re-sampling, cost-sensitive learning, or other techniques. For data re-sampling methods, training samples are either over-sampled (adding copies of training samples for tail classes) [16], under-sampled (deleting training samples for head classes) [8], or class-balanced sampled [34, 27], which motivates RFS [15]. For cost-sensitive learning, the network losses are re-weighted at category level by multiplying different weights on different categories to enlarge the influence of tail-class training samples [6, 2, 19] or at instance level by multiplying different weights on different training samples for more fine-grained control [24, 35]. Some other approaches optimize the classifier trained with long-tail data such as Nearest Class Mean classifier (NCM) [28, 14], and -normalized classifier [21]. These methods are usually sensitive to hyper-parameters and do not perform well when transferred to detection frameworks due to the inherent difference between classification and detection as stated in Sec. 1.

Therefore, an approach specifically designed for long-tail object detection is desirable, and our work is the first successful attempt to overcome classifier imbalance through group-wise training without extra sampling from tail classes.

Figure 2: Framework of our balanced group softmax module. (a) Training: Classes containing similar training instances are grouped together. Class others is added to each group. denotes background group. Softmax cross entropy (CE) loss is applied to each group individually. (b) Testing: With new prediction

, softmax is applied in each group, and probabilities are ordered by their original category id (CID) and re-scaled with foreground probability, generating new probability vectors for following post process.

3 Preliminary and Analysis

3.1 Preliminary

We first revisit the popular two-stage object detection framework, by taking Faster R-CNN [33] as an example. We adopt such a two-stage framework to develop and implement our idea.

The backbone network takes an image as input, and generates a feature map . The feature map is then passed to ROI-align [17] or ROI-pooling [13] to produce proposals with their own feature . Here denotes proposal . The classification head then extracts a -dimensional feature for each of the proposals. Finally, one FC (fully connected) layer is used to transfer to the ()-category prediction ( object classes plus background) by , where is the classifier weights with each column related to one specific category , and is the bias term.

During training, with ground truth label , softmax cross entropy is applied to compute loss for a specific proposal:

(1)
(2)

Here denotes the -th element of and is the predicted probability of the proposal being an instance of category .

3.2 Analysis

Current well-performing detection models often fail to recognize tail classes when the training set follows a long-tailed distribution. In this section, we try to investigate the underlying mechanism behind such performance drop from balanced dataset to long-tailed dataset, by conducting contrast experiments on their representative examples, i.e., COCO and LVIS.

We adopt a Faster R-CNN [13]

model with R50-FPN backbone. By directly comparing the mAP on the two datasets, the performance drops notably from 36.4%(COCO) to 20.9%(LVIS). Despite the unfairness as LVIS contains much more classes than COCO (1230 v.s. 80), we can still draw some interesting observations. On head classes, the LVIS model achieves comparable results with COCO. However, when it comes to tail classes, the performance decreases to 0 rapidly. Such a phenomenon implies current detection models are indeed challenged by data imbalance. To further investigate how the performance degradation is induced by data imbalance, we decouple the detection framework into proposal feature extraction stage and proposal classification stage, following  

[21].

Specifically, following the notations in Sec. 3.1, we deem the operations used to generate as proposal feature extraction, and the last FC layer and softmax in Eqn. (2) as a softmax classifier. Then, we investigate the correlation between the number of training instances and the weight norm in the classifier for each category. The results are visualized in Figure 1. We can see for COCO dataset, most categories contain training instances (at least ); and classifier weight norms are also relatively balanced (0.75-1.25) for all foreground categories 111Note that the first class is background(CID=0).. In contrast, for the LVIS dataset, a weight norm is highly related to the number of training instances in the corresponding category ; the more training examples there are, the larger weight magnitude it will be. For the extreme few-shot categories (tail classes), their corresponding weight norms are extremely small, even close to zero. Based on such observations, one can foresee that prediction scores for tail classes will be congenitally lower than head classes, and proposals of tail classes will be less likely to be selected after competing with those of head categories within the softmax computation. This explains why current detection models often fails on tail classes.

Why would the classifier weights be correlated to the number of training instances per-class? To answer this question, let us further inspect the training procedure of Faster R-CNN. When proposals from a head class are selected as training samples, should be activated, while the predictions for other categories should be suppressed. As the training instances for head classes are much more than those of tail classes (e.g., 10,000 vs. 1 in some extreme cases), classifier weights of tail classes are much more likely (frequent) to be suppressed by head class ones, resulting in imbalanced weight norms after training.

Therefore, one may see why re-sampling method [15, 39] is able to benefit tail classes on long-tail instance classification and segmentation. It simply increases the sampling frequency of tail class proposals during training so that the weights of different classes can be equally activated or suppressed, thus balance the tail and head classes to some degree. Also, loss re-weighting methods [6, 2, 19, 24, 35] can take effect in a similar way. Though the resampling strategy is able to alleviate data imbalance, it actually introduces new risks like overfitting to tail classes and extra computation overhead. Meanwhile, loss re-weighting is sensitive to per-class loss weight design, which usually varies across different frameworks, backbones and datasets, making it hardly deployable in real-world applications. Moreover, re-weighting based methods cannot handle the background class well in detection problems. Therefore, we propose a simple yet effective solution to balance the classifier weight norms without heavy hyper-parameter engineering.

4 Balanced Group Softmax

Our novel balanced group softmax module is illustrated in Figure 2. We first elaborate on its formulation and then explain the design details.

4.1 Group softmax

As aforementioned, detector performance is harmed by the positive correlation between weight norms and number of training examples. To solve this problem, we propose to divide classes into several disjoint groups and perform the softmax operation separately, such that only classes with similar numbers of training instances are competing with each other within each group. In this way, classes containing significantly different numbers of instances can be isolated from each other during training. The classifier weights of tail classes would not be substantially suppressed by head classes.

Concretely, we divide all the categories into groups according to their training instance numbers. We assign category to group if

(3)

where is the number of ground-truth bounding boxes for category in the training set, and and are hyper-parameters that determine minimal and maximal instance numbers for group . In this work, we set to ensure there is no overlap between groups, and each category can only be assigned to one group. and are set empirically to make sure that categories in each group contain similar total numbers of training instances. Throughout this paper, we set .

Besides, we manually set the to contain only the background category, because it owns the most training instances (typically 10-100 times more than object categories). We adopt sigmoid cross entropy loss for here because it only contains one prediction, while for the other groups we use softmax cross entropy loss. The reason for choosing softmax is that the softmax function inherently owns the ability to suppress each class from another, and less likely produce large numbers of false positives. During training, for a proposal with ground-truth label , two groups will be activated, which are background group and foreground group where .

4.2 Calibration via category “others”

However, we find the above group softmax design suffers from the following issue. During testing, for a proposal, all groups will be used to predict since its category is unknown. Thus, at least one category per group will receive a high prediction score, and it will be hard to decide which group-wise prediction we should take, leading to a large number of false positives. To address this issue, we add a category others into every group to calibrate predictions among groups and suppress false positives. This category others contains categories not included in the current group, which can be either background or foreground categories in other groups. For , category others also represents foreground classes. To be specific, for a proposal with ground-truth label , the new prediction should be . The probability of class is calculated by

(4)

The ground-truth labels should be re-mapped in each group. In groups where is not included, class others

will be defined as the ground-truth class. Then the final loss function is

(5)

where and represent the label and probability in .

4.3 Balancing training samples in groups

In the above treatment, the newly added category others

will again become a dominating outlier with overwhelming many instances. To balance training sample number per group, we only sample a certain number of

others proposals for training, which is controlled by a sampling ratio . For , all training samples of others will be used since the number of background proposals is very large. For , others instances will be randomly sampled from all others instances, where . is a hyper-parameter and we conduct an ablation study in Sec. 5.4 to show the impact of . Normally, we set . indicates the instances for category in current batch.

Namely, within the groups that contain the ground-truth categories, others instances will be sampled proportionally based on a mini-batch of proposals. If there is no normal categories activated in one group, all the others instances will not be activated. This group is ignored. In this way, each group can keep balanced with a low ratio of false positives. Adding category others brings 2.7% improvement over the baseline.

4.4 Inference

During inference, we first generate with the trained model, and apply softmax in each group using Eqn.  (4). Except for , all nodes of others are ignored, and probabilities of all categories are ordered by the original category IDs. in can be regarded as the probability of foreground proposals. Finally, we rescale all probabilities of normal categories with . This new probability vector is fed to the following post-processing steps like NMS to produce final detection results. It should be noticed that the is not a real probability vector technically since the summation of it does not equal to 1. It plays the role of the original probability vector which guides the model through selecting final boxes.

5 Experiments

5.1 Dataset and setup

We conduct experiments on the recent Large Vocabulary Instance Segmentation (LVIS) dataset [15], which contains 1,230 categories with both bounding box and instance mask annotations. For object detection experiments, we use only bounding box annotation for training and evaluation. When exploring BAGS’s generalization to instance segmentation, we use mask annotations. Please refer to our supplementary materials for implementation details.

Following [39], we split the categories in the validation set of LVIS into 4 bins according to their training instance numbers to evaluate the model performance on the head and tail classes more clearly. Bin contains categories that have to instances. We refer categories in the first two bins as “tail classes”, and categories in the other two bins as “head classes”. Besides the official metrics mAP, AP (AP for rare classes), AP (AP for common classes), and AP (AP for frequent classes) that are provided by LVIS-api222https://github.com/lvis-dataset/lvis-api, we also report AP on different bins. AP denotes the mean AP over the categories from Bin.

ID Models mAP AP AP AP AP AP AP AP ACC ACC ACC ACC ACC ACC
(1) R50-FPN 20.98 0.00 17.34 24.00 29.99 4.13 19.70 29.30 92.78 0.00 2.47 25.30 45.87 95.91
(2) x2 21.93 0.64 20.94 23.54 28.92 5.79 22.02 28.26 92.62 0.00 5.60 26.51 45.71 95.69
(3) Finetune tail 22.28 0.27 22.58 23.89 27.43 5.67 23.54 27.34 94.81 0.00 5.04 5.58 5.86 99.85
(4) RFS [15] 23.41 7.80 24.18 23.14 28.33 14.59 22.74 27.77 92.71 0.60 7.50 25.62 44.39 95.84
(5) RFS-finetune 22.66 8.06 23.07 22.43 27.73 13.44 22.06 27.09 92.77 0.60 7.14 25.08 43.79 95.91
(6) Re-weight 23.48 6.34 22.91 23.88 30.12 11.47 22.41 29.61 94.84 0.00 0.82 9.57 17.40 99.53
(7) Re-weight-cls 24.66 10.04 24.12 24.57 31.07 14.16 23.51 30.28 94.76 0.00 0.34 7.72 16.02 99.64
(8) Focal loss [24] 11.12 0.00 10.24 13.36 13.17 2.74 11.13 14.46 3.87 0.00 17.45 40.11 49.31 1.35
(9) Focal loss-cls 19.29 1.64 19.30 20.64 23.70 6.60 19.81 23.71 2.90 0.00 27.67 48.53 48.89 0.16

(10)
NCM-fc [21] 16.02 5.87 14.13 16.97 21.40 10.31 13.92 20.92 94.29 0.00 0.02 0.23 0.15 100.00
(11) NCM-conv [21] 12.56 4.20 9.71 13.75 18.46 6.11 10.39 17.85 94.29 0.00 0.00 0.20 0.10 100.00
(12) -norm [21] 11.01 0.00 11.71 12.01 12.36 2.07 12.30 12.97 5.91 0.00 30.32 39.49 49.14 3.42
(13) -norm-select 21.61 0.35 20.07 23.43 29.16 6.18 20.99 28.54 92.43 0.00 13.19 20.62 38.98 95.91
(14) Ours 25.96 11.33 27.64 25.14 29.90 17.65 25.75 29.54 93.71 2.06 7.50 22.07 35.88 97.41
Table 1: Comparison with state-of-the-art methods transferred from long-tail image classification on LVIS val set. Bold numbers denote the best results among all models. Model (1) and (4) are initialized with model pre-trained on COCO dataset. All the others are initialized with model (1). “-cls” denotes only train the classification FC layer and , and the other parameters are frozen. Model (10) and (11) represent NCM model using classification FC features and ROI-pooled Conv feature to calculate category centers respectively. Model (13) means using -norm results only on foreground proposals. Refer to our supplementary materials for more implementation details.

5.2 Main results on LVIS

We transfer multiple state-of-the-art methods for long-tail classification to the Faster R-CNN framework, including fine-tuning on tail classes, repeat factor sampling (RFS) [27], category loss re-weighting, Focal Loss [24], NCM [21, 36], and -normalization [21]

. We carefully adjust the hyperparameter settings to make them suitable for object detection. Implementation details are provided in our supplementary material. We report their detection performance and proposal classification accuracy in Table

1.

How well does naive baseline perform? We take Faster R-CNN with ResNet-50-FPN backbone as the baseline (model (1) in the table) that achieves 20.98% mAP but 0 AP

. The baseline model misses most tail categories due to domination of other classes. Consider other models are initialized by model (1) and further fine-tuned by another 12 epochs. To make sure the improvement is not from longer training schedule, we train model (1) with another 12 epochs for fair comparison. This gives model (2). Comparing model (2) with model (1), we find longer training mainly improves on AP

, but AP remains around 0. Namely, longer training hardly helps improve the performance for low-shot categories with instances less than 10. Fine-tuning model (1) on tail-class training samples (model (3)) only increases AP notably while decreases AP by 2.5%, and AP remains 0. This indicates the original softmax classifier cannot perform well when the number of training instances is too small.

Do long-tail classification methods help? We observe sampling-based method RFS (model (4)) improves the overall mAP by 2.5%. The AP for tail classes is improved while maintaining AP for head classes. However, RFS increases the training time cost by . We also try to initialize the model with model (1), obtaining model (5). But mAP drops by 0.8% due to over-fitting.

For cost sensitive learning methods, both model (6) and (7) improve the performance, while model (7) works better. This confirms the observation in [21] that decoupling feature learning and classifier benefits long-tail recognition still applies for object detection. For focal loss, we directly apply sigmoid focal loss at proposal level. It is worth noting that in terms of proposal classification, the accuracy over all the object classes (ACC) increases notably. However, for background proposals, ACC drops from 95.8% to 0.16%, leading to a large number of false positives and low AP. This phenomenon again highlights the difference between long-tail detection and classification—the very special background class should be carefully treated.

For NCM, we try to use both FC feature just before classier (model (10)), and Conv feature extracted by ROI-align (model (11)). However, our observation is NCM works well for extremely low-shot classes, but is not good for head classes. Moreover, NCM can provide a good 1-nearest-neighbour classification label. But for detection, we also need the whole probability vector to be meaningful so that scores of different proposals on the same categories can be used to evaluate the quality of proposals.

The -normalization model (12) suffers similar challenge as Focal Loss model (8). The many-shot background class is extremely dominating. Though foreground proposal accuracy is greatly increased, ACC drops hugely. Consequently, for model (13), the proposals categorized to background inherit prediction of the original model while the others take -norm results. However, the improvement is limited. We should notice that AP and ACC are still 0 after -norm, but AP and ACC are improved.

How well does our method perform? For our model, except for , we split the normal categories into 4 groups for group softmax computation, with and to be (0, 10), (10, ), (, ), (, ) respectively, and . Our model is initialized with model (1) , and the classification FC layer is randomly initialized since the output shape changed. Only this FC layer is trained for another 12 epochs, and all other parameters are frozen. Our results surpass all the other methods by a large margin. AP increases 11.3%, AP increases 10.3%, with AP and AP almost unchanged. This result verifies the effectiveness of our designed balanced group softmax module.

Models mAP AP AP AP AP AP AP AP
Faster R50 20.98 0.00 17.34 24.00 29.99 4.13 19.70 29.30
Ours 25.96 11.33 27.64 25.14 29.90 17.65 25.75 29.54
Faster X101 24.63 0.79 22.37 27.45 32.73 5.80 24.54 32.25
Ours 27.83 14.99 28.07 27.93 32.02 18.78 27.32 32.07
Cascade X101 27.16 0.00 24.06 31.09 36.17 4.84 27.22 36.00
Ours 32.77 19.03 36.10 31.13 34.96 28.24 32.11 35.41
Table 2: Results with stronger backbone ResNeXt-101-64x4d and stronger framework Cascade R-CNN. All Ours models are initialized with their plain counterparts.

Extension of our method to stronger models. To further verify the generalization of our method, we change Faster R-CNN backbone to ResNeXt-101-64x4d [40]. Results are shown in Table 2. On this much stronger backbone, our method still gains 3.2% improvement. Then, we apply our method to state-of-the-art Cascade R-CNN [1] framework with changing all 3 softmax classifiers in 3 stages to our BAGS module. Overall mAP is increased significantly by 5.6%. Our method brings persistent gain with 3 heads.

5.3 Results for instance segmentation

ID Models Backbone mAP AP AP AP AP AP AP AP mAP AP AP AP AP AP AP AP
(1) Mask-RFS* [15] R50 24.40 14.50 24.30 28.40
(2) Mask-RFS* [15] R101 26.00 15.80 26.10 29.80
(3) Mask-RFS* [15] X101-32x8d 27.10 15.60 27.50 31.40
(4) Mask-Calib* [39] R50 21.10 8.60 22.00 19.60 26.60
(5) HTC-Calib* [39] X101 29.85 16.05 30.60 29.80 33.50
(6) HTC-Calib* [39] X101-MS-DCN 32.10 12.70 32.10 33.60 37.00
(7) Mask R-CNN R50 20.78 0.00 15.88 24.61 30.51 3.28 18.99 30.00 20.68 0.00 17.06 23.66 29.62 3.73 19.95 28.37
(8) Ours R50 25.76 9.65 26.20 26.09 30.45 15.03 25.45 30.42 26.25 12.81 28.28 25.15 29.61 17.97 26.91 28.74
(9) HTC X101 31.28 5.02 31.71 33.24 37.21 12.39 32.58 37.18 29.28 5.11 30.34 30.62 34.37 12.11 31.32 33.58
(10) Ours X101 33.68 19.95 36.14 32.82 36.06 25.43 34.12 36.42 31.20 17.33 33.87 30.34 33.29 23.40 32.34 32.89
(11) HTC X101-MS-DCN 34.61 5.80 35.36 36.87 40.50 14.24 35.98 41.03 31.94 5.56 33.07 33.75 37.02 13.67 34.04 36.62
(12) Ours X101-MS-DCN 37.71 24.40 40.30 36.67 40.00 29.43 37.78 40.92 34.39 21.07 36.69 33.71 36.61 26.79 35.04 36.61
Table 3: Results of bounding box and mask AP when extending our method to instance segmentation on LVIS val set. AP denotes AP of instance segmentation mask. All backbones are with FPN. X101 denotes X101-64x4d. * results are from the corresponding cited paper. Bold numbers indicate the best results among all models. Model (8)(10)(12) are initialized with model (7)(9)(11).

We further evaluate our method benefits for instance segmentation models including Mask R-CNN [17] and state-of-the-art HTC [4] on LVIS. Here HTC models are trained with COCO stuff annotations for a segmentation branch. Results are shown in Table 3. First, comparing our models (8)(10)(12) with their corresponding baseline models (7)(9)(11), mAPs of both bounding box and mask increase largely. Our models fit the tail classes much better while APs on head classes drop slightly. Second, we compare our results with state-of-the-art results [39, 15] on LVIS instance segmentation task. With both Mask R-CNN framework and ResNet-50-FPN backbone, our model (8) surpass RFS (1) and Calib (4) by at least 1.8%. With both HTC framework and ResNeXt-101-FPN backbone, our model (10) is 1.4% better than Calib (5). With ResNeXt-101-FPN-DCN backbone and multiscale training, our model (12) is 2.3% better than Calib (6). Our method establishes new state-of-the-arts in terms of both bounding box and mask criterion.

Figure 3: Comparison of weight norms from model (1)(4)(7)(14) in Table 1. The vertical dashed lines split all categories into Bin.

5.4 Model analysis

Does our method balance classifiers well? We visualize the classifier weight norm of model (1)(4)(7) and our model (14) of Table 1 in Figure 3. Weights of RFS on tail classes are obviously enlarged. Re-weighting method suppresses the weights of head classes and lifts weights of the tail classes. For ours, since we decouple the relationships of different group of categories, weights of and are almost at the same level. Though weights of are still smaller, they have been better balanced than the original model. Noting that the weights norm of our model are less related to the training instance numbers in each group, implying such decoupling benefits network training.

How much background and others contribute? See Table 4. With baseline model (0), directly grouping normal categories to 4 sets without adding background and others in each group, we get results of (1). For model (1), during inference, scores of each group are fed to softmax respectively, and concatenated directly for NMS. Though AP improves 5.7%, performance on all the other Bins drops significantly. This is because we do not have any constraints for FPs. For a single proposal, at least one category will be activated in each group, leading to many FPs. When we add (model (2)), and use to rescale scores of normal categories, we get 1.9% improvement over model (1), but still worse than model (0). For model (3), we add category others into each groups, and not using , we obtain 2.7% performance gain.

ID b o N mAP AP AP AP AP AP AP AP
(0) 20.98 0.00 17.34 24.00 29.99 4.13 19.70 29.30
(1) 4 17.82 5.71 17.07 18.09 23.13 8.52 17.44 22.01
(2) 4 19.73 7.18 19.66 18.80 25.95 9.89 19.32 24.19
(3) 4 23.74 9.90 24.06 23.38 28.88 15.46 22.58 28.49
(4) 2 25.31 6.53 27.55 24.19 30.35 15.30 25.14 29.53
(5) 4 25.96 11.33 27.64 25.14 29.90 17.65 25.75 29.54
(6) 8 24.85 7.79 26.05 24.59 29.58 14.11 24.79 29.21
Table 4: Effect of adding different components to our module. b for background group . o for adding category others to all bins. N is number of normal category groups.

How many groups to use in BAGS? With rescaling with , another 2.2% improvement is obtained (model (5)). If we reduce the group number from 4 to 2, as shown in model (4), the overall mAP drops 0.6. However, specifically, it should be noticed that AP becomes much worse, while AP increases a little. Using more groups does not help as well (model(6)). Since #ins for Bin is too small for , dividing Bin into 2 bins further decreases #ins of per group, leading to highly insufficient training for tails. To sum up, adding category others into each group matters a lot, and using specially trained to suppress background proposals works better than just others. Finally, grouping categories into bins and decoupling the relationship between tail and head classes benefits a lot for learning of tail classes.

Impact of in BAGS. After adding category others to all groups, we need to sample training instances for others. Using all others proposals will lead to imbalance problem in each group. Thus, our strategy is to sample others with ratio , so that #ins others#ins normal = . As shown in Fig.4, mAP continuously increases when we increase until . If we use all others proposal in activated group (indicated as in x-axis), the performance for head classes keep increasing, but that for tail classes drops a lot. If we train all others proposals no matter whether there are normal categories being activated (indicated as in x-axis), mAP gets worse. This verifies our opinion that another imbalance problem could worsen the results.

Figure 4: Influence of sample ratio . n indicates all others in the activated groups, and all indicates all others in all the groups.

5.5 Results on COCO-LT

To further verify the generalization ability of our method, we construct a long-tail distribution COCO-LT dataset by sampling images and annotations from COCO [25]. We get similar results on COCO-LT as on LVIS. Our model still introduces over 2% improvement on mAP (+2.2% for Faster R-CNN, +2.4% for Mask R-CNN bounding box, +2.3% for Mask R-CNN mask), especially gaining large improvement on tail classes (from 0.1% to 13.0% for bounding box) with both Faster R-CNN and Mask R-CNN frameworks. Please refer to our supplementary materials for dataset construction, data details, and full results.

6 Conclusion

In this work, we first reveal a reason for poor detection performance on long-tail data is that the classifier becomes imbalanced due to insufficiently training on low-shot classes, by analyzing their classifier weight norms. Then, we investigate multiple solid baseline methods transferred from long-tail classification, but we found they are limited in addressing challenges for the detection task. We thus propose a balanced group softmax module to undertake the imbalance problem of classifiers, which achieves notably better results on different strong backbones for long-tail detection as well as instance segmentation.

7 Supplementary materials

7.1 Implementation details

7.1.1 Experiment setup

Our implementations are based on the MMDetection toolbox [5]

and Pytorch 

[30]. All the models are trained with 8 V100 GPUs, with a batch size of 2 per GPU, except for HTC models (1 image per GPU). We use SGD optimizer with learning rate = 0.01, and decays twice at the and epochs with factor = 0.1. Weight decay = 0.0001. Learning rate warm-up are utilized. All Ours models are initialized with their corresponding baseline models that directly trained on LVIS with softmax, and only the last FC layer is trained of another 12 epochs, with learning rate = 0.01, and decays twice at the and epochs with factor = 0.1. All other parameters are frozen.

7.1.2 Transferred methods

Here, we elaborate on the detailed implementation for transferred long-tail image classification methods in Table 1 of the main text.

Repeat factor sampling (RFS)

RFS [27] is applied to LVIS instance segmentation in [15]. It increases the sampling rate for tail class instances by oversampling images containing these categories. We implement RFS with its best practice settings given by [15] with .

Re-weight

Re-weight is a category-level cost sensitive learning method. Motivated by [6], we re-weight losses of different categories according to their corresponding number of training instances. We calculate , where denotes the number of instance for category . We normalize by dividing the mean of all , namely and cap their values between 0.01 and 5. is set to 1 for background class. The model (6) and (7) are both initialized with model (1). Model (6) fine-tunes all parameters in the network except for Conv1 and Conv2. Model (7) only fine-tunes the last fully connected layer, namely and in Sec.3.1 in the main text, and is set to 0.999.

Fig.5 left shows more settings we have tried for loss re-weighting. we tried  [6]’s best practice {=0.999, focal, =0.5} by setting #bg=3#fg, but only got 14.57% mAP. {=0.999, softmax}=23.07% indicates softmax works better for Faster R-CNN. So our (6) in Tab.1 are improved version of {=1, softmax} with weights truncated to [0.01,5]. We further try to add weight truncation to ={0.9, 0.99, 0.999}, loss={softmax,  focal}, and set =1, =2 (loss for =0.5 is too small), and finally found that {=1, softmax, truncated} (model 7) works best.

Figure 5: Settings we tried for [6] and [21].
Focal loss

Focal loss [24] re-weights the cost at image-level for classification. We directly apply Sigmoid focal loss at proposal-level. Similar to models (6) and (7), models (8) and (9) are initialized with model (1). Then we finetune the whole backbone and classifier () respectively.

Nearest class mean classifier (NCM)

NCM

is another commonly used approach that first computes the mean feature for each class on training set. During inference, 1-NN algorithm is applied with cosine similarity on

normalized mean features [21, 36]. Thus, for object detection, with the trained Faster R-CNN model (1), we first calculate the mean feature for proposals of each class on training set except for background class. At inference phase, features for all the proposals are extracted. We then calculate cosine similarity of all the proposal features with the class centers. We apply softmax over similarities of all categories to get a probability vector for normal classes. To recognize background proposals we directly take the probability of background class from model (1), and update with . We try both FC feature just before classier (model (10)), and Conv feature extracted by ROI-align (model (11)) as proposal features.

-normalization

-normalization [21] directly scale the classifier weights by , where and denotes norm. It achieves state-of-the-art performance on long-tail classification [21]. For model (13), we first obtain results from both the original model and the -normed model. The original model is good at categorizing background. Thus, if the proposal is categorized to background by the original model, we select the results of the original model for this proposal. Otherwise, the -norm results will be selected. In spite of this, we designed multiple ways to deal with bg (background class) (Fig 5 red bars), and found the above way perform best. We also searched value on val set, and found =1 is the best (Fig 5 right).

7.2 How to train our model

ID Mode Part mAP AP AP AP AP AP AP AP
(1) train fc-cls 23.79 8.16 24.42 23.35 29.26 14.36 23.04 28.50
(2) train head 21.18 9.34 21.32 20.94 25.69 12.39 20.67 25.31
(3) tune head 23.88 8.90 23.96 23.78 29.44 14.19 23.08 28.75
(4) tune all 24.02 8.91 24.86 23.49 29.06 14.81 23.36 28.52
Table 5: Different ways to train models. Mode “train” means train from random initialization. Mode “tune” means finetune from trained model (1). Part fc-cls, head, and all indicate the last classification FC layer, the whole classification head (2FC+fc-cls), and the whole backbone except for Conv1 and Conv2. is set to 1 here so that the results are lower than that in the main paper where .

There are several options to train a model with our proposed BAGS module. As shown in Tab.5, we try different settings with . Since adding categories others changes the dimension of classification outputs, we need to randomly initialize the classifier weights and bias . So for model (1), following [21] to decouple feature learning and classifier, we fix all the parameters for feature extraction and only train the classifier with parameters and . For model (2), we fix the backbone parameters and train the whole classification head together (2 FC and ). It is worth noticing that the 2 FC layers are initialized by model (1), while are randomly initialized. This drops mAP by 2.6%, which may be caused by the inconsistent initialization of feature and classifier. Therefore, we try to train and first with settings for model (1), and fine-tune the classification head (model (3)) and all backbones except for Conv1 and Conv2 (model (4)) respectively. Fine-tuning improves mAP slightly. However, taking the extra training time into consideration, we choose to take the setting of model (1) to directly train parameters for classifier only in all the other experiments.

7.3 Comparison with winners of Lvis 2019

Since the evaluation server for LVIS test set is closed, all results in this paper are obtained on val set. There are two winners: lvlvisis and strangeturtle. We compared with lvlvisis in Tab.3 based on their report [39], and our results surpass theirs largely. For strangeturtle, their Equalization Loss [37] (released on 12/11/2019) replaces softmax with sigmoid for classification and blocks some back-propagation for tail classes. With Mask R-CNN R50 baseline (mAP 20.68%), Equalization Loss achieves 23.90% with COCO pre-training (vs 26.25% of ours). Our method performs much better on tail classes (AP 11.70% [37] vs 17.97% ours). They also tried to decrease the suppression effect from head over tail classes, but using sigmoid completely discards all suppression among categories, even though some of them are useful for suppressing false positives. Without bells and whistles, our method outperforms both winners on val set.

7.4 Results on COCO-LT

Figure 6: We align 80 categories of COCO with 1230 categories of LVIS, and sample corresponding number of instances for each COCO category.

To further verify the generalization ability of our BAGS, we construct a long-tail distribution dataset COCO-LT by sampling images and annotations from COCO [25] train 2017 split.

7.4.1 Dataset construction

To get a similar long-tail data distribution as LVIS, we first sort all categories of LVIS and COCO by their corresponding number of training instances. As shown in Fig. 6, we align 80 categories of COCO with 1230 categories of LVIS, and set the target training instance number per category in COCO as the training instance number of its corresponding category in LVIS. Then, we sample target number of instances for each COCO category. We make use of as many instances in a sampled image as possible. Training instances in a sampled image will only be ignored when there are plenty of instances belonging to that category. In this way, we sample a subset of COCO that follows long-tail distribution just like LVIS. COCO-LT only contains 9100 training images of 80 categories, which includes 64504 training instances. For validation, we use the same validation set as COCO val 2017 split, which includes 5000 images.

7.5 Main results

We compare with Faster R-CNN and Mask R-CNN (R50-FPN backbone) on the above COCO-LT dataset. The results are shown in Tab. 6. Since the number of training images is small, we initialize baseline models with model trained on LVIS. As we can see, our models introduce more than 2% improvements on mAP of both bounding box and mask. Importantly, it gains large improvement on tail classes.

mAP AP AP AP AP
Faster R-CNN 20.3 0.1 12.9 24.3 26.7
Ours 22.5 13.0 18.6 24.1 26.4
Mask R-CNN bbox 19.1 0.0 11.1 22.9 26.4
Ours 21.5 13.4 17.7 22.5 26.0
Mask R-CNN segm 18.0 0.0 11.5 21.8 23.3
Ours 20.3 3.4 18.9 21.7 23.0
Table 6: Results on COCO-LT dataset. ResNet50-FPN backbone are used for both Faster R-CNN and Mask R-CNN.

References

  • [1] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6154–6162. Cited by: §1, §1, §2, §2, §5.2.
  • [2] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma (2019) Learning imbalanced datasets with label-distribution-aware margin loss. arXiv preprint arXiv:1906.07413. Cited by: §1, §2, §3.2.
  • [3] H. Chen, Y. Wang, G. Wang, and Y. Qiao (2018) Lstd: a low-shot transfer detector for object detection. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • [4] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4974–4983. Cited by: §1, §2, §5.3.
  • [5] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §7.1.1.
  • [6] Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie (2019) Class-balanced loss based on effective number of samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9268–9277. Cited by: §1, §2, §3.2, Figure 5, §7.1.2, §7.1.2.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.
  • [8] C. Drummond, R. C. Holte, et al. (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, Vol. 11, pp. 1–8. Cited by: §1, §2.
  • [9] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578. Cited by: §2.
  • [10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1.
  • [11] Q. Fan, W. Zhuo, and Y. Tai (2019) Few-shot object detection with attention-rpn and multi-relation detector. arXiv preprint arXiv:1908.01998. Cited by: §2.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2015) Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence 38 (1), pp. 142–158. Cited by: §2.
  • [13] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2, §2, §3.1, §3.2.
  • [14] S. Guerriero, B. Caputo, and T. Mensink (2018) Deepncm: deep nearest class mean classifiers. Cited by: §2.
  • [15] A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5356–5364. Cited by: §1, §1, §2, §2, §3.2, §5.1, §5.3, Table 1, Table 3, §7.1.2.
  • [16] H. Han, W. Wang, and B. Mao (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pp. 878–887. Cited by: §1, §2.
  • [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §2, §3.1, §5.3.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [19] C. Huang, Y. Li, C. L. Chen, and X. Tang (2019)

    Deep imbalanced learning for face recognition and attribute prediction

    .
    IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2, §3.2.
  • [20] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019) Few-shot object detection via feature reweighting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8420–8429. Cited by: §2.
  • [21] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217. Cited by: §1, §2, §3.2, §5.2, §5.2, Table 1, Figure 5, §7.1.2, §7.1.2, §7.2.
  • [22] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §1, §2.
  • [23] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1.
  • [24] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §1, §2, §2, §3.2, §5.2, Table 1, §7.1.2.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §5.5, §7.4.
  • [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • [27] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §1, §2, §5.2, §7.1.2.
  • [28] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka (2013) Distance-based image classification: generalizing to new classes at near-zero cost. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2624–2637. Cited by: §2.
  • [29] W. Ouyang, X. Wang, C. Zhang, and X. Yang (2016) Factors in finetuning deep model for object detection with long-tail distribution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 864–873. Cited by: §2.
  • [30] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §7.1.1.
  • [31] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1, §2.
  • [32] W. J. Reed (2001) The pareto, zipf and other power laws. Economics letters 74 (1), pp. 15–19. Cited by: §1.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §1, §1, §2, §2, §2, §3.1.
  • [34] L. Shen, Z. Lin, and Q. Huang (2016)

    Relay backpropagation for effective learning of deep convolutional neural networks

    .
    In European conference on computer vision, pp. 467–482. Cited by: §1, §2.
  • [35] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng (2019) Meta-weight-net: learning an explicit mapping for sample weighting. arXiv preprint arXiv:1902.07379. Cited by: §1, §2, §3.2.
  • [36] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §5.2, §7.1.2.
  • [37] J. Tan, C. Wang, Q. Li, and J. Yan (2019) Equalization loss for large vocabulary instance segmentation. arXiv preprint arXiv:1911.04692. Cited by: §7.3.
  • [38] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §2.
  • [39] T. Wang, Y. Li, B. Kang, J. Li, J. H. Liew, S. Tang, S. Hoi, and J. Feng (2019) Classification calibration for long-tail instance segmentation. arXiv preprint arXiv:1910.13081. Cited by: §2, §3.2, §5.1, §5.3, Table 3, §7.3.
  • [40] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1, §5.2.
  • [41] C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen (2019) CANet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5217–5226. Cited by: §2.
  • [42] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.
  • [43] X. Zhou, J. Zhuo, and P. Krahenbuhl (2019) Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859. Cited by: §2.