Object detection aims to localize objects in an image and provide their class labels. As one of the most fundamental tasks in computer vision, it serves as a key component for many vision applications, including instance segmentation, human pose analysis , and visual reasoning 
. The significance of the object detection problem together with the rapid development of deep neural networks has led to substantial progress in recent years[7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, which encompass rectangular areas of an image, serve as the basic element for processing. They describe target locations of objects throughout the stages of an object detector, from anchors and proposals to final predictions. Based on these bounding boxes, features are extracted and used for purposes such as object classification and location refinement. The prevalence of the bounding box representation can partly be attributed to common metrics for object detection performance, which account for the overlap between estimated and ground truth bounding boxes of objects. Another reason lies in its convenience for feature extraction in deep networks, because of its regular shape and the ease of subdividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they provide only a coarse localization of objects that does not conform to an object’s shape and pose. Features extracted from the regular cells of a bounding box may thus be heavily influenced by background content or uninformative foreground areas that contain little semantic information. This may result in lower feature quality that degrades classification performance in object detection.
In this paper, we propose a new representation, called RepPoints, that provides more fine-grained localization and facilitates classification. Illustrated in Fig. 1, RepPoints are a set of points that learn to adaptively position themselves over an object in a manner that circumscribes the object’s spatial extent and indicates semantically significant local areas. The training of RepPoints is driven jointly by object localization and recognition targets, such that the RepPoints are tightly bound by the ground-truth bounding box and guide the detector toward correct object classification. This adaptive and differentiable representation can be coherently used across the different stages of a modern object detector, and does not require the use of anchors to sample over a space of bounding boxes.
RepPoints differs from existing non-rectangular representations for object detection, which are all built in a bottom-up manner [44, 24, 54]. These bottom-up representations identify individual points (e.g., bounding box corners or object extremities) and rely on handcrafted clustering to group them into object models. Their representations furthermore either are still axis-aligned like bounding boxes [44, 24] or require ground truth object masks as additional supervision . In contrast, RepPoints are learned in a top-down fashion from the input image / object features, allowing for end-to-end training and producing fine-grained localization without additional supervision.
To demonstrate the power of the RepPoints representation, we present an implementation within a deformable ConvNets framework , which provides recognition feedback suitable for guiding the adaptive sampling while maintaining convenience in feature extraction. This anchor-free detection system is found to have strong classification ability while also accurately localizing objects. Without multi-scale training and testing, our clean detector achieves 42.8 AP and 65.0 on the COCO benchmark , not only surpassing all existing anchor-free detectors but also performing on-par with state-of-the-art anchor-based baselines.
2 Related Work
Bounding boxes for the object detection problem.
The bounding box has long been the dominant form of object representation in the field of object detection. One reason for its prevalence is that a bounding box is convenient to annotate with little ambiguity, while providing sufficiently accurate localization for the subsequent recognition process. This may explain why the major benchmarks all utilize annotations and evaluations based on bounding boxes [8, 29, 23]. In turn, these benchmarks motivate object detection methods to use the bounding box as their basic representation in order to align with the evaluation protocols.
and during the deep learning era[22, 40, 42, 16], are based on an input patch with a regular grid form. It is thus convenient to use the bounding box representation to facilitate feature extraction [12, 11, 38].
Although the proposed RepPoints has an irregular form, we show that it can be amenable to convenient feature extraction. Our system utilizes RepPoints in conjunction with deformable convolution , which naturally aligns with RepPoints in that it aggregates information from input features at several sample points. Besides, a rectangular pseudo box can be readily generated from RepPoints (see Section 3.2), allowing the new representation to be used with object detection benchmarks.
Bounding boxes in modern object detectors.
The best-performing object detectors to date generally follow a multi-stage recognition paradigm [27, 4, 14, 26, 2, 41, 30], and the bounding box representation appears in almost all stages: 1) as pre-defined anchors that serve as hypotheses over the bounding box space; 2) as refined object proposals connecting successive recognition stages; and 3) as the final localization targets.
1) Bounding boxes as anchors. Significant improvements in object detection have been achieved through the use of dense predefined anchors. Most existing detectors [38, 28] use anchors of different aspect ratios and scales. Recently, there is a trend towards better modeling of anchors, such as by guided anchoring , learning anchor functions , and adaptive optimization .
In contrast, with the proposed RepPoints, we show that an anchor-free detector can reach performance comparable to modern detectors that rely on a large set of dense anchors.
2) Bounding boxes as object proposals. Progressive refinement of object proposals is crucial to the success of multi-stage object detectors. The improved localization increases the quality of extracted features, leading to improved detection performance from more recognition stages . However, the mechanics of bounding box refinement can be seen as non-intuitive, where the width/height change is directly regressed from an exponential function on input object features.
With RepPoints, refinement of object localization corresponds to adjusting the positions of its sample points towards more semantically meaningful features, which both aids in classification and naturally determines the object’s spatial extent.
3) Bounding boxes as the final target localization. Existing methods usually adopt the same bounding box regression method as used for object proposals to produce the final target localization. Recently, some alternatives have been presented for this purpose, using IoU-Net , Generalized IoU loss  or consistent optimization .
Other representations for object detection.
To address limitations of rectangular bounding boxes, there have been some attempts to develop more flexible object representations. These include an elliptic representation for pedestrian detection  and a rotated bounding box to better handle rotational variations [17, 55].
Other works aim to represent an object in a bottom-up manner. Early bottom-up representations include DPM  and Poselet . Recently, bottom-up approaches to object detection have been explored with deep networks [24, 54]. CornerNet  first predicts top-left and bottom-right corners and then employs a specialized grouping method  to obtain the bounding boxes of objects. However, the two opposing corner points still essentially model a rectangular bounding box. ExtremeNet  is proposed to locate the extreme points of objects in the x- and y-directions  with supervision from ground-truth mask annotations. In general, bottom-up detectors benefit from a smaller hypothesis space (for example, CornerNet and ExtremeNet both detect 2-d points instead of directly detecting a 4-d bounding box) and potentially finer-grained localization. However, they have limitations such as relying on handcrafted clustering or post-processing steps to compose whole objects from the detected points.
Similar to these bottom-up works, RepPoints is also a flexible object representation. However, the representation is constructed in a top-down manner, without the need for handcrafted clustering steps. RepPoints can automatically learn extreme points and key semantic points without supervision beyond ground-truth bounding boxes, unlike ExtremeNet  where additional mask supervision is required.
Deformation modeling in object recognition.
One of the most fundamental challenges for visual recognition is to recognize objects with various geometric variations. To effectively model such variations, a possible solution is to make use of bottom-up composition of low-level components. Representative detectors along this direction include DPM  and Poselet . An alternative is to implicitly model the transformations in a top-down manner, where a lightweight neural network block is applied on input features, either globally  or locally .
RepPoints is inspired by these works, especially the top-down deformation modeling approach . The main difference is that we aim at developing a flexible object representation for accurate geometric localization in addition to semantic feature extraction. In contrast, both the deformable convolution and deformable RoI pooling methods are designed to improve feature extraction only. The inability of deformable RoI pooling to learn accurate geometric localization is examined in Section 4. In this sense, we expand the usage of adaptive sample points in previous geometric modeling methods [19, 4] to include finer localization of objects.
3 The RepPoints Representation
We first review the bounding box representation and its use within multi-stage object detectors. This is followed by a description of RepPoints and its differences from bounding boxes.
3.1 Bounding Box Representation
The bounding box is a 4-d representation encoding the spatial location of an object, , with denoting the center points and denoting the width and height. Due to its simplicity and convenience in use, modern object detectors heavily rely on bounding boxes for representing objects at various stages of the detection pipeline.
Review of Multi-Stage Object Detectors
The best performing object detectors usually follow a multi-stage recognition paradigm, where object localization is refined stage by stage. The role of the object representation through the steps of this pipeline is as follows:
At the beginning, multiple anchors are hypothesized to cover a range of bounding box scales and aspect ratios. In general, high coverage is obtained through dense anchors over the large 4-d hypothesis space. For instance, 45 anchors per location are utilized in RetinaNet .
For an anchor, the image feature at its center point is adopted as the object feature, which is then used to produce a confidence score about whether the anchor is an object or not, as well as the refined bounding box by a bounding box regression process. The refined bounding box is denoted as “bbox proposals (S1)”.
In the second stage, a refined object feature is extracted from the refined bounding box proposal, usually by RoI-pooling  or RoI-Align . For the two-stage framework , the refined feature will produce the final bounding box target by bounding box regression. For the multi-stage approach , the refined feature is used to generate intermediate refined bounding box proposals (S2), also by bounding box regression. This step can be iterated multiple times before producing the final bounding box target.
In this framework, bounding box regression plays a central role in progressively refining object localization and object features. We formulate the process of bounding box regression in the following paragraph.
Bounding Box Regression
Conventionally, a 4-d regression vectoris predicted to map the current bounding box proposal into a refined bounding box , where
Given the ground truth bounding box of an object , the goal of bounding box regression is to have and as close as possible. Specifically, in the training of an object detector, we use the distance between the predicted 4-d regression vector and the expected 4-d regression vector as the learning target, using a smooth loss:
This bounding box regression process is widely used in existing object detection methods. It performs well in practice when the required refinement is small, but it tends to perform poorly when there is large distance between the initial representation and the target. Another issue lies in the scale difference between and , which requires tuning of their loss weights for optimal performance.
As previously discussed, the 4-d bounding box is a coarse representation of object location. The bounding box representation considers only the rectangular spatial scope of an object, and does not account for shape and pose and the positions of semantically important local areas, which could be used toward finer localization and better object feature extraction.
To overcome the above limitations, RepPoints instead models a set of adaptive sample points:
where is the total number of sample points used in the representation. In our work, is set to 9 by default.
Progressively refining the bounding box localization and feature extraction is important for the success of multi-stage object detection methods. For RepPoints, the refinement can be expressed simply as
where are the predicted offsets of the new sample points with respect to the old ones. We note that this refinement does not face the problem of scale differences among the bounding box regression parameters, since the offsets are at the same scale in the refinement process of RepPoints.
Converting RepPoints to bounding box
To take advantage of bounding box annotations in the training of RepPoints, as well as to evaluate RepPoint-based object detectors, a method is needed for converting RepPoints into a bounding box. We perform this conversion using a pre-defined converting function , where denotes the RepPoints for object and represents a pseudo box.
Three converting functions are considered for this purpose:
: Min-max function. Min-max operation over both axes are performed over the RepPoints to determine , equivalent to the bounding box over the sample points.
: Partial min-max function. Min-max operation over a subset of the sample points is performed over both axes to obtain the rectangular box .
: Moment-based function.
The mean value and the second-order moment of the RepPoints is used to compute the center point and scale of the rectangular box, where the scale is multiplied by globally-shared learnable multipliers and .
These functions are all differentiable, enabling end-to-end learning when inserted into an object detection system. In our experiments, we found them to work comparably well.
The learning of RepPoints is driven by both an object localization loss and an object recognition loss. To compute the object localization loss, we first convert RepPoints into a pseudo box using the previously discussed transformation function . Then, the difference between the converted pseudo box and the ground-truth bounding box is computed. In our system, we use the smooth distance between the top-left and bottom-right points to represent the localization loss. This smooth distance does not require the tuning of different loss weights as done in computing the distance between bounding box regression vectors (i.e., for and ). Figure 3 indicates that when the training is driven by this combination of object localization and object recognition losses, the extreme points and semantic key points of objects are automatically learned ( is used in transforming RepPoints to pseudo box).
4 RPDet: an Anchor Free Detector
We design an anchor-free object detector that utilizes RepPoints in place of bounding boxes as its basic representation. Within a multi-stage pipeline, the object representation evolves as follows:
Our RepPoints Detector (RPDet) is constructed with two recognition stages based on deformable convolution, as illustrated in Figure 2. Deformable convolution pairs nicely with RepPoints, as its convolutions are computed on an irregularly distributed set of sample points and conversely its recognition feedback can guide training for the positioning of these points. In this section, we present the design of RPDet and discuss its relationship to and differences from existing object detectors.
Center point based initial object representation.
While predefined anchors dominate the representation of objects in the initial stage of object detection, we follow YOLO  and DenseBox  by using center points as the initial representation of objects, which leads to an anchor-free object detector.
An important benefit of the center point representation lies in its much tighter hypothesis space compared to the anchor based counterparts. While anchor based approaches usually rely on a large number of multi-ratio and multi-scale anchors to ensure dense coverage of the large 4-d bounding box hypothesis space, a center point based approach can more easily cover its 2-d space. In fact, all objects will have center points located within the image.
Center point based methods also face problems that limit its prevalence in modern object detectors. One is that two different objects may be located at the same position in a feature map, resulting in ambiguity of the recognition targets. In previous methods , this is mainly addressed by producing multiple targets at each position, which faces another issue of vesting ambiguity111If the center points of multiple ground truth objects are located at a same feature map position, only one randomly chosen ground truth object is assigned to be the target of this position.. In RPDet, we show that this issue can be greatly alleviated by using the FPN structure  for the following reasons: first, objects of different scales will be assigned to different image feature levels, which addresses objects of different scales and the same center points locations; second, FPN has a high-resolution feature map for small objects, which also reduces the chance of two objects having centers located at the same feature position. In fact, we observe that only 1.1% of objects in the COCO datasets  suffer from the issue of center points located at the same position when FPN is used.
Another issue is that it is hard for center point based methods to predict accurate object localization, due to the large variation in the spatial scope of objects. This issue can also be alleviated in our framework, through the flexibility of the RepPoints representation and the effective refinement of RepPoints in Eq.(6). In our experiments, we demonstrate that our proposed anchor-free object detector can be as effective as the state-of-the-art anchor-based methods.
It is worth noting that the center point representation can be viewed as a special RepPoints configuration, where only a single fixed sample point is used, thus maintaining a coherent representation throughout the proposed detection framework.
Utilization of RepPoints.
As shown in Figure 2, RepPoints serve as the basic object representation throughout our detection system. Starting from the center points, the first set of RepPoints is obtained via a 33 regular convolution . The learning of these RepPoints is driven by two objectives: 1) the top-left and bottom-right points distance loss between the induced pseudo box and the ground-truth bounding box; 2) the object recognition loss of the subsequent stage. As illustrated in Figure 3, extreme and key points are automatically learned. The second set of RepPoints represents the final object localization, which is refined from the first set of RepPoints by Eq. (5). Driven by the points distance loss alone, this second set of RepPoints aims to learn finer object localization.
Relation to deformable RoI pooling . As mentioned in Section 2, deformable RoI pooling plays a different role in object detection compared to the proposed RepPoints. Basically, RepPoints is a geometric representation of objects, reflecting more accurate semantic localization, while deformable RoI pooling is geared towards learning stronger appearance features of objects.
In fact, deformable RoI pooling cannot learn sample points representing accurate localization of objects. This can be seen from the following contradiction. If the deformable RoI pooling method indeed learns an accurate geometric representation of objects, the deformable RoI pooling layer would produce the same appearance features for two nearby proposals of the same object. In this case, the object detectors would fail, due to the inability to differentiate these two proposals. However, the deformable RoI pooling method has been shown to well differentiate two nearby proposals . This shows that the deformable RoI pooling cannot learn accurate localization of objects. Please see the appendix for a more detailed discussion.
We also note that deformable RoI pooling can be complementary to RepPoints, as indicated in Table 6.
Our FPN structures follow  to produce 5 feature pyramid levels from stage 3 (downsampling ratio of 8) to stage 7 (downsampling ratio of 128). Specifically, for a box with width and height , it corresponds to pyramid level , where . For each ground-truth box , we project its center point to each feature level in the pyramid (), where the projected bin is denoted as . Then, we set to be a positive bin and and as ignored bins222If () is not in , i.e. equals to 3 (7), we just assign one ignored label to ().. All other unset bins are assigned negative labels.
5.1 Implementation Details
We present experimental results of our proposed RPDet framework on the MS-COCO  detection benchmark, which contains 118k images for training, 5k images for validation (minival) and 20k images for testing (test-dev). All the ablation studies are conducted on minival with ResNet-50 , if not otherwise specified. The state-of-the-art comparison is reported on test-dev in Table 7. The code will be made available.
5.2 Ablation Study
RepPoints vs. bounding box.
To demonstrate the effectiveness of the proposed RepPoints, we compare the proposed RPDet to a version where RepPoints are all replaced by the regular bounding box representation. Specifically, the bounding box representation based version uses a single anchor as the initial object representation (the width and height of this anchor is
of the feature map stride). The intersection-over-union (IoU) criterion is used to assign classification labels to each bounding box. The bounding box regression method in Section3.1 is used to obtain the first stage’s bounding box proposals and the final bounding box targets. The RoIAlign  operator is adopted to extract object features of the second stage. All other settings are the same as in the proposed RPDet method.
Table 1 shows that the change of object representation from bounding box to RepPoints can bring a +2.1 mAP improvement using ResNet-50  and a +2.0 mAP improvement using a ResNet-101  backbone, showing the advantage of the RepPoints representation over bounding boxes for object detection.
|: partial min-max||38.1||59.6||40.5|
|Faster R-CNN w. FPN ||ResNet-101||36.2||59.1||39.0||18.2||39.0||48.2|
|Deep Regionlets ||ResNet-101||39.3||59.8||-||21.7||43.7||50.9|
|Mask R-CNN ||ResNeXt-101||39.8||62.3||43.4||22.1||43.2||51.2|
|LH R-CNN ||ResNet-101||41.5||-||-||25.2||45.3||53.1|
|Cascade R-CNN ||ResNet-101||42.8||62.1||46.3||23.7||45.5||55.2|
Supervision source for RepPoints learning.
RPDet uses both an object localization loss and an object recognition loss to drive the learning of the first set of RepPoints, which represents the object proposals of the first stage. Table 2 ablates the use of these two supervision sources in the learning of the object representations. As mentioned before, describing the geometric localization of objects is an important duty of a representation method. Without the object localization loss, it is hard for a representation method to accomplish this duty, as it results in significant performance degradation of the object detectors. For RepPoints, we observe a huge drop of 4.5 mAP by removing the object localization supervision, showing the importance of describing the geometric localization for an object representation method.
Table 2 also demonstrates the benefit of inluding the object recognition loss in learning RepPoints (+0.7 mAP). The use of the object recognition loss can drive the RepPoints to locate themselves at semantically meaningful positions on an object, which leads to fine-grained localization and improves object feature extraction for the following recognition stage. Note that the object recognition loss benefit object detection with the bounding box representation (see the first block in Table 2), further demonstrating the advantage of RepPoints in flexible object representation.
Anchor-free vs. anchor-based.
We first compare the center point based method (a special RepPoints configuration) and the prevalent anchor based method in representing initial object hypotheses, in Table 3. The center point based method surpasses the anchor based method by +1.4 mAP, likely because of its better coverage of ground-truth objects.
We also compare the proposed anchor-free detector based on RepPoints to RetinaNet  (a popular one-stage anchor-based method), FPN  with RoIAlign (a popular two-stage anchor-based method) , and a YOLO-like detector which is adapted from the anchor-free method of YOLOv1 , in Table 4. The proposed method outperforms both RetinaNet  and the FPN 
methods, which utilize multiple anchors per scale and sophisticated anchor configurations (FPN). The proposed method also significantly surpasses another anchor-free method (the YOLO-like detector) by +4.4 mAP and +4.1 mAP, respectively, probably due to the flexible RepPoints representation and its effective refinement.
Converting RepPoints to pseudo box.
RepPoints act complementary to deformable RoI pooling .
Table 6 shows the effect of applying the deformable RoI pooling layer  to both bounding box proposals and RepPoints proposals. While applying the deformable RoI pooling layer to bounding box proposals brings a +0.7 mAP gain, applying it to RepPoints proposals also brings a +0.7 mAP gain, implying that the roles of deformable RoI pooling and the proposed RepPoints are complementary.
5.3 RepPoints Visualization
In Figure 3, we visualize the learned RepPoints and the corresponding detection results on several examples from the COCO  minival set. It can be observed that RepPoints tend to be located at extreme points or key semantic points of objects. These point distributed over objects are automatically learned without explicit supervision. The visualized results also indicate that the proposed RPDet, implemented here with the min-max transformation function, can effectively detect tiny objects.
5.4 State-of-the-art Comparison
We compare RPDet to state-of-the-art detectors on test-dev. Table 7 shows the results. Without any bells and whistles, RPDet achieves 42.8 AP on COCO benchmark , which is on-par with 4-stage anchor-based Cascade R-CNN  and outperforms all existing anchor-free approaches.
Moreover, without multi-scale training and testing, our clean anchor-free design achieves 65.0 333 is believed to be a better metric in ., which surpasses all baselines by a significant margin (). This improvement mostly comes from its performance on small objects (). These observations are consistent with the findings in YOLOv3  on the advantage of using center point based label assignment.
In this paper, we propose RepPoints, a representation for object detection that models fine-grained localization information and identifies local areas significant for object classification. Based on RepPoints, we develop an object detector called RPDet that achieves competitive object detection performance without the need of anchors. Learning richer and more natural object representations like RepPoints is a direction that holds much promise for object detection.
-  L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting people using mutually consistent poselet activations. In ECCV, pages 168–181. Springer, 2010.
-  Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, pages 379–387, 2016.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893. IEEE Computer Society, 2005.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
-  D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, pages 2147–2154, 2014.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9):1627–1645, 2010.
-  C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
-  R. Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
-  R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. PAMI, 37(9):1904–1916, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
C. Huang, H. Ai, Y. Li, and S. Lao.
High-performance rotation invariant multiview face detection.PAMI, 29(4):671–686, 2007.
-  L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
-  B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition of localization confidence for accurate object detection. In ECCV, pages 784–799, 2018.
-  T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi. Consistent optimization for single-shot object detection. arXiv preprint arXiv:1901.06563, 2019.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In NIPS, pages 1097–1105, 2012.
-  A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
-  H. Law and J. Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, pages 734–750, 2018.
-  B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. In CVPR, volume 1, pages 878–885. IEEE, 2005.
-  Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264, 2017.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In ICCV, pages 2117–2125, 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
-  S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR, pages 8759–8768, 2018.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016.
-  A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, pages 2277–2287, 2017.
-  D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari. Extreme clicking for efficient object annotation. In ICCV, pages 4930–4939, 2017.
-  P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In NIPS, pages 1990–1998, 2015.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
-  J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In CVPR, pages 7263–7271, 2017.
-  J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  B. Singh and L. S. Davis. An analysis of scale invariance in object detection snip. In CVPR, pages 3578–3587, 2018.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
-  A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR, pages 1653–1660, 2014.
-  L. Tychsen-Smith and L. Petersson. Denet: Scalable real-time object detection with directed sparse sampling. In ICCV, pages 428–436, 2017.
-  P. Viola, M. Jones, et al. Rapid object detection using a boosted cascade of simple features. CVPR, 1:511–518, 2001.
-  J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin. Region proposal by guided anchoring. In CVPR, 2019.
-  J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum. Learning to see physics via visual de-animation. In NIPS, pages 153–164, 2017.
-  Y. Wu and K. He. Group normalization. In ECCV, pages 3–19, 2018.
-  H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa. Deep regionlets for object detection. In ECCV, pages 798–814, 2018.
-  T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun. Metaanchor: Learning to detect objects with customized anchors. In NeurIPS, pages 318–328, 2018.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
-  S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network for object detection. In CVPR, pages 4203–4212, 2018.
-  Y. Zhong, J. Wang, J. Peng, and L. Zhang. Anchor box optimization for object detection. arXiv preprint arXiv:1812.00469, 2018.
-  X. Zhou, J. Zhuo, and P. Krähenbühl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.
-  Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao. Oriented response networks. In CVPR, pages 519–528, 2017.
-  C. Zhu, Y. He, and M. Savvides. Feature selective anchor-free module for single-shot object detection. In CVPR, 2019.
Appendix A1 Relationship between RepPoints and Deformable RoI pooling
In this section, we explain the differences between our method and deformable RoI pooling  in greater detail. We first describe the translation sensitivity of the regression step in the object detection pipeline. Then, we discuss how deformable RoI pooling  works and why it does not provide a geometric representation of objects, unlike the proposed RepPoints representation.
We explain the translation sensitivity of the regression step in the context of bounding boxes. Denote a rectangular bounding box proposal before regression as and the ground-truth bounding box as . The target for bounding box regression can then be expressed as
where is a function for transforming to . This transformation is conventionally learned as a regression function :
where is the input image and is a pooling function defined over the rectangular proposal, e.g., direct cropping of the image , RoIPooling , or RoIAlign . This formulation aims to predict the relative displacement to the ground truth box based on features within the area of . Shifts in should change the target accordingly:
Thus, the pooled feature should be sensitive to the box proposal . Specifically, for any pair of proposals , we should have . Most existing feature extractors satisfy this property. Note that the improvement of RoIAlign  over RoIPooling  is partly due to this guaranteed translation sensitivity.
Analysis of Deformable RoI Pooling.
where is the function for generating the sample points. Then, bounding box regression aims to learn a regression function which utilizes the sampled features via to predict the target as follows:
where is the pooling function with respect to the sample points .
From the translation sensitivity property, we have . Because the pooled feature is determined by the locations of sample points , we have . This means that for two different proposals and of the same object, the sample points of these two proposals by deformable RoI pooling should be different. Hence, the sample points of different proposals cannot correspond to the geometry of the same object. They represent a property of the proposals rather than the geometry of the object.
Figure 4 illustrates the contradiction that arises if deformable RoI pooling were a representation of object geometry. Moreover, Figure 5 illustrates that, for the learned sample points of two proposals for the same object by deformable RoI pooling, the sample points represent a property of the proposals instead of the geometry of the object.
In contrast to deformable RoI pooling where the pooled features represent the original bounding box proposals, the features extracted from RepPoints localize the object. As it is not restricted by translation sensitivity requirements, RepPoints can learn a geometric representation of objects when localization supervision on the corresponding pseudo box is provided (see Figure 3). While object localization supervision is not applied on the sample points of deformable RoI pooling, we show in Table 2 that such supervision is crucial for RepPoints.