Dense RepPoints: Representing Visual Objects with Dense Point Sets

We present an object representation, called Dense RepPoints, for flexible and detailed modeling of object appearance and geometry. In contrast to the coarse geometric localization and feature extraction of bounding boxes, Dense RepPoints adaptively distributes a dense set of points to semantically and geometrically significant positions on an object, providing informative cues for object analysis. Techniques are developed to address challenges related to supervised training for dense point sets from image segments annotations and making this extensive representation computationally practical. In addition, the versatility of this representation is exploited to model object structure over multiple levels of granularity. Dense RepPoints significantly improves performance on geometrically-oriented visual understanding tasks, including a 1.6 AP gain in object detection on the challenging COCO benchmark.


page 5

page 7


RepPoints: Point Set Representation for Object Detection

Modern object detectors rely heavily on rectangular bounding boxes, such...

Oriented RepPoints for Aerial Object Detection

In contrast to the oriented bounding boxes, point set representation has...

A MultiPath Network for Object Detection

The recent COCO object detection dataset presents several new challenges...

RPT: Learning Point Set Representation for Siamese Visual Tracking

While remarkable progress has been made in robust visual tracking, accur...

Graph Fusion Network for Multi-Oriented Object Detection

In object detection, non-maximum suppression (NMS) methods are extensive...

MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation

Object localization in 3D space is a challenging aspect in monocular 3D ...

A Principled Design of Image Representation: Towards Forensic Tasks

Image forensics is a rising topic as the trustworthy multimedia content ...

Code Repositories


Dense reppoints: Representing visual objects with dense point sets

view repo

1 Introduction

An object representation should ideally describe the appearance and geometry of an object in a manner that facilitates a broad range of object analysis tasks, such as object detection, semantic correspondences. At present, the representation most commonly used for objects is the bounding box, a rectangular window that encompasses the spatial extent of an object in an image. The prevalence of bounding boxes can be attributed to their convenience in feature extraction and ground-truth annotation. However, they provide only a coarse description that gives little indication of where an object’s boundaries and discriminative features may lie.

In order to obtain an object representation with greater utility, the recently proposed RepPoints [46]

models objects by a small set of representative points that gravitate to object boundaries and semantically meaningful object locations. This flexible representation is applied to object detection, where the representation adapt to the geometric variations of objects, providing guidance on where to extract features. While RepPoints is shown to be effective for the classification and localization tasks inherent to object detection, the underlying sparse point set lacks the capacity to reveal detailed object structure sought in applications such as instance segmentation, pose estimation, dense correspondence learning.

Figure 1: Visual object in different geometric forms (top row from left to right): bounding box, contour, foreground area, binary boundary map. These various object forms can be unified represented by a dense point set, called Dense RepPoints (bottom row).

In this paper, we propose to use a new significantly larger set of points with optional attributes to learn object representations. We call this new representation Dense RepPoints. With a dense collection of adaptively attributed points, it becomes possible to represent fine-scale structural information that is beneficial for geometrically-oriented object analysis. However, major challenges arise in dealing with dense point sets: (1) How can we perform training of 0-D points using the 1-D object contour lines and 2-D binary foreground masks available for dense supervision. (2) How to infer various foreground shape models using a non-grid distributed point set. (3) How to deal with the lengthy object features, which increase heavily with respect to the number of points.

In this paper, we address the first issue by converting these common forms of geometric annotation into a point set, to allow comparison to the predicted Dense RepPoints within a differentiable loss. Two conversion approaches are proposed, generating either an organized representation that establishes point-to-point correspondences with Dense RepPoints, or an unorganized point set that is compared to Dense RepPoints via Chamfer distances. To address the second challenge, we present methods for conversion in the reverse direction, from dense point sets to three different descriptors for 2D object shape. To resolve the last issue, we develop an efficient computational scheme that involves group pooling of Dense RepPoints for object recognition and shared offset fields and attribute maps for modeling point refinements over an object and point attributes.

In addition to resolving these issues, we show that Dense RepPoints can simultaneously represent object structures of different granularity using the same point set, e.g., coarsely at the object detection level and finely at the instance segment level. This unified representation allows the coarse detection task to directly benefit from finer segment annotations, in contrast to training through separate branches built on top of base features as popularized in [12, 21].

We demonstrate the flexibility and effectiveness of the Dense RepPoints representation on different geometrically-oriented visual understanding tasks. The use of denser point sets is shown to yield significant performance improvements, indicating the benefits of finer object representations. Dense RepPoints holds great potential as an alternative object representation for problems involving detailed object shape analysis, such as instance segmentation.

2 Related Work

Bounding box representation.

Most existing high-level object recognition benchmarks [17, 31, 27] employ bounding box annotations for object detection. The current top-performing two-stage object detectors [19, 20, 36, 13] use bounding boxes as anchors, proposals and final predictions throughout their pipelines. Some early works have proposed to use rotated boxes [23]

to improve upon axis-aligned boxes, but the representation remains in a rectangular form. For other high-level recognition tasks such as instance segmentation and human pose estimation, the intermediate proposals in top-down solutions 

[12, 21] are all based on bounding boxes. However, the bounding box is a coarse geometric representation which only encodes the spatial extent of an object.

Non-box object representations.

For instance segmentation, the annotation for objects is either as a binary mask [17] or as a set of polygons [31]. While most current top-performing approaches [11, 21, 8] use a binary mask as final predictions, recent approaches also exploit contours for efficient interactive annotation [6, 1] and segmentation [10, 44]

. This contour representation, which was popular earlier in computer vision

[24, 7, 38, 39, 40], is believed to be more compatible with the semantic concepts of objects [33, 39]. Some works also use edges and superpixels [45, 25] as 2D object representations. Our proposed Dense RepPoints has the versatility to model objects in several of these non-box forms, providing a more generalized representation.

Point set representation.

There is much research focused on representing point clouds in 3D space [34, 35]. A direct instantiation of ordered point sets in 2D perception is 2D pose [41, 5, 2], which directly addresses the semantic correspondence problem. Recently, there has been increasing interest in the field of object detection on using specific point locations, including corner points [28], extreme points [48], and the center point [47, 16]. These point representations are actually variants of the bounding box representation, which is coarse and lacks semantic information. RepPoints [46] proposes a learnable point set representation trained from localization and recognition feedback. However, it uses only a small number () of points to represent objects, limiting its ability to represent finer geometry. In this work, we extend RepPoints [46] to a denser and finer geometric representation, enabling usage of dense supervision and taking a step towards dense semantic geometric representation.

3 Review of RepPoints for Object Detection

A rectangular bounding box is a coarse 2D object representation encoding only the rough spatial scope of an object and does not account for object shape and pose. To address this issue, RepPoints [46] uses a set of adaptive representative points for each object:


The set of points is used to simultaneously describe the spatial extent of an object and extract object features in a more fine-grained and targeted way than the bounding box approach. The learning process of RepPoints is also driven by both the geometric localization and recognition tasks built on the object features. The two supervision sources help to position the learned RepPoints on semantically significant and aligned locations on objects.

Geometric supervision by bbox annotation:

In object detection, only coarse bounding box annotation is provided, which cannot be directly compared to a point set representation. To utilize bounding box annotations for geometric supervision, RepPoints presents three ways to convert a point set to a pseudo box , such that it can be compared with the ground-truth bounding box. A representative conversion function is the MinMax function:


In inference, this conversion function is also used to produce bounding box output from the RepPoints.

Object feature extraction:

A differentiable sampler is used to extract the features of an object based on the current location of RepPoints. This operation enables recognition feedback to back-propagate onto the RepPoints. A deformable convolution module [14] is used for differentiable feature extraction where the current RepPoints are used as the deformable sampling locations in [46].

This extracted object features can then be used both for object classification as well as geometric refinement. The geometric refinement process is applied in a point-wise manner where each point is adjusted by a predicted offset:


The geometric refinement mechanism enables an alternative refinement of localization/feature extraction, which is commonly exploited in modern object detectors [36, 4].

4 Dense (Attributed) RepPoints

In RepPoints [46]

, the number of representative points used is relatively small (n=9). It may be adequate for object detection, which requires only a coarse bounding box (4 degrees of freedom) as output. However, for more fine-grained geometric localization tasks such as instance segmentation, the representation capacity of a sparse set of points may be insufficient. To represent detailed object structures, we propose a new significantly larger set of points, which we call

Dense RepPoints. The new representation has the potential to approximate more fine-grained structure of objects (i.e. contours, foreground, and boundary areas), as illustrated in Figure 1.

In addition, attributes can optionally be associated with each point, further strengthening its representation power. Potential attributes include the probability of a point being located on the foreground, or the visibility of a person keypoint. We name the attribute strengthened representation as

Dense (Attributed) RepPoints:



is an attribute vector associated with the point


4.1 Object Segment Representation

An object segment accurately describes the spatial scope of an object. There exist several typical representations for detailed structure, including the binary foreground heatmap computed at grid points within a rectangular bounding box [21] and the non-box form of contour. We show that an object segment can also be represented in some other non-box related forms such as foreground area and binary boundary map in Figure 1 (top). While representing and computing a box-based structure is relatively easy [21], the other non-box forms are non-trivial to represent and compute. In this section, we show that all these non-box forms can be conveniently represented and computed by the proposed Dense RepPoints approach, demonstrating its flexibility and potential in representing general object structures.

4.1.1 Object Segments in Three Non-box Forms

Figure 1 (top) illustrates three non-box forms for describing the accurate spatial scope of an object. We now explain each of them in details.

Object contour:

An object contour describes an object segment by the boundary pixels that separate the object foreground and background. This approach is compact to describe object due to its 1-D (defined by a line) nature. It is also widely used for object mask annotation [31], probably due to the convenience of labeling tools for polygons (an approximation of object contours). The contour description was also extensively studied earlier in computer vision [49, 37, 24], and methods based on it generally have better accuracy near boundaries than area-based methods.

Foreground area:

In this representation, the set of foreground pixels is adopted to describe the spatial scope of an object. An example of such a descriptor is super-pixels [45, 25]

. However, this representation is rarely used in the deep learning era.

Binary boundary map:

This descriptor is an extension of the standard binary map, where an irregular boundary region is used in place of a rectangular grid. The boundary region is more informative for object instance segmentation, as also observed and exploited by a concurrent work [26].

With adequate density, a point set can approximate all of the above non-box segment representations, which motivates us to adopt Dense RepPoints to model them.

4.1.2 Learn Dense RepPoints by Segment Supervision

A dense point set and object segment annotation cannot be directly compared, and hence designing fine localization guidance for dense point set learning is non-trivial. To allow for comparison, we propose to convert the object segment descriptor into a point set for dense supervision of an intermediate Dense RepPoints representation. We present two approaches to supervise Dense RepPoints, based on either an unorganized or organized point set representation.

Figure 2: Illustration of sampling unorganized and organized point sets.
Unorganized point set:

In this approach, Dense RepPoints learns to distribute points in a manner to minimizes the discrepancy between its set of points and the target point set sampled from segment supervision. The points in the point set are learned to both facilitate feature extraction and minimize overall point set discrepancy with the ground truth, without explicitly defining a meaning for each point on an object. Those points are referred to as unorganized points. Figure 2 (top) illustrates the target point sets used to approximate the three types of segment descriptors. For the contour representation, we sample the target points uniformly along its perimeter(see Figure 2 top left). For the foreground area representation, we uniformly sample the points from the object foreground region as the target point set(see Figure 2 top middle). For the binary boundary map representation, we obtain the target point set by applying a distance transform on the contour map, and each point in the object foreground area is assigned the attribute “foreground”, while those in the background area are assigned the attribute “background”(see Figure 2 top right). Then we utilize the Chamfer loss [18, 42] as the differentiable point set distance between the Dense RepPoints and the target point set to drive the learning of Dense RepPoints:

Organized point set:

We can alternatively address the issue by converting the non-box geometric descriptions in the Cartesian space to polar coordinates, in which points can be indexed in terms of the angle between the point regression vector and the horizontal direction. We set the centroid of an object mask as the origin of the polar system, and the rotation space is quantized into a number of angles (e.g. ). The points are organized by these angles, and we refer to them as an organized point set. We adopt different rules to sample the target points for the three non-box segment representations. For an object contour, the two intersection points for each quantized rotation angle are sampled, resulting in a set of points to represent an object contour (see Figure 2

bottom left). For the foreground area representation, a fixed number of uniformly distributed points (e.g.

) within the foreground area along each quantized rotation angle is sampled as target points, resulting in a set of points to represent the foreground area of an object (see Figure 2 bottom middle). For the binary boundary map representation, we also sample a fixed number of uniformly distributed points for each quantized rotation angle as the target points, but the points are located within the boundary area (see Figure 2 bottom right). The points will also be associated with an attribute denoting whether they are a foreground point. The difference between the predicted point set and the sampled target point set is computed as an averaged smooth loss:


In experiments, both approaches perform moderately well in representing all the three forms of non-box object segments. They either perform significantly better than previous methods (object contour), or are the first attempt to represent these forms (foreground area and binary boundary map). The experiments also demonstrate the necessity of using a larger set of points for accurate representation.

In a comparison of the two types of representations, unorganized methods perform moderately better than organized ones, in their current implementations, especially for the object contour representation. For the contour representation, it is impossible for the organized approach to accurately describe contours with extremely concave structure since there might be multiple intersections () in one specific rotation angle (e.g. the complex human pose in Figure 1), even when using an unlimited number of points. For the other two non-box forms, both approaches have the potential to describe a shape using unlimited points accurately.

4.1.3 Inference of Non-box Descriptors by Dense RepPoints

Figure 3: Generating image segments by using contour representation (left), foreground scope representation (middle) and boundary binary map representation (right) from the Dense RepPoints.

In inference, Dense RepPoints requires conversion to the three non-box segment representations.

Object contour: Given the predicted Dense RepPoints for an object contour, we compute the enclosing boundary, also known as the concave hull, of the point set using a k-nearest neighbours approach [32] (see Figure 3 left).

Foreground area: Given the predicted Dense RepPoints for an object foreground area, we dilate them using an adaptive Gaussian kernel to generate the foreground target (see Figure 3 middle).

Binary boundary map: Given the predicted attributed Dense RepPoints

for a binary boundary map, we first employ Delaunay triangulation to triangulate the image space, and then adopt a linear interpolation approach in the Barycentric coordinate system of each triangle to compute foreground scores for arbitrary image pixels inside the triangle (see Figure 

3 right).

4.2 Efficient Feature Extraction

Figure 4: Illustration of efficient feature extraction for Dense RepPoints. Top: group pooling operation. Bottom: shared offset fields / attribute maps for each point index.
Figure 5: Overview of our approach. For each bin in the feature map, it first generates a set of Dense RepPoints which are driven by specific dense supervision. Then, classification, point refinement and optional per-point attribute prediction outputs are generated via the extracted point features. The point refinements are added onto the previous dense point set representation to obtain the refined Dense RepPoints.

As discussed before, dense points are needed for a better approximation of complex geometric structures. However, directly applying the feature extraction method in [46] would make this impractical, due to the linearly increased complexity by more points. We denote the number of points to be and the feature length of each point to be . Then the computational complexity for both classification and regression would be in  [46].

To address this issue, we introduce group pooling for classification and shared offset fields for point refinement, which make the computational complexity of both branches almost constant to varying point number.

Group pooling:

For the classification branch, given representative points, we first divide the points into groups, each of which has points except the last group has

points. Then, we sample the feature of each group instead of individual representative points. This can be achieved by sampling the corresponding features of a subset of points within the group and performing max-pooling over the sampled features. Finally, a

convolution is computed over the concatenated features from all groups. In this way, the computational complexity reduces to and the space complexity reduces to , which is irrelevant to varying number of points. In our implementation, we set to 9 by default, which works comparably well than the counterparts using all point features.

Shared offset fields and attribute maps:

Unlike in the recognition branch, we need the information of individual points for point location refinement and point-wise attribute prediction. Hence, we cannot directly apply the grouped features used in recognition. Instead, we empirically find that local point features provide enough information for point refinement and attribute prediction. By performing point location refinement using the individual feature of this point, we further reduce the computation and memory cost by sharing all computation layers except the last one to produce 2-d offsets for all points. Because the computation of the last layer is very small, the total computation and memory cost is roughly irregular to varying point number.

Figure 4 illustrates our efficient computation processes for both the recognition and point refinement branches.

4.3 RPSeg: An Box-Free Instance Segmentation Framework

Figure 6: Illustration of the head design. The attribute head is optional.

We design a box-free RepPoints Segmenter (RPSeg) for the task of instance segmentation that utilizes Dense RepPoints as the intermediate representation throughout the pipeline. Similar to [46], we use a center point based initial object representation and utilize Dense RepPoints as the intermediate feature sampling locations. The overall architecture is illustrated in Figure 5, using an FPN backbone like in [30, 46], where feature pyramid levels from 3 (downsampling ratio of 8) to 7 (downsampling ratio of 128) are employed. The head architecture is illustrated in Figure 6.

In addition to the class head and localization head, we introduce an optional attribute head to predict the attribute of each point. The localization subnet first computes offsets for the Dense RepPoints, then the refinement and attribute predictions are obtained by bilinear sampling on the predicted refine fields and attribute maps with sampling locations based on . For the classification branch, we use group pooling to sample the features of Dense RepPoints with sampling locations based on , then fully-connected layers are used to predict the classification results. From its anchor-free design and efficient implementation in Section 4.2, our approach is fast and even more efficient than the one-stage RetinaNet [30](215.7 vs 234.5 GFLOPS) and has a finer object representation (225 points). We use the same label assignment approach as in [46]. For the additional attribute branch, we use per-point binary cross entropy for the foreground / background attribute prediction.

4.4 Multi-Granular Representation

Dense RepPoints can also be employed to simultaneously represent multi-granular object structures, e.g. the coarse object bounding box and the more fine-grained object segment. This can be achieved by using both bounding box and segment annotation to jointly supervise the learning of Dense RepPoints. In this way, different 2D visual representations including bounding boxes, contours, and masks are unified by this dense point set representation.

Unlike Mask R-CNN [21] where the mask and bounding box are processed using independent branches, Dense RepPoints provides a unified representation for different granularities of object geometry. Experimental results also show that the object detection task can benefit greatly from object segments as a fine-grained supervision source even on a strong baseline, with a AP improvement through this multi-granular representation (see Table 4).

5 Experiments

(a) Contours.
(b) Foreground.
(c) Boundary binary map (foreground points with brighter color).
Figure 7: The visualization of the predicted contours using concave hull algorithms (a)a; The foreground scope points (b)b; The predicted binary mask by non-grid interpolation from points (c)c. We visualize the predicted foreground points in (c)c with brighter color, and background points with darker color.

We present experimental results for multiple applications, all on the MS-COCO [31] benchmark, which contains 118k images for training, 5k images for validation (minival) and 20k images for testing (test-dev). We report the ablation results on the 5k validation images (minival). The state-of-the-art comparison is reported on test-dev.

Implementation Details:

All our ablation experiments, except when specifically noted, are conducted on minival with ResNet-50 [22]

. Our framework is trained with synchronized stochastic gradient descent (SGD) over 8 GPUs with a total of 16 images per minibatch (2 images per GPU) for 12 epochs (

settings). The learning rate is initialized to be 0.01 and then divided by 10 at epochs 8 and 11. The weight decay and momentum parameters are set to

and 0.9, respectively. The ImageNet

[15] pretrained model was used for initialization. We follow the training schedule and horizontal image flipping augmentation in [46] and use GN [43] and focal loss [30] to facilitate training. We use the unorganized point set by default. At inference, NMS () is employed to post-process the results, following [30]. Code will be made available.

base [46, 14] efficient base [46, 14] efficient
25 267.9G 211.4G 38.3 38.5
49 334.0G 211.8G 38.7 38.7
81 422.2G 212.6G 38.9 39.1
Table 1: Studies on the effect of efficient implementation.

points # mAP 19.1 22.7 23.2 23.6

(a) result by using organized contour representation.

points # mAP 9.0 15.5 17.0 20.2

(b) result by using organized foreground scope representation.

points # mAP 19.2 23.1 26.1 28.2

(c) result by using organized boundary binary map representation.

points # mAP 19.7 23.9 24.6 26.0

(d) result by using unorganized contour representation.

points # mAP 16.6 19.9 22.4 23.1

(e) result by using unorganized foreground scope representation.

points # mAP 23.5 26.7 28.2 29.2

(f) result by using unorganized boundary binary map representation.
Table 2: Instance segmentation results by Dense RepPoints using different representations with respect to points number

5.1 Effect of Efficient Implementation.

To show the effectiveness of our efficient feature extraction design, we present detection results of using the efficient implementation for object foreground area representation. Results are shown in Table 1. When , the efficient approach achieves almost the same AP as the base approach with less FLOPS. As increases, the accuracy of both approaches is constantly improving, but the computation cost of the base approach also increases with respect to the number of points. On the contrary, the computational complexity in our method is near constant with respect to .

5.2 Instance Segmentation Results

Contour representation:

We use Dense RepPoints as a contour representation for instance segmentation by converting polygons to foreground masks. We report the mask AP for both organized point sets (Table (a)a) and unorganized point sets (Table (d)d). It is shown that the performance is improved as the number of points increases, indicating the necessity of dense points for fine geometric representation. Figure (a)a presents some qualitative results (), note that we only show the polygon vertices. Generally, the points can capture fine object shape and geometry.

Foreground area representation:

We supervise the Dense RepPoints representation with the target point set sampled on the object foreground. Then, we generate foreground masks by converting the points onto a single Gaussian heatmap. Specifically, we choose as the Gaussian kernel size, where and denote the width and height of the proposal and denotes the number of dense points. Then, we apply a threshold of 0.1 to the Gaussian heatmap to get the binary mask. Morphological dilation and erosion with a kernel size of are used to fill in the holes. Results for organized and unorganized point sets are shown in Table (b)b and Table (e)e. As the number of points increases, the performance correspondingly increases significantly. Figure (b)b shows the learned Dense RepPoints on the object foreground area (n = 81).

Binary boundary map representation:

In the binary boundary map representation, the learned Dense RepPoints will distribute around the object boundary, and each point will predict a foreground probability. We use the non-grid interpolation as described in Section 4.1.3 to convert the attributed Dense RepPoints into a foreground mask. We apply a threshold of 0.5 to the interpolated mask to get the binary mask output. Figure (c)c presents some qualitative results (). For the same instance, the points predicted to be foreground are plotted with brighter color and the points predicted to be background are plotted with darker color. It can be seen in the top left image of Figure (c)c that our boundary binary map representation is even able to represent object with disconnected part (skis in the image). We report the results for organized point sets in Table (c)c and unorganized point sets in Table (f)f. It shows that the performance improves with respect to the points number. We also compare the results to the grid binary map representation in Table 3, where the Dense RepPoints are replaced by the same number of points in grid form, with all other settings the same. It can be observed in Table 3 that the change from grid form to a non-grid form representation brings substantial improvement, especially when the points number is limited ( AP improvement when ). It is worth noting that we are one of the first to represent object segments as non-grid scored points, which is time-saving and memory-saving by the following mechanism: we do not need to verify all the pixels inside the bounding box enclose the object, and instead we can focus on the more difficult near boundary points. Such benefit is also observed and exploited by a concurrent work [26]. We expect more investigations made in this direction.

# of points
TensorMask [9] - - - - 28.9
Grid points 5.0 15.6 22.7 25.7 28.6
Dense RepPoints 13.3 23.5 26.7 28.2 29.2
Table 3: Comparison of the Dense RepPoints (boundary binary map representation) and grid binary map representations for instance segmentation. The network structures are the same except for processing the given object representation.
Comparison with other instance segmentation methods.

To better demonstrate the effectiveness of our dense representation, we compare it to other instance segmentation methods on COCO test-dev and report the results in Table 5. We use 81 points for contour representation and 225 points for boundary binary mask representation. It can seen that our Dense RepPoints achieves the best performance among the non-box-based approaches. Our method is completely different from previous methods which are mostly based on grid binary maps [21, 9]. Dense RepPoints perform significantly better than existing contour based methods. Most importantly, we open two new directions for future study, namely object representations by foreground area and by binary boundary map. Although the current result is not as strong as the state-of-the-art Mask R-CNN [21] and TensorMask [9]

, there is much room for improvement. For example, the tensor bipyramid head in TensorMask 

[9] brings a AP improvement. When our RPSeg and TensorMask both use an FPN head for fair comparison, Table 3 shows that RPSeg using the binary boundary map representation achieves a AP improvement over TensorMask [9] on minival using a ResNet-50 [22] backbone, though TensorMask [9] uses more training epochs (72 epochs) and a stronger FPN (pyramid levels from 2 to 7). This demonstrates the great potential of using the binary boundary map representation.

Dense supervision diff.
37.7 58.0 39.7 0.8
foreground area 38.5 58.4 40.9
37.5 57.8 39.6 1.2
foreground area 38.7 58.9 41.9
37.5 57.7 39.8 1.6
foreground area 39.1 58.7 42.1
Table 4: Studies on the effects of dense supervision for detection.
Method Backbone Object Detection Instance Segmentation
Anchor Representation
Mask RCNN [21] ResNeXt-101 39.8 62.3 43.4 grid binary 37.1 60.0 39.4
TensorMask [9] ResNet-101 - - - - grid binary 37.1 59.3 39.4
FCIS [29] ResNet-101 - - - - grid binary 29.2 49.5 -
YOLACT [3] ResNet-101 33.7 54.3 35.9 grid binary 31.2 50.6 32.8
ExtremeNet [48] Hourglass-104 40.2 55.5 43.2 contour 18.9 44.5 13.7
CornerNet [28] Hourglass-104 40.5 56.5 43.1 - - - -
CenterNet [47] Hourglass-104 42.1 61.1 45.9 - - - -
RepPoints [46] ResNet-101-DCN 42.8 65.0 46.3 - - - -
RepPoints [46] ResNeXt-101-DCN 44.5 65.6 48.3 - - - -
Ours ResNeXt-101-DCN 45.3 65.9 49.1 contour 30.0 57.2 28.2
Ours ResNeXt-101-DCN 45.8 66.7 49.6 boundary binary 33.7 59.9 34.3
Table 5: Results of object detection and instance segmentation task on COCO test-dev. Our method achieves best detection results without anchor. And achieve the best instance segmentation performance among the methods based on irregular representation.

5.3 Dense RepPoints Benefits Object Detection.

We perform experiments on utilizing dense supervision to help object detection by explicitly supervising the sampling locations of the Dense RepPoints. Specifically, we sample the target point set from the object foreground area, then add the point set distance away from the predicted Dense RepPoints. Table 4 presents the experimental results on object detection with different numbers of points used. We observe consistent performance gain on bbox AP when the point set becomes denser, it is shown that the fine supervision can improve object detection by up to AP under . Also note that the metric improves substantially, which suggests that Dense RepPoints models a finer geometric representation. This novel application of explicit multi-task learning also verifies the necessity of using a denser point set, and it demonstrates the effectiveness of our multi-granular representation.

Comparison with other object detection methods.

We perform a system-level comparison between our object detector explicitly assisted by mask supervision to the state-of-the-art detector on COCO test-dev [31] and report the results in Table 5. Without multi-scale training and testing, our method achieves AP without needing anchors, outperforming other strong competitors.

6 Conclusion

In this paper, we present Dense RepPoints, a dense attributed point set representation for 2D objects. By introducing efficient feature extraction and employing dense supervision, this work takes a step towards learning a geometric, semantic and unified representation for top-down object recognition pipelines, enabling explicit modeling between different visual entities. Experimental results show that this new dense 2D representation is not only applicable for predicting dense targets such as contours and foreground masks, but also can help improve other tasks such as object detection via its novel multi-granular object representation.

Acknowledgement We thank Jifeng Dai and Bolei Zhou for discussion and comments about this work. Jifeng Dai was involved in the early discussions of the work. He gave up the authorship after he joined another company.


  • [1] D. Acuna, H. Ling, A. Kar, and S. Fidler (2018) Efficient interactive annotation of segmentation datasets with polygon-rnn++. Cited by: §2.
  • [2] R. Alp Güler, N. Neverova, and I. Kokkinos (2018) Densepose: dense human pose estimation in the wild. In CVPR, pp. 7297–7306. Cited by: §2.
  • [3] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee (2019) YOLACT: Real-time instance segmentation. In ICCV, Cited by: Table 5.
  • [4] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, pp. 6154–6162. Cited by: §3.
  • [5] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, pp. 7291–7299. Cited by: §2.
  • [6] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler (2017) Annotating object instances with a polygon-rnn. In CVPR, pp. 5230–5238. Cited by: §2.
  • [7] T. F. Chan and L. A. Vese (2001) Active contours without edges. IEEE Transactions on image processing 10 (2), pp. 266–277. Cited by: §2.
  • [8] L. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam (2018) Masklab: instance segmentation by refining object detection with semantic and direction features. In CVPR, pp. 4013–4022. Cited by: §2.
  • [9] X. Chen, R. B. Girshick, K. He, and P. Dollár (2019) TensorMask: A foundation for dense object segmentation. In ICCV, Cited by: §5.2, Table 3, Table 5.
  • [10] D. Cheng, R. Liao, S. Fidler, and R. Urtasun (2019) DARNet: deep active ray network for building segmentatio. arXiv preprint arXiv:1905.05889. Cited by: §2.
  • [11] J. Dai, K. He, Y. Li, S. Ren, and J. Sun (2016) Instance-sensitive fully convolutional networks. In ECCV, pp. 534–549. Cited by: §2.
  • [12] J. Dai, K. He, and J. Sun (2016) Instance-aware semantic segmentation via multi-task network cascades. In CVPR, pp. 3150–3158. Cited by: §1, §2.
  • [13] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In NeurIPS, pp. 379–387. Cited by: §2.
  • [14] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §3, Table 1.
  • [15] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. Cited by: §5.
  • [16] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: object detection with keypoint triplets. arXiv preprint arXiv:1904.08189. Cited by: §2.
  • [17] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §2, §2.
  • [18] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 605–613. Cited by: §4.1.2.
  • [19] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §2.
  • [20] R. Girshick (2015) Fast r-cnn. In ICCV, pp. 1440–1448. Cited by: §2.
  • [21] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §1, §2, §2, §4.1, §4.4, §5.2, Table 5.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §5, §5.2.
  • [23] C. Huang, H. Ai, Y. Li, and S. Lao (2007)

    High-performance rotation invariant multiview face detection

    PAMI 29 (4), pp. 671–686. Cited by: §2.
  • [24] M. Kass, A. Witkin, and D. Terzopoulos (1988) Snakes: active contour models. IJCV 1 (4), pp. 321–331. Cited by: §2, §4.1.1.
  • [25] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother (2017) Instancecut: from edges to instances with multicut. In CVPR, pp. 5008–5017. Cited by: §2, §4.1.1.
  • [26] A. Kirillov, Y. Wu, K. He, and R. Girshick (2019) PointRend: image segmentation as rendering. External Links: 1912.08193 Cited by: §4.1.1, §5.2.
  • [27] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §2.
  • [28] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In ECCV, pp. 734–750. Cited by: §2, Table 5.
  • [29] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017) Fully convolutional instance-aware semantic segmentation. In CVPR, pp. 2359–2367. Cited by: Table 5.
  • [30] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §4.3, §4.3, §5.
  • [31] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §2, §2, §4.1.1, §5, §5.3.
  • [32] A. Moreira and M. Y. Santos (2007) Concave hull: a k-nearest neighbours approach for the computation of the region occupied by a set of points. Cited by: §4.1.3.
  • [33] S. E. Palmer (1999) Vision science: photons to phenomenology. MIT press. Cited by: §2.
  • [34] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §2.
  • [35] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, pp. 5099–5108. Cited by: §2.
  • [36] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §2, §3.
  • [37] P. Srinivasan, L. Wang, and J. Shi (2009) Grouping contours via a related image. In NeurIPS, pp. 1553–1560. Cited by: §4.1.1.
  • [38] P. Srinivasan, Q. Zhu, and J. Shi (2010) Many-to-one contour matching for describing and discriminating object shape. In CVPR, Cited by: §2.
  • [39] A. Toshev, B. Taskar, and K. Daniilidis (2012) Shape-based object detection via boundary structure segmentation. IJCV 99 (2), pp. 123–146. Cited by: §2.
  • [40] X. Wang, X. Bai, T. Ma, W. Liu, and L. J. Latecki (2012) Fan shape model for object detection. In CVPR, pp. 151–158. Cited by: §2.
  • [41] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In CVPR, Cited by: §2.
  • [42] Y. Wei, S. Liu, W. Zhao, J. Lu, and J. Zhou (2019) Conditional single-view shape generation for multi-view stereo reconstruction. In CVPR, Cited by: §4.1.2.
  • [43] Y. Wu and K. He (2018) Group normalization. In ECCV, pp. 3–19. Cited by: §5.
  • [44] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo (2019) PolarMask: single shot instance segmentation with polar representation. arXiv preprint arXiv:1909.13226. Cited by: §2.
  • [45] J. Yang, B. Price, S. Cohen, H. Lee, and M. Yang (2016) Object contour detection with a fully convolutional encoder-decoder network. In CVPR, pp. 193–202. Cited by: §2, §4.1.1.
  • [46] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019) RepPoints: point set representation for object detection. In CVPR, Cited by: §1, §2, §3, §3, §4.2, §4.3, §4.3, §4, §5, Table 1, Table 5.
  • [47] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2, Table 5.
  • [48] X. Zhou, J. Zhuo, and P. Krähenbühl (2019) Bottom-up object detection by grouping extreme and center points. In CVPR, Cited by: §2, Table 5.
  • [49] Q. Zhu, L. Wang, Y. Wu, and J. Shi (2008) Contour context selection for object detection: a set-to-set contour matching approach. In ECCV, pp. 774–787. Cited by: §4.1.1.