DenseRepPoints
Dense reppoints: Representing visual objects with dense point sets https://arxiv.org/abs/1912.11473
view repo
We present an object representation, called Dense RepPoints, for flexible and detailed modeling of object appearance and geometry. In contrast to the coarse geometric localization and feature extraction of bounding boxes, Dense RepPoints adaptively distributes a dense set of points to semantically and geometrically significant positions on an object, providing informative cues for object analysis. Techniques are developed to address challenges related to supervised training for dense point sets from image segments annotations and making this extensive representation computationally practical. In addition, the versatility of this representation is exploited to model object structure over multiple levels of granularity. Dense RepPoints significantly improves performance on geometricallyoriented visual understanding tasks, including a 1.6 AP gain in object detection on the challenging COCO benchmark.
READ FULL TEXT VIEW PDFDense reppoints: Representing visual objects with dense point sets https://arxiv.org/abs/1912.11473
None
An object representation should ideally describe the appearance and geometry of an object in a manner that facilitates a broad range of object analysis tasks, such as object detection, semantic correspondences. At present, the representation most commonly used for objects is the bounding box, a rectangular window that encompasses the spatial extent of an object in an image. The prevalence of bounding boxes can be attributed to their convenience in feature extraction and groundtruth annotation. However, they provide only a coarse description that gives little indication of where an object’s boundaries and discriminative features may lie.
In order to obtain an object representation with greater utility, the recently proposed RepPoints [46]
models objects by a small set of representative points that gravitate to object boundaries and semantically meaningful object locations. This flexible representation is applied to object detection, where the representation adapt to the geometric variations of objects, providing guidance on where to extract features. While RepPoints is shown to be effective for the classification and localization tasks inherent to object detection, the underlying sparse point set lacks the capacity to reveal detailed object structure sought in applications such as instance segmentation, pose estimation, dense correspondence learning.
In this paper, we propose to use a new significantly larger set of points with optional attributes to learn object representations. We call this new representation Dense RepPoints. With a dense collection of adaptively attributed points, it becomes possible to represent finescale structural information that is beneficial for geometricallyoriented object analysis. However, major challenges arise in dealing with dense point sets: (1) How can we perform training of 0D points using the 1D object contour lines and 2D binary foreground masks available for dense supervision. (2) How to infer various foreground shape models using a nongrid distributed point set. (3) How to deal with the lengthy object features, which increase heavily with respect to the number of points.
In this paper, we address the first issue by converting these common forms of geometric annotation into a point set, to allow comparison to the predicted Dense RepPoints within a differentiable loss. Two conversion approaches are proposed, generating either an organized representation that establishes pointtopoint correspondences with Dense RepPoints, or an unorganized point set that is compared to Dense RepPoints via Chamfer distances. To address the second challenge, we present methods for conversion in the reverse direction, from dense point sets to three different descriptors for 2D object shape. To resolve the last issue, we develop an efficient computational scheme that involves group pooling of Dense RepPoints for object recognition and shared offset fields and attribute maps for modeling point refinements over an object and point attributes.
In addition to resolving these issues, we show that Dense RepPoints can simultaneously represent object structures of different granularity using the same point set, e.g., coarsely at the object detection level and finely at the instance segment level. This unified representation allows the coarse detection task to directly benefit from finer segment annotations, in contrast to training through separate branches built on top of base features as popularized in [12, 21].
We demonstrate the flexibility and effectiveness of the Dense RepPoints representation on different geometricallyoriented visual understanding tasks. The use of denser point sets is shown to yield significant performance improvements, indicating the benefits of finer object representations. Dense RepPoints holds great potential as an alternative object representation for problems involving detailed object shape analysis, such as instance segmentation.
Most existing highlevel object recognition benchmarks [17, 31, 27] employ bounding box annotations for object detection. The current topperforming twostage object detectors [19, 20, 36, 13] use bounding boxes as anchors, proposals and final predictions throughout their pipelines. Some early works have proposed to use rotated boxes [23]
to improve upon axisaligned boxes, but the representation remains in a rectangular form. For other highlevel recognition tasks such as instance segmentation and human pose estimation, the intermediate proposals in topdown solutions
[12, 21] are all based on bounding boxes. However, the bounding box is a coarse geometric representation which only encodes the spatial extent of an object.For instance segmentation, the annotation for objects is either as a binary mask [17] or as a set of polygons [31]. While most current topperforming approaches [11, 21, 8] use a binary mask as final predictions, recent approaches also exploit contours for efficient interactive annotation [6, 1] and segmentation [10, 44]
. This contour representation, which was popular earlier in computer vision
[24, 7, 38, 39, 40], is believed to be more compatible with the semantic concepts of objects [33, 39]. Some works also use edges and superpixels [45, 25] as 2D object representations. Our proposed Dense RepPoints has the versatility to model objects in several of these nonbox forms, providing a more generalized representation.There is much research focused on representing point clouds in 3D space [34, 35]. A direct instantiation of ordered point sets in 2D perception is 2D pose [41, 5, 2], which directly addresses the semantic correspondence problem. Recently, there has been increasing interest in the field of object detection on using specific point locations, including corner points [28], extreme points [48], and the center point [47, 16]. These point representations are actually variants of the bounding box representation, which is coarse and lacks semantic information. RepPoints [46] proposes a learnable point set representation trained from localization and recognition feedback. However, it uses only a small number () of points to represent objects, limiting its ability to represent finer geometry. In this work, we extend RepPoints [46] to a denser and finer geometric representation, enabling usage of dense supervision and taking a step towards dense semantic geometric representation.
A rectangular bounding box is a coarse 2D object representation encoding only the rough spatial scope of an object and does not account for object shape and pose. To address this issue, RepPoints [46] uses a set of adaptive representative points for each object:
(1) 
The set of points is used to simultaneously describe the spatial extent of an object and extract object features in a more finegrained and targeted way than the bounding box approach. The learning process of RepPoints is also driven by both the geometric localization and recognition tasks built on the object features. The two supervision sources help to position the learned RepPoints on semantically significant and aligned locations on objects.
In object detection, only coarse bounding box annotation is provided, which cannot be directly compared to a point set representation. To utilize bounding box annotations for geometric supervision, RepPoints presents three ways to convert a point set to a pseudo box , such that it can be compared with the groundtruth bounding box. A representative conversion function is the MinMax function:
(2) 
In inference, this conversion function is also used to produce bounding box output from the RepPoints.
A differentiable sampler is used to extract the features of an object based on the current location of RepPoints. This operation enables recognition feedback to backpropagate onto the RepPoints. A deformable convolution module [14] is used for differentiable feature extraction where the current RepPoints are used as the deformable sampling locations in [46].
This extracted object features can then be used both for object classification as well as geometric refinement. The geometric refinement process is applied in a pointwise manner where each point is adjusted by a predicted offset:
(3) 
In RepPoints [46]
, the number of representative points used is relatively small (n=9). It may be adequate for object detection, which requires only a coarse bounding box (4 degrees of freedom) as output. However, for more finegrained geometric localization tasks such as instance segmentation, the representation capacity of a sparse set of points may be insufficient. To represent detailed object structures, we propose a new significantly larger set of points, which we call
Dense RepPoints. The new representation has the potential to approximate more finegrained structure of objects (i.e. contours, foreground, and boundary areas), as illustrated in Figure 1.In addition, attributes can optionally be associated with each point, further strengthening its representation power. Potential attributes include the probability of a point being located on the foreground, or the visibility of a person keypoint. We name the attribute strengthened representation as
Dense (Attributed) RepPoints:(4) 
where
is an attribute vector associated with the point
.An object segment accurately describes the spatial scope of an object. There exist several typical representations for detailed structure, including the binary foreground heatmap computed at grid points within a rectangular bounding box [21] and the nonbox form of contour. We show that an object segment can also be represented in some other nonbox related forms such as foreground area and binary boundary map in Figure 1 (top). While representing and computing a boxbased structure is relatively easy [21], the other nonbox forms are nontrivial to represent and compute. In this section, we show that all these nonbox forms can be conveniently represented and computed by the proposed Dense RepPoints approach, demonstrating its flexibility and potential in representing general object structures.
Figure 1 (top) illustrates three nonbox forms for describing the accurate spatial scope of an object. We now explain each of them in details.
An object contour describes an object segment by the boundary pixels that separate the object foreground and background. This approach is compact to describe object due to its 1D (defined by a line) nature. It is also widely used for object mask annotation [31], probably due to the convenience of labeling tools for polygons (an approximation of object contours). The contour description was also extensively studied earlier in computer vision [49, 37, 24], and methods based on it generally have better accuracy near boundaries than areabased methods.
In this representation, the set of foreground pixels is adopted to describe the spatial scope of an object. An example of such a descriptor is superpixels [45, 25]
. However, this representation is rarely used in the deep learning era.
This descriptor is an extension of the standard binary map, where an irregular boundary region is used in place of a rectangular grid. The boundary region is more informative for object instance segmentation, as also observed and exploited by a concurrent work [26].
With adequate density, a point set can approximate all of the above nonbox segment representations, which motivates us to adopt Dense RepPoints to model them.
A dense point set and object segment annotation cannot be directly compared, and hence designing fine localization guidance for dense point set learning is nontrivial. To allow for comparison, we propose to convert the object segment descriptor into a point set for dense supervision of an intermediate Dense RepPoints representation. We present two approaches to supervise Dense RepPoints, based on either an unorganized or organized point set representation.
In this approach, Dense RepPoints learns to distribute points in a manner to minimizes the discrepancy between its set of points and the target point set sampled from segment supervision. The points in the point set are learned to both facilitate feature extraction and minimize overall point set discrepancy with the ground truth, without explicitly defining a meaning for each point on an object. Those points are referred to as unorganized points. Figure 2 (top) illustrates the target point sets used to approximate the three types of segment descriptors. For the contour representation, we sample the target points uniformly along its perimeter(see Figure 2 top left). For the foreground area representation, we uniformly sample the points from the object foreground region as the target point set(see Figure 2 top middle). For the binary boundary map representation, we obtain the target point set by applying a distance transform on the contour map, and each point in the object foreground area is assigned the attribute “foreground”, while those in the background area are assigned the attribute “background”(see Figure 2 top right). Then we utilize the Chamfer loss [18, 42] as the differentiable point set distance between the Dense RepPoints and the target point set to drive the learning of Dense RepPoints:
(5) 
We can alternatively address the issue by converting the nonbox geometric descriptions in the Cartesian space to polar coordinates, in which points can be indexed in terms of the angle between the point regression vector and the horizontal direction. We set the centroid of an object mask as the origin of the polar system, and the rotation space is quantized into a number of angles (e.g. ). The points are organized by these angles, and we refer to them as an organized point set. We adopt different rules to sample the target points for the three nonbox segment representations. For an object contour, the two intersection points for each quantized rotation angle are sampled, resulting in a set of points to represent an object contour (see Figure 2
bottom left). For the foreground area representation, a fixed number of uniformly distributed points (e.g.
) within the foreground area along each quantized rotation angle is sampled as target points, resulting in a set of points to represent the foreground area of an object (see Figure 2 bottom middle). For the binary boundary map representation, we also sample a fixed number of uniformly distributed points for each quantized rotation angle as the target points, but the points are located within the boundary area (see Figure 2 bottom right). The points will also be associated with an attribute denoting whether they are a foreground point. The difference between the predicted point set and the sampled target point set is computed as an averaged smooth loss:(6) 
In experiments, both approaches perform moderately well in representing all the three forms of nonbox object segments. They either perform significantly better than previous methods (object contour), or are the first attempt to represent these forms (foreground area and binary boundary map). The experiments also demonstrate the necessity of using a larger set of points for accurate representation.
In a comparison of the two types of representations, unorganized methods perform moderately better than organized ones, in their current implementations, especially for the object contour representation. For the contour representation, it is impossible for the organized approach to accurately describe contours with extremely concave structure since there might be multiple intersections () in one specific rotation angle (e.g. the complex human pose in Figure 1), even when using an unlimited number of points. For the other two nonbox forms, both approaches have the potential to describe a shape using unlimited points accurately.
In inference, Dense RepPoints requires conversion to the three nonbox segment representations.
Object contour: Given the predicted Dense RepPoints for an object contour, we compute the enclosing boundary, also known as the concave hull, of the point set using a knearest neighbours approach [32] (see Figure 3 left).
Foreground area: Given the predicted Dense RepPoints for an object foreground area, we dilate them using an adaptive Gaussian kernel to generate the foreground target (see Figure 3 middle).
Binary boundary map: Given the predicted attributed Dense RepPoints
for a binary boundary map, we first employ Delaunay triangulation to triangulate the image space, and then adopt a linear interpolation approach in the Barycentric coordinate system of each triangle to compute foreground scores for arbitrary image pixels inside the triangle (see Figure
3 right).As discussed before, dense points are needed for a better approximation of complex geometric structures. However, directly applying the feature extraction method in [46] would make this impractical, due to the linearly increased complexity by more points. We denote the number of points to be and the feature length of each point to be . Then the computational complexity for both classification and regression would be in [46].
To address this issue, we introduce group pooling for classification and shared offset fields for point refinement, which make the computational complexity of both branches almost constant to varying point number.
For the classification branch, given representative points, we first divide the points into groups, each of which has points except the last group has
points. Then, we sample the feature of each group instead of individual representative points. This can be achieved by sampling the corresponding features of a subset of points within the group and performing maxpooling over the sampled features. Finally, a
convolution is computed over the concatenated features from all groups. In this way, the computational complexity reduces to and the space complexity reduces to , which is irrelevant to varying number of points. In our implementation, we set to 9 by default, which works comparably well than the counterparts using all point features.Unlike in the recognition branch, we need the information of individual points for point location refinement and pointwise attribute prediction. Hence, we cannot directly apply the grouped features used in recognition. Instead, we empirically find that local point features provide enough information for point refinement and attribute prediction. By performing point location refinement using the individual feature of this point, we further reduce the computation and memory cost by sharing all computation layers except the last one to produce 2d offsets for all points. Because the computation of the last layer is very small, the total computation and memory cost is roughly irregular to varying point number.
Figure 4 illustrates our efficient computation processes for both the recognition and point refinement branches.
We design a boxfree RepPoints Segmenter (RPSeg) for the task of instance segmentation that utilizes Dense RepPoints as the intermediate representation throughout the pipeline. Similar to [46], we use a center point based initial object representation and utilize Dense RepPoints as the intermediate feature sampling locations. The overall architecture is illustrated in Figure 5, using an FPN backbone like in [30, 46], where feature pyramid levels from 3 (downsampling ratio of 8) to 7 (downsampling ratio of 128) are employed. The head architecture is illustrated in Figure 6.
In addition to the class head and localization head, we introduce an optional attribute head to predict the attribute of each point. The localization subnet first computes offsets for the Dense RepPoints, then the refinement and attribute predictions are obtained by bilinear sampling on the predicted refine fields and attribute maps with sampling locations based on . For the classification branch, we use group pooling to sample the features of Dense RepPoints with sampling locations based on , then fullyconnected layers are used to predict the classification results. From its anchorfree design and efficient implementation in Section 4.2, our approach is fast and even more efficient than the onestage RetinaNet [30](215.7 vs 234.5 GFLOPS) and has a finer object representation (225 points). We use the same label assignment approach as in [46]. For the additional attribute branch, we use perpoint binary cross entropy for the foreground / background attribute prediction.
Dense RepPoints can also be employed to simultaneously represent multigranular object structures, e.g. the coarse object bounding box and the more finegrained object segment. This can be achieved by using both bounding box and segment annotation to jointly supervise the learning of Dense RepPoints. In this way, different 2D visual representations including bounding boxes, contours, and masks are unified by this dense point set representation.
Unlike Mask RCNN [21] where the mask and bounding box are processed using independent branches, Dense RepPoints provides a unified representation for different granularities of object geometry. Experimental results also show that the object detection task can benefit greatly from object segments as a finegrained supervision source even on a strong baseline, with a AP improvement through this multigranular representation (see Table 4).
We present experimental results for multiple applications, all on the MSCOCO [31] benchmark, which contains 118k images for training, 5k images for validation (minival) and 20k images for testing (testdev). We report the ablation results on the 5k validation images (minival). The stateoftheart comparison is reported on testdev.
All our ablation experiments, except when specifically noted, are conducted on minival with ResNet50 [22]
. Our framework is trained with synchronized stochastic gradient descent (SGD) over 8 GPUs with a total of 16 images per minibatch (2 images per GPU) for 12 epochs (
settings). The learning rate is initialized to be 0.01 and then divided by 10 at epochs 8 and 11. The weight decay and momentum parameters are set toand 0.9, respectively. The ImageNet
[15] pretrained model was used for initialization. We follow the training schedule and horizontal image flipping augmentation in [46] and use GN [43] and focal loss [30] to facilitate training. We use the unorganized point set by default. At inference, NMS () is employed to postprocess the results, following [30]. Code will be made available.n  FLOPS  
base [46, 14]  efficient  base [46, 14]  efficient  
25  267.9G  211.4G  38.3  38.5 
49  334.0G  211.8G  38.7  38.7 
81  422.2G  212.6G  38.9  39.1 
To show the effectiveness of our efficient feature extraction design, we present detection results of using the efficient implementation for object foreground area representation. Results are shown in Table 1. When , the efficient approach achieves almost the same AP as the base approach with less FLOPS. As increases, the accuracy of both approaches is constantly improving, but the computation cost of the base approach also increases with respect to the number of points. On the contrary, the computational complexity in our method is near constant with respect to .
We use Dense RepPoints as a contour representation for instance segmentation by converting polygons to foreground masks. We report the mask AP for both organized point sets (Table (a)a) and unorganized point sets (Table (d)d). It is shown that the performance is improved as the number of points increases, indicating the necessity of dense points for fine geometric representation. Figure (a)a presents some qualitative results (), note that we only show the polygon vertices. Generally, the points can capture fine object shape and geometry.
We supervise the Dense RepPoints representation with the target point set sampled on the object foreground. Then, we generate foreground masks by converting the points onto a single Gaussian heatmap. Specifically, we choose as the Gaussian kernel size, where and denote the width and height of the proposal and denotes the number of dense points. Then, we apply a threshold of 0.1 to the Gaussian heatmap to get the binary mask. Morphological dilation and erosion with a kernel size of are used to fill in the holes. Results for organized and unorganized point sets are shown in Table (b)b and Table (e)e. As the number of points increases, the performance correspondingly increases significantly. Figure (b)b shows the learned Dense RepPoints on the object foreground area (n = 81).
In the binary boundary map representation, the learned Dense RepPoints will distribute around the object boundary, and each point will predict a foreground probability. We use the nongrid interpolation as described in Section 4.1.3 to convert the attributed Dense RepPoints into a foreground mask. We apply a threshold of 0.5 to the interpolated mask to get the binary mask output. Figure (c)c presents some qualitative results (). For the same instance, the points predicted to be foreground are plotted with brighter color and the points predicted to be background are plotted with darker color. It can be seen in the top left image of Figure (c)c that our boundary binary map representation is even able to represent object with disconnected part (skis in the image). We report the results for organized point sets in Table (c)c and unorganized point sets in Table (f)f. It shows that the performance improves with respect to the points number. We also compare the results to the grid binary map representation in Table 3, where the Dense RepPoints are replaced by the same number of points in grid form, with all other settings the same. It can be observed in Table 3 that the change from grid form to a nongrid form representation brings substantial improvement, especially when the points number is limited ( AP improvement when ). It is worth noting that we are one of the first to represent object segments as nongrid scored points, which is timesaving and memorysaving by the following mechanism: we do not need to verify all the pixels inside the bounding box enclose the object, and instead we can focus on the more difficult near boundary points. Such benefit is also observed and exploited by a concurrent work [26]. We expect more investigations made in this direction.
# of points  
TensorMask [9]          28.9 
Grid points  5.0  15.6  22.7  25.7  28.6 
Dense RepPoints  13.3  23.5  26.7  28.2  29.2 
To better demonstrate the effectiveness of our dense representation, we compare it to other instance segmentation methods on COCO testdev and report the results in Table 5. We use 81 points for contour representation and 225 points for boundary binary mask representation. It can seen that our Dense RepPoints achieves the best performance among the nonboxbased approaches. Our method is completely different from previous methods which are mostly based on grid binary maps [21, 9]. Dense RepPoints perform significantly better than existing contour based methods. Most importantly, we open two new directions for future study, namely object representations by foreground area and by binary boundary map. Although the current result is not as strong as the stateoftheart Mask RCNN [21] and TensorMask [9]
, there is much room for improvement. For example, the tensor bipyramid head in TensorMask
[9] brings a AP improvement. When our RPSeg and TensorMask both use an FPN head for fair comparison, Table 3 shows that RPSeg using the binary boundary map representation achieves a AP improvement over TensorMask [9] on minival using a ResNet50 [22] backbone, though TensorMask [9] uses more training epochs (72 epochs) and a stronger FPN (pyramid levels from 2 to 7). This demonstrates the great potential of using the binary boundary map representation.Dense supervision  diff.  

✗  37.7  58.0  39.7  0.8  
foreground area  38.5  58.4  40.9  

✗  37.5  57.8  39.6  1.2  
foreground area  38.7  58.9  41.9  

✗  37.5  57.7  39.8  1.6  
foreground area  39.1  58.7  42.1 
Method  Backbone  Object Detection  Instance Segmentation  
Anchor  Representation  
Mask RCNN [21]  ResNeXt101  ✓  39.8  62.3  43.4  grid binary  37.1  60.0  39.4 
TensorMask [9]  ResNet101          grid binary  37.1  59.3  39.4 
FCIS [29]  ResNet101          grid binary  29.2  49.5   
YOLACT [3]  ResNet101  ✓  33.7  54.3  35.9  grid binary  31.2  50.6  32.8 
ExtremeNet [48]  Hourglass104  ✗  40.2  55.5  43.2  contour  18.9  44.5  13.7 
CornerNet [28]  Hourglass104  ✗  40.5  56.5  43.1         
CenterNet [47]  Hourglass104  ✗  42.1  61.1  45.9         
RepPoints [46]  ResNet101DCN  ✗  42.8  65.0  46.3         
RepPoints [46]  ResNeXt101DCN  ✗  44.5  65.6  48.3         
Ours  ResNeXt101DCN  ✗  45.3  65.9  49.1  contour  30.0  57.2  28.2 
Ours  ResNeXt101DCN  ✗  45.8  66.7  49.6  boundary binary  33.7  59.9  34.3 
We perform experiments on utilizing dense supervision to help object detection by explicitly supervising the sampling locations of the Dense RepPoints. Specifically, we sample the target point set from the object foreground area, then add the point set distance away from the predicted Dense RepPoints. Table 4 presents the experimental results on object detection with different numbers of points used. We observe consistent performance gain on bbox AP when the point set becomes denser, it is shown that the fine supervision can improve object detection by up to AP under . Also note that the metric improves substantially, which suggests that Dense RepPoints models a finer geometric representation. This novel application of explicit multitask learning also verifies the necessity of using a denser point set, and it demonstrates the effectiveness of our multigranular representation.
We perform a systemlevel comparison between our object detector explicitly assisted by mask supervision to the stateoftheart detector on COCO testdev [31] and report the results in Table 5. Without multiscale training and testing, our method achieves AP without needing anchors, outperforming other strong competitors.
In this paper, we present Dense RepPoints, a dense attributed point set representation for 2D objects. By introducing efficient feature extraction and employing dense supervision, this work takes a step towards learning a geometric, semantic and unified representation for topdown object recognition pipelines, enabling explicit modeling between different visual entities. Experimental results show that this new dense 2D representation is not only applicable for predicting dense targets such as contours and foreground masks, but also can help improve other tasks such as object detection via its novel multigranular object representation.
Acknowledgement We thank Jifeng Dai and Bolei Zhou for discussion and comments about this work. Jifeng Dai was involved in the early discussions of the work. He gave up the authorship after he joined another company.
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 605–613. Cited by: §4.1.2.Highperformance rotation invariant multiview face detection
. PAMI 29 (4), pp. 671–686. Cited by: §2.