Location-Sensitive Visual Recognition with Cross-IOU Loss
Object detection, instance segmentation, and pose estimation are popular visual recognition tasks which require localizing the object by internal or boundary landmarks. This paper summarizes these tasks as location-sensitive visual recognition and proposes a unified solution named location-sensitive network (LSNet). Based on a deep neural network as the backbone, LSNet predicts an anchor point and a set of landmarks which together define the shape of the target object. The key to optimizing the LSNet lies in the ability of fitting various scales, for which we design a novel loss function named cross-IOU loss that computes the cross-IOU of each anchor point-landmark pair to approximate the global IOU between the prediction and ground-truth. The flexibly located and accurately predicted landmarks also enable LSNet to incorporate richer contextual information for visual recognition. Evaluated on the MS-COCO dataset, LSNet set the new state-of-the-art accuracy for anchor-free object detection (a 53.5 shows promising performance in detecting multi-scale human poses. Code is available at https://github.com/Duankaiwen/LSNetREAD FULL TEXT VIEW PDF
This paper presents HoughNet, a one-stage, anchor-free, voting-based,
We address the visual relocalization problem of predicting the location ...
Current anchor-free object detectors label all the features that spatial...
Keypoint-based detectors have achieved pretty-well performance. However,...
Most current detection methods have adopted anchor boxes as regression
This paper presents HoughNet, a one-stage, anchor-free, voting-based,
Visual localization is of great importance in robotics and computer visi...
Location-Sensitive Visual Recognition with Cross-IOU Loss
Object recognition is a fundamental task in computer vision. Beyond image classification that depicts an image using a single semantic label, there exist other recognition tasks that not only predict the class of the object but also localize it using fine-scaled information. In this paper, we consider three popular examples including object detection [19, 36], instance segmentation [19, 13], and human pose estimation [1, 36]
. We notice that, despite the fact that the rapid progress of deep learning has introduced powerful deep networks as the backbone [25, 60, 20], the designs of the head module for detection [38, 48, 49, 52, 65, 31, 30, 63, 6], segmentation [59, 24, 44, 7], and pose estimation [51, 57, 5, 40, 66] have fallen into individual sub-fields. This is mainly due to the difference in the prediction target, i.e., a bounding box for detection, a pixel-level mask for segmentation, and a set of keypoints for pose estimation, respectively.
Going one step further, we merge the aforementioned three tasks into one, named the location-sensitive visual recognition (LSVR). On the basis of the definition, we propose a location-sensitive network (LSNet) as a unified formulation to deal with them all. The LSNet is built upon any network backbone, e.g., those designed for image classification. The key is to relate an object to an anchor point and a set of landmarks that accurately localize the object. In particular, the landmarks should correspond to the four extreme points for object detection, sufficiently dense boundary pixels for instance segmentation, and the keypoints for human pose estimation. Note that the anchor point as well as landmarks can be also used for extracting discriminative features of the object and thus assisting recognition. Figure 1 illustrates the overall idea.
The major difficulty of optimizing the LSNet lies in the requirement of fitting objects of different scales and different properties, which existing methods including the smooth- loss and the IOU loss suffer cannot satisfy. This motivates us to design a novel loss function named the cross-IOU loss
. It assumes that the landmarks are uniformly distributed around the anchor point and thus approximates the IOU (between the prediction and ground-truth) using the coordinate of the offset vectors. The cross-IOU loss is easily implemented in a few lines of codes. Compared to other loss functions, it achieves a better trade-off between the global and local properties and transplants easier to multi-scale feature maps without specific parameter tuning.
We perform all three tasks on the MS-COCO dataset . LSNet, equipped with the cross-IOU loss, achieves competitive recognition accuracy. We further equip the LSNet with a pyramid of deformable convolution that extracts discriminative visual cues around the landmarks. As a result, LSNet reports a box AP and a mask AP, both of which surpass all existing anchor-free methods. For human pose estimation, LSNet reports competitive results without using the heatmaps, offering a new possibility to the community. Moreover, LSNet shows a promising ability in detecting human poses in various scales, some of which were not annotated in the dataset.
On top of these results, we claim two-fold contributions of this work. First, we present the formulation of location-sensitive visual recognition that inspires the community to consider the common property of these tasks. Second, we propose the LSNet as a unified framework in which the key technical contribution is the cross-IOU loss.
Deep neural networks have been widely applied for visual recognition tasks. Among them, image classification  is the fundamental task that facilitates the design of powerful network backbones [25, 60, 20]. Beyond image-level description, there exist fine-scaled tasks, including object detection, instance segmentation, and pose estimation, which focus on depicting different aspects of the object. For example, the bounding boxes locate objects simply and efficiently but lack the details, while masks and keypoints reflect the shape and pose of the objects but usually need the bounding boxes to locate object firstly. According to the different properties of different tasks, many representative methods have been developed.
The object detection methods can be roughly categorized into anchor-based and anchor-free. The anchor-based methods detect objects by placing a pre-defined set of anchor boxes, predicting the class and score for each anchor, and finally regressing the preserved boxes tightly around the objects. The representative methods include Fast R-CNN , Faster RCNN , R-FCN , SSD , RetinaNet , Cascade R-CNN , etc. On another line, the anchor-free methods usually represent an object as a combination of geometry. Among them, CornerNet  and DeNet  generated a bounding box by predicting a pair of corner keypoints, and CenterNet (keypoint triplets)  and CPNDet  applied semantic information within the objects to filter out the incorrect corner pairs. FCOS , RepPoints , FoveaBox , SAPD , CenterNet (objects as points) , YOLO , etc., defined a bounding box by placing a single point (called anchor point) within the object and predicting its distances to the object boundary.
For instance segmentation, there are mainly two kinds of methods, namely, the pixel-based and the contour-based methods. The pixel-based methods consider the segmentation problem as predicting the class of each single pixel. One of the representative work is Mask R-CNN , which first predicted the bounding box to help locating the objects, and then used pixel-wise classification to determine the object mask. The contour-based methods instead represent an object by the contour. They often start with a set points that are roughly located around the object boundary, and the points gradually get closer to the object boundary under iteration. The early representative methods are the snake series [27, 11, 23, 12] and the recent efforts include using deep neural network  and the idea of anchor-free to improve the features, such as DeepSnake  and PolarMask .
There are two mainstreams for human pose estimation, namely, the bottom-up methods [3, 10, 26, 39, 5] and top-down methods [66, 54, 58, 37]. The bottom-up methods first detect the human parts and then locates the keypoints in each object, while the top-down first locates all the keypoints in the human body and then composes the individual parts into a person. The keypoints are often sparsely distributed in an image and thus are difficult to be accurately located. A practical solution is to detect the keypoints in a high-resolution feature map, called heatmap [40, 10, 10]. However, applying the heatmap makes the optimization hard and introduces a complex post-processing operation. CenterNet  proposes a neat and simple method, which only predicts a center heatmap and the keypoints are obtained by regressing the vector from the center within objects to the keypoints.
This paper particularly focuses on the anchor-free methods for visual recognition. These methods originated from object detection and have drawn a lot of attention recently. They do not rely on the pre-defined anchor boxes to locate objects but by points and distance. Therefor, the anchor-free methods enjoy the ability to extend into any directions. This offers the researchers a possibility to unify the visual recognition tasks. Recent trends have spent effort in extending the anchor-free methods in object detection into other tasks, e.g., PolarMask  tries to extend the anchor-free into instance segmentation, while CenterNet  applies it into the pose estimation. Compared with our framework, both of them have limitations, which we will give a detailed discussion in section 3.2.
Visual recognition tasks start with an image, . Image classification aims to assign a class label for the entire image, yet there are more challenging tasks for fine-scaled recognition. These tasks often focus on the instances (i.e., individual objects) in the image and depict the object properties from different aspects. Typical examples include object detection that uses a rectangular box that tightly cover the object, instance segmentation that finds out each pixel that belongs to the object, and human pose estimation that localizes the landmarks of the object (i.e., human keypoints). We use , , and to indicate the target of these tasks, where and are the image width and height, and is the number of keypoints.
An important motivation of our work is that, although these tasks differ from each other in the form of description, they share the common requirement that the model should be sensitive to the location of the anchor and/or landmarks. Throughout the remaining part of this paper, we refer to these tasks as location-sensitive visual recognition and design a unified framework for them.
The proposed location-sensitive network (LSNet) starts with a backbone (e.g., the ResNet , ResNeXt , etc.) that extracts features from the input image. We denote the process using . Next, an anchor point and a few landmarks are predicted on top of , denoted as . Here we define where is the number of landmarks, is the anchor point, and is a landmark for .
As a unified framework, the key is to relate the prediction targets (i.e., , , and as aforementioned) to . For object detection, this is done by finding an extreme point (a pixel that belongs to the object and is tangent to the bounding box) on each edge of the bounding box111As a disclaimer, there may exist multiple or even continuous extreme points on each edge. We assume the method to find any one of them, by which it confirms the prediction of the bounding box., i.e., . For instance segmentation, we locate a fixed number (e.g., in the experiments, ) of landmarks along the contour and thus use the formed polygon to approximate the shape of the object222In case that the mask is not simply-connected in topology, we follow PolarMask  to deal with each part separately and ignore the holes.. For human pose estimation, we follow the definition of the dataset to learn a fixed number of keypoints, e.g., in the MS-COCO dataset, .
Figure 2 shows the pipeline of LSNet. It belongs to the category of anchor-free methods, i.e., there is no need to pre-define a set of anchor boxes for localizing the object. LSNet is partitioned into two stages, where the first stage predicts an anchor point from the FPN head and relates it with a set of landmarks, and the second stage composes the landmarks into an object with the desired geometry (e.g., a bounding box). To facilitate accurate localization, we use the ATSS assigner  to assign more anchor points for each object and extract features with deformable convolution (DCN) upon the predicted landmarks. The entire model is an end-to-end trainable learnable function.
LSNet receives two sources of supervision for localization and classification, elaborated in Sections 3.3 and 3.4, respectively. The localization loss is added to both stages, where the major contribution is a unified loss that fits the properties of different tasks, and the classification loss is added to the second stage upon the DCN features.
LSNet extends the border of anchor-free methods for location-sensitive visual recognition. We briefly review two counterparts. (i) CenterNet predicted horizontal or vertical offsets beyond the anchor point for object detection. This limits its ability in finding the extreme points and extracting discriminative features, yet it cannot perform instance segmentation. (ii) PolarMask used a polar coordinate system for instance segmentation, making it difficult to process the situation that a ray intersects the object multiple times at some direction. In comparison, LSNet easily handles the challenging scenarios and report superior performance (see the experimental part, section 4).
The unified framework raises new challenges to the supervision of localization, because the function needs to consider both the global and local properties of the object. To clarify, we notice that the evaluation of object detection and instance segmentation judges if an object is correctly recognized by the global IOU between the prediction and the ground-truth, while pose estimation measures the accuracy by each individual keypoint.
To this end, we design the cross-IOU loss as the unified supervision. The loss is defined upon the predicted and ground-truth objects, and we use , where , to denote the corresponding ground-truth of the anchor point and landmarks. We compute the offset from the anchor point to the landmarks in a cross-coordinate system, i.e., for , where and denotes and , respectively. Finally, we write the cross-IOU loss as:
where indicates the -norm. In other words, the cross-IOU function rewards the components ( and ) of similar length (in which case the prediction and the ground-truth maximally overlap) and penalizes the components on different directions. Based on the Eqn (1), we define the cross-IOU loss as . Obviously, when for all , we have as expected333The current form of can cause the gradients over and to be when the corresponding dimensions of prediction and ground-truth are of different signs. We design a softened prediction mechanism to solve the issue. Please refer to the Appendix A for details..
The cross-IOU loss brings a direct benefit that it fits different scales of features without the need of specific parameter tuning. This alleviates the difficulty of integrating multi-scale information, e.g., using the feature pyramid . In comparison, the smooth- loss  is sensitive to the scale of vector (e.g., the loss value tends to be large when the feature resolution is large) and neglects the relationship between the components that are from the same vector. Moreover, by approximating the IOU using individual components, the cross-IOU loss is flexibly transplanted to instance segmentation and human pose estimation, unlike the original IOU loss  that is difficult to compute on polygons (for segmentation) and undefined for discrete keypoints (for pose estimation).
To enhance discriminative information for recognition, we use deformable convolution (DCN) [15, 70] to extract features from the landmarks. The standard DCN has offsets, while the number of offsets is , , and
for detection, segmentation, and pose estimation, respectively. In the latter two cases, to avoid redundant features extracted from close areas, we samplelandmarks uniformly from the candidates. We further build the feature extraction module upon the feature pyramid . The offsets are adjusted to different stages by accordingly rescaling the vectors.
We name the proposed method Pyramid-DCN, and illustrate it in Figure 3. As shown in experiments, both feature extraction from the landmarks and using the pyramid structure improve recognition accuracy.
We evaluate our framework on the MS-COCO dataset , which is a popular, large-scale object detection, segmentation, and human pose dataset. For object detection and segmentation, it contains over training images, validation images and test-dev images covering object categories. For human pose, the person instance is labeled with keypoints, containing over training images with person instances, validation images and test-dev images.
The average precision (AP) metric is applied to characterize the performance of our method as well as other competitors. There are subtle differences in the definition of AP for different tasks. For object detection, AP is calculated the average precision under different bounding box IOU thresholds(from to ), while the bounding box IOU is replaced with the mask IOU in instance segmentation. In the human pose task, AP is calculated based on the object keypoint similarity (OKS), which reflects the distance between the predicted keypoints and the annotations.
with the weights pre-trained on ImageNet as our backbones, respectively. The feature pyramid network (FPN)  is applied to deal with objects with different scales. For object detection, we set four vectors for each object to learn to find the four extreme points (top, left, bottom, right). We refer to ExtremeNet  to obtain extreme point annotations from the object mask444As a side comment, when annotating the bounding boxes on a dataset, we recommend annotating an object by clicking the four extreme points (top-most, left-most, bottom-most, right-most) of the object. According to , this way is roughly four times faster than directly annotating the bounding boxes. In addition, the extreme point itself contains the object information as well.. For instance segmentation and human pose estimation, we set vectors for each instance to regress the location of contour points and vectors to regress the keypoints.
Training and Inference. We train our framework on eight NVIDIA Tesla-V100 GPUs with two images on each GPU. The initial learning rate is set as , the weight decay as and momentum as . In the ablation study, we use a ResNet-50 
pre-trained on ImageNet as the backbone, and fine-tune the model for epochs using a single-scale of and augment the training images with random horizontal flipping. The learning rate decays by a factor of at after the th and th epochs, respectively. We also use stronger backbones and longer training epochs ( epochs for object detection, epochs for instance segmentation and epochs for human pose with the learning rate decayed by a factor of after the th, nd, and th epochs, respectively) and multi-scale input images (from to ) to further improve the recognition accuracy. In the first stage, we only select the anchor point closest to the center of the object as a positive sample. In the second stage, we use the ATSS  assigner to assign the anchor points for each object. The overall loss function is
where and denote the Focal loss  and our cross-IOU loss, respectively. We set the balancing coefficients, and , to be and in the experiments. During the inference, both the single-scale testing and multi-scale testing strategy are applied. We use the scale of for single-scale testing. For the multi-scale testing, we refer to ATSS  to set the image scales. We also use the non-maximum suppression (NMS) strategy with a threshold of to remove the redundant results.
|Libra R-CNN ||X-101-64x4d||12||8.5||43.0||64.0||47.0||25.3||45.6||54.6|
Comparisons to SOTA. We evaluate the detection accuracy of LSNet on the MS-COCO test-dev set, the results are shown in Table 1. As Table 1 shows, our method is an anchor-free detector, with a backbone of ResNet-50, LSNet achieves a box AP of with 12.7 FPS, which has been competitive with other detectors that equipped with deeper backbones. When equipped with stronger backbones, LSNet performs even better. This benefits from our proposed cross-IOU loss. It helps the LSNet to locate the landmarks with high accuracy, the rich global information contained in the landmarks further promote the cross-IOU loss to regress the landmarks more accurately. With the additional corner point verification (CPV)  and multi-scale testing , LSNet achieves a box AP of , which outperforms all the anchor-free detectors as we know.
|Mask R-CNN ||X-101-32x4d||12||37.1||60.0||39.4||16.9||39.9||53.5|
|DeepSnake ||DLA-34 ||120||30.3||-||-||-||-||-|
Cross-IOU Loss for Vector Regression. To evaluate the performance of cross-IOU loss, we design four contrast experiments on the MS COCO  validation set, which are (i) the GIOU loss  (a variant of the IOU loss) for rectangle bounding box regression, (ii) the smooth- loss for rectangle bounding box regression, (iii) the smooth- loss for extreme bounding box regression, and (iv) the cross-IOU loss for extreme bounding box regression, respectively. All the experiments are done in the first stage in our framework (shown in Figure 2) with ResNet-50  as the backbone, and we train the model for each experiment for epochs. Table 2 summarizes the results. We can see that the smooth- loss reports an AP of and when regressing the rectangle bounding boxes and the the extreme bounding boxes, respectively. This reveals that it is more difficult to regress an angled vector than a straight vector. By contrast, the cross-IOU loss performs much better than the smooth- loss and even produces competitive results with the IOU loss for the rectangle bounding box. Although the GIOU loss at present still performs better than the cross-IOU loss, the cross-IOU loss allows the framework regressing the location of the landmarks, thus we could extract the discriminative information around the landmarks to enhance recognition. We will show in the next section that the combination of the cross-IOU loss and landmark feature extraction significantly boosts recognition accuracy.
Landmark Features Improve Precision. The landmarks (in particular, the extreme points in the detection task) are often related to discriminative appearance features, which may benefit visual recognition. To confirm this, we investigate different settings by using the anchor point features alone and integrating the anchor point features with either DCN or extreme point features, respectively. We still report the detection accuracy with all other settings remaining the same as in the previous experiments (studying the cross IOU loss). The results are shown in Figure 4. For the smooth- loss and the GIOU loss, we regress the rectangle bounding boxes, and the DCN features are extracted by the adaptively learned DCN kernel; for the cross-IOU loss, we use two sets of vectors both of which regress the extreme bounding boxes – the first set is trained to predict the extreme points and extract the extreme features, and we use the extreme features along with the anchor point features to train the second set from scratch. As shown in Figure 4, both the extreme and DCN features boost the classification accuracy. Recall that the prior experiments suggested the usefulness of the extreme features are useful for localization, combining the current results, we verify that the features around the landmarks are discriminative and thus benefit visual recognition.
Pyramid DCN Improves Precision. We further equip the LSNet with the pyramid DCN to extract the multi-scale features around the landmarks. Table 3 shows our method achieves an AP of with the features extracted by the Pyramid DCN, which outperforms the AP with single-scale features by a margin of .
Comparisons to SOTA. We show the instance segmentation inference results evaluated on the MS-COCO test-dev set  on Table 4. LSNet achieves a mask AP of and using the single-scale and multi-scale testing protocols, respectively, surpassing all published contour-based methods to the best of our knowledge, and the accuracy is even competitive among the pixel-based approaches.
Comparisons with PolarMask . It is interesting to further compare our method with PolarMask, the previous best contour-based approach for instance segmentation. The major difference is that PolarMask assumed the entire object boundary to be seen by the anchor point, but this may not be the case especially for some complicated objects. Once the ray along some direction intersects with the border more than once, the method considered only one and thus incurred accuracy loss (a typical example is the ‘motorcycle’ contour in Figure 5). In our approach, this issue is solved by ranking the landmarks more flexibly, being compatible to complicated shapes.
The Number of Landmarks. LSNet represents each instance using a polygon. Using a larger number of landmarks improves the upper-bound of accuracy, but can also incur heavy computational costs and cause the landmark prediction module difficult to be optimized. To choose a proper number of landmarks, we refer to the ground-truth masks of the MS-COCO validation set and quantize each mask into a polygon that best describes it. We find that using , , and landmarks achieves APs of , , and , respectively, and we consider to be a nice tradeoff.
Comparisons to SOTA. Unlike most of the human pose estimation methods that predict the keypoints using the heatmaps, LSNet predicts the keypoints using regression only. In the experiment, we use the object bounding boxes (’obj-box’) and keypoint-boxes (’kps-box’) to assign training samples, respectively. We will give a detailed discussion of the difference between the two methods in the Appendix B. On the MS-COCO test-dev set, LSNet reports an AP of w/ obj-box and w/ kps-box, respectively, which outperform CenterNet-reg  with the Hourglass-104 backbone. However, LSNet does not perform as well as the heatmap-based methods, and we analyze the reason as follows.
Error Analysis. We can observe that the LSNet struggles particularly in the high OKS regimes, e.g., compared to Pose-AE , the deficit of AP (for LSNet w/ obj-box) is while that of AP grows to . Note that using keypoint regression is not as accurate as using the heatmaps for refinement, and thus LSNet is less sensitive in the pixel-level prediction. However, the AP metric of pose estimation is largely impacted by this factor. To show this, we artificially add an average deviation of , , and pixels to the prediction results of CenterNet-jd  (with a backbone of Hourglass-104). The AP on the MS-COCO validation set is significantly reduced from (corresponding to the test AP of in Table 5) to , , and , respectively.
On the other hand, we use the heatmaps produced by CenterNet-jd (Hourglass-104) to refine the prediction of LSNet w/ obj-box. As a result, the AP on the MS-COCO validation set is improved from (corresponding to the test AP of in Table 5) to . This suggests that LSNet still needs further manipulation of high-resolution features towards higher pixel-level accuracy.
The Benefit of LSNet. Despite the relatively weak pixel-level localization, LSNet (w/ obj-box) enjoys the ability of perceiving multi-scale human instances, many of which are not annotated in the dataset. Some examples are shown in the right side of the Table 5. Since the ground-truth is not available to evaluate the impact, we refer to the heatmaps of CenterNet-jd (Hourglass-104) to deliberately remove these ‘false positives’. Consequently, AP is further improved from to , comparable with the heatmap-based methods, though the improvement seems less meaningful.
We show some visualized results of LSNet in Figure 5, inlcuding object detection, instance segmentation and human pose estimation. Please refer to the appendix for more qualitative results.
This paper unifies three location-sensitive visual recognition tasks (object detection, instance segmentation, and human pose estimation) using the location-sensitive network (LSNet). The key module that supports the framework is a novel cross-IOU loss that is friendly to receiving supervision from multiple scales. Equipped with a pyramid DCN, LSNet achieves the state-of-the-art performance on anchor-free detection and segmentation. This work suggests that using keypoints to define and localize objects is a promising direction, and we hope to extend our approach to achieve a stronger ability of generalization.
As mentioned in Section 3.3, the form of (Equation 1) can cause the gradients over and to be when the corresponding dimensions of prediction and ground-truth are of different signs. To solve this problem, we predict four components for each offset vector, as shown in Figure 6. Then can be rewritten as: , where are all greater than . On the other hand, when transforming into the cross-coordinate system, as shown in Figure 6, we assign the minimum sides ( and ) the non-zero value, which are times the corresponding maximum sides ( and ), where . In all our experiments, we set .
During inference, we transform the predicted offset vectors from the cross-coordinate system into the rectangular coordinate system by taking the maximum value in the horizontal and vertical direction, respectively, i.e., .
In Section 4.5, we mainly discuss the characteristics of using the object bounding boxes (’obj-box’) to assign training samples, which lets LSNet enjoy the ability of perceiving multi-scale human instances especially for small human instances, many of which are not annotated in the dataset. In this section, we mainly discuss the characteristics of using the keypoint-boxes (’kps-box’, bounding box generated by the topmost, leftmost, bottommost and rightmost keypoints of an objects) to assign training samples. Compared with the former, the later will no longer treat the human instances that only have the object bounding box annotations but lack of pose annotations as positive samples. This makes the network pay more attention to learn the human instances that have the pose annotations, which helps to improve the AP score. As shown in Table 5, LSNet using keypoint-boxes reports an AP of , an improvement of over , achieved by LSNet using object bounding boxes.
However, we find that, with the ‘improved’ AP score, the ability of the algorithm at perceiving multi-scale human instances is weakened. As shown in Figure 7, the modified algorithm mostly fails to detect the small person instances. This proves that the annotations of the dataset is biased.
Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693. Cited by: §1.