Object detection has served as one of the defining problems in the field of computer vision for over 50 yearspapert1966summer . The “object” with its relation to the scene has no universal formalization or definition – a topic under extensive research and debate in mathematics, computer science, cognitive science, and philosophy. With every attempt at explicitly defining what it means to be a distinct object in the visual scene, a lot of valuable semantic knowledge is discarded feldman2003visual . In computer vision, objects in 2D image space have been defined by their 2D bounding boxes redmon2016you , 3D bounding boxes mousavian20173d , polygons castrejon2017annotating , splines castro2015statistical , pixels liang2018proposal , and voxels girdhar2016learning . Each representation has benchmarks and state-of-the-art algorithms. Each have strengths and weaknesses when considered from the pragmatic perspective of a particular application (e.g., robot vision), providing variable levels of fidelity, information density, and annotation cost.
We propose a new representation based on the bivariate normal distribution (5 parameters) as an alternative to the most commonly used object representation of 2D bounding boxes (4 parameters). This distribution-based representation has the benefit of robust detection of highly-overlapping objects as shown in Fig. 1. No well-established benchmarks are available to evaluate this statistical representation, so for the detection task we rely primarily on qualitative evaluation. Conceptually, the strength of this representation is its emphasis on the object center of visual mass versus the object edges, allowing for uncertainty around the latter. The result of this emphasis is that the derived tasks of object tracking and instance segmentation may become more robust to the inherent spatial and temporal variability of object edges and to occlusion artifacts. We provide a baseline instance segmentation approach based on this statistical representation, motivating further work on utilizing this representation for the downstream segmentation and tracking tasks. Ultimately, object detection is a simplification of the general task of perception and visual scene understanding. One of the underlying open questions raised by this work is whether bounding boxes is the most useful minimalist representation of objects in real-world detection tasks.
2 Related Work
2.1 Object Detection
Convolutional neural networks (CNNs) have been used in recent years to achieve state-of-the-art performance on object detection girshick2014rich ; ren2015faster ; liu2016ssd . These CNN-based methods can be divided into two types: one stage methods and two stage methods. One stage methods such as YOLO redmon2016you or SSD liu2016ssd which directly predict the bounding boxes of interest with a single feedforward pass through the nextwork. Two stage methods such as Faster R-CNN ren2015faster or R-FCN dai2016r first generate proposals, and then exploit the extracted region features from CNN for further refinement. Further refinement of these methods focus on addressing various drawbacks such as the lack of robustness to scale variation, often achieving state-of-the-art performance on object detection benchmarks li2019scale (e.g., on COCO Object Detection Task lin2014microsoft ).
2.2 Box-Free Instance Segmentation
Although instance segmentation has been viewed as a more advanced form of object detection, many recent advances for instance segmentation are still box-dependent, e.g. dai2016instance ; li2017fully ; he2017mask ; arnab2017pixelwise involves first detecting objects with a box and then segmenting each object using the box as a guide, pinheiro2015learning ; dai2016instances generate mask proposals in a dense sliding-window manner. On the other hand, box-free methods bai2017deep ; kirillov2017instancecut ; liu2017sgn predict each image pixel with a class label and some auxiliary information, then use a clustering algorithm to group pixels into object instances. A major drawback for these methods is that the auxiliary information is usually uninterpretable, so the detection results can not be obtained until the dense object masks are generated, which is sometimes unnecessary and expensive to compute.
3 Rethinking Object Representation
One of the basic component in object detection is the way of representing the existence of objects of interest in the space and time. In the case of 2D image, the axis-aligned minimum bounding box representation is a wildly used method to identify an object with its approximate location and a very simple box descriptor of its shape and size. Despite its advantages such as parameterization- and annotation-friendly, to be used as the label to denote the existence of an object, bounding box has a few drawbacks such as: 1) it only cares about the borders of the object on two directions, which is not representative if the object is rotated or not squarely shaped. 2) it is sensitive to the change on the border parts, meaning the box parameters may change dramatically ignoring the majority pixels of the object. 3) It can not distinguish overlapped objects within very similar bounding boxes, though their masks may be very different.
On the other hand, another representation of 2D object that has been recently used is the object mask, usually parameterized by a dense pixel matrix or a polygon. It gives highly accurate shape of the object, but usually have too many parameters that makes it hard to be used as the label. An example is in state-of-the-art proposal-based instance segmentation methods, e.g. Mask R-CNN he2017mask , that involve first detecting each object with a bounding box then segmenting it, the box is still responsible for denoting the existence of an object.
In the consideration of the above two paradigms, our goal is to explore a new way of representing the existence of 2D object, that can at most describe the object, making it distinguishable from other objects, with the least number of parameters.
3.1 2D Object Representation with Bivariate Normal Distribution
We hereby introduce a representation, that uses bivariate normal distribution to parameterize the visual existence of 2D objects in the scene. Specifically, for any densely or coarsely annotated object in the image, its annotation can be viewed as a set of pixels distributed in the 2D space. We use the bivariate normal distribution to represent this set as
where and are the x-axis-coordinates and y-axis-coordinates of all the pixels belong to the object, respectively.
By using maximum likelihood to estimation, we can parameterize the distribution as
where , , ,
are means and standard deviations ofand , respectively. is the correlation coefficient of and .
There are several advantages of using the above representation, comparing to the 2D bounding box representation. Firstly, the proposed representation is more precise in terms of the shape and rotation variation of general objects, especially for objects that are not squarely shaped. Secondly, it gives more robust indication of object’s existence in the 2D space, as opposed to boxes that will vary significantly when there are any changes on the boarder of an object. Thirdly, it can handle a few difficult situations such as distinguishing highly-overlapped objects, which will be discussed in detail in the next section. In addition, this representation is parameterized by only five parameters for each object, one extra than the bounding box but much fewer than the dense pixel labeling.
3.2 Distinguishing Objects with Discrimination Information
For any successful representation of a certain kind information, it is crucial that the representation should minimize the information loss at the time of recovery. For 2D objects in this case, the most important information to preserve is the spatial existence of all annotated objects that also makes them distinguishable from each other (one can not have same representation for two different objects). It is also an important premise for many recent object detection approaches, namely those who tend to output a smooth response map that triggers many imprecise object hypotheses, that by using some post-processing techniques, e.g. non-maximum suppression, one can ideally remove false positives and obtain a best single detection for each object.
Under the situation of using parameterized distributions to encode objects, a natural thinking of distinguish them is to measure how one distribution is different from another. In this work, we use the Kullback–Leibler (KL) divergence between two parameterized distributions and to quantify the difference between two objects. Note that the KL divergence is always non-negative, with if and only if almost everywhere, which is statistically unlike to happen in the natural scene. In practice, for model optimization and inference, we use the symmetrified KL divergence as:
in order to ensure the consistency of the measure of two objects with different ordering.
For object detection in the literature, intersection over union (IoU) is a major evaluation metric to measure the difference between two detections with bounding box representation. Our proposed method of distribution representation with KL divergence not only shares some good features with IoU including invariance to the scale of image and strictness about both location and size of objects, but also has a few advantages. Firstly, KL Divergence is fully differentiable and can be directly used in optimization. This eliminates one of the major drawbacks of IoU that people have to find alternative ways to optimize bounding box size and location. Secondly, it is able to handle edge cases when the objects are overlapped and have very similar bounding boxes. This situation is very likely to happen in some real-world contexts, e.g. driving scene, where pedestrians or vehicles crowds are likely to contain highly overlapping objects. Fig.2 shows some statistics obtained from the Cityscapes cordts2016cityscapes dataset, which consists of various driving scenes in urban areas. Our representation significantly improve the discrimination of highly-overlapped objects by reducing the number of failed object pair decoupling by over 70%.
4 Object Detection with Representation Modeling
To further illustrate the potential usage of the proposed representation, we present a simple architecture for object detection. Similar to YOLO redmon2016you , the approach is a unified framework based on a single fully convolutional architecture. However, unlike most of other detection works, we do not use bounding box and the output is distributions representations of objects. We also do not use region proposals or anchor boxes which are proved to be useful in many works, because the proposed model aims at handling various cases in object detection and thus not making any assumptions about object shape and size. We also extend the model to output dense object masks for cases when fine object shape is needed.
4.1 Feature Extraction and Semantic Prediction
The whole model architecture can be viewed as a natural extension of any semantic segmentation model, which gives dense prediction on every pixel. For easy implementation, we adopt the DeeplabV3+ chen2018encoder as the base model in this work, which uses the Xception chollet2017xception model backbone as the feature extractor. We keep the Deeplab segmentation branch as-is, and extend another branch for class-agnostic object representation prediction on the shared feature map.
4.2 Representation Modeling with Mixture Density Networks
By using bivariate normal distribution to represent object , we need to model the five parameters , , , , . In order to make the prediction location-invariant, for each pixel that belongs to object , the model predicts , , , , to form the distribution. This can be interpreted as, at each pixel, the model is estimating the relative location, shape, size, and rotation of the object it belongs to. The objective function is to minimize the symmetrified KL Divergence of predicted distribution and ground truth distribution, which is fully differentiable.
Although one can directly model the above characteristics for each pixel using a single convolution layer, a potential issue is the cost of computation if the image resolution is high. Following the common approach for dense pixel prediction, we consider having a down-scaled prediction and up-scale it afterward. However, common up-scaling techniques (bilinear, nearest neighbor, etc.) can not deal with the distribution parameters at object boundary that are not continuous. For example, bilinear upscaling will generate unexpected values that may be viewed as parameters of another distribution by averaging parameters of two real object distributions, resulting in predicting the boundary as another object. In order to solve this problem, we take idea from Mixture Density Networks to model the target distribution with distribution candidates. For object , the target distribution is modeled as
where is the likelihood value assigned to each distribution candidate. We use the last convolution layer to predict parameters, which are distribution parameters and likelihood parameter for each of distribution candidates. The intuition is to let the model predict multiple possible objects, and assign the pixel to the object with top likelihood. By converting a value regression problem to a classification problem, the object representation branch also fits the behavior of the base semantic segmentation architecture.
During training, the global loss functionconsists of three parts: the semantic segmentation loss , the representation loss , and the mixture density loss .
where and are two weight parameters to balance the optimization. The parameters are related to the complexity of the scene, namely for lower proportion of foreground objects to background, one may select higher value for and .
For , we follow the common practice for semantic segmentation and use per-pixel categorical cross-entropy loss on all the classes, with a binary mask that ignores the void classes and unlabeled regions.
is calculated using the symmetrified KL divergence (eq. 6) between predicted distribution representation and the ground truth distribution representation. Since the predicted representation is selected from candidates in mixture density networks, we use a dynamic mask that select two candidates at each pixel to optimize: the one with highest likelihood and the one with lowest divergence,
for being the ground truth distribution representation at pixel , being one of the candidates prediction at pixel . is the binary mask that if or where is the likelihood value for candidate at pixel , and otherwise.
Lastly, is also a categorical cross-entropy loss on the likelihood of distribution candidates. The ground truth for which candidate being the best option is dynamically selected as the one with lowest divergence, . Generally speaking, we let the mixture density network automatically optimize to find the best candidate based on its current state, and also jointly optimize the best candidate and currently selected candidate to be close to the gound truth. Pixels that does not belong to an object are ignored for and .
4.4 Divergence-based Non-Max Suppression
Having the dense prediction of distribution representation of potential objects, we modify the non-max suppression to use symmetrified KL divergence in place of IoU, and use it to remove false positive detections and get the detected objects as their distribution representations. In practice, we find the threshold in invariant to object size, but variant to object class. We thus use class-dependent divergence where the class prediction is obtained from the semantic prediction.
4.5 Instance Segmentation with Pixel Clustering
Since all the pixels are predicted with a class label and an object representation, we simply cluster the pixels predicted as foreground object classes into different instances using nearest-neighbor, based on the divergence between the pixels and detected objects after Non-maximum Suppression. For best practice, we get object candidates on the down-scaled prediction for speed, and cluster the pixels on original scale for accuracy. One thing to point out is that the proposed algorithm aims to perform detection (which does not need fine object mask), but can output object masks as an alternative for evaluation purpose.
In this section, we describe the experimental results quantitatively and qualitatively explored on the selected Cityscapes cordts2016cityscapes dataset. Cityscapes features 5000 images of ego-centric driving scene in urban area, which are split into 2975, 500 and 1525 for training, validation and testing, respectively. The ground truth contains 8 classes for foreground object (things) with instance-level annotation, and 11 classes for background (stuff). The reason why we choose to use this dataset is it covers a greater variety of scene complexity and to have a higher portion of highly complex scenes than other datasets geiger2012we ; lin2014microsoft . Since this work aim to solve difficult cases in object detection such as occlusion and overlapping, it is necessary to be able to observe such cases in the dataset. Cityscapes also provides a similar environment as real-world applications, e.g. autonomous driving, which makes extreme demands on system performance and reliability to handle edge cases.
5.1 Implementation Details
We use the 2975 finely-annotated images for training. The model is initialized from pretrained DeeplabV3+ chen2018encoder semantic segmentation weights plus randomly initialized object representation branch. The images are randomly cropped and flipped during training. Since the semantic segmentation branch is already well-trained, we focus on samples that have higher proportion of instance labels with a weighted sampling. We set a learning rate at and train for 120k iterations using a desktop machine with a single Nvidia 1080Ti GPU at batch size 1. The evaluation is done with the testing set for instance segmentation. We use single-scale images at original resolution during inference.
5.2 Quantitative Results
Since most of the previous work on object detection are evaluated by average precision on bounding box IoU, which by nature can not handle cases when ground truth bounding boxes are highly overlapped. In order to present comparable quantitative results, as an alternative, we report the instance segmentation results on Cityscapes test in Tab. 1. Our method achieves competitive results to state-of-the-art instance segmentation methods, even it is not proposed to handle dense mask prediction tasks. We would like to highlight that our method has much better performance predicting person and car classes, where the overlapping are more likely to exist. For classes like train or bus that have much larger size than other objects, our method suffer from obtaining the contextual information of the exact location and size of the objects, which is a common issue with single-scale models.
5.3 Qualitative Results
We visualize the qualitative results to better illustrate the idea of proposed representation and method. Fig. 3 shows examples of detected objects with highly-overlapped bounding boxes, including both within-category overlapping and out-of-category overlapping situations. All these cases are very likely to be failure cases for any box-dependent methods. We claim that some of these situations can be seriously important for real-world applications, e.g. for autonomous driving, partly-occluded vehicles in the driving way are essential to path planning and collision prevention. The proposed method may also help with other related computer vision tasks such as object tracking, motion prediction, etc.
We also show some visualization of model performance on different driving scenes in Fig. 4. Our model provides fast and precise detection with low computational cost, and can also provide dense object masks as an extension.
We propose a statistical representation of objects for the object detection task based on the bivariate normal distribution. The qualitative evaluation shows that this representation has the benefit of robust detection of highly-overlapping objects and the potential for improved downstream tracking and instance segmentation tasks due to the statistical representation of object edges.
Future work will utilize this representation for improved instance segmentation in images and temporal smoothing of both segmentation and tracking in video. Additionally, we hope that this work raises the question whether bounding boxes is the most useful minimalist representation of objects in real-world detection tasks such as autonomous vehicle perception. For pedestrian, bicyclist, and vehicle detection, decoupling of overlapping objects may have elevated significance when used as part of explicit modeling of intent and trajectory prediction.
Support for this research was provided by Veoneer. The views and conclusions of authors expressed herein do not necessarily reflect those of Veoneer.
-  Seymour A Papert. The summer vision project. 1966.
-  Jacob Feldman. What is a visual object? Trends in Cognitive Sciences, 7(6):252–256, 2003.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.
You only look once: Unified, real-time object detection.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka.
3d bounding box estimation using deep learning and geometry.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7074–7082, 2017.
-  Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Annotating object instances with a polygon-rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5230–5238, 2017.
-  Isaac Castro-Mateos, Jose M Pozo, Marco Pereañez, Karim Lekadir, Aron Lazary, and Alejandro F Frangi. Statistical interspace models (sims): application to robust 3d spine segmentation. IEEE transactions on medical imaging, 34(8):1663–1675, 2015.
-  Xiaodan Liang, Liang Lin, Yunchao Wei, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Proposal-free network for instance-level object segmentation. IEEE transactions on pattern analysis and machine intelligence, 40(12):2978–2991, 2018.
Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta.
Learning a predictable and generative vector representation for objects.In European Conference on Computer Vision, pages 484–499. Springer, 2016.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
-  Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892, 2019.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
-  Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2359–2367, 2017.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
-  Anurag Arnab and Philip HS Torr. Pixelwise instance segmentation with a dynamically instantiated network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 441–450, 2017.
-  Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. Learning to segment object candidates. In Advances in Neural Information Processing Systems, pages 1990–1998, 2015.
-  Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun. Instance-sensitive fully convolutional networks. In European Conference on Computer Vision, pages 534–549. Springer, 2016.
-  Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5221–5229, 2017.
-  Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. Instancecut: from edges to instances with multicut. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5008–5017, 2017.
-  Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3496–3504, 2017.
-  Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–818, 2018.
-  François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
-  Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012.
-  Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491, 2018.