Multi-object detection is the task of finding multiple objects through bounding boxes with class information.
Since the breakthrough of the deep neural networks (DNN), multi-object detection has been extensively developed in terms of computational efficiency and performance and is now at a level that can be used in real life and industry.
There are several approaches for multi-object detection using DNN. Most methods perform regression of the bounding box coordinates and estimate the class probability by assigning a ground truth bounding box to specific anchors which are used as references of output bounding boxes. These methods, based on ground truth assignment, have become the mainstream of multi-object detection
There are several approaches for multi-object detection using DNN. Most methods perform regression of the bounding box coordinates and estimate the class probability by assigning a ground truth bounding box to specific anchors which are used as references of output bounding boxes. These methods, based on ground truth assignment, have become the mainstream of multi-object detection[ren2015fasterRCNN, dai2016r, liu2016ssd, he2017mask, lin2017feature, lin2017focal]
. However, there are some problems in these methods. A ground truth bounding box must be assigned to the anchors at the specific location in the network’s output, depending on its location, scale, and aspect ratio. Therefore, a large number of anchors having various scales and aspect ratios are needed to cover the four-dimensional (xy position, height, width, or top-left, bottom-right corners) space in which a box can exist on an image. Besides, the scale and aspect ratio of the anchor affect detection performance. Thus, an appropriate design of the anchor is required, which may be determined heuristically or require a separate process[redmon2018yolov3, redmon2017yolo9000]. Also, a large number of anchors can cause the so-called positive-negative imbalance problem, where negative background samples outnumber the positive ones, making training difficult.
Recently, keypoint-based object detection methods have been proposed inspired by bottom-up pose estimation. Instead of learning the bounding box coordinates directly, these methods learn the heat-maps for the points constituting bounding boxes and obtain the resultant bounding boxes from them.
Recently, keypoint-based object detection methods have been proposed inspired by bottom-up pose estimation. Instead of learning the bounding box coordinates directly, these methods learn the heat-maps for the points constituting bounding boxes and obtain the resultant bounding boxes from them.The keypoint-based methods solve many of the drawbacks of the ground-truth-assignment-based methods. Since the keypoint-based methods do not rely on anchors, the design of the network is simpler and the number of hyper-parameters is reduced. Currently, keypoint-based detectors show state-of-the-art detection performances. However, these methods have other drawbacks. It is necessary to extract the bounding box coordinates from the estimated heat-maps since they learn the heat-maps rather than directly learn the coordinates of the bounding boxes. Moreover, keypoint-based methods use hourglass-like networks [newell2016stacked] that require a relatively large amount of computation. When using a different structure such as FPN, which requires relatively a small amount of computation, the detection performance becomes much lower than that by an hourglass-like structure [law2018cornernet]. This is a disadvantage that prohibits keypoint-based detection methods from being practical for real applications.
In this paper, we approach the multi-object detection task through density estimation. We propose a Mixture-Model-based Object Detector (MMOD) that captures the distribution of bounding boxes for an input image using a mixture model of components consisting of Gaussian and categorical distributions. For each component of the mixture model, the Gaussian represents the distribution of the bounding box coordinates, and the categorical distribution represents the class probability of that box. Also, we proposed the process to sample the Region of Interests (RoIs) from the estimated mixture model to learn the class probability considering the background. In the training phase, the network is trained to maximize the log-likelihood of the mixture model for the ground truth bounding boxes and sampled RoIs.
Through density estimation using a mixture model, our MMOD has the following advantage. First, unlike the ground-truth-assignment-based methods, the mixture components learn the location and class of objects through the density estimation of bounding boxes without groud truth assignment. Second, since the RoIs are sampled from the estimated mixture model that captures the ground truth bounding boxes, our MMOD is free from the positive-negative imbalance problem. Third, unlike the keypoint-based methods, our method does not need the extra process of extracting bounding boxes from the heat-map. It is also more friendly to the feature-pyramid-style networks that are computationally more efficient. Finally, our proposed method achieves the state-of-the-art detection performance among the object detection methods with similar speed.
2 Related Works
Ground-truth-assignment-based methods: Many kinds of research have been conducted on object detection using DNN [girshick2014rich, he2015spatial, girshick2015fast]. Earlier studies such as Faster R-CNN [ren2015fasterRCNN] and SSD [liu2016ssd] attempted to represent the space of bounding boxes as much as possible by using anchors in training. The problem is that the scale and aspect ratio of anchors had a significant impact on detection performance [lin2017focal, ren2015fasterRCNN]. To resolve it, YOLOv2 [redmon2017yolo9000] and YOLOv3 [redmon2018yolov3]
found the optimal anchor types through k-means clustering. After that, studies not using anchors[tian2019fcos, kong2019foveabox, zhu2019feature], generating anchor functions [yang2018metaanchor], and predicting anchor types [wang2019region, zhong2018anchor] were conducted. In addition, defining multiple anchors in every possible locations showed the positive-negative imbalance problem. In the early days, OHEM [shrivastava2016training] tackled this by constructing a mini-batch with a high-order loss example. Focal Loss [lin2017focal] tackled the problem by concentrating on the loss of hard examples. Other examples include [li2019gradient] which suggested the gradient harmonizing mechanism, [pang2019libra] which presented an effective hard mining method with Intersection over Union balanced sampling, and [chen2019towards] which used AP-loss by redefining classification as a ranking task.
Keypoint-based methods: Recently, studies on approaching object detection with a keypoint-based method used in pose estimation [toshev2014deeppose, pishchulin2016deepcut, cao2017realtime] without using anchors have been conducted. CornerNet [law2018cornernet] used corner pooling to detect corners in the heat-maps and matched them using Associative Embedding [newell2017associative]. After that, there are studies such as [duan2019centernet, zhou2019objects, zhou2019bottom] in keypoint-based object detection methods to boost the performance. However, they all show their best performance when using a specific backbone called hourglass network [newell2016stacked] and are relatively slow due to a large amount of computation. There is also a top-down approach [yang2019reppoints] that uses deformable convolution [dai2017deformable] to find finer representative points.
Unlike previous methods, we perform multi-object detection by learning the distribution of bounding boxes for an image using a mixture model of Gaussian and categorical distributions. In the proposed method, the heuristic design of anchors and ground truth assignment are not needed. Also, the positive-negative imbalance problem is removed. Finally, our detection performance is superior to any other methods with similar inference speed.
3 Mixture Model for Object Detection
The bounding box
can be represented as a vector consisting of four coordinatesrepresenting the location (left-top and right-bottom corners) and an one-hot vector representing the corresponding class. It has an uncertainty of its coordinates due to occlusion, inaccurate labeling, and ambiguity of object boundary [he2019bounding]. Thus, the distribution of can be considered as a continuous distribution rather than a point mass. In the problem of multi-object detection, the conditional distribution of for an image may be multi-modal, depending on the number of objects on the input image. Therefore, our object detection network must be able to capture the multi-modal distribution. In this paper, we propose a MMOD that can estimate the multi-modal distribution by extending the mixture density network [bishop1994mixture] for object detection. Our proposed network MMOD models the conditional distribution of for an
using a mixture model whose component consist of Gaussian and categorical distribution. Gaussian and categorical distributions represent the distribution of bounding box coordinates and the distribution for the class probability, respectively. The probability density function of this mixture model is defined by the estimated parameters of the mixture model as follows:
denote the probability density function of Gaussian distribution andthe probability mass function of categorical distribution, respectively. The parameters , , and
are the mean, standard deviation, and, mixing coefficient of the-th component among components. The -dimensional vector is the probability for classes. We assume that the covariance matrix of each Gaussian is diagonal to prevent the model from being overly complicated. Each component is a four-dimensional multivariate Gaussian for the coordinates representing the bounding box . Thus, the multivariate Gaussian probability density function of each component of the mixture model can be factorized as follows:
The objective of the MMOD is to accurately estimate the paramters of the mixture model by maximizing the log-likelihood of the ground truth bounding box , as follows:
Here, is the empirical distribution of for a given and is the parameter vector that includes mixture parameters including the class probability .
4 Mixture-Model-based Object Detector
Figure 2 shows the architecture of MMOD. The network outputs four types of results , , , and from the input feature-map concatenated by the coordinate embedding (xy-map). The xy-map is the x and y coordinate for the spatial axis of the feature-map. The parameter maps of our mixture model, -map, -map, -map, and -map are obtained from , , , and , respectively. The mixture component is represented at each position on the spatial axis of the paramter-maps.
The -map is calculated from by the following procedure: . Here, , and represent the xy-limit operation, the default coordinate, and the transformation function, respectively. The xy-limit operation, , illustrated in Fig. 3 plays a role of limiting the center coordinate of the bounding box not to deviate much from the default coordinate . This operation is implemented by applying to the offset value of the center coordinates (xy) in and multiplying the limit factor . The default coordinates, , which are similar to conventional anchors, represent the default center position and scale of the bounding box. Note that the first two channels (xy) of the are xy-map of the corresponding size, and the last two channels (wh) are filled with a constant depending on the size of the input feature from the feature pyramid. The transformation converts coordinates represented by the center, width, and height () to the left-top and right-bottom corners (). In this paper, we set equal to the spacing between adjacent ’s on the -map (see Fig 3). We used one aspect ratio and five scales of bounding boxes as our , depending on the layer in the feature pyramid (see Fig. 2). The width and height of our are calculated as for all , where is the coordinate range, which is defined by width and height of input image. The -map is obtained by applying the softplus [dugas2001incorporating] activation to and then adding the default std . The is calculated as the width and height of multiplied by a predefined std-factor . The prevents the mixture model from being sharp and becoming zero. Note that -map is for the ()-coordinate, not for the ()-coordinate. The -map is obtained by applying the softmax function along the channel axis to , and the -map is obtained by applying the softmax to the entire five spatial maps of .
Our network consists of a convolution layer of 33 kernel and three convolution layers of 1
1 kernel. Leaky ReLU[maas2013leakyReLU]
with a negative slope of 0.2 is used for the activation function of the 33 convolution layer. In this paper, we use RetinaNet’s Feature Pyramid Network (FPN) [lin2017focal, lin2017feature] as the backbone network. The MMOD estimates one mixture model from all levels of feature-maps outputted from the FPN. Thus, the number of components is the summation of the number of components of each parameter-map corresponding to the feature-map. Here, each feature-map and parameter-map at the same layer have the same spatial dimension. The height of the first feature-map and the -th feature-maps are calculated as and , respectively, where is height of the input image and is the height of the -th feature-map. The widths of the feature-maps are also calculated in the same way.
The MMOD is trained using the negative log-likelihood of the estimated mixture model as a loss function, without any procedure such as ground truth (gt) assignment. Because of the nature of the mixture model that the sum of all the components’ probabilities equals 1, i.e., the estimated probability of the -th gt bounding box for a given image , , would decrease by a factor of number of objects, , in the image. However, since object detection should be performed based on each object regardless of the number of objects in an image, this phenomenon of lower gt bounding box probability for a large is undesirable. In this paper, we alleviate this problem through Likelihood compensation, which multiplies the number of objects in the image, , to . Therefore, the loss function of the MMOD for the -th gt object becomes
Confidence score through RoI sampling: Note that is calculated only by the gt bounding boxes, . Since generally does not include the background class, in our model, we cannot obtain the confidence score of a bounding box that takes background probability into account. Instead, we can consider the likelihood of an arbitrary bounding box as a confidence score. However, a likelihood only expresses the density of a bounding box, not its probability whether it is foreground or background or whether it belongs to a certain class. In addition, our model is difficult to evaluate the performance by a metric such as mean Average Precision (mAP) because only relative likelihood comparison of boxes can be performed in an image due to the hardness of assigning a likelihood threshold universal to all the images. Unlike the likelihood, represents the probability of the corresponding mixture component, but likewise, only comparisons between bounding boxes on the same image are possible. Class probability considering background is generally used as a confidence of a bounding box, and it does not suffer from problems like likelihood and mentioned above. To obtain the class probability that includes background class, we perform an additional sampling and labeling process. We sample bounding boxes from the mixture of Gaussian (MoG) ignoring the class probability from our mixture model. If the IoU between a sampled bounding box and a gt bounding boxes is above a threshold, we label it as the class of the gt with the highest IoU, otherwise, we label it as the background class. Through this sampling and labeling process, we create the region of interest (RoI) set . Since is stochastically acquired from the estimated MoG by the MMOD, we do not suffer from class imbalance problems, and we can train the class probability by focusing more on the location where objects are more likely to exist.
Modified loss function: In order to train the network to represent the background probability using the , we re-define the loss function of MMOD into two terms. The first loss term is the negative log-likelihood of the MoG:
The MMOD learns only the distribution of the coordinates of the ground truth bounding box , excluding class information using the MoG parameters (, and ) through . The second loss function is a complete form of the MMOD that includes class probability and is calculated as:
is used to learn the class probability of the estimated mixture model. Note that is identical to (4) except the fact that it is calculated on the different sets of bounding box candidates, i.e., it is trained using sampled from the estimated MoG. Also, it is trained such that the mixture of Gaussian is not relearned by itself. To this end, the error is not propagated to other parameters of mixture models except class probabilities. The final loss function is defined as:
Here, is a hyper-parameter controlling the balance between the two terms.
In the inference phase, we choose ’s of mixture components as coordinates of the predicted bounding boxes. We assume that these ’s have a high possibility to be near to the local maxima (modes) of the estimated mixture model by MMOD. In the aspect of MoG-based clustering, we consider the ’s as representative values for the corresponding Gaussian clusters. Before performing the non-maximum suppression (NMS), we filter out the mixture components with relatively low or values. Since the scale of depends on the input image, we filter the mixture component through normalized- (), which is calculated through min-max normalization where min value is zero:
5.1 Details of Experiments
In our experiments, the MS COCO [lin2014microsoft] and Pascal VOC [everingham2015pascal]pascanu2013difficulty]
is applied with a cutoff threshold of 7.0. An ImageNet[imagenet] pretrained ResNet-34, ResNet-50, or ResNet-101 [he2016resnet] is used for our backbone network, and the remaining layers of the network are initailized with the Xavier-uniform intializer [glorot2010understanding]. In order to generalize our network for various inputs, we augment the data with the following process. First we adjust the contrast and brightness of the image, then perform the expanding and cropping process and flip the data horizontally. This augmentation process is applied randomly as specified in [liu2016ssd, fu2017dssd]. Unless otherwise specified in this section, we apply the Likelihood compensation, and the network is trained by the loss in (7) using RoIs sampled from MoG. The size of , , is five-times of . Thus, the class probability that includes background class is used as the confidence score of a bounding box. We set the and to 10.0 and 2.0. In inference phase, we perform NMS with the IoU threshold of 0.5 after filtering the bounding boxes with the class probability threshold of 0.001 and the
threshold of 0.001. We implement MMOD in Pytorch[paszke2017automatic].
We basically trained the network with a single GPU, and use 6-GPU only for the network using ResNet-101. Here, in order to reduce the effects of Batch Normalization[ioffe2015batch] statistics, we use Cross-GPU Batch Normalization [peng2018megdet].
5.2 Analysis of MMOD
For the analysis of our MMOD, we use the MS COCO ‘train2017’ as the training-set and ‘val2017’ dataset as the test set. Input images are resized to 320320, and ResNet-50 is used for the backbone network. The initial learning rate is 0.005. The learning rate is decayed at iteration 350k, 430k and 470k with a decay rate 0.1, and the network is trained up to 500k iterations.
Confidence measure: In section 4.2, we considered serveral types of confidence measure of a bounding box. We perform the quantitative comparison for the following confidence measures: , , , and . Here, and are normalized and , respectively. They are calculated by min-max normalization with zero min-value for the bounding boxes predicted on the same image. The results are measured from the network trained either by or . The of the network trained by contains the background probability. We compare the confidence measures through F1 score and AP (primary metric of MS COCO). The F1 score is calculated with IoU threshold of 0.5, using the predicted bounding boxes of top confidence in each image. Thus, unlike AP where alignment between all predicted bounding boxes of a dataset is important, only alignment between predicted bounding boxes in an image is required. Table 1 shows F1 scores and APs for different confidence measures. Compared with , shows better results for all kinds of confidence measures. All confidences based on show low F1 score and AP, and are considered inappropriate criteria for object detection. Confidences based on show better results than -based confidences. And, shows improved AP results over . In the netwotk trained by , the F1 score is on the same level as with backgorund probability. However, shows a lower result compared to . This is conjectured as the result of the difficulty of comparison between ’s on different images. The without background shows a relatively low F1 score and AP, while with background shows the best results among all the confidence measures.
|Faster R-CNN [lin2017feature]||ResNet-101 FPN||short-800||36.2||59.1||39.0||18.2||39.0||48.2||172ms/M|
|Cascade R-CNN [cai2018cascade]||ResNet-101 FPN+||-||42.8||62.1||46.3||23.7||45.5||55.2||140ms/P|
|Grid R-CNN [lu2019grid]||ResNet-101 FPN||short-800||41.5||60.9||44.5||23.3||44.9||53.1||-|
|YOLOv3 [redmon2018yolov3, law2019cornerlite]||DarkNet-53||608x608||33.0||57.9||34.4||18.3||35.4||41.9||39ms/P|
|RefineDet320 [zhang2018single]||ResNet-101 TCB||320x320||32.0||51.4||34.2||10.5||34.7||50.4||-|
|RefineDet512 [zhang2018single]||ResNet-101 TCB||512x512||36.4||57.5||39.5||16.6||39.9||51.4||-|
|RetinaNet800 [lin2017focal, chen2019revisiting]||ResNet-101 FPN||short-800||39.1||59.1||42.3||21.8||42.7||50.2||104ms/P|
|FCOS [tian2019fcos]||ResNet-101 FPN||short-800||41.5||60.7||45.0||24.4||44.8||51.6||-|
|CornerNet [law2018cornernet]||Hourglass-104||511x511 (ori.)||40.6||56.4||43.2||19.1||42.8||54.3||244ms/P|
|ExtremeNe [zhou2019bottom]||Hourglass-104||511x511 (ori.)||40.2||55.5||43.2||20.4||43.2||53.1||322ms/P|
|CenterNet [duan2019centernet]||Hourglass-104||511x511 (ori.)||44.9||62.4||48.1||25.6||47.4||57.4||340ms/P|
Positive-negative balance: Since we perform sampling from the estimated MoG, the sampled set contains both positive and negative bounding boxes. In order to check the balance of positive and negative boxes in , we measure the positive ratio of in 100 mini-batch of the tranining set at every 50k iterations. The positive ratio is calculated for each image, except for those with zero ground truth bounding box. In Figure 4, the positive ratio, which is initially low, increases as training progresses and converges to certain value. This shows that no positive-negative imbalance problem occurs while training. From this, the MMOD can be trained with stable positive-negative ratio without any special processing.
Relation of uncertainty () and confidence : The estimates the uncertainty of the coordinate of a bounding box. In object detection problems, the confidence should reflect not only class probability but also uncertainty for bounding box coordinate. In our method, class probability, , is trained through the probability density function of the mixture model using sampled from the estimated MoG, thus affects the traninig of . To show the change in confidence score for different uncertainty, , we compare the distribution of the confidence for high and low uncertainty. We use as the uncertainty score considering the scale of with respect to the size of the bounding box (width or height), where . We set the bounding boxes which have the top uncertainty as high uncertainty set and bounding boxes of the bottom uncertainty as low uncertainty set, in the predicted bounding boxes obtained from the randomly sampled 300 images. As shown in Figure 5, for low uncertainty, has both high and low values. On the other hand, is mostly distributed at very low values, for the high uncertainty set. In this experiments, it is shown that the confidence score considers not only the class probability but also the uncertainty for the bounding box coordinate.
|Faster R-CNN [ren2015fasterRCNN, liu2016ssd]||VGG-16||short-600||300||73.2||7|
|RefineDet [zhang2018single]||VGG-16 TCB||320x320||6375||80.0||40.3|
|RefineDet [zhang2018single]||VGG-16 TCB||512x512||16320||81.8||24.1|
Likelihood compensation: In table 2
, we compare the object detection results according to whether likelihood compensation is performed or not on MS COCO evaluation metric. The ‘without LC’ and ‘with LC’ mean that the MMOD is trained with likelihood compensation and without likelihood compensation, respectively. In this table, the results ‘with LC’ are mostly better than the those ‘without LC’. Especially, in the metric for small objectand large object , the likelihood compensation shows noticeable improvement.
Flexibility of the MMOD: The role of the default coordinate in our MMOD is similar to that of the anchors in that it represents pre-defined forms of bounding boxes. But, it differs in that the MMOD learns flexibly without assigning gt bounding box to a specific mixture component in the training process. Because of this flexibility, the shape of predicted bounding boxes by the MMOD is not limited by pre-defined forms. Table 3 shows the APs for two-pre-defined with different scales. ‘1-scale’ is defined as the half size of the coordinate range and ‘5-scales’ follows the description of the in section 4.1. As can be seen in this table, the AP does not drop significantly even if fewer number of scales is used. In this experiment, MMOD shows relatively robust detection results in different settings of , compared to ground-truth-assignment-based methods that use the anchors. [ren2015fasterRCNN, redmon2017yolo9000]
Default std (): The default-std , which is determined by the hyper-parameter std-factor , smooths the distribution of the mixture model, preventing the mixture model from being sharp for the given gt bounding boxes. To check how this affects network performance, we changed the . Table 4 shows the AP for various . Setting to 0.1 yields the highest AP. In addition, when the is zero, training becomes unstable by increasing possibility of zero , thus resulting in a more spiky shape of likelihood.
Ablation study: The MMOD has components that play a specific role in the intermidate feature-map. In this experiment, we changed each component of the MMOD architecture one by one to see the effect. The results is shown in Table 5. The MMOD that uses all architecture components shows the best performance. It is noteworthy that performance is significantly degraded when neither xy-map nor default coordinate with spatial information is used.
5.3 Evaluation result comparison
MS COCO: For the evaluation of the MMOD, we perform the comparison with other object detection methods on MS COCO dataset. We use the MS COCO ‘train2017’ as the training-set and ‘test-dev2017’ dataset is used for evaluation of the MMOD. The network is trained according to the details specified in section 5.2. The processing time of MMOD is measured on a single nvidia Geforce 1080Ti GPU. Table 6 reports the MS COCO evaluation results and inference time for the MMOD and other methods. In this table, all of our proposed models show inference time of less than 40ms. Especially, the MMOD320 with ResNet-50 shows the fastest inference speed. The MMOD320 models outperform the detection performance of Refiendet320 even with lighter backbones. The MMOD512 produces the lower detection performances than keypoint-based methods (CornerNet, ExtremeNet and CenterNet), but its inference speed is more than six-times faster. These results show that our method is more competitive than other methods in terms of speed and performance trade-offs.
Pascal VOC: In addition, we also performed evaluation on PascalVOC, another representative object detection dataset. In experiments on the Pascal VOC, we use the PascalVOC ‘0712trainval’ (union of PascalVOC 07 and 12 trainval-set) as the training-set. PascalVOC ‘07test’ dataset is used for evaluation. the initial learning rate is 0.003. The learning rate is decayed at 40k and 70k with decay rate 0.1, and the maximum training iteration is set to 100K. For a fair comparison, the processing time of MMOD is measured on a single nvidia Titan X (Maxwell) GPU. Table 7 shows the mAP results and FPS for the MMOD and other object detection methods. In this table, the MMOD320 with ResNet-34 backbone is the fastest model and the MMOD512 produces the best mAP results. When considering both mAP and FPS, our models show the better mAP and FPS results than other methods. Also, our method predicts a relatively small number of bounding boxes, except for RPN (Region Proposal Network) based methods and YOLO v2 which finds anchors through clustering.
In this paper, we proposed a new multi-object detector named as Mixture-Model-based Object Detector (MMOD). Unlike previous multi-object detection methods, our MMOD estimates the density of bounding boxes for an input image using a mixture model. To capture this distribution correctly, we also proposed the mixture model whose components consist of Gaussian and categorical distributions. MMOD does not need the ground truth assignment process, and relatively a small number of mixture components are needed compared to most object detection methods. Also, the positive-negative imbalance problem does not occur due to our RoI sampling process. MMOD not only has the advantages mentioned above but also shows the state-of-the-art detection performance among the fast object detectors. Our method is a new approach to object detection and has a high potential for further research and development.