Learning Instance-Aware Object Detection Using Determinantal Point Processes

05/28/2018 ∙ by Nuri Kim, et al. ∙ Seoul National University 0

Recent object detectors find instances while categorizing candidate regions in an input image. As each region is evaluated independently, the number of candidate regions from a detector is usually larger than the number of objects. Since the final goal of detection is to assign a single detection to each object, an additional algorithm, such as non-maximum suppression (NMS), is used to select a single bounding box for an object. While simple heuristic algorithms, such as NMS, are effective for stand-alone objects, they can fail to detect overlapped objects. In this paper, we address this issue by training a network to distinguish different objects while localizing and categorizing them. We propose an instance-aware detection network (IDNet), which can learn to extract features from candidate regions and measures their similarities. Based on pairwise similarities and detection qualities, the IDNet selects an optimal subset of candidate bounding boxes using determinantal point processes (DPPs). Extensive experiments demonstrate that the proposed algorithm performs favorably compared to existing state-of-the-art detection methods particularly for overlapped objects on the PASCAL VOC and MS COCO datasets.



There are no comments yet.


page 2

page 11

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection is one of the fundamental problems in computer vision. Its goal is to locate objects that belong to a set of target categories in an image

girshick2014rich ; girshick2015fast ; ren2015faster ; redmon2016you ; redmon2016yolo9000 ; liu2016ssd . It has received a lot of attention because of its wide range of applications such as object tracking andriluka2008people , surveillance tian2005robust

, and face detection


. Most of the state-of-the-art detectors show significant performance improvements based on a deep convolutional neural network.

Despite the advances in object detection, it is still difficult to assign correct detections for all objects in an image since detectors do not distinguish different object instances in the same class as it only focuses on an instance-agnostic task, i.e., object category classification. This issue becomes critical when objects are overlapped. As shown in Figure 1, the bounding box of a person in the striped shirt is not detected due to the overlapped bounding boxes in proximity.

Figure 1: Detection results of Faster R-CNN and IDNet. The category labels are rearranged for the best view. (a) Results with overlapped objects. (b) Results with multiple detections of different categories for the same object.

In order to address this issue, we develop a method which can compare appearances of bounding boxes while considering their spatial arrangements. It is in line with how a human perceives the proximity and similarity to distinguish object instances koffka2013principles

. The goal of this paper is to find the most representative set of bounding boxes by extracting features of object instances, which consist of a combination of both visual differences and spatial positions, in addition to object classification. We proposed an instance-aware detection network (IDNet), which learns to differentiate different instances of objects. IDNet uses an existing detector, such as Faster R-CNN, as a component to obtain candidate bounding boxes. Given candidate boxes, IDNet extracts features for all candidates using a CNN branch, named a region identification network (RIN), which aims to increase the probability of selecting an optimal subset. To this end, IDNet is trained not only with classical losses of existing detectors, such as a classification loss and a bounding box regression loss, but also with novel losses based on determinantal point processes (DPPs)

kulesza2012determinantal . Using the property that DPPs can describe the repulsiveness of the fermion system in quantum physics kulesza2012determinantal , we design an instance-aware detection loss (ID loss), which learns to increase the probability of selecting an optimal subset. Additionally, we address the problem of multiple bounding boxes on a single object. For example, as shown in Figure 1, there are two bounding boxes categorized as a sheep and a cow for the same object. Since the objective of a detector is finding a single bounding box for a single object instance, we propose a sparse-score loss (SS loss) to make IDNet assign a single bounding box for a single object, considering all categories. In particular, we formulate a loss to suppress falsely categorized bounding boxes by optimizing weights of IDNet to have low confidence scores for bounding boxes with incorrect class labels.

Since DPPs involve calculations of determinants, the use of DPPs as a loss function to train deep neural networks introduces numerical challenges. We address this problem by scaling detection quality scores. Then, we formulate an optimization problem to select a subset of detections, which is composed of representative bounding boxes. After training, our algorithm efficiently finds an optimal set of detections using the log-submodular property of DPPs

kulesza2012determinantal . Experimental results show that IDNet performs favorably against the state-of-the-art detectors such as Faster R-CNN ren2015faster and LDDP azadi2017learning on PASCAL VOC everingham2010pascal , and MS COCO lin2014microsoft datasets. In ablation study, we demonstrate that our method is more robust for detecting overlapped objects, achieving 22.3% improvement over Faster R-CNN for PASCAL VOC and 34.9% improvement for MS COCO in detection recall.

2 Related Work

Class-aware detection algorithms.

The goal of class-aware or multi-class object detection methods is to localize objects in an image while predicting the category of each object. These systems are usually composed of region proposal networks and region classification networks girshick2015fast ; ren2015faster ; liu2016ssd . To improve detection accuracy, a number of different optimization formulations and network architectures have been proposed ren2015faster ; kong2016hypernet ; azadi2017learning ; redmon2016you ; liu2016ssd ; redmon2016yolo9000 ; dai2016r . Ren et al. ren2015faster use convolutional networks, called region proposal networks, to get region proposals and combine it with Fast R-CNN. Kong et al. kong2016hypernet concatenate each layer’s feature to construct the final feature for detecting small objects in an image. A real-time multi-class object detector is proposed by combining region proposal networks and classification networks together in redmon2016you . Liu et al. liu2016ssd improve the performance of redmon2016you using multiple detectors for each convolutional layer. To increase network efficiency, fully connected layers are replaced by convolution layers in dai2016r . Redmon et al. redmon2016yolo9000 extend redmon2016you

by classifying thousands of categories using the hierarchical structure of categories in the dataset. DPPs have been used to improve detection qualities before. Azadi

et al. azadi2017learning

propose to suppress background bounding boxes using DPPs. However, this method focuses on adjusting background detection scores and uses a fixed visual similarity matrix from WordNet, while our algorithm learns the similarity matrix from data.

Instance-aware algorithms.

Instance-aware methods have been developed to provide finer solutions in different problem domains. Instance-aware segmentation aims to label instances at the pixel level dai2016instance ; ren2017end . Li et al. dai2016instance propose a cascade network which finds each instance stage by stage. Similar to RIN, a network in dai2016instance finds features of each instance. Ren et al. ren2017end

use a recurrent neural network to sequentially find each instance. A face detector which takes key points of faces as an input is suggested in

li2016face . The dataset for this application contains face labels for identifying each face, while the standard object detection datasets only have a small number of categories. In object detection, Lee et al. lee2016individualness

provide an inference method to find an optimal subset for binary-class detection considering the individualness of each candidate box. However, their approach is limited to a single-class detection problem. Besides, instead of training networks, they use features computed from a network pre-trained on the ImageNet dataset

deng2009imagenet . The proposed method tackles a challenging multi-class detection task by learning distinctive features of object instances.

3 Proposed Method

As shown in Figure 2

, IDNet is composed of VGG16 for image feature extraction, a region proposal network (RPN), a region classification network (RCN) and a region identification network (RIN) (see the detailed structure of RIN in Appendix 

D). Based on image feature maps from VGG16, RPN determines whether objects exist in the region of interests (RoIs). Then, RCN proposes candidate boxes while locating and classifying them. RIN computes instance features of candidates, which are used by DPPs.

3.1 Determinantal Point Processes for Detection

Suppose that there are candidate bounding boxes, , where is the

th bounding box. A determinantal point process (DPP) defines a probability distribution over subsets of

as follows kulesza2012determinantal . If Y is a DPP, then


where , a kernel matrix is a real symmetric positive semi-definite matrix, an indexed kernel matrix is a submatrix of indexed by the elements of , and

is an identity matrix. The kernel matrix

can be decomposed as , where is a feature matrix for candidate bounding boxes with each row extracted from RIN. Similar to the kernel matrix, the indexed kernel matrix can be decomposed as .

Figure 2: The training procedure of the instance-aware detection network (IDNet). The dashed arrow is only used for calculating the forward pass.

Let be the detection score for the th bounding box . We first scale the detection score between 0 and 1 by using , where and are the minimum and maximum possible values of the th detection scores (), respectively. Let be the detection quality of and it is a rescaled score defined as , where , to avoid numerical issues during training.111

Naive logit scores or normalized scores in

might cause numerical overflow or underflow while calculating determinants, particularly, when there are many detection candidates. Let be the detection quality for all detection candidates. The feature for is extracted from the last layer of RIN. Let be a normalized feature and . Using candidate bounding boxes, the intersection over union between and can be calculated by and we construct a matrix by setting . A similarity matrix is constructed as , where

. Using the detection quality vector

q and the similarity matrix , the kernel matrix for a DPP can be formed as , where is a Hadamard product.222 The notations in this paper are summarized in Appendix A.

If the similarity and detection qualities q are correctly assigned, a subset which maximizes (1) is a collection of the most distinctive detections due to the property of the determinant in a DPP kulesza2012determinantal . Since IDNet is trained to maximize the probability (1) of the ground-truth detections, IDNet learns the most distinctive features and correctly scaled detection scores to separate difference object instances in order to correctly compute and q.

3.2 Learning Detection Quality

As RCN classifies each RoI into all categories, the number of candidate boxes is equal to the number of RoIs multiplied by the number of categories (). As there are multiple bounding boxes with different categories for a RoI, multiple classes often have detection scores higher than a certain threshold. For example, a detector would report a horse bounding box nearby a cow as they are visually similar. Then, conventional methods, such as NMS, typically suppresses bounding boxes in each class. In this case, even if there is a true bounding box for the cow, the horse bounding box cannot be suppressed. To alleviate this issue, we refine the score of top- bounding boxes, which are bounding boxes with top detection scores. We assume that categories of the top- bounding boxes are composed of visually similar categories to the correct category. By suppressing the scores of the visually similar categories, we can obtain a single bounding box with a correct category for an object.

Let be the union of all top- bounding boxes from all RoIs and be a set of positive boxes, i.e., detected bounding boxes which are closest to the ground truth bounding boxes with correct class labels. Then, we define a SS loss as a negative log-likelihood of (1) as follows:


This loss function increases detection scores of bounding boxes in the positive set, . In other words, this loss suppresses scores of all subsets which have at least one non-positive bounding box. We would like to note that the normalization term for a DPP is included for numerical stability during learning.

We also use classification and regression losses for training RPN and RCN, similarly to Faster R-CNN ren2015faster . Suppose each of RPN and RCN output the probability of categories, , when there are categories. The classification loss () and the regression loss () are calculated as follows:


where is the true class, is the predicted location shift , is the target location shift for the th class, and is a combination of L1 and L2 losses as defined in girshick2015fast . The regression loss is not applied to the background category (). Since the only difference between RPN loss and the RCN loss is the number of categories, the RPN loss can be expressed as and the RCN loss can be also expressed as (3), i.e., . See ren2015faster for more details about and .

The weights for VGG16, RPN and RCN, which are denoted by in Figure 2, can be learned by optimizing:


3.3 Learning Instance Differences

An instance-agnostic detector solely based on object category information often fails to detect objects in proximity. For accurate detections from real-world images with frequent overlapping objects, it is crucial to distinguish different object instances. To address this problem, we propose an instance-aware detection loss (ID loss). The objective of this loss function is to obtain similar features from the same instance and different features from different instances. This is done by maximizing the probability of a subset of the most distinctive bounding boxes.

Let be a set of all candidate bounding boxes which intersect with the ground truth bounding boxes. Let be a set of the most representative boxes, i.e., candidate boxes which are closest to the ground truth boxes. Then, ID loss for all objects is defined as follows:


Due to the determinant, it increases the cosine distance between and if and are from different instances. As we select boxes nearby the ground truth bounding boxes to construct , the network can learn what bounding boxes are similar or different.

In addition to (5), we set an additional objective which focuses on differentiating instances from the same category given , candidate boxes in the th category, and , the representative boxes for the ground truth boxes in the th category. The intra-class loss is defined as follows:


It provides an additional guidance signal to train the network since it is more difficult to distinguish similar instances from the same category than instances from different categories. Bounding boxes for a particular category, , are illustrated in Figure 5. Then we construct the final loss by adding two losses over every category,


The goal of the ID loss is to find all instances while discriminating different instances as shown in Figure 1. Given a set of candidate bounding boxes and subsets of them, weights of RIN ( in Figure 2) can be learned by optimizing:333The gradients of the SS loss and ID loss are derived in Appendix B.


3.4 Inference

Given a set of candidate bounding boxes, the similarity matrix and the detection quality q, Algorithm 1 (IDPP) finds the most representative subset of bounding boxes. and are thresholds. The problem of finding an optimal subset is NP-hard because normalizing probabilities of a finite point process has the complexity of , where is the number of candidate bounding boxes. Fortunately, due to the log-submodular property of DPPs kulesza2012determinantal , we can approximately solve the problem by using a greedy algorithm, such as Algorithm 1, which iteratively adds an index of a detection candidate until it cannot make the determinant of a new subset higher than that of the current subset azadi2017learning .

0:  , q, , , ,
  while  do
     if  and  then
         delete from
     end if
  end while
Algorithm 1 Instance-Aware DPP Inference (IDPP)

4 Experiments

We evaluated IDNet on the standard datasets: PASCAL VOC everingham2010pascal , and MS COCO lin2014microsoft . Since IDNet is the first identity-aware detection network in our knowledge, we compare our algorithm with the baseline methods, Faster R-CNN ren2015faster and LDDP azadi2017learning . Since the goal of our algorithm is to discriminate instances with given candidate bounding boxes, we adopt Faster R-CNN as a proposal network to get candidate detections. Additionally, we do not use the SS loss during the early stage of training, since the accuracy of detection scores is very poor and top- categories do not contain similar categories during the early stage of training. The number of iterations for the early stage is found by a grid search. The amount of training iterations for adjusting scores is the same as the number of iterations required to train other detectors for fair comparisons.

For the inference method, we report results from three algorithms. First, NMS can be applied to all detectors described earlier. Second, LDPP is applied to Faster R-CNN and LDDP, which is an inference method used in LDDP azadi2017learning . Third, IDPP (Algorithm 1) is applied to the proposed algorithm. Note that IDPP cannot be applied to other detectors as they do not have a module to extract features of instances. The detailed parameter settings for the implementation are in Appendix C.

4.1 Results

Pascal Voc

We train the network with VOC2007 and VOC0712 sets and test on VOC 2007 test set. The VOC2007 dataset has 5,011 images for training and 4,952 images for testing with 20 object categories. The VOC0712 train set consists of a union of VOC2007 trainval set and VOC 2012 trainval set, which has 16,551 images. The performance was evaluated with the mean average precision (mAP), which is the average of AP of all categories. Each AP is calculated by averaging precisions of 11 uniform sections of the recall.


Network Inference mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv


Faster R-CNNren2015faster NMS 71.4 70.4 78.2 69.7 58.9 56.9 79.5 83.0 84.3 53.3 78.6 64.5 81.7 83.7 76.1 77.9 45.4 70.5 66.7 74.3 73.3
Faster R-CNNren2015faster LDPP 71.1 72.1 77.6 67.8 58.5 54.9 79.0 80.1 85.5 53.8 79.9 64.0 81.7 83.7 76.7 78.0 45.0 70.9 66.7 74.0 73.0
LDDPazadi2017learning NMS 70.5 69.7 78.6 69.2 55.0 54.4 77.0 82.7 82.6 52.0 78.7 66.0 81.7 83.3 75.3 77.9 44.5 69.7 66.0 73.2 72.2
LDDPazadi2017learning LDPP 70.5 71.6 78.4 67.2 55.9 52.9 76.8 79.9 83.5 51.4 79.5 65.1 82.1 83.6 75.6 77.9 44.9 71.0 66.3 73.7 72.6
IDNet NMS 71.5 70.1 78.1 67.8 56.9 56.2 82.5 82.1 83.2 56.1 81.2 66.0 81.9 84.3 76.7 78.5 42.3 70.3 65.7 76.2 73.9
IDNet IDPP 72.2 70.2 79.5 70.1 58.0 55.6 81.1 83.5 84.2 56.2 81.3 64.8 83.0 84.1 77.3 80.4 43.6 72.9 66.9 76.9 73.7


Table 1: Results on VOC2007 test set (trained with VOC2007 trainval).


Network Inference mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv


Faster R-CNNren2015faster NMS 75.8 77.2 84.1 74.8 67.3 65.5 82.0 87.4 87.9 58.7 81.5 69.8 85.0 85.1 77.7 79.2 47.2 75.4 71.8 82.3 75.8
Faster R-CNNren2015faster LDPP 76.1 77.7 82.5 75.1 66.1 65.2 82.9 88.1 87.3 59.6 82.2 70.6 85.4 86.1 80.7 79.1 48.3 76.5 71.1 83.2 75.1
LDDPazadi2017learning NMS 75.9 77.3 81.5 74.4 65.9 64.9 84.8 87.2 86.7 60.4 80.9 70.8 85.3 84.9 77.1 79.0 47.9 76.0 72.6 83.4 77.5
LDDPazadi2017learning LDPP 76.4 76.9 83.0 75.0 66.5 64.3 83.4 87.5 87.7 61.2 81.5 70.0 86.0 84.9 81.9 83.3 48.6 75.7 72.3 82.6 76.5
IDNet NMS 76.0 78.4 79.6 74.2 63.1 66.7 84.5 87.7 85.9 60.8 84.8 70.2 85.2 85.4 79.2 79.2 46.4 77.0 74.1 81.6 76.4
IDNet IDPP 76.8 78.8 83.4 74.4 64.0 66.9 83.5 87.8 87.1 61.1 84.6 70.5 85.6 85.2 80.7 83.1 47.0 79.0 73.1 83.2 76.2


Table 2: Results on VOC2007 test set (trained with VOC0712 trainval).

For VOC2007 train set, we set the number of iterations for the early stage as 40k and 70k for VOC0712. Then, we train RIN to learn differences of instances with the ID loss for 30k and 20k iterations, respectively. As Faster R-CNN and LDDP do not have a module to extract the feature of each bounding box, we use LDPP as an inference method for them, which is proposed in LDDP azadi2017learning . LDPP uses a class-wise similarity matrix while IDPP uses the features extracted from RIN.

As shown in Table 1, the NMS results of IDNet show that the SS loss effectively suppresses a number of candidate boxes while leaving the correct boxes. As the number of categories is small, the number of similar categories is even smaller, which has caused the marginal performance improvement. When we test networks with several post-processing methods, such as NMS and LDPP, we can observe the following results. For the VOC2007 train set, Faster R-CNN with NMS has an mAP of 71.4%, LDDP with LDPP has an mAP of 70.5% and IDNet with IDPP has an mAP of 72.2%. The proposed algorithm works favorably compared to Faster R-CNN with NMS by 0.8% mAP. For VOC0712 train set, Faster R-CNN with NMS has an mAP of 75.8%, LDDP with LDPP has an mAP of 76.4% and IDNet with IDPP has an mAP of 76.8%, as shown in Table 2. The overall trends for VOC0712 train set are similar to the experiment of VOC2007, which show 1.0% mAP improvement compared to the Faster R-CNN with NMS. Due to the constraint of the space, we visualize result images in Figure 9. Additionally, for measuring the impact of the ID loss with respect to overlap ratios, we evaluate the performance of IDNet for test images with overlapped objects. The experimental results show that the performance gap in recall between Faster R-CNN with NMS and IDNet with IDPP increases as the overlap ratio increases. For VOC, the recall of overlapped objects that have IoU more than 0.4 is 71.3% for the proposed method while Faster R-CNN reports 58.3%, an improvement of 22.3% (Table 8).


Network Inference mean AP @ IoU: mean AP @ Area: mean AR, # Dets: mean AR @ Area:
0.5-0.95 0.5 0.75 S M L 1 10 100 S M L


Faster R-CNNren2015faster NMS 26.2 46.6 26.9 10.3 29.3 36.4 25.5 38.1 39.0 17.9 44.0 55.7
Faster R-CNNren2015faster LDPP 26.2 46.5 26.9 10.2 29.3 36.6 24.8 37.0 37.9 15.7 42.5 54.9
LDDPazadi2017learning NMS 26.4 46.8 26.9 10.5 29.4 36.7 25.7 38.5 39.4 18.2 44.6 56.4
LDDPazadi2017learning LDPP 26.4 46.7 26.8 10.5 29.4 36.8 25.0 37.4 38.4 16.0 43.1 55.3
IDNet NMS 27.0 47.3 27.9 10.7 29.7 37.7 25.9 38.4 39.3 18.2 44.0 56.6
IDNet IDPP 27.3 47.6 28.2 10.9 30.1 38.0 25.9 39.4 40.6 18.6 45.1 58.9


Table 3: Results on COCO 2014 validation set. All networks are trained with COCO 2014 train set.

Microsoft COCO

We carry out experiments with 82,783 images in the train set and 40,504 images in the validation set, which is used for testing with 80 object categories. The number of iterations for the early stage is set to 360k. After adjusting scores, we train RIN for 20k iterations. As shown in Table 3, we evaluate different algorithms with twelve different performance metrics. Average precision at IoU [.5, .95] is a method of evaluating using multiple thresholds obtained by uniformly sampling 10 samples from 0.5 to 0.95. This is a primary challenge metric in COCO detection evaluations. The proposed algorithm achieves 27.3% mAP@ IoU [.5, .95] on the validation set, higher than the other methods. mAP at IoU=0.5 is the same metric with the VOC. AP at the certain IoU threshold considers that the predicted box is well detected when the overlap with the ground truth box is greater than the threshold. Metrics with area measure AP for different scales of objects. As recall is higher when there are a large number of predicted boxes, mAP metrics constraint the number of detections per image. The mean average recall, mAR, is the maximum recall for each category given a fixed number of detections. We see that our algorithm has comparable results on all performance metrics. Additionally, as the COCO dataset has a larger number of categories, the performance improved by the SS loss is from 26.2% mAP to 27.0% mAP, which is a bigger improvement compared with that of VOC. This result indicates that the SS loss has potential to lead higher performance improvement when it is applied to large-scale detection datasets which have a large number of categories. We visualize detection results in Figure 10 and Figure 11. The performance with respect to different overlap ratios is shown in Table 9, which shows recall of 63.4% while Faster R-CNN reports 47% for the overlapped objects that have IoU more than 0.4. This is an improvment of 34.9%.

4.2 Ablation Study

To analyze the influence of each loss, we conduct several ablation studies. Table 4 demonstrates the results of the ablation study. We check the proposed method with two post-processing methods. Since IDPP uses the trained features with the ID loss, we substitute IDPP with LDPP for the ablation experiments that do not use the ID loss, which are the last two rows in Table 4. As shown in Table 4, the performance with NMS slightly increases to 71.5% mAP for VOC2007 train set and 76.0% for VOC0712 train set as we add the SS loss. We see that the SS loss is effective for not only DPP inference methods, but also NMS because it keeps the precision while reducing redundant detections. We note that when we use parameters in the paper azadi2017learning , most of the results with the LDPP inference are lower than the results of NMS. The performance of IDNet trained with the ID loss is 71.9% mAP for VOC2007 and 76.7% mAP for VOC0712. It indicates that the ID loss, which learns the differences of each bounding box is critical for the performance improvement. The result with both the SS loss and the ID loss achieves 72.2% mAP for VOC2007 and 76.8% mAP for VOC0712. The detailed analyses are given below.


SS loss ID loss Inference VOC2007 VOC0712


x x NMS 71.4 75.8
LDPP 71.1 76.1
o x NMS 71.5 76.0
LDPP 70.4 75.8
x o NMS 71.3 75.8
IDPP 71.9 76.7
o o NMS 71.5 76.0
IDPP 72.2 76.8


Table 4: Ablation results on VOC2007 test

Effect of sparse-score loss.

As stated in Section 3.2, a detector often finds falsely categorized bounding boxes. The SS loss is introduced to alleviate this problem. Specifically, in our experimental setting, the SS loss suppresses other bounding boxes except for the top-1 bounding box. To validating the loss, we extract top-5 bounding boxes having detection scores over a fixed threshold (set to 0.01) for each RoI. When a predicted box overlap with the ground truth box by 0.5 of IoU or more, we consider it as a correct box. We compute the ratio for each category, where is the class label, is the number of correct boxes in top-5 bounding boxes, and is the number of top-5 bounding boxes. Figure 3 shows that the proposed IDNet achieves superior performance in terms of correctly detected bounding boxes among top-5 bounding boxes compared to other methods. On average, IDNet achieves 43.7% while Faster R-CNN has 32.4% and LDDP has 32.9% for COCO. (For VOC 2007, IDNet achieves 68.9% while Faster R-CNN has 61.0% and LDDP has to 60.5% as shown in Figure 7.) The images with scores are visualized in Figure 8, showing that the SS loss successfully suppresses bounding boxes having wrong classes.

Figure 3: Visualization of the impact of the SS loss for IDNet (trained with the COCO train set and tested with the COCO validation set). The class labels are sampled for the best view.

Effect of instance-aware detection loss.

Figure 4: Graphs showing the impact of ID loss as a function of the overlap ratio on COCO dataset.

Table 5 gives the total number of objects in the datasets and the number of overlapped objects within the same category depending on the degree of overlaps. When counting the number of objects which have IoU over 0.3, there are only 719 objects (6.0% of all objects) for the VOC2007 test set and 16512 objects (5.7% of all objects) for the COCO validation set. Since IDNet is more effective for overlapped objects, the small number of overlapped bounding boxes in datasets is the reason behind a marginal improvement over other methods. To further evaluate our method, we experiment with only overlapped objects. We demonstrate the probability of finding objects among the overlapped objects in Table 8. We count overlapped objects using the ground truth object boxes when they have the same class label. Then, we check there are detected bounding boxes for that overlapped objects. After calculating the probability in each category, the results are averaged over categories. Since there is a small number of highly overlapped objects in the datasets that have IoU more than 0.6, the overlap ratio of 0.6 include all objects with IoU larger than or equal to 0.6. For the overlapped objects in all overlap ratios, the probability of detecting objects is higher than Faster R-CNN with LDPP and LDDP with LDPP. Figure 4 demonstrates that IDNet with IDPP successfully detects overlapped objects compare to existing instance-agnostic detectors. When comparing with Faster R-CNN, the detection probability is increased from 58.2% to 62.7% for COCO. (For VOC, the detection probability is increased from 72.2% to 78.9% as shown in Figure 6.) This result shows that the ID loss is critical for detecting objects in proximity.

More results on ablation studies are in Appendix E.2 and the failure case studies are in Appendix F.2.


Overlap [0.0, 1.0] (0.0, 0.1] (0.1, 1.0] (0.2, 1.0] (0.3, 1.0] (0.4, 1.0] (0.5, 1.0] (0.6, 1.0]


VOC2007 12032 6061 3026 1439 719 359 167 84
COCO 291874 183657 77749 35633 16512 7590 3248 1291


Table 5: Number of objects in VOC2007 test set and COCO validation set in the overlap ranges.

5 Conclusion

We have introduced IDNet which tackles two challenges in object detection: detecting overlapped objects and suppressing falsely categorized bounding boxes. By introducing two novel losses using determinantal point processes, we have demonstrated that the proposed method is effective for detecting overlapped objects and suppressing falsely categorized bounding boxes while maintaining correctly detected bounding boxes.

Appendix A Notations

We summarized notations for DPPs used in this paper in Table 6

Notation Definition Description
RoIs - Region of interest boxes which are proposed from RPN.
b - Candidate bounding boxes which are proposed from RCN.
Intersection over union (IoU) of two bounding boxes.
A rescaled score. , .
Normalized feature of a bounding box .
Similarity between box and . .
Kernel matrix of DPPs.
Table 6: Notations in this paper.

Appendix B Gradient of losses

For notational convenience, we assume that the matrix has the same dimension as and its entries corresponding to is copied from while remaining entries are filled with zero, for any matrix and indices .

b.1 Gradient of Instance-Aware Detection Loss

Here, we show the gradient over the normalized feature (). As the derivative of the log-determinant is , the derivative of intra-class ID loss is as follows:


where is a Frobenius inner product, is a Hadamard product, and is the th category. Note that the is the number of categories. Since we only calculate the gradient of ID loss on the similarity feature (), the derivative of is as follows:


where . Using the property that , where are arbitrary matrices, we can derive this:


By seeing the matrix in element-wise,


Since the gradient of is similar with gradient of , we omit the derivation of that. Then, we can construct the gradient of ID loss as follows by summing up (12) for all batches and categories as follows:


b.2 Gradient of Sparse-Score Loss

The derivation for calculating the gradient of sparse-score loss is similar with the derivation of instance-aware detection loss, while the gradient for sparse-score loss is derived over the quality (q). The derivative of sparse-score loss is as follows:




the final derivative is this:


Appendix C Implementation Details

The detailed settings of IDNet are as follows. Our model has three hyper-parameters that need to be tuned: a ratio between spatial similarity and visual similarity for constructing the kernel matrix of DPPs (), the dimensionality of the extracted feature and the starting point of training with the SS loss. These hyper-parameters are found through a grid search on the validation dataset. The parameters are searched in the following ranges: [0.2; 0.7] for , [128; 1024] for the dimensionality of , [30k; 50k] and [300k; 400k] for training SS loss on the PASCAL VOC dataset and COCO dataset, respectively. We choose as 0.6 and set the feature dimension as 256 for all experiments. Once the hyper-parameters are tuned, we take the whole train set to learn the model and evaluate it in the test set. We choose to use 0.25, 4, 5 and 0.001 for all experiments, because they are the empirically best parameters. The learning rate is set to 0.001, and the SS loss and the ID loss are multiplied by 0.01 to balance with the classification loss (negative log probability loss) and the regression loss ( loss). Other details are same as chen2017implementation

. As the original Faster R-CNN, we flip the input image horizontally for data augmentation. For all experiments, we use VGG network as the region proposal part of detectors. IDNet is implemented using TensorFlow, and the optimization is done with the stochastic gradient descent method. The parameters of IDNet are initialized with ImageNet pre-trained model

deng2009imagenet except the RIN module. We run the experiments using an NVIDIA TITAN X graphics card for the PASCAL VOC 2007 and 2012 datasets and an NVIDIA TITAN Xp graphics card for the COCO dataset.

For training IDNet with determinantal point processes (DPPs), it is important to carefully select the most representative subset of candidate bounding boxes (). To help understanding, we show examples of in Figure 5.

Figure 5: Visualization of for an image which contains a horse and a person categories. Each set has the most representative, i.e., close to ground truth, bounding boxes where each box captures different instances.

Appendix D Network Architecture

RIN consists of three fully connected layers, three max-pooling layers, one RoI-pooling layer, and 9 convolutional networks, while the first two of convolutional layers are shared with VGG16 network (Table 


). At the end of each convolutional and fully-connected layer except the last layer has a batch normalization


and a rectified linear unit (ReLU) in order. We set all convolutional layers to have filters with a size of 3

3 pixels and a stride of one.

Layer Type Parameter Filter size Remark
0 Convolution 3x3x3x64 3x3 Shared w/ VGG16
1 Convolution 64x3x3x64 3x3 Shared w/ VGG16
2 Max-pooling - 2x2 -
3 Convolution 64x3x3x128 3x3 -
4 Convolution 64x3x3x128 3x3 -
5 Convolution 128x3x3x256 3x3 -
6 Convolution 256x3x3x256 3x3 -
7 Max pooling - 2x2 -
8 Convolution 256x3x3x256 3x3 -
9 Convolution 128x3x3x256 3x3 -
10 Convolution 128x3x3x256 3x3 -
11 Max-pooling - 2x2 -
12 RoI-pooling - 15x15 -
13 Fully connected 57600x1000 - -
15 Fully connected (1000+5)x1000 - Concat w/ box locations & category
16 Fully connected 1000x256 - -


Table 7: Architecture of a region identification network (RIN)

Appendix E More Experimental Results

e.1 Experiments with Overlapped Objects

The experimental results are evaluated over the images in which overlapped objects exist. We measure the recall and mAP performance. The recall is calculated as the ratio of detected objects among the overlapped objects. The recall is better performance measure showing that our IDNet is robust to overlap because the recall is calculated only for objects with overlap, whereas mAP is calculated for all objects in images. As shown in Table 8 and Table 9, as the overlap ratio is getting higher, the performance gap between Faster R-CNN and IDNet is bigger. For PASCAL VOC 2007 dataset, the performance gaps of recall are increasing: 5.5%, 7.8%, 10.5%, 12.2%, 13% (Table 8). For COCO dataset, the performance gaps of recall are 8%, 11%, 14.3%, 16.3%, 16.0% (Table 9). The performance gaps of mAP are smaller but also have a trend to getting bigger. Since there is no object with an overlap of 0.5 or more in a category, only the performance is measured to 0.4 or more.


Network Inference Overlap # Obj # Ovl. obj # Det. obj Recall mAP


Faster R-CNNren2015faster NMS (0.0, 1.0] 5505 4714 3792 80.4 61.4
LDDPazadi2017learning LDPP 3758 79.7 60.8
IDNet IDPP 4048 85.9 63.1
Faster R-CNNren2015faster NMS (0.1, 1.0] 3802 2675 2045 76.5 60.2
LDDPazadi2017learning LDPP 2084 77.9 60.2
IDNet IDPP 2254 84.3 62.4
Faster R-CNNren2015faster NMS (0.2, 1.0] 2458 1352 941 69.6 58.3
LDDPazadi2017learning LDPP 999 73.9 59.7
IDNet IDPP 1095 80.1 60.3
Faster R-CNNren2015faster NMS (0.3, 1.0] 1310 695 437 62.9 56.8
LDDPazadi2017learning LDPP 477 68.6 59.5
IDNet IDPP 522 75.1 59.3
Faster R-CNNren2015faster NMS (0.4, 1.0] 734 355 207 58.3 53.8
LDDPazadi2017learning LDPP 217 61.1 54.9
IDNet IDPP 253 71.3 58.8


Table 8: Results on PASCAL VOC 2007 test set containing overlapped objects. The networks are trained with PASCAL VOC 2007 trainval set.


Network Inference Overlap # Obj # Ovl. obj # Det. obj Recall mean AP @ IoU: mean AP @ Area: mean AR, # Dets: mean AR @ Area:
0.5-0.95 0.5 0.75 S M L 1 10 100 S M L


Faster R-CNNren2015faster NMS (0.0, 1.0] 168687 135912 87767 64.6 22.4 41.7 22.0 9.6 26.8 32.8 19.4 33.2 34.2 16.0 40.8 52.2
LDDPazadi2017learning LDPP 87982 64.7 22.7 42.1 22.1 9.8 27.1 33.5 19.0 32.7 33.8 14.6 40.3 51.8
IDNet IDPP 98618 72.6 23.2 42.7 23.1 10.0 27.5 34.3 19.7 34.2 35.5 16.4 41.8 55.3
Faster R-CNNren2015faster NMS (0.1, 1.0] 123532 65618 42055 64.1 21.2 40.0 20.5 9.1 26.1 31.2 18.2 31.4 32.4 15.0 39.5 50.3
LDDPazadi2017learning LDPP 43519 66.3 21.5 40.6 20.6 9.4 26.4 32.0 17.9 31.0 32.2 14.0 39.1 50.2
IDNet IDPP 49266 75.1 22.0 40.9 21.5 9.5 26.8 32.9 18.4 32.7 34.2 15.5 40.7 54.5
Faster R-CNNren2015faster NMS (0.2, 1.0] 79632 31963 18856 59.0 19.9 38.2 19.0 8.7 24.8 30.7 17.4 29.8 30.8 14.2 37.8 48.7
LDDPazadi2017learning LDPP 20272 63.4 20.3 39.0 19.2 9.0 25.2 31.3 17.0 29.4 30.6 13.2 37.5 48.6
IDNet IDPP 23423 73.3 20.9 38.8 20.4 9.4 26.2 32.4 17.5 31.9 34.2 15.1 40.6 56.0
Faster R-CNNren2015faster NMS (0.3, 1.0] 44429 15268 8070 52.9 19.2 36.9 18.4 8.5 24.3 31.0 17.0 28.6 29.6 13.4 36.4 47.8
LDDPazadi2017learning LDPP 8944 58.6 19.6 37.9 18.6 8.9 24.6 31.6 16.6 28.4 29.6 12.9 36.4 47.7
IDNet IDPP 10558 69.2 20.5 38.2 20.0 9.1 25.7 33.0 17.0 30.9 33.2 14.4 39.2 56.0
Faster R-CNNren2015faster NMS (0.4, 1.0] 22369 7196 3381 47.0 18.9 35.9 18.2 8.4 23.6 31.5 17.1 28.3 29.1 13.0 35.0 46.8
LDDPazadi2017learning LDPP 3765 52.3 19.3 37.2 18.4 8.6 24.0 32.6 16.5 27.6 28.7 12.3 34.6 47.3
IDNet IDPP 4563 63.4 20.3 38.0 19.8 9.0 24.9 34.3 17.1 30.8 33.0 14.1 38.1 56.1


Table 9: Results on COCO 2014 validation set containing overlapped objects.

e.2 Results of Ablation Study

Additional to the results which show the impacts of ID loss and spare-score loss on COCO, we did the same experiment on PASCAL VOC. The results of ID loss is in Figure 6 and the results of sparse-score loss is in Figure 7 and Figure 8. In Figure 8, the candidate boxes over a fixed threshold (0.1 for Faster R-CNN and IDNet) are visualized. The highest score in each category is visualized in images of Figure 8 and all scores are measured in , which is the normalized score. For the images in the left column of Figure 8, the highest score of the horse category in Faster R-CNN (Figure 8) is 0.546 while the score in IDNet (Figure 8) is 0.154. The results clearly show that the sparse-score loss suppressed scores of bounding boxes which have horse category around the cow. Additionally, for the images in the right column of Figure 8, the score of the category "tennis racket" is 0.226 in Faster R-CNN, while the score of the tennis racket category is under the threshold (0.1) in IDNet. Therefore, the SS loss successfully suppresses the scores of falsely categorized bounding boxes around a correct bounding box.

Figure 6: Graphs showing the impact of the ID loss, which are results of VOC 2007 test set. All networks are trained with VOC 2007 trainval set.
Figure 7: Graphs showing the impact of the sparse-score loss for IDNet. The upper graphs are results of VOC 2007 test set (trained with VOC 2007 trainval set). The below graphs are results of VOC 2007 test (trained with VOC 0712 trainval set).
Figure 8: Scores of bounding boxes showing the impact of the sparse-score loss. (a) Results of Faster R-CNN. (b) Results of IDNet.

Appendix F Example Visualization

We visualize the results of PASCAL VOC in Figure 9 and results of COCO in Figure 10. The bounding boxes are selected with a score threshold of 0.6 for Faster R-CNN with NMS and LDDP with LDPP. The threshold is designated in their paper azadi2017learning . For visualization of IDNet with IDPP, we use 0.2 as a score threshold. The results show the instance-aware DPP inference method (IDPP) can detect the overlapped objects by leveraging features of objects.

f.1 Successful Cases

We visualize the successful images of IDNet (Figure 9 for VOC, Figure 10 and Figure  11 for COCO. In Figure 9, the first row images show that the wrong class bounding boxes are suppressed while selecting a correct class. The results on other rows show the objects in proximity are detected while other methods fail. In Figure 10 and Figure 11, overlapped objects are successfully detected in IDNet.

Figure 9: Visualization results on PASCAL VOC 2007 test set. The left column shows the outputs of Faster R-CNN. The middle column shows the outputs of the LDDP, and the right column shows the results of IDNet.

Figure 10: Visualization results on COCO validation set. The left column shows the outputs of Faster R-CNN. The middle column shows the outputs of the LDDP, and the right column shows the results of IDNet.

Figure 11: Visualization results on COCO validation set. The left column shows the outputs of Faster R-CNN. The middle column shows the outputs of the LDDP, and the right column shows the results of IDNet.

f.2 Failure Cases Analysis

The Figure 12 shows that the detector detected the bounding box of the wrong category for avocados. This means that the detector has found a class similar to avocado, such as banana and apple because there are no categories in a dataset. This case suggests that there is a need to suppress further scores for pictures in the absence of a detection class, i.e., background category. In the Figure 12, the giraffe is hidden behind two trees. If there is an occlusion for an object, detectors tend to do not notice that it is a single object. Then detectors choose several bounding boxes for the object. Since DPP inference tries to find the most representative bounding boxes, it would select all of the created bounding boxes, which increases the number of false detections.

Figure 12: Failure cases of IDNet.


  • (1) Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation.

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014)

  • (2) Girshick, R.: Fast r-cnn. In: IEEE International Conference on Computer Vision (ICCV). (2015)
  • (3) Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Neural Information Processing Systems (NIPS). (2015)
  • (4) Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  • (5) Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
  • (6) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision (ECCV). (2016)
  • (7) Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2008)
  • (8) Tian, Y.L., Lu, M., Hampapur, A.: Robust and efficient foreground analysis for real-time video surveillance. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2005)
  • (9) Ranjan, R., Patel, V.M., Chellappa, R.:

    Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2017)
  • (10) Koffka, K.: Principles of Gestalt psychology. Volume 44. Routledge (2013)
  • (11) Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083 (2012)
  • (12) Azadi, S., Feng, J., Darrell, T.: Learning detection with diverse proposals. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • (13) Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) 88(2) (2010) 303–338
  • (14) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV). (2014)
  • (15) Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: Towards accurate region proposal generation and joint object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  • (16) Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fully convolutional networks. In: Neural Information Processing Systems (NIPS). (2016)
  • (17) Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  • (18) Ren, M., Zemel, R.S.: End-to-end instance segmentation with recurrent attention. arXiv preprint arXiv:1605.09410 (2017)
  • (19) Li, Y., Sun, B., Wu, T., Wang, Y.: Face detection with end-to-end integration of a convnet and a 3d model. In: European Conference on Computer Vision (ECCV). (2016)
  • (20) Lee, D., Cha, G., Yang, M.H., Oh, S.: Individualness and determinantal point processes for pedestrian detection. In: European Conference on Computer Vision (ECCV). (2016)
  • (21) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2009)
  • (22) Chen, X., Gupta, A.: An implementation of faster rcnn with study for region sampling. arXiv preprint arXiv:1702.02138 (2017)
  • (23) Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift.

    In: International Conference on Machine Learning (ICML). (2015)