Spatial Semantic Regularisation for Large Scale Object Detection

10/10/2015 ∙ by Damian Mrowca, et al. ∙ 0

Large scale object detection with thousands of classes introduces the problem of many contradicting false positive detections, which have to be suppressed. Class-independent non-maximum suppression has traditionally been used for this step, but it does not scale well as the number of classes grows. Traditional non-maximum suppression does not consider label- and instance-level relationships nor does it allow an exploitation of the spatial layout of detection proposals. We propose a new multi-class spatial semantic regularisation method based on affinity propagation clustering, which simultaneously optimises across all categories and all proposed locations in the image, to improve both the localisation and categorisation of selected detection proposals. Constraints are shared across the labels through the semantic WordNet hierarchy. Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations. Detection experiments are conducted on the ImageNet and COCO dataset, and in settings with thousands of detected categories. Our method provides a significant precision improvement by reducing false positives, while simultaneously improving the recall.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

page 14

page 15

page 16

page 18

page 20

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human assistance technologies or question answering require a precise and detailed object recognition of a visual scene. Recently, large scale detection approaches have been proposed which aim to distinguish hundreds or thousands of object categories [1, 3, 11, 12, 23]. While impressive progress has been shown, they suffer from competing object category candidate detections as can be seen in Figure 1 (a). Commonly, non-maximum suppression (NMS) is used to select the bounding boxes with the highest detection score for each category. This method is not globally optimal as only locally overlapping boxes are suppressed by the highest scoring box. Further, in the multi-class case, it does not take semantic relations between objects into account, e.g. the couch, floorstool and beanbag proposals should support the settee candidate detection box in Figure 1, such that it is not suppressed by doggy bag as in Figure 1(b).

With thousands of different object categories, semantic relationships become a valuable source of information. Using semantics, consistency can be ensured across different detections. Hence, this work examines the benefit of a semantic hierarchy to object detection of thousands of object categories. We show that in such a large scale setting semantic constraints significantly improve detection.

The key contribution of this work is a large scale spatial semantic regulariser for the correct selection of candidate object detection proposals. Under the framework of Affinity Propagation Clustering (APC) [8], our developed method is characterised by two new ideas.

First, we present an approach which unifies within and across class selection of candidate object detections. Our new multi-class affinity propagation clustering (MAPC) allows for global reasoning over the whole image simultaneously, rather than reasoning locally over image parts or single classes separately, to determine the correct setup of an image. Unlike NMS or [20], which perform the selection separately for each class, our algorithm uses the relationships of highly related fine-grained categories in a large scale detection setting. Based on WordNet relationships, our algorithm knows that golden retrievers, dalmatians and dachshunds are all different dog breeds and should support each other, rather than suppress, if the corresponding boxes cover almost identical regions of interest in the image.

Second, we propose a large scale detection evaluation including over a thousand categories, which requires discriminating among competing classes, in contrast to standard detection challenges, which focus on a per category mean Average Precision (mAP) evaluation. We demonstrate that our algorithm improves performance in two challenging scenarios. First, for a large number of objects per image, we show results on COCO. Second, for a large number of categories, we evaluate on a subset of ImageNet, which is labeled with bounding boxes of 1,825 categories, a large scale detection scenario, which has not been previously evaluated.

2 Related Work

Our work is most related to spatial regularisation over detection proposals. In most detection methods, detection proposals (raw bounding box outputs with scores from detectors) need to be regularised over space to remove double detections on the same object, prune false positives, and improve localisation. Although greedy non-maximum suppression (NMS) is the most often used spatial regularisation approach, other approaches, such as merging nearby detection boxes, are sometimes shown to be more robust [25]. In [29], overlapping detections are averaged and a threshold is set based on overlapping box numbers. In [25], a greedy merge strategy is proposed to group detection proposals together and reward bounding box coherence. Spatial and cooccurrence priors are introduced in [2, 28] to prune detection results. In [7], labels of detection proposals are obtained via approximate inference over several types of spatial relationships instead of greedy NMS. Recently, Affinity Propagation Clustering (APC) [8], an unsupervised clustering method based on message passing, has been used to cluster proposed bounding boxes of the same class based on their overlap [22]. In [22], background and repellence terms are introduced to APC to allow the suppression of false positives and to avoid selecting object proposals lying too close to each other. Our work builds on [22], but is different in that: (1) our algorithm clusters object proposals of the same and different classes simultaneously, whereas [22] is applied only within each class, (2) we introduce new constraints to ensure that one label per detection proposal is selected, and (3) we design our similarity measure such that semantically close objects get clustered together.

Another line of related work is exploiting semantic category hierarchies in visual recognition and detection [4, 6, 11, 14, 15, 17, 20, 24, 30]. Real world object categories often form a hierarchical structure, which can provide useful information for large scale detection. Such hierarchical relationships can be obtained from predefined semantic structures such as WordNet, or learned by data-driven methods. In [4], a conditional random field based hierarchy-exclusion Graph is proposed to represent subsumption and exclusion relationships between classes. In [11, 14], the ImageNet hierarchy, which is based on WordNet, is used to transfer bounding box annotations and segmentations to semantically close categories. In [6], an accuracy-specificity trade-off based on the ImageNet hierarchy is optimised through the DARTS algorithm. PST [20] uses the WordNet hierarchy to transfer knowledge from known to novel categories and propagates information between instances of the novel categories. In [24] a visual hierarchy is discovered based on the Chinese Restaurant Prior and used to share detector parameters between classes. [15] learn a semantic hierarchy based on visual and semantical constraints. Our work is complementary to previous methods in this area, as we integrate a semantic hierarchy into Multi-class Affinity Propagation Clustering (MAPC) for spatial regularisation, while hierarchies have been only used to train classifiers or share features in previous methods.

Our work is also related to large scale detection. In [3], large scale detectors on over 100,000 classes are trained based on hashing. In [1]

, NEIL, a semi-supervised learning system, is proposed to train detectors from Internet images. One major obstacle for large scale detection is the lack of bounding box annotations, which has been recently partially resolved by weakly supervised methods such as knowledge transfer

[11], Multiple Instance Learning [27, 18], domain adaptation [12] or combined approaches [13]. Among these methods, LSDA [12] is a framework for classifier-to-detector adaptation, and was shown to effectively train large scale detectors based on image-level labels. Thus, in this paper we use LSDA to train a baseline detector on 7,404 leaf classes of the ImageNet hierarchy. However, we note that our spatial regularisation method does not depend on how detectors are trained, and can be applied to arbitrary sets of detectors.

To our knowledge, this is the first time that hierarchical semantic relationships are used together with spatial information to determine the correct scene configuration from contradicting candidate object detections. Furthermore, it is even more challenging to apply this algorithm on a large scale setting, as it requires inference over thousands of fine-grained and diverse categories. Our detection system is unique in its amount of categories, both in terms of the degree of fine-grained detail, for instance incorporating different dog breeds, and the variety of categories, including various animals, plants, fruits, and man-made structures.

3 Spatial semantic regulariser

In this section we describe our spatial semantic regulariser. Our method is based on Affinity Propagation Clustering (APC), which has shown to outperform other clustering techniques such as k-means clustering

[8]. [22] successfully adapted APC to the task of selecting candidate object detections of the same class. This method is denoted as Single-class APC (SAPC) in the following.

Our main contributions are to extend the previous work on APC [8, 10, 22] to multi-class detection and a large scale setting with thousands of fine-grained classes. Therefore, we incorporate a new constraint ensuring that each bounding box exemplar gets assigned only one label. Similar to [22]

, we use an intercluster repellence term and a background category to remove false positives. Additionally, in order to leverage the visual similarity of semantically related fine-grained classes, we introduce hierarchical label relations into APC to cluster semantically similar objects. The resulting Multi-class APC (MAPC) algorithm is presented in Figure 

2 after introducing standard APC.

Figure 2: MAPC message passing. Messages are passed between all candidate detections until a subset of detections gets selected as exemplars. IoU stands for Intersection over Union, and is the measure. For simplicity not all messages are depicted.

3.1 Standard affinity propagation clustering

APC is a message passing based clustering method. It uses data similarities to identify exemplars such that the sum of similarities between cluster exemplars and cluster members is maximised. Let denote the similarity between data points and with being the number of data points. indicates how well would serve as an exemplar for [8]. The self-similarity indicates how likely a certain point will be chosen as an exemplar. Using the binary formulation of [10], we encode the exemplar assignment with a set of binary variables : if is represented by and otherwise. A valid clustering must hold two constraints: (i) each point is represented by exactly one exemplar and (ii) when represents any other point , then must be an exemplar representing itself. In the following objective function, represents constraint (i) and represents constraint (ii):

(1)
(2)
(3)
(4)

Max-sum message passing is applied to maximise equation (1) [8, 10] consisting of two messages: The responsibility (sent from to ) describes how suited would be as an exemplar for . The availability (sent from to ) reflects the accumulated evidence for point to choose point as its exemplar:

(5)
(6)

3.2 Affinity propagation clustering for multi-class object detection

We introduce our novel Multi-class Affinity Propagation Clustering (MAPC) algorithm, which extends SAPC [22] from single-class to multi-class detection. In multi-class detection most object detectors propose multiple category labels with a certain confidence score for each bounding box. However, the label with the highest confidence is not always the correct one. Hence, not only the correct location but also the correct class for each box has to be inferred. Therefore, we redefine each data point or as a combined box-class detection, e.g. -dog, -cat, or -cat. This allows us to define a similarity measure between detections which includes both the spatial relation between bounding boxes and the relation between their labels (7):

(7)

Whereas SAPC bases its similarities solely on the IoU between bounding boxes [22], our similarity measure clusters overlapping detections, represented by the term, as well as semantically similar detections, represented by the term. An example can be seen in Figure 3. is a weighting factor trading off spatial and semantic similarity. The Intersection over Union is defined as , where is the area of the image covered by the bounding box of . It is used to describe the overlap and hence the visual similarity of two detections. The measure denotes how semantically similar the labels of two detections are. denotes the lowest common subsumer of the classes of and of in the WordNet hierarchy and equals to the information content of a class, where

is the probability of encountering an instance of the class

in a corpus. The relative corpus frequency of and the probabilities of all child classes that

subsumes are used to estimate the probability

[19, 21].

The self-similarity is defined as , where is the detection score generated by the object detector and is a background threshold used to discard detections scoring lower than before APC inference.

To further avoid that contradicting detections are chosen as exemplars, we introduce a new constraint: If class is an exemplar for a specific box (i.e. ), no other class can be an exemplar for box :

(8)
Figure 3: Combining spatial and semantic similarity in MAPC. All red boxes form one cluster in which the blue box emerged as their exemplar. With a semantic-spatial similarity, semantically similar and spatially localised detections get clustered which finally results in a well localised true positive detection.

The remaining algorithm exactly follows [22], which uses a repellence term , but with to avoid selecting semantic-spatially close exemplars, and a background category to allow for false positives to be suppressed, denoted by the term in equation (9). Linearly combining all of the terms presented yields in the following objective function to be maximised:

(9)

All function arguments in equation (9) were left out for the sake of clarity. To solve this optimisation problem the message passing paradigm of [22] is used. All messages are initialised with zero and iteratively updated until convergence.

4 Experiments

In this section we evaluate the performance of MAPC in a large scale setting. At this time, there is no standardised large scale dataset with both a large amount of object instances within one image as well as a large amount of different object categories. Hence, we evaluate MAPC on two different datasets. We use the Microsoft COCO dataset [16] for the evaluation on a large amount of object instances within one image. To evaluate on a large amount of fine-grained categories, we create a new dataset built of images with bounding box annotations from ImageNet [5]. This dataset covers 1,825 categories, but contains only a few object instances per image due to incomplete annotations.

However, we believe that in a setting with both, thousands of fine-grained categories as well as dozens of object instances per image, our method would perform best. Hence, we also present qualitative results in the supplemental material, where we show the performance of our MAPC algorithm on all 7,404 LSDA categories [12].

We mainly use precision and recall as well as the F1-score, which is the harmonic mean of precision and recall, to evaluate MAPC on these datasets. The mAP metric, which is usually used to evaluate the performance on detection tasks, is not an appropriate performance measure for our multi-class detection setup.

mAP is a metric for retrieval tasks. Traditionally, single-class detection has been seen as a retrieval task: all window detections that contain an object of the given class are to be retrieved. As most object detectors were designed as window-scoring methods it was obvious to rank all window detections according to their scores. With the clustering view, there is no absolute score which could be used for a global ranking and mAP can not be used correctly. The multi-class setting makes it even less suited. mAP favors multiple detections for each class and overall punishes across class selection of object proposals. In contrast, our method actually tries to provide a better way of selecting detections across classes. Hence, we can not use mAP to evaluate this task. For a true understanding of a depicted scene we have to focus especially on a high precision and F1-score for selecting object proposals across classes, while trying to maintain the recall. It is obvious that a high recall could also be achieved by selecting many object proposals without doing across class suppression. As can be seen in Figure 1(a) within class suppression alone—which would be desirable for the mAP measure—still leaves the question unanswered which objects are actually depicted in an image. For a more detail investigation of wrong detections, we examine whether a false positive occurred due to a wrong localisation or classification. Wrong label is the amount of all false positives with wrong labels of all false positives. Wrong overlap is the amount of all false positives with a wrong location of all false positives.

To setup MAPC and determine all of its parameters, we use grid search on a training set obtained from ImageNet [5] as follows: First, we search for all ImageNet categories with available bounding box annotations. Next, we determine which of these categories overlap with the 7,404 categories of the LSDA detector [12]. This results in 1,825 categories with annotated images. Next, we discard all images used during the LSDA training and in the ImageNet test set described in section 4.2. We obtain our final training set by randomly selecting two annotated images per category from the remaining images. After performing grid search on this training set the MAPC parameters are set such that recall and precision are maximised.

In all our experiments common non-maximum suppression (NMS) is used as the baseline. More specifically, detections of the same category overlapping more than a defined IoU threshold are suppressed in a first step. Then, all the remaining detections are suppressed across all classes with a different IoU threshold. Both NMS thresholds were determined using grid search as previously described. The best configuration resulted in a higher IoU threshold for within class suppression than for across class suppression. The intuition for this is that detections of the same class instance are typically located at similar positions in the image. Thus, in order to suppress within classes a higher threshold is necessary. This baseline will be denoted as Within Class and Across Class NMS (WC+AC-NMS). MAPC is also compared against SAPC [22]. However, SAPC was designed for single-class detection. As we evaluate in a multi-class detection scenario, we simply accumulate the per class output of SAPC across all classes for a first SAPC version. However, accumulating all detections without suppressing across classes is more suitable for an object retrieval task than for multi-class object detection. Thus, in a second version, we use across class NMS (AC-NMS) on top of the accumulated SAPC output to select object detections also across classes. This makes SAPC [22] better comparable to our method. The IoU threshold for this across class NMS was also determined using grid search.

4.1 Multiple instance detection on COCO

The Microsoft COCO dataset [16] consists of images that depict objects in their real world context rather than centered objects. Because of this, the detection on COCO is much more challenging than on the mostly centered ImageNet pictures. Hence, this dataset is chosen to evaluate the performance of our semantic spatial regulariser in a contextual setup with numerous object instances per image.

4.1.1 Experimental setup

COCO consists of 80 different categories with on average 7 object instances per image. In a first experiment, we use the latest LSDA setup with 7,404 fine-grained categories [12]. 15 COCO categories neither overlap with the leaf node categories of LSDA nor with either of their parents in the WordNet hierarchy111traffic light, fire hydrant, stop sign, snowboard, person, kite, fork, sandwich, hot dog, pizza, donut, cake, potted plant, book, teddy bear. For those of the remaining 65 categories which overlap with a parent category, we use all of their children as an input to our method and the baselines. For example we detect beagle and dachshunds instead of their parent category dog. This results in 1,133 fine-grained child categories, which all methods have to infer on. We simply relabel the children output after inference to their parent categories to compare it with the COCO ground truth. We neither train LSDA nor adapt the MAPC paramters to COCO.

In a second experiment, we fine-tune our detection network on the COCO training set using all 80 COCO categories as input to our method and the baselines. Both experiments are evaluated on the COCO validation set.

4.1.2 Experimental results

Table 3 shows the detection results of our first experiment without finetuning our detector on COCO on the COCO validation set. As can be seen MAPC outperforms WC+AC-NMS by 3.16% in terms of precision when maintaining the recall. This performance gain can be explained by less wrongly labeled (65.13%) and wrongly localised (74.31%) detections. The F1 score for the chosen setup is 13.46% for WC+AC-NMS versus 15.09% for MAPC. In Figure 5(c) & (d) we vary the IoU evaluation threshold above which a detection is counted as a true positive. As can be seen MAPC is always better than WC+AC-NMS. In general, almost all operating points of MAPC lie above WC+AC-NMS as can be seen in the precision-recall curve depicted in Figure 5(a). These results clearly show that MAPC is superior to WC+AC-NMS in scenarios with a lot of object instances per image. Also when compared to SAPC [22] our MAPC method shows an improvement over all numbers, except for the recall of 20.72% since no across class suppression is applied. Hence, many detections are selected resulting in a cluttered outcome, which manifests in the low precision value of 5.25% and decreases the F1-score to 8.38%. As [22] was designed for within class suppression and does not suppress across classes, these results are not surprising. However, when across class NMS (AC-NMS) is applied on the accumulated outcome of [22] the precision increases to 14.66% at the cost of a recall decrease. Overall the F1-score increases to 13.12%. However, MAPC performs best on the COCO validation set amongst all tested methods.

MAPC (optimal precision) MAPC (optimal F1)

Figure 4: Different optimisation criteria. When optimised for F1 score instead of precision, MAPC selects more detections, resulting in more true and false positives.

The greater precision of MAPC can be especially seen when we look at example images. The pictures in Figure 7 show the output of WC+AC-NMS and MAPC after optimising both algorithms for the highest precision with comparable recall. The detector not fine-tuned on COCO was used. Green boxes are true positive detections. Red boxes are false positive detections. WC+AC-NMS reaches its precision limit after suppressing all overlapping boxes, while MAPC can also suppress non-overlapping boxes. At the same time, MAPC still enables the selection of overlapping object proposals as can be clearly seen in the example pictures. Allowing a greater overlap for WC+AC-NMS would increase true positives at the cost of lower precision and a cluttered detection output. In general, MAPC outputs less false positives and better localised true positives.

If required, MAPC can also be optimised towards a higher recall. Figure 4 examplarily compares a F1 score optimised MAPC to a precision optimised MAPC. Clearly more boats get detected when we optimise towards F1, but also more false positives are selected. All in all, MAPC can be optimised towards a high recall and a high precision, while WC+AC-NMS reaches its precision limit when trying to suppress non overlapping boxes. Thus, MAPC can be preciser in selecting the correct bounding box proposals.

In our second experiment, we fine-tune our object detector on COCO. The results can be seen in table 3. As expected all of our metrics highly improve. Most striking the MAPC precision rises to 37.64%, while the recall remains comparable, which increases the F1 score difference between MAPC and WC+AC-NMS to 5.40%. Also the F1 score of SAPC strongly improves to 21.17%. All methods obviously greatly profit from better detections. Thus, a detector which provides good candidate detections in the first place is crucial for all of the examined methods.

Method Pre- Re- Wrong Wrong F1
cision call Label Overlap Score
WC+AC-NMS 13.44 13.47 79.39 88.97 13.46
SAPC [22] 5.25 20.72 74.79 72.73 8.38
SAPC + AC-NMS 14.66 11.86 81.36 87.15 13.12
MAPC (ours) 16.60 13.84 65.13 74.31 15.09
Table 1: Detection results on COCO without finetuning, in %.
Method Pre- Re- Wrong Wrong F1
cision call Label Overlap Score
WC+AC-NMS 23.50 24.80 62.99 94.97 24.10
SAPC [22] 15.66 32.61 69.01 72.43 21.17
SAPC + AC-NMS 30.01 21.97 74.95 92.90 25.39
MAPC (ours) 37.64 24.23 55.47 71.79 29.50
Table 2: Detection results on COCO, fine-tuned on COCO, in %.
Method Pre- Re- Wrong Wrong F1
cision call Label Overlap Score
WC+AC-NMS 8.34 11.29 91.90 85.53 9.59
SAPC [22] 3.46 22.57 93.69 68.05 6.00
SAPC + AC-NMS 9.76 10.34 91.02 81.54 10.04
MAPC (ours) 10.94 16.22 86.41 68.57 13.07
Table 3: Detection results on ImageNet without finetuning, in %.
Figure 5: % & (b) Precision-recall curve for WC+AC-NMS, SAPC + AC-NMS and MAPC on COCO (a) and on the set of 1,825 ImageNet categories (b) without finetuning. The curved lines mark points of equal F1 score. The F1 score increases from lower left to upper right. Multiple operating points are obtained by varying the across class IoU threshold in AC-NMS (for WC+AC-NMS and SAPC + AC-NMS) and for MAPC. MAPC consistently outperforms WC+AC-NMS and SAPC + AC-NMS on COCO and ImageNet. SAPC + AC-NMS is superior to WC+AC-NMS in the lower precision range. (c) & (d) F1-score plotted against IoU evaluation threshold for COCO (c) and on the set of 1,825 ImageNet categories (d). MAPC consistently outperforms WC+AC-NMS on both datasets.

4.2 Fine-grained multi-class detection on ImageNet

In this section we evaluate MAPC on a large scale multi-class detection setup constructed from ImageNet data [5]. Since there is no standardised dataset with thousands of categories, we construct our own dataset to evaluate MAPC on a large amount of fine-grained categories. The final dataset covers 1,825 categories, but only a few object instances per image due to incomplete annotations of ImageNet.

4.2.1 Experimental setup

In order to construct a dataset with numerous fine-grained categories, we search for all ImageNet categories with available detection annotations. As we use the LSDA detector [12], we determine which of these categories overlap with its 7,404 categories. This results in 1,825 categories with annotated images. Next, all images used during the training of the LSDA detector are discarded. As most of the remaining images have only one object annotated, we further restrict our test set to images with at least two annotated objects. This way, we ensure that we evaluate on a true detection setup rather than a localisation setup. After this step, we obtain our final fine-grained ImageNet test set.

4.2.2 Experimental results

WC+AC-NMS MAPC WC+AC-NMS MAPC










Figure 6: Examples, where MAPC outperforms WC+AC-NMS on Microsoft COCO. True positives: green, false positives: red.
Ground Truth WC+AC-NMS MAPC




Figure 7: Examples where WC+AC-NMS outperforms MAPC on Microsoft COCO. Ground truth: blue, true positives: green, false positives: red.

Table 3 shows the detection results on our large scale ImageNet detection dataset. The same tendencies as on COCO can also be observed here. The precision and recall for MAPC are 2.60% and 4.93% higher than for WC+AC-NMS. The F1-score increases from 9.59% to 13.07%. False positives due to wrong labels drop by 5.49% and localisation errors drop from 85.53% to 68.57%. Again SAPC performs better after applying across class NMS. However, MAPC still performs best, which confirms our results on COCO. What is striking however is that the improvement to WC+AC-NMS and SAPC + AC-NMS is bigger in the fine-grained setting. Also the improvement of MAPC over WC+AC-NMS, when the IoU evaluation threshold is varied, is bigger than on COCO, which can be seen in Figure 5. It seems that the more fine-grained the categories, the more visually similar are semantically similar categories, and thus, the more useful the label relations from the WordNet are. This indicates that our approach is especially useful in a large scale setting when a lot of visually similar fine-grained object categories are competing against each other.

5 Conclusions and Future Work

We presented MAPC, a large scale multi-class regulariser which globally maximises both the semantic and spatial similarity, and thus, visual similarity of clusters of detection proposals. MAPC reduces false positives significantly in multi-class detection, resulting in an improved classification and localisation. Our results show that the selection of detection proposals can be significantly improved over baseline class-independent non-maximum suppression by formulating a clustering problem across class labels and spatial dimensions, which can be solved by affinity propagation. Overall, we consistently improve precision and recall for different operating points and evaluation thresholds.

As future work, it would be interesting to compare the fine-grained category detection on COCO with detectors trained on all parent categories to see whether training more fine-grained classes to detect the actual parent class helps the detection of objects. MAPC could also be extended to the temporal domain, in order to cluster over consecutive video frames for activity recognition and video description.

Acknowledgements. This work was supported by DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1427425, IIS-1212798, and IIS-113629, and the Berkeley Vision and Learning Center. Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD).

References

  • [1] X. Chen, A. Shrivastava, and A. Gupta. NEIL: Extracting visual knowledge from web data.

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    , 2013.
  • [2] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky. Exploiting hierarchical context on a large database of object categories.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2010.
  • [3] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  • [4] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In Proceedings of the European Conference on Computer Vision (ECCV). 2014.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009.
  • [6] J. Deng, J. Krause, A. C. Berg, and L. Fei-Fei. Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [7] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative Models for Multi- Class Object Layout. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009.
  • [8] B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976, 2007.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587. IEEE, 2014.
  • [10] I. E. Givoni and B. J. Frey. A binary variable model for affinity propagation. Neural computation, 21(6), 2009.
  • [11] M. Guillaumin and V. Ferrari. Large-scale knowledge transfer for object localization in ImageNet. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [12] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. LSDA: Large scale detection through adaptation. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • [13] J. Hoffman, D. Pathak, T. Darrell, U. C. Berkeley, and K. Saenko. Detector Discovery in the Wild: Joint Multiple Instance and Representation Learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [14] D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation Propagation in ImageNet. Proceedings of the European Conference on Computer Vision (ECCV), 2012.
  • [15] L.-J. Li, C. Wang, Y. Lim, D. M. Blei, and L. Fei-Fei. Building and using a semantivisual image hierarchy. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  • [16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV). 2014.
  • [17] M. Marszałek and C. Schmid. Semantic hierarchies for visual object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–7. IEEE, 2007.
  • [18] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully Convolutional Multi-Class Multiple Instance Learning. Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [19] P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint, 1995.
  • [20] M. Rohrbach, S. Ebert, and B. Schiele. Transfer Learning in a Transductive Setting. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • [21] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1641–1648, 2011.
  • [22] R. Rothe, M. Guillaumin, and L. V. Gool. Non-maximum suppression for object detection by passing messages between windows. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2014.
  • [23] O. Russakovsky, J. Deng, Z. Huang, A. C. Berg, and L. Fei-Fei. Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going? Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013.
  • [24] R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for multiclass object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1481–1488, 2011.
  • [25] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [27] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. In

    Proceedings of the International Conference on Machine Learning (ICML)

    , 2014.
  • [28] B. A. Torralba, K. P. Murphy, and W. T. Freeman. Using the Forest to See the Trees : Exploiting Context for Visual Object Detection and Localization. Advances in Neural Information Processing Systems (NIPS), 2003.
  • [29] P. Viola and M. Jones.

    Robust Real-Time Face Detection.

    International Journal of Computer Vision (IJCV), 2004.
  • [30] A. Zweig and D. Weinshall. Exploiting object hierarchy: Combining models from different category levels. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1–8. IEEE, 2007.

6 Multiple Instance Detection on COCO based on VGG net

In order to examine the influence of the network architecture on the selection process, we conduct a third experiment on the COCO dataset, additionally to the two experiments mentioned in the main paper. As in the second experiment, we fine-tune our detection network on the COCO training set using all 80 COCO categories as input to our method and the baselines. However, we replace the original RCNN detection network architecture [9] by VGG net, a deeper neural network architecture, which has shown to significantly improve classification and detection performance [26]. We evaluate our method against the baselines using the output of fine-tuned VGG net as the input to all regularisation methods and obtain the results depicted in Table 4.

Method Precision Recall Wrong Label Wrong Overlap F1 Score
WC+AC-NMS 11.26 29.41 51.28 98.42 16.29
SAPC ([22]) 21.11 32.39 80.71 76.95 25.56
SAPC + AC-NMS 32.96 23.84 82.93 93.19 27.67
MAPC (ours) 37.23 31.21 55.18 75.71 33.96
Table 4: Detection results on COCO, fine-tuned on COCO using VGG net, in percent. Within Class and Across Class NMS (WC+AC-NMS), Single-class APC (SAPC), Single-class APC and Across Class NMS (SAPC + AC-NMS), are compared against Multi-class APC (MAPC).
Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)



Figure 8: Detection examples when using VGG net with WC+AC-NMS, SAPC + AC-NMS and MAPC on Microsoft COCO. Ground truth: blue, true positives: green, false positives: red.

When we compare Table 4 with the COCO tables from the main paper, we can clearly see that all methods benefit from the deeper VGG net except WC+AC-NMS. Compared to our second COCO experiment the F1 score of SAPC improves from 21.17% to 25.56%, SAPC + AC-NMS improves from 25.39% to 27.67 % and MAPC improves from 29.50% to 33.96%. Only WC+AC-NMS drops from 24.10% to 16.29 %. Whereas the recall and precision of all other methods increases or at least stays the same, the precision of WC+AC-NMS decreases significantly. The obtained numbers by itself are not able to explain this behavior. However, a qualitative analysis of several obtained detection results gives an explanation for the drop in precision of WC+AC-NMS. When we look at the detection results of WC+AC-NMS in the images of Figure 8, we can see that a lot of small non-overlapping false positive detections are obtained. VGG learns better than the original RCNN network, that the COCO dataset contains a lot of small objects, and thus, scores object that are usually small in the dataset, such as groups of persons, higher. This creates a lot of small non-overlapping detection proposals scattered all over the image. In these cases, NMS simply cannot suppress the false positive detections, since they do not overlap with any higher scoring detection. SAPC and MAPC however do not need false positive detections to overlap to be able to suppress them. Thus, both APC methods are able to suppress more false positive detections than NMS, which suppression ability is limited by the necessity of overlapping detection proposals. This results in an upper boundary for the precision of NMS. Hence, especially when using the VGG net, the better precision of our MAPC method compared to NMS gets obvious. But also in comparison to SAPC and SAPC + AC-NMS, MAPC has a significantly higher F1 score. When using the VGG net, the gap between MAPC and the two SAPC approaches gets even bigger. In the end, MAPC improves the detection performance in terms of the F1 score by 6.29%, when compared to SAPC + WC-NMS, and even by 8.40%, when compared to the original SAPC method. Summarising, our results clearly show that MAPC is able to improve the detection performance over state-of-the-art methods in terms of precision, while maintaining the recall.

7 Comparison of MAPC, SAPC and NMS on COCO and ImageNet

In this section we show and discuss example images, where our MAPC method outperforms the SAPC and NMS baselines. We use the original LSDA network not fine-tuned to the COCO dataset. All regularisation methods were optimized to maximize the F1-score. The images were not selected randomly, but chosen based on the best performance of each method.

We evaluate on images from COCO and ImageNet on all categories of the 7,404 categories of the LSDA large scale detector [12], which overlap with the respective dataset. Following the experimental setup in the main paper, we evaluate on 65 categories, i.e. on their 1,133 child categories, for the COCO validation set and on 1,825 categories for the ImageNet 1,825 categories set. Since a dataset with ground truth annotations for all 7,404 LSDA categories is missing, we also present qualitative MAPC results on all 7,404 categories of the LSDA detector. We evaluate against the WC+AC-NMS and the SAPC + AC-NMS baseline as defined in the main paper.

Exemplary, Figures 9, 12 and 15 depict images, where MAPC both localised and classified objects better, resulting in an overall significantly better detection result. For both, a large number of object instances per image as in COCO, as well as a large number of detection classes as in our ImageNet 1,825 categories dataset or our qualitative dataset with 7,404 categories, MAPC performs better than the baselines. Hence, based on LSDA detections, our method enables a very fine-grained and precise detection, be it between different types of bags, flowers or trees.

The following sections analyse more images qualitatively and are organised as follows. Section 7.1 discusses images, where MAPC significantly better localised objects. Section 7.2 analyses images, where MAPC significantly better classified objects. Finally, we show some failure cases of MAPC in Section 7.3.

7.1 Improved localisation

In this section, we show and generally discuss images, where MAPC significantly better localises objects than the baselines. These images can be seen in Figures 10, 13 and 16. As can be seen, WC+AC-NMS often selects bigger detection proposals that group multiple objects together. Thereby, NMS simply takes the maximum scoring detection proposal and locally suppresses all other detections which overlap with this detection. Hence, if large proposals score high in a detection setup, those detection proposals suppress all other detection proposals, which makes a more detailed detection on the object level impossible. But also in other scenarios, where such large detections do not appear, WC+AC-NMS performs worse in localising objects as can be seen in the images. In contrast, SAPC [22] optimizes over the whole set of detection proposals, which improves the localisation compared to WC+AC-NMS. However, SAPC does not regularise across classes. Thus, MAPC, which groups spatially and semantically similar detections together, does much better in localising objects. MAPC benefits from the fact that usually multiple similarly sized and classified detection proposals lie on the same object due to the visual similarity of such proposals. Hence, local semantically and spatially similar clusters of detection proposals are formed over objects all over the image, which are grouped together by MAPC. This results in a precise object localisation, as can be seen in the images.

7.2 Improved Classification

In this section, we show and generally discuss images, where MAPC significantly better classified objects than the baselines. These images can be seen in Figures 11, 14 and 17. When we look at these images, we can see that MAPC clearly benefits from taking semantic similarities into account during the labeling of detection proposals. As in multi-class detection each object detection is not only classified by one class, but by a confidence score distribution across all classes, the problem of labelling each detection with the correct class arises. Hence, the MAPC similarity was formulated such that not only the spatial similarity, but also the semantic class similarity between all detection proposals in one cluster is maximized. In contrast, SAPC and NMS do not take class labels into account at all. NMS simply suppresses overlapping detections by the top scoring detection, whereas SAPC only relies on spatial relations for its inference. Further, when SAPC and NMS is applied to object detection, each detection proposal is simply labeled by the top scoring class. The input for MAPC however can also contain multiple classes per detection proposal. MAPC will then select the class for each detection, which maximizes the semantic and spatial similarity between all detection proposals in one cluster at the same time. All in all, this results in significantly better classified detections.

7.3 Erroneous Examples

In this section we look at images, where MAPC performs worse than SAPC + AC-NMS or WC+AC-NMS. As can be seen in Figure 18, MAPC sometimes results in multiple detections for the same object. Hence, MAPC not always eliminates all false positives. These false positives can be both from the same category or a different category than the underlying true positive detection. This means that the semantic or spatial clustering respectively did not work correctly. In these cases, the parameters can be adjusted until all false positives are eliminated. Nonetheless, there are cases, where MAPC detects the wrong category or localises an object in the wrong area, whereas SAPC + AC-NMS or WC+AC-NMS find the correct object and label it correctly.

8 MAPC explained step-by-step

In order to better understand how MAPC works, this section will explain step-by-step how MAPC clusters are formed throughout the iterative message passing updates on one example. This example is also used to verify that indeed spatially and semantically similar object detection proposals are clustered together, and that the detections are selected, which best represent these clusters. The most relevant iterations of MAPC are depicted in Figure 19. All detection proposals which are cluster representatives are coloured green. All detection proposals which are not cluster representatives but belong to one of the green representatives are depicted in red. All detection proposals which belong to the background cluster are not depicted.

MAPC is an iterative algorithm, which follows the message passing paradigm of [22] until convergence. As can be seen in the first image of Figure 19, all detection proposals are in the background cluster at the initial iteration (Figure 19 (a)). In the following, messages are passed between all detection proposals. Based on the spatial semantic similarity, detection proposals with similar classes and locations get clustered together and first clusters with representatives emerge (Figure 19 (b)). More clusters are formed in this process (Figure 19 (c)). These clusters are grouped together until the similarity between each representative and each detection proposal in the clusters is maximized (Figure 19 (d)). Once the setup of clusters with their respective representatives does not change anymore, the algorithm converges (Figure 19 (e)). All detection proposals, except the cluster representatives, get removed, which results in the depicted object detections (Figure 19 (f)). It can be observed that throughout the whole process cluster representatives are mainly different types of chairs or cats. Thus, this example shows that object detections are chosen as cluster representatives that represent the spatial and semantic distribution of all detection proposals. Summarizing, MAPC clusters semantically and spatially similar detection proposals, representing them by the object detections which maximize the similarity between all detection proposals and their representatives.

Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)







Figure 9: Examples where MAPC outperforms WC+AC-NMS and SAPC + AC-NMS on Microsoft COCO. Ground truth: blue, true positives: green, false positives: red.
Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)







Figure 10: Examples where MAPC better localises objects than WC+AC-NMS and SAPC + AC-NMS on Microsoft COCO. Ground truth: blue, true positives: green, false positives: red.
Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)







Figure 11: Examples where MAPC better classifies objects than WC+AC-NMS and SAPC + AC-NMS on Microsoft COCO. Ground truth: blue, true positives: green, false positives: red.
Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)






Figure 12: Examples where MAPC outperforms WC+AC-NMS and SAPC + AC-NMS on 1,825 ImageNet categories. Ground truth: blue, true positives: green, false positives: red.
Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)






Figure 13: Examples where MAPC better localises objects than WC+AC-NMS and SAPC + AC-NMS on 1,825 ImageNet classes. Ground truth: blue, true positives: green, false positives: red.
Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)






Figure 14: Examples where MAPC better classifies objects than WC+AC-NMS and SAPC + AC-NMS on 1,825 ImageNet classes. Ground truth: blue, true positives: green, false positives: red.
WC+AC-NMS SAPC + AC-NMS MAPC (ours)







Figure 15: Examples where MAPC outperforms WC+AC-NMS and SAPC + AC-NMS on 7,404 ImageNet categories. No ground truth is available. Detections: blue.
WC+AC-NMS SAPC + AC-NMS MAPC (ours)







Figure 16: Examples where MAPC better localises objects than WC+AC-NMS and SAPC + AC-NMS on 7,404 ImageNet classes. No ground truth is available. Detections: blue.
WC+AC-NMS SAPC + AC-NMS MAPC (ours)







Figure 17: Examples where MAPC better classifies objects than WC+AC-NMS and SAPC + AC-NMS on 7,404 ImageNet classes. No ground truth is available. Detections: blue.
Ground Truth WC+AC-NMS SAPC + AC-NMS MAPC (ours)







Figure 18: Examples where MAPC performs worse than WC+AC-NMS or SAPC + AC-NMS on Microsoft COCO. Ground truth: blue, true positives: green, false positives: red.
Figure 19: MAPC clustering process step-by-step. Cluster representatives: green bounding boxes. Cluster members: red bounding boxes. All bounding boxes which belong to the background cluster are not depicted. The number of iterations is illustrated in the lower right corner of each picture.