Detect2Rank : Combining Object Detectors Using Learning to Rank

12/26/2014 ∙ by Sezer Karaoglu, et al. ∙ University of Amsterdam 0

Object detection is an important research area in the field of computer vision. Many detection algorithms have been proposed. However, each object detector relies on specific assumptions of the object appearance and imaging conditions. As a consequence, no algorithm can be considered as universal. With the large variety of object detectors, the subsequent question is how to select and combine them. In this paper, we propose a framework to learn how to combine object detectors. The proposed method uses (single) detectors like DPM, CN and EES, and exploits their correlation by high level contextual features to yield a combined detection list. Experiments on the PASCAL VOC07 and VOC10 datasets show that the proposed method significantly outperforms single object detectors, DPM (8.4 and EES (17.0



There are no comments yet.


page 5

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object detection is an active research area in the field of computer vision. Many detection algorithms have been proposed [3, 5, 8, 9, 11, 12, 32, 41]. Although these detection algorithms are successful for many detection tasks, they may be less accurate for some specific cases.

To gain more insight on the differences amongst detectors, Hoiem et al. [1] provide an extensive analysis on object detectors and their properties [1]. Their findings are that detectors perform well for standard object appearances and for common imaging conditions. Obviously, different design properties of the detectors (e.g. search strategy, features, and model presentation) influence the robustness of the methods to varying imaging conditions (e.g. occlusion, clutter, unusual views, and object size). For instance, detectors based on the sliding-window approach [5] using pre-defined window sizes and aspect ratios are good at finding likely object positions (rough object positions). However, they are less suited to detect deformable objects precisely. Hoeim et al. [1]

show that these type of detectors typically suffer from poor localization errors. On the other hand, a flexible sliding-window allows detecting deformable objects. The large number of candidate regions to be considered for detection limits the use of strong classifiers. Therefore, selective search 

[8] is integrated as a pre-processing step in current state-of-the-art techniques [41] to reduce the computational complexity of sliding-window based approaches to generate a reduced set of candidate regions. However, Hosang et al. [42] show that selective search generates candidate regions which are sensitive to changes in scale, illumination and geometrical transformations. This is because selective search is based on segmentation derived from superpixels which are unstable for small image deformations.

Besides the method to generate proper candidate regions for detection, the choice of features influences the robustness and discriminative power of the detectors. HOG-based templates are able to preserve shape information [5, 12] of the objects but are less suited to differentiate between visually similar categories such as cats and dogs. This limitation is addressed using color information in [11], following successful results of using color information in object recognition [31]. HOG-based object detection using color [11] is suited for object classes in which the intra class color variation is low (e.g. potted plant and tv-monitor). However, the use of color negatively affects the detection accuracy for object classes in which the intra class color variation is large (e.g. bottles and buses).

Finally, the chosen model and classifier drastically influences the performance of the detectors. In general, object detectors represent all positive samples of a given category as a whole [5, 11]. However, Malisiewicz and Efros [2] show that standard categories (e.g. train, car and bus) do not form coherent visual categories. Accordingly these methods are too generic. To address this issue Malisiewicz et al. [12] propose to train a separate linear SVM classifier for each positive sample in the training set. Gu et al. [4] show that using only one positive sample for training significantly reduces the generalization capacity. Hence, the detection performance of [12] is deteriorated for uncommon object views.

Fig. 1: Flow of the proposed method (Best viewed in color). Initial detections from different detectors namely, Det1(green), Det2(red) and Det3(Blue) are combined by a learning to rank algorithm. False detections of the individual detectors are learned by detector-detector relations and obtain less confidence when combined, whereas consistency in detectors , and are rewarded by the re-ranking system.

As a consequence, no detection algorithm can be considered as universal. With the large variety of available methods, the question is how to combine these object detectors preserving their strengths while reducing their limitations and assumptions. In this paper, we consider a rank learning approach to combine object detection methods. The proposed framework combines detections (detector outputs which consist of a classifier score and bounding box locations) of different state-of-the-art object detectors including DPM [5], CN [11] and EES [12]. Furthermore, the method extracts high-level context features such as detector-detector consistency, detector-class preference, object-saliency of a detection, and object-object relations. These features are used in a learning to rank framework to yield a combined detection list. The flow of the proposed method is summarized in Fig 11.

The proposed approach offers the following advantages over single object detectors:

  • Missed detections (false negatives) of single detectors are compensated by combining detections of different detectors.

  • Detections are re-ranked by using information gathered by other detectors. True detections (true positives) of each detector are rewarded and false detections (false positives) of each detector are penalized within the learning to rank framework.

  • The combined list maintains the strengths of the detectors. Therefore, it is more robust than each single detector for varying imaging conditions.

To the best of our knowledge, we are the first to propose using re-ranking approaches to combine object detectors. Experiments on the PASCAL VOC07 and VOC10 datasets show that the proposed method significantly outperforms single detectors.

Ii Related Work

Ii-a Object Detection

In general, papers on object detection aim at designing a single detector, descriptor or classifier [3, 5, 9, 41, 32, 33, 39]. Felzenszwalb et al. [5] propose a part-based object detection method using HOG features and a latent SVM. This algorithm outperforms the state-of-the-art methods for standard object appearances. The use of template-based models limits the capability to detect deformable objects [1]. Moreover, template-based models (using HOG features) are designed to accommodate for shape information and are less suited to differentiate visually similar categories (e.g cats and dogs). In contrast to part-based detection methods, Vedaldi et al. [9] propose to use a bag-of-words model for object detection. Multiple features are used within a multiple kernel learning framework which is able to distinguish between visually similar object categories. However, Hoiem et al. [1] show that this approach is sensitive to object size due to the bag-of-words model. Khan et al. [11] propose to use additional color information for object detection. The color information contains expressive power for object classes in which the intra class color variations are low (e.g. potted-plants and sheeps). However, color may have a negative influence on the detection of classes in which the intra class color variations are high (e.g. bottles and buses) [11].

Malisiewicz et al. [12]

propose to learn a linear classifier per exemplar in the training set. The algorithm benefits from a large collection of simpler exemplar classifiers. In this way, the method is tuned to the appearance of the exemplar. While the detection of this detector covers the objects in the dataset (high recall), the detector usually provides low average precision. This is due to the large number of false detections introduced by each of the exemplar specific classifiers. Currently, remarkable results for object detection are obtained by convolutional neural networks 

[3, 41]. Girshick et al. [41] employ the CNN of [38] to a set of candidate windows obtained by selective search [8].

Ii-B Contextual Information for Object Detection

Contextual information for object detection has been exploited over the past few years. Contextual information includes the relation between objects [15, 36], scene layout [20] or characteristics [18, 19], surrounding pixels [15, 22, 37] and background segments [21]. [19] shows that real-world scene structures can be modeled by inference rules. Therefore, in addition to the appearance of objects, contextual information provides useful information for object detection  [23, 24]. For example, Choi et al. [18] model the object spatial relationships and co-occurrences by employing a tree-structured graphical model. Desai et al. [20] model the spatial arrangements between objects to detect objects in a structured prediction framework. Cinbis and Sclaroff [25] formulate the object and scene context in terms of relative spatial locations and relative scores between pairs of detections as sets of unordered items. Felzenszwalb et al. [5] re-score their DPM detections by exploiting contextual information as a post processing. Their re-scoring scheme relies on object co-occurrences as well as the location and size of the objects. The above methods show that contextual information is important for object detection. However, these methods have also certain limitations. For example, the above methods rely on object-object co-occurrences and spatial relationships and hence are suited for images consisting of (many) different objects. Further, the context-based methods aim at re-scoring detections. They do not introduce new detections and hence are not able to recover from missed detections of single detectors.

Ii-C Score Aggregation

The approach of aggregating the responses of classifiers and learning a second level SVM to re-score them for different tasks such as action recognition [28]

, image retrieval 

[29] and object recognition [30, 34]

has been exploited in the literature. The organizers of Pascal VOC12 use seven methods submitted to the classification challenge. The scores of each submission are concatenated to form a single vector to train another linear classifier. Substantial increase for average precision is reported for classes such as potted plants and bottles. However, the problem of aggregating scores of different object detectors is not straightforward as other problems mentioned. More precisely, for these problems each instance in the dataset has a response from each classifier. By contrast, the object detectors do not generate candidate regions (exactly) at the same locations. Therefore, each candidate region does not necessarily has response from other detectors. Recently, Xu et al. 

[6] propose to combine different pedestrian detectors using a score calibration and detection clustering steps. The authors reduce false and missed detections of pedestrian detectors per image. However, they do not aim at performing a global ranking of detections over all dataset for different object classes.

Our contributions are the following:

  • Detector combination: We propose to combine the state-of-the-art object detectors rather than proposing a new one.

  • Detector consistency: We show that the state-of-the-art detectors have many detections in common. These common detections are proven to be very informative to re-rank detection scores.

  • Detector complement: We show that existing state-of-the-art object detectors also have complementary detections. These complementary detections reduce missed detections of single detectors in a combined list.

  • Detector contextual integration: We propose high-level context features (e.g. detector-detector relations and object-saliency cues) to combine detections in a learning to rank framework.

Iii Object Detectors

In this section, the detectors used in this paper are outlined. We focus on publicly available detectors. Note that there are no constraints on the type of detector since the proposed method only requires detections (bounding box locations with classifier scores) of a detector.

Iii-a Dpm

Felzenszwalb et al. [5]

propose an object detector in which each object category consists of a global template and deformable parts. The global template and deformable parts are represented by HOG features extracted at different scales. Training the object models is done in a latent support vector machines (SVM) framework. Each detection

in training set is labeled as being either or . Each detection is scored as


The set defines all possible latent values for detection . and is a vector of model parameters and a feature vector, respectively. is trained by minimizing the following objective function:


where is the hinge loss and constant is the regularization parameter.

Iii-B Cn

Khan et al. [11] propose an object detector which uses color attributes as a complementary feature to DPM based HOG features. Color attributes are combined with HOG features in a late fusion manner. The proposed color attributes are compact and efficient. They are proven to be effective for the object classes in which intra class color variations are low such as potted-plants and sheep. Beside HOG features are extended with color attributes, training is done exactly the same as in DPM.

Iii-C Ees

Malisiewicz et al. [12] propose an object detector which is trained by a parametric SVM for each positive exemplar in the training set. Consequently, a large collection of simpler exemplar specific detectors, which are highly tuned to the appearance of the exemplars, are obtained. Each exemplar is represented using a rigid HOG template [14] to train a linear SVM. Then, each Exemplar-SVM, (,), is used as a learned instance-specific HOG weight vector to score. is learned by optimizing the following convex objective function:


where is the hinge loss and and are regularization parameters. Training each detector allows to tune detectors based on variations on the exemplar s appearance (viewpoint and object geometry). As a result, high recall is obtained for object detection.

Iv Combining Detectors by Learning to Rank

To combine detections from different detectors, L2R is used. L2R aims at ranking group of items according to their relevance to a given task. Fig. 2 illustrates a common L2R flow. In our framework, the training set consists of detections ( is the number of the items in training set) and the ground truth label (). Feature vector and are used in training data to learn a ranking model (). To re-score detections, is described as follows:


Using different loss functions

(see section VI), the weight () is optimized by minimizing the following objective function:


To learn a ranking algorithm to perform re-ranking, the proposed method starts with the feature extraction step using detections from different detectors.

Fig. 2: Learning to rank framework for detection re-ranking.
Fig. 3: The figure illustrates relative score for each detection in VOC 2007 trainval set. Each sphere represents a detection in the trainval set whereas each axis represents relative score from detectors namely, DPM, CN and EES. The color blue, green and red holds for true detection, poor localization and false detection, respectively. Best viewed in color.

Iv-a Context Features

The proposed method starts with high-level context feature extraction to learn the ranking between detections of different detectors into a single detection list. We aim at extracting generic features exploiting the correlation and consistency between detectors.

Iv-A1 Detector-Detector Context

We define detector consistency when different object detectors generate detections for the same image region. Agreement of all object detectors for a certain location increases the probability of a correct object detection. However, different detectors may generate detections at different locations even for the same image. As a result, it is hard to obtain an exact bounding box location where all detectors provide a detection. Therefore, a detector relative score is defined. To obtain a relative score for each detection, a correspondence term is computed by considering the overlapping ratios between all other detections. In this way, an image is represented as a collection of detections obtained by different object detectors

, where and is the number of the detectors used. For the detection in the image, the maximum overlapping detection with each detector is considered as follows:


where is the overlap ratio and is the index of the maximum overlapping detection for detector type . Then, the corresponding relative score of a detector to the detection is considered as , where is the initial classification score of the detector. Note that if a detection has no overlap with other detectors (), its relative scores will be zero. In this way, higher relative scores correspond to more reliable detections because more detectors agree on a particular location (see Fig. 3). If a detection has high relative score from each single detector it corresponds to a high probability of being a true detection. Whereas a low relative score corresponds to a false detection. Moreover, a mid-level consistency in relative score can be considered as a good indication of poor localization error.

Relative score of a detection does not include the information of which detector it belongs to. However, some detectors performs better than others for some classes, hence its detections should get higher scores than detection of other detectors (to benefit the strength of detectors on the task they are successful). Therefore, the detector indicator term is specified. The aim is to provide information to the learning system to create detector preferences over classes. To give an indication of which detector the detection belongs to, a binary vector of three dimensions (i.e. three detectors in our case) is used. The value of the dimension is assigned to be one if case of a detection by the corresponding detector otherwise the value is set to zero. This feature vector is at the detector level. Therefore, all detections of the same detector have the same binary coding .

The final corresponding score feature , for the detection is denoted by . The dimension of is limited to the number of the detectors.

Fig. 4: The figure illustrates object likelihood score for each detection in VOC 2007 trainval set. Each sphere represents a detection (randomly sub-sampled over all classes) in the trainval set whereas each axis represents object likelihood score from object indicators namely, OBJ, CORE and EES. The color blue, green and red holds for true detection, poor localization and false detection, respectively. Best viewed in color.

Iv-A2 Object-Saliency

A feature vector is proposed to represent how likely a detection contains an object. EES [12], OBJ [16] and CORE [17] are used to measure the object-saliency of a detection. OBJ and CORE are category independent region proposal methods. They are mostly used by the current object detection algorithms to avoid exhaustive sliding window search. These methods provide region candidates/proposals (bounding box) which most likely contain objects. Both methods approximately result in 1000 region candidates per image. In addition to these category independent region proposal methods, EES [12] is also used to provide region candidates. The overlap ratios between these different region proposals and object detections are calculated according to eq. 6. Then, the feature vector for the detection is given by:


where is the number of neighbors to measure object-saliency, is the sorted list of overlaps and is the indicator of different regions proposals, namely OBJ, CORE and EES. Additionally, we use the confidence scores of the maximum overlapping neighbors of detections by EES [12] in eq. 9 since these regions proposals are class specific. A detection with a high object-saliency value is considered to be a good indicator for a correct detection. These features may be useful to provide less confidence scores for false detections. Fig. 4 illustrates that true or false detections are highly correlated with object likelihood scores of the detection.

Iv-A3 Object-Object Relation

The likelihood of an object present is inferred by using other object class likelihoods. Let be the detection with maximum confidence for object class () by detector (j={1,2,3}) in an image, where denotes the number of object classes. Then, the object-object context is given by


This feature exploits the object-object relations. For instance, when three detectors locate a cow with high confidence, it is less likely to have a sofa or tv in the same image.

The compactness of the proposed contextual features used in this paper is shown in Table I.

Feature Notation Dimension
Detector Relative Score 10
Object Likelihood Measure 4
Object-Object Context 20
Total 34

TABLE I: Contextual features used in the proposed learning to rank framework.

Iv-B Learning

aero bike bird boa bot bus car cat chr cow tab dog hor mbik pers plnt shp sofa tra tv mAP
DPM [5] 26.7 56.9 2.6 12.8 21.9 46.0 55.3 13.7 19.0 19.4 12.6 2.2 58.1 47.3 40.9 6.8 15.0 26.9 43.4 38.8 28.3
CN [11] 28.7 55.9 6.3 11.6 18.2 44.3 55.5 17.7 18.3 20.5 14.9 4.9 57.3 48.9 41.5 15.0 21.8 28.1 44.1 45.7 30.0
EES [12] 17.9 47.2 2.8 10.6 9.1 39.3 40.3 1.6 6.2 15.3 7.0 1.7 44.0 38.1 13.2 4.6 20.0 11.6 35.9 27.6 19.7
M-DPM 39.3 66.5 29.2 25.5 36.2 58.2 73.4 36.3 53.8 33.6 19.9 22.5 74.7 65.5 62.5 35.0 28.5 37.2 66.0 51.6 45.8
M-CN 43.5 61.7 26.1 20.5 34.5 56.8 72.2 39.4 46.3 33.2 22.8 22.5 73.3 63.1 65.0 38.3 38.8 43.9 62.1 60.1 46.2
M-EES 47.7 72.4 38.3 37.3 46.1 64.3 64.1 45.0 44.4 50.8 44.7 43.1 69.5 63.4 54.9 35.6 47.9 50.2 62.8 73.7 52.8
M-(DPM + EES) 60.7 80.4 48.6 46.0 54.6 73.7 80.2 59.5 67.6 58.2 51.9 51.3 82.2 73.8 73.1 49.0 51.7 60.3 76.2 76.0 63.7
M-(DPM + CN) 48.8 68.0 36.8 27.4 40.7 62.9 77.4 49.4 61.5 39.8 31.6 33.7 78.7 70.8 71.2 48.5 42.1 49.4 70.6 61.7 53.5
M-(EES + CN) 59.3 79.5 46.6 43.7 54.6 72.8 78.9 62.0 63.2 55.7 51.5 52.1 82.5 71.7 74.7 50.0 55.8 66.1 73.4 76.3 63.5
M-All 62.5 81.3 52.3 47.5 56.7 76.1 82.3 65.9 71.2 59.0 55.3 56.9 84.2 75.7 77.5 56.0 57.0 67.4 78.4 76.3 67.0
TABLE II: mAP values for baseline detectors DPM, CN and EES. Class specific and overall maximal mAP values of baseline detectors M-DPM, M-CN and M-EES, and their combinations M-(DPM+CN), M-(CN+EES), M-(DPM+EES) and M-(All) on PASCAL VOC07.

L2R methods are used to learn the ranking models. L2R methods used in this paper can be categorized in two groups [13]. The first type of algorithms is called pointwise techniques. Pointwise approaches represent the problem of ranking as a regression or classification problem. These techniques are straightforward approaches to learn the ranking model. Pointwise algorithms are preferred because of their efficiency and effectiveness. These methods have been optimized to work on large scale data.

The second type of L2R algorithms are pairwise techniques. These methods consider the problem of ranking as a pairwise classification problem. The aim is to learn a binary classifier to determine which instance is most relevant from a given pair of instances. The goal of these algorithms is to minimize the average number of miss orders in ranking rather than the traditional miss classification in the ordinary pointwise approach.

V Non-maximum Suppression

Duplicate removal for the same instance is a known problem for single detectors. Obviously, by combining multiple detectors, the proposed method increases the number of duplicates. To this end, we propose to eliminate these multiple detections by non-maximum suppression (). The common application of considers all bounding boxes (over a certain overlap threshold) for suppression. We use only correspondences (overlaps between detections of other detectors) obtained for each detection in eq. 7 for suppression. After applying the re-ranking system, the corresponding detections are sorted and the highest among the others is remained constant while the other detections which are at least covered by the highest detection are suppressed.

Vi Experiments

Experiments are conducted on the Pascal VOC07 and VOC10 datasets. VOC07 dataset consists of 9963 images of 20 different object classes (24640 annotated objects) with 5011 training images and 4952 test images. VOC10 train/val dataset contains 10103 images of 20 different categories (23374 annotated objects). Object detections for the set are obtained by models trained on 2007 and detections for the set are trained on 2007 set to learn detector-detector context. Detections for the set are obtained by models trained on the 2007 set for both dataset evaluations. This process is summarized in Fig. 5.

Fig. 5: Each training is used for learning detector models and context models. To avoid overfitting, the object detectors for context models are trained on the train set to generate detections on validation. Further, they are trained on validation to provide detections on train.
aero bike bird boa bot bus car cat chr cow tab dog hor mbik pers plnt shp sofa tra tv mAP
DPM [5] 26.7 56.9 2.6 12.8 21.9 46.0 55.3 13.7 19.0 19.4 12.6 2.2 58.1 47.3 40.9 6.8 15.0 26.9 43.4 38.8 28.3
CN [11] 28.7 55.9 6.3 11.6 18.2 44.3 55.5 17.7 18.3 20.5 14.9 4.9 57.3 48.9 41.5 15.0 21.8 28.1 44.1 45.7 30.0
EES [12] 17.9 47.2 2.8 10.6 9.1 39.3 40.3 1.6 6.2 15.3 7.0 1.7 44.0 38.1 13.2 4.6 20.0 11.6 35.9 27.6 19.7
NaiveI 31.0 61.6 6.1 13.7 22.7 48.9 58.4 19.6 20.5 22.3 19.3 3.9 63.2 52.1 44.3 14.5 22.7 31.5 47.8 47.4 32.6
NaiveII 30.8 57.7 6.1 14.2 20.2 47.6 55.2 13.3 16.7 22.4 20.0 4.4 61.4 50.2 33.4 11.8 23.4 28.0 46.4 41.6 30.2
NaiveIII 28.3 61.3 2.8 13.3 22.8 48.1 58.7 18.5 19.5 15.3 19.0 1.8 61.7 52.6 41.9 14.8 20.0 29.3 48.9 48.3 31.4
PoW1 36.8 62.7 10.0 18.1 24.3 51.6 59.5 21.2 22.5 25.4 22.4 7.8 64.2 57.3 44.9 18.7 26.7 34.1 54.1 47.8 35.5
PoW2 36.7 62.8 13.3 18.4 27.0 52.3 59.9 24.7 21.9 24.8 25.8 10.6 65.4 55.9 44.7 19.2 21.2 37.5 54.0 46.5 36.2
PoW3 35.6 63.1 9.7 17.0 25.0 51.2 60.0 21.3 22.5 25.1 21.5 8.1 65.0 56.4 43.8 18.2 27.0 33.9 53.5 48.2 35.3
PaW1 34.5 59.4 10.2 16.2 19.8 49.5 54.4 24.6 20.7 19.7 24.0 8.0 61.0 51.5 40.9 16.7 25.9 31.1 48.3 41.5 32.9
Imp 8.1 7.1 6.2 5.6 5.1 6.3 4.5 7.0 3.5 4.9 10.9 5.7 7.4 8.4 3.4 4.2 5.2 9.4 9.9 2.4 6.3
TABLE III: The results using learning to rank algorithms. Naive: Direct merging methods without learning. Imp: The improvement over maximum baseline detector by maximum learning algorithm.

Vi-a Detector Bounds

In this experiment, we evaluate the maximal mAP that can be achieved by the detections of the baseline detectors and their combinations. The maximal mAP of a detector is calculated when all true detections are ranked at the top of the detection list. Table II shows that re-ranking , and detections results in a substantial performance improvement, , and , respectively. This result shows the positive effect on re-ranking detection scores of object detectors.

Table II shows that and have similar maximal mAP and , respectively. However, their combination has significantly higher maximal mAP () than both of them individually. This shows that although these two detectors are very similar in nature, they are somewhat complementary to each other. Furthermore, when the detectors are designed intrinsically different (e.g. and or and ), they are more complementary to each other. This can be derived by the performance gain obtained by combining DPM+EES and CN+EES in Table II, and , respectively. Consequently, the proposed method would benefit from more detectors.

Another observation that can be derived from Table II is that beside detectors have complementary detections to each other, they also have common detections. While these common detections are useful to learn consistency in their output, complementary detections resolves missed detections for each individual detector.

Table II shows that the performance of detectors is limited by their correct detections. Therefore, detector combinations always show higher mAP values than each single detector. The proposed method highly benefits from this, whereas other context based re-ranking methods lead to a limited performance improvement (limited to correct detections of a single detector).

Vi-B Direct Combining of Detections

In this experiment, several ways of combining (without learning) detector outputs are investigated. Since the detectors are trained independently, detector scores are not necessarily compatible. A calibration process [35] is applied before merging different detector outputs. Given a detection and the learned sigmoid parameter (, ), the calibrated detection score is calculated as


where and for each detector are learned in set. After the scores are calibrated, we evaluate three different approaches of combining:

  • , after scores are calibrated, detections are merged into a single list.

  • , after scores are calibrated, detections are sorted in a descending score order for each single detector. Then, detections are combined by taking one by one from the top of each sorted detector outputs.

  • , the detectors are combined based on their training set performance. The output of the best performing detector is first added to the list followed by the others based on their performance.

After the detections are combined in a single list, (see section V) is applied. It can be derived from Table III that naively combining detector outputs outperforms baseline scores. The improvements are due to the increase in recall of the combined detection list.

The minimum performance improvement is obtained by . gives equal importance to each single detector. This means that although detections are not precise, they become as important as and . Therefore, more false positives are introduced at the top of detection list which negatively affects the detection performance. This result shows the importance of properly weighting the detections.

is expected to perform better than other naive methods since it incorporates the training performances of the baseline detectors. However, the performance of the baseline detectors explains the lower performance of . To obtain performance detector models are: trained on to test on and trained on to test on (see Fig. 5). Since the detectors are trained with fewer samples for detections, baseline performances do not necessarily correspond to their performances. Training with fewer examples has also an influence on our context models.

Vi-C Learning to Rank Detectors

In this experiment, four different L2R algorithms are evaluated. Pointwise methods are regularized support vector classifier (), logistic regressor () and support vector regressor ().  [26] is commonly used as a pairwise L2R method. Therefore, () is used as a pairwise method in our experiments.  [27] implementations for pointwise approaches and rankSVM implementation by Joachims [26] are used with default parameter settings.

Ground-truth overlap ratios are taken as training labels. Pascal VOC () overlap criteria is used to assign positive and negative labels for and , while overlap ratios are directly used as training labels for and .

Table III shows that the proposed learning to rank approach outperforms the baseline detectors for all classes, (7.8%), (6.2%) and

(16.5%). While learning based methods always perform better, logistic regression (

) based learning method performs slightly better than other L2R algorithms. The performance of is slightly lower than other L2R methods. This is due to unbalanced data. The number of negative samples is significantly larger than positive samples. This has also influence on the final result of .

Considering the low dimensionality of the proposed feature vector, the feature space may not be linearly separable. Therefore, other non-linear kernel options for classifier could be tried. However, we avoid learning a non-linear due to its learning time and parameter selection. Therefore, we use a feature mapping method proposed by Vedaldi and Zisserman [7]. A dimensional feature vector is mapped to a higher dimensional feature space. The best performing linear classifier in Table III() is selected to perform on this new feature space. classifier obtains mAP improvement(). Increasing the dimensionality results in support vectors to better separate the feature space. Increasing the feature vector dimension with additional context features may further improve the results.

The improvement over direct merging methods by the proposed learning scheme in Table III indicates that the performance gain is not only due to the recall increase but also the effectiveness of the contextual information and learning scheme.

Vi-D Detection Error Analysis

Fig. 6: Average (over classes) for the highest and lowest performing subsets within each different object characteristics such as occlusion, truncation, bounding box area, aspect ratio, viewpoint and part visibility.

To provide more insight in the performance obtained by combining the baseline detectors, we follow the procedure introduced by Hoiem et al. [1]. The first analysis is performed for detector sensitivities. The detector sensitivity is calculated based on the difference between max and min normalized AP for each characteristic (occlusion, truncation, bounding box area, aspect ratio, viewpoint, part visibility). Each different colored plot in Fig. 6 shows the mean (over all classes) normalized AP for specified detectors. The results show that the proposed method does not reduce the sensitivity. However, it improves both the highest and lowest performing subsets for nearly all object characteristics. This indicates that the proposed method improves robustness for all object characteristics. The reason why the proposed method does not reduce the sensitivity is due to commonly missed detections (hard detections cannot be detected easily even for human observers). While some of these hard detections are covered by one of the baseline detectors, they mainly remained unveiled. That is why the minimum normalized APs for each characteristic increase but not as much as the maximum normalized APs. Consequently, the difference between max and min normalized AP increases.

Hoiem et al. [1] show the problem of small objects. Since small sized objects are mainly missed by all detectors, even if three baseline detectors are combined, we observe that the min normalized AP for category ”size” is not improved.

Fig. 7 shows the changes in the percentage of each false positive () types with an increasing total number of . are divided into four categories as follows:

  • Poor localization () occurs when the label of detection is correct but misaligned with the ground-truth detection ( overlap or a duplicate detection.

  • Confusion with similar classes () occurs when a false detection has an overlap with an instance of a similar class.

  • Confusion with dissimilar object categories () occurs when a false detection is obtained for dissimilar classes.

  • Confusion with background () occurs when a false detection has no overlap with an instance of similar or dissimilar classes.

The obtained errors originate from poor localization rather than other errors. This shows the effectiveness of relative score features. For instance, consider an image region where all detectors generate a detection. All detections belonging to this region have high classifier scores because of the high relative score. Consequently, these detections are ranked at the top of the detection list. However, the proposed method creates preferences over detectors for classes. Now, assume a detection of preferred detector with a localization error in this region. The corresponding detections of the other detectors are suppressed by . The suppressed ones may be true detections. That is why top ranked false positives of the proposed method are mostly due to poor localization.

Fig. 7 illustrates that the confusion with background error is significantly reduced. This shows the effectiveness of the proposed object likelihood features. Such strong object-saliency cues positively affect the proposed method to detect false detections.

Another observation shown in Fig. 7 is that the proposed features could not reduce the confusion caused by similar object categories. However, they are effective on limiting the confusion between dissimilar object categories.

Fig. 7: Figure shows the fraction of false positives of each type (animal, furniture and vehicle) evolving as the total number of false positives increase.

Vi-E Feature Importance

aero bike bird boa bot bus car cat chr cow tab dog hor mbik pers plnt shp sofa tra tv mAP
33.7 62.4 9.5 15.2 22.9 50.4 59.6 18.8 22.0 22.8 20.3 6.2 62.7 53.7 44.6 16.5 24.4 34.4 50.9 46.7 33.9
+ 35.6 62.1 10.6 17.4 24.6 50.8 59.3 25.1 21.3 23.2 23.9 10.5 63.1 51.0 45.5 14.7 26.3 37.8 50.6 47.5 35.1
+ 35.4 63.7 10.6 18.2 26.5 51.7 60.3 18.7 22.7 24.1 21.5 6.6 63.8 57.3 43.7 18.5 24.3 34.5 53.0 45.3 35.0
All 36.8 64.2 12.3 20.3 27.3 53.0 60.3 27.0 22.0 25.3 27.1 11.1 63.7 56.6 45.4 19.3 24.0 38.0 54.5 46.8 36.8

TABLE IV: The influence of selected features for the final detection performance.

In this experiment, we study the influence of each individual feature. The weights are obtained by averaging the absolute classifier weights over the classes. The importance of proposed detector-detector context features ( ) is highlighted in Fig. 8. Moreover, feature weights also emphasize the importance of proposed object-saliency features (). As stated earlier, the proposed and features are more generic and independent of the number of object categories. However, object-object relation exploited by other state-of-the-art context based object detection methods [18, 20, 25] is dependent on the image characteristics. Therefore, the accuracy gain is limited to the image characteristics for these methods.

We now investigate the influence of each feature on the final mAP score. The detector scores are essential to rank the detection list. Therefore, it is not possible to evaluate and individually. We evaluate mAP using only feature. For the rest of the features, is also included. It is shown in Table IV that using only improves the baseline detectors significantly. Object likelihood measure also improves the accuracy (e.g. for animal classes such as cat, dog or sheep). Significant improvement for these classes is due to poor representation capacity of template based detectors for non-rigid objects. Deformable part based object detectors are suited to detect rigid parts of the objects, see top ranked visual results of category cat in Fig. 9. Due to their (cats, dogs and sheeps) homogenous appearances, most of object proposals contain the full object shape. Therefore, detections of the entire objects receive higher confidence than detections for object parts. The object size plays role for the other animals, such as horse and cow. Object proposal methods used in this paper tend to perform better to cover small sized objects. Moreover, it is less likely to happen that object proposal methods generate many large bounding boxes for a specific image region. Therefore, the average overlap of a detection with these windows becomes lower. Adding object-object context () slightly improves most of the object classes. However, its contribution to the average precision increases when it is combined with the object-saliency. Furthermore, clearly improves the accuracy for class ”bottle” in which samples usually occur within a context (usually on a table or in the hand of a person).

Fig. 8: Classifier weights are averaged over different classes to see the importance of features individually.
aero bike bird boa bot bus car cat chr cow tab dog hor mbik pers plnt shp sofa tra tv mAP Imp.
DPM + Context 29.8 57.5 9.9 16.4 24.1 46.3 58.0 21.1 19.6 20.4 15.1 7.5 58.3 50.4 42.0 14.3 18.2 28.0 49.0 39.6 31.3 3.0
CN + Context 33.3 55.1 11.4 13.4 22.7 44.9 57.0 22.6 18.6 19.4 17.5 8.5 56.0 50.9 42.1 17.4 20.9 31.3 48.5 45.5 31.9 1.9
EES + Context 31.4 57.2 10.6 16.9 21.0 46.6 51.5 13.3 15.5 20.6 15.2 8.1 57.3 51.5 32.9 14.1 18.0 20.1 46.9 44.5 29.7 10.0

TABLE V: The results of the re-ranked SINGLE baseline detector outputs using contextual features. The results of SINGLE detectors are improved using context.
aero bike bird boa bot bus car cat chr cow tab dog hor mbik pers plnt shp sofa tra tv mAP
BL1 28.6 55.1 0.6 14.5 26.5 39.7 50.1 16.5 16.5 16.8 24.6 5.0 45.2 38.3 35.8 9.0 17.4 22.7 34.0 38.3 26.8
[20] 1.7 0 0.1 1.4 0 -3.5 1.3 0.5 -2.8 1.2 -0.7 0.2 0.5 1.1 -2.8 -1.1 -2.3 -0.7 0.5 0 -0.3
[18] 2.4 -4.2 2.3 0.8 -1.1 -0.2 -0.4 3.9 1.6 0.9 2.3 6.9 5.6 2.2 0 4.7 3.8 2.8 4.7 -0.1 1.9
[25] 5.6 2.7 9.2 0.8 3.2 1.9 3.4 5.0 0 0.7 1.4 7.9 5.9 4.6 3.5 4.2 3.1 4.9 4.9 0.3 3.6
BL2 27.8 55.9 1.4 14.6 25.7 38.1 47.0 15.1 16.3 16.7 22.8 11.1 43.8 37.3 35.2 14.0 16.9 19.3 31.9 37.3 26.4
[10] 2.4 1.9 0.5 0.2 3.2 2.6 2.9 -0.9 0.9 1.9 0.2 5.3 1.3 3.3 3.6 3.0 3.2 3.7 2.9 -0.5 2.0
BL3 26.7 56.9 2.6 12.8 21.9 46.0 55.3 13.7 19.0 19.4 12.6 2.2 58.1 47.3 40.9 6.8 15.0 26.9 43.4 38.8 28.3
Proposed 10.1 7.3 9.7 7.5 5.4 7.0 5.0 13.3 3.0 5.9 14.5 8.9 5.6 9.3 4.5 12.5 8.9 11.0 11.1 8.1 8.4

TABLE VI: Comparison of the state-of-the art context based object detection methods on PASCAL VOC07 dataset. The results of referred works [20, 18, 25] and baseline scores () are reported in [25] whereas  [10] and baseline scores () are reported in [10]. is baseline score obtained in this paper. The results represented as proposed are the improvements over baseline in this paper.
aero bike bird boa bot bus car cat chr cow tab dog hor mbik pers plnt shp sofa tra tv mAP
BL1 46.3 49.5 4.8 6.4 22.6 53.5 38.7 24.8 14.2 10.5 10.9 12.9 36.4 38.7 42.6 3.6 26.9 22.7 34.2 31.2 26.6
DPM-Context[5] 0.1 1.3 2.7 1.8 -0.6 1.8 2.9 -4.8 0.5 1.3 0.7 1.0 1.5 1.5 2.5 0.6 -2.8 4.9 6.6 2.7 1.2
[40] 6.5 -0.7 7.2 4.4 6.5 1.7 6.9 7.2 0.0 2.1 2.8 3.7 3.4 5.5 2.5 4.6 8.4 3.3 7.9 3.1 4.2
BL2 37.4 51.8 5.1 3.9 20.3 51.4 39.2 13.3 15.2 9.5 7.2 4.8 40.1 43.4 41.5 9.8 13.2 16.4 31.9 26.5 24.1
Proposed 7.3 2.9 8.5 6.6 2.5 7.1 5.1 15.0 2.8 3.5 5.8 7.1 3.6 7.7 3.5 8.0 8.4 3.4 10.6 11.8 6.6

TABLE VII: Comparison of the state-of-the art context based object detection methods on PASCAL VOC10. The results of referred works [40] and baseline scores () are reported in [40]. is baseline score obtained in this paper. The results represented as proposed are the improvements over baseline in this paper.

Vi-F Re-ranking Detections from Single Detector

aero bike bird boa bot bus car cat chr cow tab dog hor mbik pers plnt shp sofa tra tv mAP
 [5] 36.8 50.1 4.3 10.6 14.3 50.0 40.4 13.9 15.9 14.2 9.4 4.7 41.8 43.0 40.9 5.9 11.6 15.3 33.4 31.4 24.4
 [11] 34.5 48.8 5.3 10.4 11.4 52.1 40.9 18.7 14.9 15.7 7.1 5.9 41.3 45.5 42.2 10.1 14.0 18.1 36.2 35.8 25.4
 [12] 22.6 34.9 3.2 9.4 4.5 45.9 25.0 2.1 7.2 10.7 4.3 2.0 21.7 31.7 10.0 2.1 11.6 8.1 21.3 23.6 15.1
PoW2 44.8 53.3 14.3 14.6 14.2 56.3 44.7 27.2 18.9 19.6 14.5 15.0 44.1 50.0 45.4 13.2 17.6 22.5 42.0 39.1 30.6
 [5] 37.4 51.8 5.1 3.9 20.3 51.4 39.2 13.3 15.2 9.5 7.2 4.8 40.1 43.4 41.5 9.8 13.2 16.4 31.9 26.5 24.1
 [11] 36.6 45.0 6.0 4.7 17.9 52.5 40.2 18.8 15.3 10.6 6.5 5.2 39.7 44.4 44.0 15.5 16.4 13.0 35.6 33.8 25.1
 [12] 19.9 36.8 1.8 3.3 7.2 46.2 23.5 2.0 4.2 6.4 2.1 1.3 20.6 30.4 9.5 2.8 14.5 7.0 24.0 24.7 14.4
PoW2 44.7 54.7 13.6 10.5 22.8 58.5 44.3 28.3 18.0 12.9 13.0 11.9 43.7 51.0 45.0 17.8 21.6 19.8 42.5 38.2 30.7

TABLE VIII: The results for baselines (, and ) and proposed detector merging scheme using PoW2 on VOC10 (upper: set and lower: set).

In this experiment, we exploit the effectiveness of context features without combining detectors into a single list. The proposed context features are only used to re-rank individual detectors. It is shown in Table V that the proposed method is still effective and improves the baseline detectors. However, the accuracy gain is relatively smaller than using the combined detector outputs in Table III. These results underline the importance of combining different detector outputs to recover from missed detections to improve the overall object detection performance.

Note that a detector with a high recall and low precision such as can be as powerful as other more precise detectors (, ) using the proposed context features.

Vi-G Comparison to Other Context Methods:

In this experiment, we compare the proposed method against the state-of-the-art context based object detection re-ranking methods. Table VI shows the baseline scores of DPM and improvements reported by the papers [25, 10] on VOC07. The gain in performance by our method indicates the importance of high level contextual features and L2R based detector merging.

Moreover, the proposed method is compared to the recent work by Mottaghi et al. [40] on VOC10 dataset (See Table VII). The authors report also context re-ranking method of (See [5] for details) discussed in Section II.

The contextual features proposed by other methods in Table VI and Table VII are from different sources. Hence, they can be complementary to the proposed features. Combination of these features may further improve the results.

Vi-H Tests on VOC10

We also evaluate our method on the PASCAL VOC10 dataset. The VOC10 annotations of the test samples are not publicly available. Therefore, we use only the ”” dataset. All the training is done on the VOC07 set, including object detection models and detector-detector relation models. Table VIII shows the results. Table VIII indicates that the proposed method outperforms the baseline detectors for all classes also on the cross dataset evaluation. The results show that the learned detector-detector context is generic and it is not dataset dependent.

Fig. 9: Top ranked false positives of the proposed method for specified classes. Blue and red colors indicate the detector type, and , respectively. Yellow and green colors correspond to poor localization and multiple detections, respectively. The image frames without color information indicates no overlap between ground truth object, either due to miss classification or background clutter.
Fig. 10: Top five ranked false positives of baseline detectors and proposed method with their rankings below (object class car). The color of the detection yellow, green and red indicates the type of detector , and respectively. The false positives of individual detectors are pushed down in the proposed method.

Vii Discussion

Diversity, and thus potential complementary detections of detectors exist mainly due to two reasons. The first one is related to the features used to represent the model. Although the selected detectors in this paper use HOG based features. They are differentiated by additional color features or feature extraction steps performed at different scales. The second source of the diversity comes from the classifier. These differences have a substantial influence on their final outputs. Table II and Table III represent the differences in the final outputs. Although these detectors have detections in common, they are complementary to each other. While common detections are useful to learn consistency in their output, complementary detections resolves missed detections for each individual detector. A higher diversity in detectors will further improve the results.

With the help of the proposed method, future object detectors can focus on more specific solutions to harder detection problems. Their results will be combined with other detection methods to carry object detection algorithms a step further. The contribution of new method can be measured compared to the combination of the-state-of-the-art methods.

The pointwise and pairwise approaches do not consider specific properties of ranking. For instance, position is not used in the loss function. Enabling the position information in the loss function may further improve the final results. However, the trade-off between the complex structure and the accuracy should be taken into account.

To avoid overfitting, the object detectors are trained on to test on . Further, they are trained on to test on . The detectors are trained with fewer examples. This has an impact on the performance of detectors on set in which we learn the relationship between detectors. It is observed that for some classes the performance of object detectors on set are not inline with set. Therefore, learning the models for detectors on a larger dataset may further improve the proposed learning to rank scheme.

The non-maximum suppression technique is a widely used ad-hoc method in object detection literature. However, learning to detect multiple detections from different detectors may be more appropriate for the proposed method.

The proposed method does not provide new bounding boxes. Therefore, it cannot recover from poor localization errors. Poor localization error becomes problematic for some cases (See Fig. 9 and Fig. 10 for top ranked false positives). This problem can be resolved by proposing new bounding boxes using object proposals or using a method similar to [41].

Viii Conclusion

No detection algorithm can be considered as universal. As a consequence, we have proposed an approach to combine different object detectors. The proposed approach uses (single) object detectors to exploit their correlation by learning a re-ranking scheme.

The proposed method uses common detections of single detectors to award a detection based on detector correlation and consistency. Whereas the proposed method uses complementary detections of detectors to recover missed detections of each single detector.

Experiments on the PASCAL VOC07 and VOC10 datasets show that the proposed method significantly outperforms individual object detectors, (8.4%), (6.8%) and (17.0%) on VOC07 and (6.5%), (5.5%) and (16.2%) on VOC10.

Fig. 11: Precision-recall curves on PASCAL VOC 2007. The proposed method significantly outperforms all single detectors. Furthermore, it is shown that detections of baseline detectors have remarkable differences.


  • [1] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing Error in Object Detectors, in ECCV 2012.
  • [2] Tomasz Malisiewicz, and Alexei A. Efros. Recognition by Association via Learning Per-exemplar Distances, in CVPR, 2008.
  • [3] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, in ICLR, 2014.
  • [4] Chunhui Gu, Pablo Arbelaez, Yuanqing Lin, Kai Yu, and Jitendra Malik. Multi-component Models for Object Detection, in ECCV 2012.
  • [5] Felzenszwalb, P. F. and Girshick, R. B. and McAllester, D. and Ramanan, D. Object Detection with Discriminatively Trained Part Based Models, in TPAMI 2010.
  • [6] Philippe Xu, Franck Davoine, and Thierry Denoeux. Evidential Combination of Pedestrian Detectors, in BMVC 2014.
  • [7] Andrea Vedaldi and Andrew Zisserman. Efficient Additive Kernels via Explicit Feature Maps, in TPAMI 2012.
  • [8] J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders. Selective Search for Object Recognition, in IJCV 2013.
  • [9] A. Vedaldi and V. Gulshan and M. Varma and A. Zisserman. Multiple Kernels for Object Detection, in ICCV 2009.
  • [10] Yukun Zhu and Jun Zhu and Rui Zhang. Discovering spatial context prototypes for object detection, in ICME 2013.
  • [11] Fahad Shahbaz Khan and Rao Muhammad Anwer and Joost van de Weijer and Andrew D. Bagdanov and Maria Vanrell and Antonio M. Lopez. Color Attributes for Object Detection, in CVPR 2012.
  • [12] Tomasz Malisiewicz and Abhinav Gupta and Alexei A. Efros. Ensemble of Exemplar-SVMs for Object Detection and Beyond, in ICCV 2011.
  • [13] Tie-Yan Liu. Learning to Rank for Information Retrieval, Springer 2011.
  • [14] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection, in CVPR 2005.
  • [15] G. Heitz and D. Koller. Learning Spatial Context: Using Stuff to Find Things, in ECCV 2008.
  • [16] Alexe, Bogdan and Deselaers, Thomas and Ferrari, Vittorio. Measuring the Objectness of Image Windows, in TPAMI 2012.
  • [17] Rahtu, E. and Kannala, J. and Blaschko, M. B. Learning a Category Independent Object Detection Cascade, in ICCV 2011.
  • [18] Myung Jin Choi and Antonio Torralba and Alan S. Willsky. A Tree-Based Context Model for Object Recognition, in TPAMI 2012.
  • [19] Torralba, Antonio. Contextual Priming for Object Detection, in IJCV 2003.
  • [20] Chaitanya Desai and Deva Ramanan and Charless C. Fowlkes. Discriminative Models for Multi-Class Object Layout, in IJCV 2011.
  • [21] Li, Congcong and Parikh, Devi and Chen, Tsuhan. Extracting adaptive contextual cues from unlabeled regions, in ICCV 2011.
  • [22] Carolina, Galleguillos and Brian, McFee and Serge, Belongie and Gert, R. G. Lanckriet. Multi-Class Object Localization by Combining Local Contextual Interactions, in CVPR 2010.
  • [23] Carolina, Galleguillos and Serge, Belongie Context Based Object Categorization: A Critical Survey, in CVIU 2010.
  • [24] Santosh Kumar Divvala and Derek Hoiem and James Hays and Alexei A. Efros and Martial Hebert. An empirical study of context in object detection, in CVPR 2009.
  • [25] Ramazan Gokberk Cinbis and Stan Sclaroff. Contextual Object Detection Using Set-Based Classification, in ECCV 2012.
  • [26] Joachims, Thorsten. Optimizing Search Engines Using Clickthrough Data, in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2002.
  • [27] Fan, Rong-En and Chang, Kai-Wei and Hsieh, Cho-Jui and Wang, Xiang-Rui and Lin, Chih-Jen. LIBLINEAR: A Library for Large Linear Classification, in JMLR 2008.
  • [28] Bangpeng Yao and Xiaoye Jiang and Aditya Khosla and Andy Lai Lin and Leonidas J. Guibas and Li Fei-Fei. Action Recognition by Learning Bases of Action Attributes and Parts, in ICCV 2011.
  • [29] Matthijs Douze and Arnau Ramisa and Cordelia Schmid. Combining attributes and Fisher vectors for efficient image retrieval, in CVPR 2011.
  • [30] Lorenzo Torresani and Martin Szummer and Andrew Fitzgibbon. Efficient Object Category Recognition using Classemes, in ECCV 2010.
  • [31]

    van de Sande, K. E. A. and Gevers, T. and Snoek, C. G. M. Evaluating Color Descriptors for Object and Scene Recognition, in TPAMI 2010.

  • [32] Cinbis, Ramazan Gokberk and Verbeek, Jakob and Schmid, Cordeli. Segmentation Driven Object Detection with Fisher Vectors, in ICCV 2013.
  • [33] Uijlings, J. R. R. and van de Sande, K. E. A. and Gevers, T. and Smeulders, A. W. M. Selective Search for Object Recognition, in IJCV 2013.
  • [34] Zheng Song and Qiang Chen and ZhongYang Huang and Yang Hua and Shuicheng Yan. Contextualizing object detection and classification, in CVPR 2011.
  • [35] John C. Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, in ADVANCES IN LARGE MARGIN CLASSIFIERS 1999.
  • [36] Ali Farhadi and Mohammad Amin Sadeghi. Phrasal Recognition, in TPAMI 2013.
  • [37] Michael Fink and Pietro Perona. Mutual boosting for contextual inference, in NIPS 2004.
  • [38]

    Alex Krizhevsky and Sutskever, Ilya and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks, in NIPS 2012.

  • [39] Dumitru Erhan and Christian Szegedy and Alexander Toshev and Dragomir Anguelov. Scalable Object Detection using Deep Neural Networks, in CVPR 2014.
  • [40] Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Nam-Gyu Cho and Seong-Whan Lee and Sanja Fidler and Raquel Urtasun and Alan Yuille. The Role of Context for Object Detection and Semantic Segmentation in the Wild, in CVPR 2014.
  • [41] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, in CVPR 2014.
  • [42] Jan Hendrik Hosang, Rodrigo Benenson and Bernt Schiele. How good are detection proposals, really? in BMVC 2014.