Empirical Upper-bound in Object Detection and More

by   Ali Borji, et al.
West Virginia University

Object detection remains as one of the most notorious open problems in computer vision. Despite large strides in accuracy in recent years, modern object detectors have started to saturate on popular benchmarks raising the question of how far we can reach with deep learning tools and tricks. Here, by employing 2 state-of-the-art object detection benchmarks, and analyzing more than 15 models over 4 large scale datasets, we I) carefully determine the upperbound in AP, which is 91.6 and 58.9 are much better than the AP of the best model1 (47.9 COCO; IOUs=.5:.95), II) characterize the sources of errors in object detectors, in a novel and intuitive way, and find that classification error (confusion with other classes and misses) explains the largest fraction of errors and weighs more than localization and duplicate errors, and III) analyze the invariance properties of models when surrounding context of an object is removed, when an object is placed in an incongruent background, and when images are blurred or flipped vertically. We find that models generate boxes on empty regions and that context is more important for detecting small objects than larger ones. Our work taps into the tight relationship between recognition and detection and offers insights for building better models.


page 1

page 2

page 3

page 4

page 5

page 6

page 9

page 10


Empirical Upper Bound, Error Diagnosis and Invariance Analysis of Modern Object Detectors

Object detection remains as one of the most notorious open problems in c...

Complementary datasets to COCO for object detection

For nearly a decade, the COCO dataset has been the central test bed of r...

On the Utility of Context (or the Lack Thereof) for Object Detection

The recurring context in which objects appear holds valuable information...

Object Detection in 20 Years: A Survey

Object detection, as of one the most fundamental and challenging problem...

Inverting and Understanding Object Detectors

As a core problem in computer vision, the performance of object detectio...

Uncertainty for Identifying Open-Set Errors in Visual Object Detection

Deployed into an open world, object detectors are prone to a type of fal...

Rank of Experts: Detection Network Ensemble

The recent advances of convolutional detectors show impressive performan...

Code Repositories


Code for calculating the upper bound AP in object detection

view repo

1 Introduction and Motivation

Several years of extensive research on object detection has resulted in accumulation of an overwhelming amount of knowledge regarding model backbones, tricks for model training and optimization, data collection and annotation, and model evaluation and comparison [68], to a point that separating wheat from chaff is very difficult. As an example, truly understanding and implementing Average Precision (AP) is frustratingly difficult. A quick Google search returns numerous blogs and codes with discrepant explanations of AP. To make matters even worse, it is not quite clear whether AP has started to saturate, whether progress is significant, and more importantly how far we can improve following the current path, making one wonder maybe we have reached the peak of performance using deep learning. Further, we do not know what is holding us back from making progress in object detection, compared to human-level (although debatable) accuracy on object recognition.

To shed light on the above matters, first we systematically and carefully approximate the empirical upper-bound in AP. We hypothesize that the upper-bound AP (UAP) is the score of the best recognition model that is trained on the training target bounding boxes and is then used to label the testing target boxes. We also investigate whether visual context surrounding a target object or its overlapping boxes can improve the upper-bound AP. Second, we identify bottlenecks by characterising the type of errors that object detectors make and measure the impact of each one on performance. Third, we study the invariance properties of various object detectors on different types of transformations including incongruent context, scale, blur, vertical flip, etc.

Figure 1: Upper-bound AP (red circles and numbers) and scores of the best models (blue; FCOS [59] on VOC and FASHION, and Hybrid Task Cascade [19] on COCO; numbers from the same model).

In a nutshell, we find that there is a large gap between the performance of the best detection models and the empirical upper-bound as shown in Fig. 1. This entails that there is a hope to reach this peak with the current tools, if we can find smarter ways to adopt object recognition models for object detection. We also find that classification remains as the major bottleneck in object detection and is more critical over small objects. Specifically, object detection models inherit the main limitations of CNNs which is the lack of invariance. Example failure cases include generating a lot of bounding boxes on a white background containing a single object, and failing to detect objects in incongruent contexts, vertically flipped images or blurred ones. It seems that humans can still manage to solve these tasks, although with higher effort and lower performance than intact images.

2 Related Work

We discuss three lines of related works. The first one includes works that strive to understand detection approaches, identify their shortcomings, and pinpoint where more research is needed. Parikh et al.  [49] aimed to find the weakest links in person detectors by replacing different components in a pipeline (e.g. part detection, non-maxima-suppression) with human annotations. Mottaghi et al.  [46]

proposed human-machine CRFs for identifying bottlenecks in scene understanding. Hoeim 

et al.  [33] inspected detection models in terms of their localization errors, confusion with other classes, and confusion with the background on PASCAL dataset. They also conducted a meta-analysis to measure the impact of object properties such as color, texture, and real-world size on detection performance. We replicate, simplify and extend this work on the larger COCO dataset and on image transformations. Russakovsky et al.  [55]

analyzed the ImageNet localization task and emphasized on fine-grained recognition. Zhang 

et al.  [64] measured how far we are from solving pedestrian detection. Vondrick et al.  [61] proposed a method for visualizing object detection features to gain insights into their functioning. Some other related works in this line include [38, 67, 63].

The second line regards research in comparing object detection models. Some works have analyzed and reported statistics and performances over benchmark datasets such PASCAL VOC [26, 25], MSCOCO [40], CityScapes [22], and open images  [37]. Recently, Huang et al.  [35] performed a speed/accuracy trade-off analysis of modern object detectors. Dollar et al.  [24] and Borji et al.  [15, 17, 16] compared models for person detection, and salient object detection, respectively. In [44], Michaelis  et al. assessed detection models on degraded images and observed about 30–60% performance drop, which could be mitigated by data augmentation. In order to resolve the shortcomings of the AP score, some works have attempted to introduce alternative [29] or complementary evaluation measures [47, 53]. A large number of works have also assessed object recognition models and their robustness (e.g.  [56, 12, 51, 45]).

Works in the third line study the role of context in object detection and recognition (e.g.  [13, 62, 43, 32, 60, 50, 54, 27]. Heitz et al.  [32] proposed a probabilistic framework to capture contextual information between “stuff” and “things” to improve detection. Barnea et al.  [14] utilized co-occurrence relations among objects to improve the detection scores. Divvala et al.  [23] explored different types of context in recognition. See also [32, 21, 57, 34, 43, 11].

3 Experimental Setup

3.1 Benchmarks

We base our analysis on two recent large-scale object detection benchmarks: MMDetection [6, 20] and Detectron2 [4]. The former evaluates more than 25 models. The latter includes several variants of FastRCNN [28]. In both benchmarks, all COCO models have been trained on train2017 and evaluated on val2017. Here, we use MMDetection to train and test additional models on a new dataset.

3.2 Models

We consider the latest models published in the major vision conferences and the ones included in the above benchmarks. Several variants of the RCNN model including FasterRCNN [52], MaskRCNN [30], RetinaNet [39], GridRCNN [42], LibraRCNN [48], CascadeRCNN [18], MaskScoringRCNN [36], GAFasterRCNN [66], and Hybrid Task Cascade [19] are considered. We also include SSD [41], FCOS [59], and CenterNet [65]. Different backbones for each model are also taken into account.

3.3 Datasets

We employ 4 datasets including PASCAL VOC [25], our home-brewed FASHION dataset, MSCOCO [40], and OpenImages [37]. Over VOC, we use trainval0712 for training (16,551 images, 47,223 boxes) and test2007 (4,952 images, 14,976 boxes) for testing. This dataset has 20 categories. Our FASHION dataset covers 40 categories of clothing items (39 + humans). Trainval, and test sets for this dataset contain 206,530 images (776,172 boxes) and 51,650 images (193,689 boxes), respectively. Fig. 5

displays samples from this dataset (see Supplement for stats). This is a challenging dataset since clothing items are non-rigid as opposed to COCO or VOC objects. MSCOCO has 80 categories. It has carried the torch for benchmarking advances in object detection for the past 6 years. We use

train2017 for training (118,287 images, 860,001 boxes) and val2017 (5,000 images, 36,781 boxes) for testing. Finally, we use the OpenImages V4 dataset, used in Kaggle competition [10]. It has 500 classes and contains 1,743,042 images (12,195,144 boxes) for training and 41,620 images (226,811 boxes) for validation (used here for testing).

3.4 Metrics

We use COCO evaluation code [2] to measure AP over IOU thresholds of 0.5 and 0.75 as well as the average AP over IOUs in the range 0.5:.05:0.95. APs are calculated per class and are then averaged. We also report breakdown APs over small (area), medium (area), and large (area) objects. Please see [2, 8, 5] for details.

4 Characterizing the Empirical Upper-bound

We hypothesize that the empirical upper-bound in AP is the score of a detector with ground truth bounding boxes labeled by the best object classifier. The classification score is considered as the detection score. This way we essentially assume that the localization problem is solved and what remains is only object recognition. However, it might be possible to improve upon this detector in at least two ways: a) by exploiting local context around an object to improve classification accuracy and hence better UAP, and b) by searching over the scene and finding boxes that are easier to classify (compared to the target box) and have enough overlap with the target box. This does not matter for the perfect IOU but may affect IOUs lower than one. We carefully investigate these possibilities in the following.

Figure 2: Illustration of visual context surrounding an object.
Dataset object only object + context context only
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1.2 2 all img
VOC 39.3 68.0 82.6 92.5 94.8 93.0 91.6 90.6 88.6 87.0 63.6 64.9 35.3
FASHION - 52.9 66.4 71.7 88.8 82.3 77.2 71.8 67.9 64.8 29.0 32.2 12.0
COCO - - 67.1 79.8 86.7 82.9 78.3 72.5 67.4 63.0 43.7 48.9 11.0
OpenImg. - - - - 69.0 65.1 62.7 - - - - - -

Table 1: Recognition accuracy using object and/or its context.

4.1 Utility of the surrounding context

We trained ResNet152 [31] (see supp.) on target boxes in three settings as shown in Fig. 2: I) object only, II) object + context, and III) context only. Standard data augmentation techniques including mean pixel subtraction, color jittering, random horizontal flip and random rotation (10 degrees) were applied. Boxes were resized to 224

224 pixels and models were trained for 15 epochs. Trained models were tested on the original object box. Results (top-1 accuracy) shown in Table 

1 reveal that the canonical object size contains the most information regarding the category of an object over all four datasets. Increasing or decreasing object box lowers the performance. Context-only scenario leads to high classification score but still does below other cases. Stretching the context to the whole scene drops the performance significantly. Training and testing models on the same condition (i.e. both on object+context) results in higher accuracy on that specific condition but does not lead to better overall recognition accuracy.

4.2 Searching for the best label

Essentially the problem definition here is how we can get the best classification accuracy for recognition of objects in the scene by utilizing all the information in the scene. This is different than recognition approaches that treat objects in isolation. Note that recognition accuracy is not the same as AP, since detection scores also matter in AP calculation.

Having the best classifier at hand, we are ready to approximate the empirical upper-bound in AP. Before delving into details lets recap how AP is calculated.

AP calculation. For each category, detections over all images are sorted according to their confidences. Starting from the top of this list, the target with the highest IOU with each detection is considered. We have a true positive (TP; hit) if their IOU is , and if that target has not been assigned yet. We have a false positive (FP) if IOU (i.e. localization error) or if the target has been assigned (i.e. duplicate; two predictions on the same target). A target box can be matched with only one detection (the one with the highest confidence score and IOU). If a detection has IOU with two targets, it is assigned to the one with the highest IOU which is not assigned already. Scanning the sorted detection list again, a precision for each recall is obtained and is used to draw the Recall-Precision (RP) curve and to compute the AP. See [8, 5] for details.

We explore two strategies in pursuit of the upper-bound AP. In the first strategy, we apply the best classifier from the previous section to the target boxes. The detector built in this fashion gives the same AP regardless of the IOU threshold, since our detections are target boxes. As we argued above, it is not possible to improve this detector at IOU=1. However, if we are interested in upper-bound for a lower IOU (say ), then it might be possible to do better by searching among the candidate boxes near a target box and choose the one that can be classified better than the target box, or aggregate information from nearby boxes. Thus, in our second strategy, we sample boxes around an object and either apply the original classifier (trained on canonical object size) or train and test new classifiers on the surrounding boxes. In any case, we always keep the target box but change its label and/or its confidence. First, lets take a look at our box sampling approach, which is illustrated in Fig. 3.

Figure 3: A) Illustration of our setup for finding boxes with IOU with the target box (corresponding to ; for ), B) The solutions are 4 curves represented by Eqs. 4 to 8. Four sample rectangles are shown with dashed lines.

Sampling boxes with IOU above a threshold. Here, we are interested in finding the coordinates of the top-left corner of all rectangles with IOU with the ground-truth bounding box. We use the coordinate system centered at the top-left corner of the target box ; which can be easily converted to the image level coordinate frame. The bottom-right coordinates of the desired rectangles that intersect with the target box from the top-left follow the equation , where and are width and height of rectangles, respectively (we assume all boxes have the same width and height as the target box). According to the illustration in Fig. 3(A), we have:


From these equations and assuming , and , it is easy to derive the following equations:


and also:


The same equation governs the coordinates of the bottom-left, top-left, and top-right corners of the rectangles intersecting with the target box at points , , and , respectively (in the coordinate frames centered as these points, in order). Calculating the top-left corner of these rectangles (in their corresponding coordinate frames) and representing them in the coordinate frame of point , we arrive at the following four equations (note that these are not lines):


Acc. Most Confident Box Most Frequent Label

93.7 88.7 91.7 81.4 63.8 89.1 92.0 82.9 60
FASHION 87.4 68.1 68.6 61.9 49.5 67.7 68.2 60.7 47.8
COCO 84.8 76.9 81.8 80.6 62.8 76.4 82.0 80.4 60.7

Table 2:

Results of our second strategy for estimating the upper-bound AP (

i.e. searching for the best bounding box or object label near a target box; among boxes with IOU ). Notice that upper-bound for AP, AP and AP are all the same. Underlined numbers show where we could improve over the 1st strategy.

Using the above equations, we then sample some (here = 4) rectangles with (Fig. 3(B)) and label them with the label of the target box. We then train a new classifier (same ResNet152 as above) on these boxes. This is effectively a new data augmentation technique. Notice that AP is a direct consequence of the classification accuracy, so if we can better classify objects we can obtain a better AP. To estimate UAP, we sample a number of rectangles (=4) near a target box (all with ), and then label the target box with: a) the label (and confidence) of the box with the highest classification score (i.e. most confident box), or b) the most frequent label among the nearby boxes (with the maximum confidence score among them).

4.3 Upper-bound results

Here, we report classification scores, upper-bound APs, score of the models (mean AP over all IOUs; unless specified otherwise), and the breakdown AP over categories.

Comparison of strategies. Summary results of the first strategy are shown in Fig. 1. As expected UAPs over all IOUs are the same and are much better than the models. To our surprise, our second strategy did not lead to better UAP values, except for few cases including UAPs over medium and small objects on FASHION dataset and small objects on COCO (using most confident boxes), as shown in Table 2. Applying the original classifier, instead of training new ones on surrounding boxes, or only sampling boxes with higher IOU (e.g. 0.9) did not improve the results. Also, setting the confidence of detections to 1 lowers the UAP. We attribute the failure of the 2nd strategy to the fact that the surrounding boxes may contain additional visual content which may introduce noise in the labels. This leads to a lower classification accuracy and hence a lower AP. Therefore, in what follows we only discuss the results from the first strategy.

Figure 4: Model scores and upper-bound AP over PASCAL VOC dataset using VOC (left) and COCO AP evaluation codes (right). Categories are sorted based on the average model AP. Bar charts show classification scores. Solid red and dashed black lines represent upper-bound AP, and the best model AP, respectively.
Figure 5: Upper-bound and model APs over the Fashion dataset.
Figure 6: APs over COCO dataset borrowed from the MMDetection benchmark [6]. We add CenterNet results to MMDetection.

PASCAL VOC. Fig. 4 shows results using both VOC and COCO evaluation codes. The VOC evaluation code is based on IOU=0.5 and calculates the area under the PR curve slightly different than COCO. For VOC, we adopt the code from the CenterNet repository [1]. We have trained and tested 5 models on this dataset including FasterRCNN, FCOS, SSD512, and two variants of CenterNet. The classification accuracy on VOC is very high (94.7%). Consequently, the UAP is very high (91.6 using the COCO code). FCOS model does the best here with AP of 47.9 (right panel in Fig. 4; dashed lines). As it can be seen, there is a large gap between the AP of the best model and the UAP on this dataset (45). Models are consistent in their performance across different categories.

FASHION. Results are shown in Fig. 5. The best classification accuracy on this dataset is 88.8% (Table 1, and supplement). The UAP is 71.2 and the AP of the best model is 59.7 (FCOS). Interestingly, FCOS performs quite close to the upper-bound at IOU=0.5 (Fig. 1). Models perform better here than over VOC. The FASHION UAP is lower than VOC UAP perhaps because classification is more challenging on the former dataset. The gap between UAP and model AP here, however, is much smaller than VOC. This could be partly due to the fact that FASHION scenes have less clutter and larger objects than the VOC scenes. While per-class UAP is above the AP of the best model over all VOC classes, UAPs of 5 FASHION categories fall below the best model AP (messenger bags, tunics, long sleeve shirts, blouses, and rompers). Looking at the classification scores, we find that they have a low accuracy.

COCO. Existing benchmarks have provided an efficient ecosystem for developing, evaluating and comparing detection models especially on the COCO dataset. They provide trained models over a variety of settings. Borrowing the MMDetection benchmark and adding the results from CenterNet to it, we end up comparing 15 models (71 in total; combination of models and backbones). Model scores are shown in Fig. 6. The best models here are Hybrid Task Cascade model [19] and Cascade MaskRCNN [18], with APs of 46.9 and 45.7, respectively. See supplement for Recall results. The upper-bound AP on COCO is about 78.2. Recall that UAP does not depend on the IOU threshold since detected boxes are classified ground truth targets. The gap between the best model AP and UAP is above 30. The gap is much smaller for AP at IOU=0.5 which is about 10. The UAP is much lower over small objects than UAP over large objects. This also holds for models. The gap between UAP and model AP over small objects is about 35 which is much higher than the gap over medium or large objects.

Breakdown APs over object categories are shown in Fig. 7. For this, we use the Detectron2 benchmark which reports per-category results mainly over RCNN model family. We noticed that aggregate scores on MMDetection and Detectron2 are quite consistent. Among 18 variants of Faster-RCNN and MASK-RCNN, the best model has the AP of 44.3 (shown by the dashed line) which is lower than the best available model on COCO (46.9; Fig. 1) and the upper-bound AP. Among 80 classes, only three (snowboard, toothbrush, and toaster) have UAPs below the best model APs.

Figure 7: Detection APs over MSCOCO dataset borrowed from Detectron2 benchmark [4]. The horizontal dash line corresponds to the best model among the shown models. “*”: The best AP here is 44.3 which is smaller than the best so far on COCO (46.9). See also Fig. 1.

OpenImages. This dataset [37]

is the latest endeavor in object detection and is much more challenging than its predecessors. Our classifier achieves 69.0% top-1 accuracy on the validation set of OpenImages V4 which is lower than other the three datasets. We achieve 58.9 UAP, using the TensorFlow evaluation code for computing AP 

[9] on this dataset, which is different than COCO AP calculation (here we discarded grouping and super-category). We are not aware of any model scores on this set of OpenImages V4.

AP vs. classification accuracy. We found that there is a linear positive correlation ( = 0.81 on COCO) between the UAP and the classification accuracy (Fig. 8). The higher the classification accuracy, the higher the UAP. We did not find a correlation between the accuracy and model APs, nor between the object size and accuracy (or UAP). The dependency of UAP on accuracy, highlights the importance of recognition on object detection and constitutes the core of our analyses in the next two sections.

Figure 8: Correlation between classification accuracy and upper-bound AP. The higher the Acc., the better the UAP.

5 Error Diagnosis

To pinpoint the shortcomings of object detectors, we follow the analysis by Hoeim et al.  [33], but revise it in two major ways. First, instead of inspecting errors across categories, here we perform a per-category error analysis (i.e. binary manner). This simplifies the process and makes it easier to understand. See Fig. 9. We combine all types of class confusions (e.g. similar classes, other classes, and background in Hoeim [33]) into two types of classification errors: a) confusion with the background (Type I), and 2) misses (Type II). Notice that this implicitly contains the above misclassification types but is much easier to investigate. In fact, recent object detectors such as FCOS [59] and CenterNet [65] also adopt this strategy to classify objects (i.e. an object is of a particular class or is not). Second, Hoeim et al. successively remove errors to reach the AP of 1. We argue that this approach convolutes different error types and does not correctly reflect the true contribution of errors (i.e. understating or exaggerating error types). For example, according to the COCO analysis tool [2], any matches to objects with a different class label but in the same supercategory do not count as either a FP or a TP. Also, the COCO tool removes mislocalized predictions. In this case, we argue that correcting the mislocalized predictions is more effective than removing them because it can reveal other sources of weakness in a model. For example, it may lead to generating duplicates which would have been overlooked by removing the detections. In contrast, here we explicitly handle the errors by removing, correcting or adding detections when appropriate. Similar to Hoeim et al. our analysis is also based on IOU=0.5.

We repeat the following procedure for each category-image pair (shown in Fig. 9; from left to right). First, we remove the detections with the maximum IOU with any target (i.e. classification error Type I; confusion with the background). Second, we correct the miss-localized predictions with . In this step, coordinates of these boxes are replaced with their matching target box coordinates (which is the target with the max IOU) while their confidence scores and labels are preserved. Third, duplicates (i.e. redundant detections) are removed. An unmatched detection is considered duplicate if it falls (i.e. has IOU) over a target with an already assigned detection (with higher score). See supplement for details. Fourth, eventually, misses are treated. A miss is a target with with any unmatched detection, and is added to the list of detections (with score of 1). Before performing this step, we set the coordinates of detections as the coordinates of their matching targets, since we now know which prediction is paired with which target (i.e. one to one mapping; no duplicates).

Results of error diagnosis are shown in Table 3 for 3 models over 3 datasets. We start from the original detection set and progressively measure the impact of fixing each error type in the order explained above and shown in Fig. 9. Confusion with the background (and other classes; see above) has the highest contribution to the overall error, across all models. This indicates that models often falsely confuse background clutter or other classes as a particular object category. The second most important error type is misses. Interestingly, localization error weighs more than duplicates and has higher impact on COCO and VOC datasets than the FASHION dataset, possibly because the former two contain a larger number of small objects. Conversely, over the FASHION dataset, duplicates matter more, perhaps because class confusion is higher (e.g. confusion in slippers vs. sandals; different types of hats, etc.). Models behave almost consistently across the three datasets.

We also cross checked our results with results obtained using the COCO analysis tool (implementing Hoeim et al. ). Notice that numbers from COCO analysis tool are not directly comparable to ours since our strategy is different and, unlike us, it does not explicitly address duplicate errors. Nevertheless, based on APs and PR curves in Fig. 10, we arrive at similar conclusions to ours. Here, again we observe that classification error Type I (Sim, Oth, and BG in Fig. 10) accounts for the largest fraction of errors, followed by misses (FN) and localization (Loc) errors.

Figure 9: Illustration of different error types in object detection.
Dataset Model - Cls. (Type I) + Local. - Duplicates + Misses
MaskRCNN 54.1 85.9 87.7 88.7 100
FASHION CenterNet 54.0 88.8 91.7 96.2 100
FCOS 59.7 90.1 91.9 95.9 100


MaskRCNN 42.1 70.1 79.0 82.7 100
COCO CenterNet 39.2 66.1 78.0 81.7 100
FCOS 42.8 69.6 80.8 85.4 100


MaskRCNN 47.3 73.7 78.8 79.7 100
VOC CenterNet 47.8 79.0 88.5 92.6 100
FCOS 47.9 76.3 85.0 90.3 100
Table 3: Quantifying the contribution of errors in object detection. “Local.” and “Dup.” stand for localization error and duplicate removal, respectively. AP is the original AP by the model.

6 Invariance Analysis

Complementary to our error diagnosis, here we conduct several experiments to reduce the impact of localization or recognition in detection pipelines (one at a time). Our principal emphasize is on the recognition component. These experiments are performed over the COCOval2017 set and are illustrated in Fig. 11. Trained models, over the COCOtrainval0712 set, are employed.

Analysis of context.

In the first experiment, we generated stimuli in which a single object was placed in a white background or in a white noise background (one object per image, hence number of images equal to the number of objects). Contrary to our expectation, we found that models either underestimate or overestimate the distribution of target bounding boxes. Fig. 

12 shows the difference in distribution of predicted boxes and distribution of ground-truth boxes. Interestingly, models search all over the place. FasterRCNN and RetinaNet oversample boxes around targets, while FCOS generates a fair amount. This hints towards the shortcomings in objectness prediction in models. Quantitative results, presented in Table 4, show that models perform poorly on these images (about the same in both conditions but lower than the original images). They are hindered much more on small objects than medium or large ones, which shows how critical context is for recognition and detection of small objects. FCOS, ranking higher on original images, does better here as well.

In the second experiment, object-only case, we removed the image background and preserved all the objects (hence the same number of images as in COCOval2017). To our surprise, FCOS and SSD performed better on these images than the original ones (Column 1 vs. 10 in Table 6). Compared to the original images, they did better on large objects and lower on small objects in the object-only case.

Figure 10: Quantifying the contribution of errors in object detection using the COCO analysis code [2].

In the third experiment, we placed objects in incongruent backgrounds (e.g. a boat in the street), similar to Rosenfeld et al.  [54] but over a larger dataset and including more models. We placed 9 objects including bear, keyboard, refrigerator, surfboard, train, tv, cake, horse, and oven on 100 images taken from the FASHION dataset; 900 images in total. Fig. 13 shows some examples. Results are given in Table. 5. Interestingly, models performed well on this dataset. They failed drastically on surfboard and oven which seem to be a little hard for humans. Cake, bear, and horse were the easiest ones. FCOS did the best among models.

Figure 11: Analysis of the impact of context and invariance in object detection. The bottom-left panel shows the distribution of target object boxes (COCOval2017) in log scale (See supplement).
Model white BG noise BG crop
FasterRCNN 31.1 42 36.1 31.8 39.8 36.8 8.4 15.0 8.2
RetinaNet 33.1 41.0 37.3 32.7 39.1 36.6 16.9 22.7 18.8
FCOS 34.5 42 37.1 34.2 39.8 37.4 14.3 18.5 15.3
SSD512 27.4 36.7 32.3 26.0 33.4 34 13.4 18.9 14.9


FasterRCNN 7.5 35.9 49.9 7.0 36.6 52.1 0 1.3 18.7
RetinaNet 8.3 37.5 53.2 6.4 38.3 54.2 1 5.2 34.1
FCOS 8.5 39.8 55.2 9.4 39.5 54.8 1 4.5 32.2
SSD512 7.0 31.4 45.1 4.6 29.3 45.2 1 2.9 25.7

Table 4: Results of invariance analysis over COCOval2017.
Model train horse bear surfboard cake tv keyboard oven fridge Avg.

64.0 58.4 84.7 2.4 77.9 74.3 54.7 15.5 20.3 50.2
RetinaNet 54.2 89.2 90.6 2 85.7 86.6 10.1 24.8 69.3 57.0
FCOS 73.4 91.5 94.0 17.1 87.6 92.1 9.8 44.2 76.2 65.1
SSD512 . 84.3 58.9 78.5 3.8 76.9 69.8 42.6 8.4 47.6 52.3


Avg. 69.0 74.5 87.0 6.3 82.0 80.7 29.3 23.2 53.4 56.2
Table 5: Model APs(IOU=.5) over objects in incongruent contexts.

Robustness to image transformations. In the fourth experiment, we evaluated models on objects that were a) cropped right out of the image, or b) cropped and resized such that their smallest dimension became 300 pixels (while preserving the aspect ratio). Models performed terribly in both cases as shown in Table 4 (see supplement), with RetinaNet doing better. Poor performance here demonstrates how sensitive models are to object scale and that they lack robustness to object appearance. Visually inspecting the images, we found it very difficult to recognize the cropped objects, especially the small ones.

Fifth and sixth experiments regard testing models on Gaussian blur (with a 11 11 kernel) and vertical flip, respectively. Results in Table 6 show that both types of transformations dramatically hinder performance with higher impact for vertical flip. We do not have a baseline for human performance on these cases, but a quick browsing shows that it is still possible to detect objects, albeit with more effort. RetinaNet and FCOS outperform other models here.

Analysis of errors. Here we measure the impact of each error type in three detection tasks including object-only, Gaussian blur and vertical flip. See Table 7 for results. Error types in order of importance include: misses, localization, misclassification (Type I), and duplicates, over three tasks. Models miss more objects in vertical flip and Gaussian blur cases compared to the objects-only case. There is less confusion with BG in objects-only case than original images (classification Type I) since there is no background clutter.

Figure 12: Distribution of predicted boxes on COCOval2017 (log scale).
Figure 13: Samples of our dataset of objects in incongruent background.
Model objects only Gaussian blur vertical flip orig img.
Fa.RCNN 35.9 55.8 39.5 17.1 29.6 17.4 15.5 27.3 15.7 36.4 58.4
RetinaNet 39.8 58.4 43.4 21.5 34.7 22.5 18.7 30.7 19.3 40.0 60.9
FCOS 43.6 60.6 46.9 21.0 33.7 21.6 19.1 30.2 19.6 42.8 62.6
SSD512 30.5 48.6 32.9 15.1 26.6 15.2 12.1 22.2 11.9 29.3 49.2


Fa.RCNN 17.5 40.6 48.6 3.8 18.3 31.5 6.2 16.6 24.7 21.5 46.6
RetinaNet 18.9 44.5 56.4 5.1 22.8 39.0 7.5 20.5 29.5 23.5 52.6
FCOS 22.1 48.8 58.7 5.3 22.5 37.4 8.0 20.8 30.0 26.5 54.5
SSD512 9.8 35.7 48.4 2.0 15.2 30.9 4.0 12.6 22.5 11.8 44.7

Table 6: Additional results of invariance analysis over COCOval2017 dataset. Fa.RCNN is short for FasterRCNN.
Dataset Model AP - Cls. (Type I) + Local. - Duplicates + Misses
objects FasterRCNN 55.8 61.5 69.3 75.2 100
only RetinaNet 58.4 64.6 72.6 79.9 100
FCOS 60.6 67.8 77.0 82.3 100


Gaussian FasterRCNN 29.6 37.2 47.4 55.2 100
blur RetinaNet 34.7 42.3 53.5 64.3 100
FCOS 33.7 43.1 56.8 65.3 100


vertical FasterRCNN 27.3 37.0 49.6 57.3 100
flip RetinaNet 30.7 41.1 54.1 64.6 100
FCOS 30.2 41.3 57.1 65.6 100

Table 7: Error analysis of models over transformed images.

7 Discussion and Conclusion

Through exhaustive analyses, we found that a) models perform significantly below what is empirically possible, b) the bottleneck in object detection is object recognition, and c) detection models lack generalization in terms of searching the right places, utilizing context, recognition of small objects, and robustness to image transformation. We did not find a significant contribution from the surrounding context of a target or its nearby overlapping boxes to better classify it. A further investigation of this with extensive data augmentation and optimization may increase the accuracy but is unlikely to drastically improve the UAP. To evaluate the recognition component of a model, one can feed the target boxes to a model and collect its decisions on them. This is, however, cumbersome and needs to be coded for each model separately, whereas our diagnosis tool is general.

We invite researchers to periodically update the upper-bound in detection scores including AP and other recently proposed ones such as LIP [29]

and probability-based detection quality 

[47]. The same can also be repeated for other tasks such as semantic and instance segmentation. Further, our new diagnosis tool can be employed to pinpoint weaknesses in other object detection models.