Object recognition is believed to be (almost) solved in computer vision witnessed by the below human-error rate of state of the art models (about 3% top-5 error on ImageNet) vs. about 5% human error rate (although this number has not been carefully measured ). Object detection222The best published mAP (IOUs=.5:.95) on COCO2017 test-dev is 51.0 by EfficientDet . See https://competitions.codalab.org/competitions/20794#results for the latest results on the COCO dataset., however, remains largely unsolved (66% Avg. Prec. (AP) – even at 50% overlap on COCOval2017; Fig. 1) which is far below the theoretical upper bound. Detection is much more challenging than recognition not only because precise localization is needed but because objects can undergo drastic transformations such as in-plane and in-depth rotation, scale, partial occlusions, etc. There is a larger variation of scale in detection datasets; the median scale of object instances relative to the image in ImageNet (classification) vs. COCO (detection) are 554 and 106, respectively. Therefore, most object instances in COCO are smaller than 1% of the image area . As such, detection can be considered as a litmus test for the capability of deep learning.
Several years of extensive research on object detection333Please see [49, 90] for a review of generic object detection methods. has resulted in accumulation of an overwhelming amount of knowledge regarding model backbones, tricks for model training and optimization, data collection and annotation, and model evaluation and comparison , to a point that separating wheat from chaff is very difficult. As an example, truly understanding and implementing Avg. Prec. (AP) is frustratingly difficult. A quick Google search returns numerous blogs and codes with discrepant explanations of AP. To make matters even worse, it is not quite clear whether AP has started to saturate, whether progress is significant, and more importantly how far we can improve following the current path, making one wonder maybe we have reached the peak of performance using deep learning. Further, we do not know what is holding us back from making progress in object detection, compared to human-level (although debatable) accuracy of object recognition models.
To shed light on the above matters, first we systematically and carefully approximate the empirical upper bound in AP. We hypothesize that the upper bound AP (UAP) is the score of the best recognition model that is trained on the training target bounding boxes and is then used to label the testing target boxes. We also investigate whether visual context surrounding a target object or its overlapping boxes can improve the upper bound AP. Second, we identify bottlenecks by characterising the type of errors that object detectors make and measure the impact of each one on performance. Third, we study the invariance properties of various object detectors on different types of transformations including incongruent context, scale, blur, vertical flip, etc.
In a nutshell, we find that there is a large gap between the performance of the best detection models and the empirical upper bound as shown in Fig. 1. This entails that there is a hope to reach this peak with the current tools, if we can find smarter ways to adopt object recognition models for object detection. We also find that classification remains as the major bottleneck in object detection and is more critical over small objects. Specifically, object detection models inherit the main limitations of CNNs which is the lack of invariance. Example failure cases include generating many bounding boxes on a white background containing a single object, and failing to detect objects in incongruent contexts, vertically flipped or blurred images. It seems that humans can still manage to solve these tasks, although with higher effort and lower performance than intact images.
2 Related Work
2.1 Object recognition, semantic segmentation, and object detection: A unified view
A large number of architectures have been proposed in the past for three seemingly different tasks in computer vision namely object recognition, semantic segmentation and object detection. Here, we provide a unified view of these tasks, illustrated in Fig. 2. In a simplified recognition architecture, a number of convloutional filters are applied to the input images (using a backbone) to generate aclasses at the input image resolution. Each element denotes the probability of the image pixel belonging to a specific class (e.g. sky, grass, car). Object detection falls somewhere in between to compromise speed vs. accuracy. For example, YOLO  uses a grid as the output map ( classes), where each cell contains information about few boxes/anchors at that location (e.g. top-left position, width, height, objectness value). As another example, the output in CenterNet  consists of
maps at the image resolution where activity at each pixel determines the probability of it being the center of an object. Additional maps are also used to predict width and height of the box centered at a point. As you can see, the resolution of output can be adjusted depending on whether we want to classify the entire image, every single pixel, or locate an object.
2.2 Diagnosing object detection models
Some of related works strive to understand detection approaches, identify their shortcomings, and pinpoint where more research is needed. Parikh et al.  aimed to find the weakest links in person detectors by replacing different components in a pipeline (e.g. part detection, non-maxima-suppression) with human annotations. Mottaghi et al. 
proposed human-machine CRFs for identifying bottlenecks in scene understanding. Hoeimet al.  inspected detection models in terms of their localization errors, confusion with other classes, and confusion with the background on the PASCAL dataset. They also conducted a meta-analysis to measure the impact of object properties such as color, texture, and real-world size on detection performance. We replicate, simplify and extend this work on the larger COCO dataset and on image transformations. Russakovsky et al.  analyzed the ImageNet localization task and emphasized on fine-grained recognition. Zhang et al.  measured how far we are from solving pedestrian detection. Vondrick et al.  proposed a method for visualizing object detection features to gain insights into their functioning. Some other related works in this line include [45, 86, 82, 28].
2.3 Object detection benchmarks
A number of studies strive to in compare object detection models. Some works have analyzed and reported statistics and performances over benchmark datasets such PASCAL VOC [23, 24], MSCOCO , CityScapes , and open images . Recently, Huang et al.  performed a speed/accuracy trade-off analysis of modern object detectors. Dollar et al.  and Borji et al. [11, 10, 8] compared models for person detection, and salient object detection, respectively. In , Michaelis et al. assessed detection models on degraded images and observed about 30–60% performance drop, which could be mitigated by data augmentation. In order to resolve the shortcomings of the AP score, some works have attempted to introduce alternative  or complementary evaluation measures [56, 65]. A large number of works have also assessed object recognition models and their robustness (e.g. [68, 3, 61, 54, 9]).
2.4 Contextual influences in object detection
Visual context is believed to be a rich source of information about an object’s identity, location and scale, especially when appearance information is weak (See Fig. 3). Torrabla et al.  introduced a framework to model the relationship between context and object properties based on the correlation between the statistics of low-level features across the entire scene and it objects. Several other works have studied the role of context in object detection and recognition (e.g. [4, 78, 52, 35, 74, 60, 66, 26, 84]. Heitz et al.  proposed a probabilistic framework to capture contextual information between “stuff” and “things” to improve object detection on PASCAL VOC. Barnea et al.  utilized co-occurrence relations among objects to improve the detection scores. Divvala et al.  explored different types of context in recognition. See also [35, 18, 70, 38, 52, 1, 42, 88].
3 Experimental Setup
We base our analysis on two recent large-scale object detection benchmarks: MMDetection 444https://github.com/open-mmlab/mmdetection and Detectron2555https://github.com/facebookresearch/detectron2. The former evaluates more than 25 models. The latter includes several variants of FastRCNN . In both benchmarks, all COCO models have been trained on train2017 and evaluated on val2017. Here, we use MMDetection to train and test additional models on a new dataset.
We consider the latest models published in the major vision conferences and the ones included in the above benchmarks. Several variants of the RCNN model including FasterRCNN , MaskRCNN , RetinaNet , GridRCNN , LibraRCNN , CascadeRCNN , MaskScoringRCNN , GAFasterRCNN , and Hybrid Task Cascade  are considered. We also include SSD , FCOS , and CenterNet . Different backbones for each model are also taken into account.
We employ 4 datasets including:
PASCAL VOC : We use trainval0712 for training (16,551 images, 47,223 boxes) and test2007 (4,952 images, 14,976 boxes) for testing. This dataset has 20 categories.
FASHION dataset: This dataset covers 40 categories of clothing items (39 + humans). Trainval, and test sets for this dataset contain 206,530 images (776,172 boxes) and 51,650 images (193,689 boxes), respectively. Fig. LABEL:fig:fashion_stats1.A displays samples from this dataset (and also additional statistics). This is a challenging dataset since clothing items are non-rigid as opposed to COCO or VOC objects.
OpenImages : We use the OpenImages V4 dataset, used also in the Kaggle competition666https://www.kaggle.com/c/open-images-2019-object-detection. It has 500 classes and contains 1,743,042 images (12,195,144 boxes) for training and 41,620 images (226,811 boxes) for validation (used here for testing).
We use COCO evaluation code (http://cocodataset.org/#detection-eval) to measure AP over IOU thresholds of 0.5 and 0.75 as well as the average AP over IOUs in the range 0.5:.05:0.95. APs are calculated per class and are then averaged. We also report breakdown APs over small (area), medium (area), and large (area) objects. See also https://github.com/rafaelpadilla/Object-Detection-Metrics, http://cocodataset.org/#detection-eval, and https://firstname.lastname@example.org/which-one-to-measure-the-performance-of-object-detectors-ap-or-olrp-936d072a6eb0..
4 Characterizing the Empirical Upper Bound
We hypothesize that the empirical upper bound in AP is the score of a detector with ground truth bounding boxes labeled by the best object classifier. The classification score is considered as the detection score. This way we essentially assume that the localization problem is solved and what remains is only object recognition. However, it might be possible to improve upon this detector in at least two ways: a) by exploiting local context around an object to improve classification accuracy and hence better UAP, and b) by searching over the scene and finding boxes that are easier to classify (compared to the target box) and have enough overlap with the target box. This does not matter for the perfect IOU but may affect IOUs lower than one. We carefully investigate these possibilities in the following.
|Dataset||object only||object + context||context only|
|test on||0.2||0.4||0.6||0.8||1||1.2||1.4||1.6||1.8||2||1.2||2||all img|
4.1 Utility of the surrounding context
We trained ResNet152  on target boxes in three settings as shown in Fig. 4: I) object only, II) object + context, and III) context only. Standard data augmentation techniques including mean pixel subtraction, color jittering, random horizontal flip and random rotation (10 degrees) were applied. Boxes were resized to 224
224 pixels and models were trained for 15 epochs. Trained models were tested on the original object box. Results (top-1 accuracy) shown in Table1 reveal that the canonical object size contains the most information regarding the category of an object over all four datasets777Please see Fig. LABEL:fig:CMs for confusion matrices of these classifiers.. Increasing or decreasing object box lowers the performance. Context-only scenario leads to high classification score but still does below other cases. Stretching the context to the whole scene drops the performance significantly. Training and testing models on the same condition (i.e. both on object+context) results in higher accuracy on that specific condition but does not lead to better overall recognition accuracy.
4.2 Searching for the best label
Essentially the problem definition here is how we can get the best classification accuracy for recognition of objects in the scene by utilizing all the information in the scene. This is different than recognition approaches that treat objects in isolation. Note that recognition accuracy is not the same as AP, since detection scores also matter in AP calculation.
Having the best classifier at hand, we are ready to approximate the empirical upper bound in AP. Before delving into details first lets recap how AP is calculated.
AP calculation. For each category, detections over all images are sorted according to their confidences. Starting from the top of this list, the target with the highest IOU with each detection is considered. We have a true positive (TP; hit) if their IOU is , and if that target has not been assigned yet. We have a false positive (FP) if IOU (i.e. localization error) or if the target has been assigned (i.e. duplicate; two predictions on the same target). A target box can be matched with only one detection (the one with the highest confidence score and IOU). If a detection has IOU with two targets, it is assigned to the one with the highest IOU which is not assigned already. Scanning the sorted detection list again, a precision for each recall is obtained and is used to draw the Recall-Precision (RP) curve and to compute the AP.
We explore two strategies in pursuit of the upper bound AP. In the first strategy, we apply the best classifier from the previous section to the target boxes. The detector built in this fashion gives the same AP regardless of the IOU threshold, since our detections are target boxes. As we argued above, it is not possible to improve this detector at IOU=1. However, if we are interested in upper bound for a lower IOU (say ), then it might be possible to do better by searching among the candidate boxes near a target box and choose the one that can be classified better than the target box, or aggregate information from nearby boxes. Thus, in our second strategy, we sample boxes around an object and either apply the original classifier (trained on canonical object size) or train and test new classifiers on the surrounding boxes. In any case, we always keep the target box but change its label and/or its confidence. First, lets take a look at our box sampling approach, which is illustrated in Fig. 5.
Sampling boxes with IOU above a threshold. Here, we are interested in finding the coordinates of the top-left corner of all rectangles with IOU with the ground-truth bounding box. We use the coordinate system centered at the top-left corner of the target box ; which can be easily converted to the image level coordinate frame. The bottom-right coordinates of the desired rectangles that intersect with the target box from the top-left follow the equation , where and are width and height of rectangles, respectively (we assume all boxes have the same width and height as the target box). According to the illustration in Fig. 5(A), we have:
From these equations and assuming , and , it is easy to derive the following equations:
The same equation governs the coordinates of the bottom-left, top-left, and top-right corners of the rectangles intersecting with the target box at points , , and , respectively (in the coordinate frames centered as these points, in order). Calculating the top-left corner of these rectangles (in their corresponding coordinate frames) and representing them in the coordinate frame of point , we arrive at the following four equations (note that these are not lines):
|Acc.||Most Confident Box||Most Frequent Label|
Results of our second strategy for estimating the upper bound AP (i.e. searching for the best bounding box or object label near a target box; among boxes with IOU ). Notice that upper bound for AP, AP and AP are all the same. Underlined numbers show where we could improve over the 1st strategy. Top rows) using a classifier trained on surrounding boxes, Bottom rows) using the original classifier trained on the canonical object size.
Using the above equations, we then sample some (here = 4) rectangles with (Fig. 5(B)) and label them with the label of the target box. We then train a new classifier (same ResNet152 as above) on these boxes. This is effectively a new data augmentation technique. Notice that AP is a direct consequence of the classification accuracy, so if we can better classify objects we can obtain a better AP. To estimate UAP, we sample a number of rectangles (=4) near a target box (all with ), and then label the target box with: a) the label (and confidence) of the box with the highest classification score (i.e. most confident box), or b) the most frequent label among the nearby boxes (with the maximum confidence score among them).
4.3 Upper bound results
Here, we report classification scores, upper bound APs, score of the models (mean AP over all IOUs; unless specified otherwise), and the breakdown AP over categories.
Comparison of strategies. Summary results of the first strategy are shown in Fig. 1. As expected UAPs over all IOUs are the same and are much better than the models. To our surprise, our second strategy did not lead to better UAP values, except for few cases including UAPs over medium and small objects on FASHION dataset and small objects on COCO (using most confident boxes), as shown in Table 2. Applying the original classifier, instead of training new ones on surrounding boxes, or only sampling boxes with higher IOU (e.g. 0.9) did not improve the results. Also, setting the confidence of detections to 1 lowers the UAP. We attribute the failure of the 2nd strategy to the fact that the surrounding boxes may contain additional visual content which may introduce noise in the labels. This leads to a lower classification accuracy and hence a lower AP. Therefore, in what follows we only discuss the results from the first strategy.
PASCAL VOC. Fig. 6 shows results using both VOC and COCO evaluation codes. The VOC evaluation code is based on IOU=0.5 and calculates the area under the PR curve slightly different than COCO. For VOC, we adopt the code from the CenterNet repository888https://github.com/xingyizhou/CenterNet. We have trained and tested 5 models on this dataset including FasterRCNN, FCOS, SSD512, and two variants of CenterNet. The classification accuracy on VOC is very high (94.7%). Consequently, the UAP is very high (91.6 using the COCO API). FCOS model does the best here with AP of 47.9 (right panel in Fig. 6; dashed lines). As it can be seen, there is a large gap between the AP of the best model and the UAP on this dataset (45). Models are consistent in their performance across different categories.
FASHION. Results are shown in Fig. 7. The best classification accuracy on this dataset is 88.8% (Table 1). The UAP is 71.2 and the AP of the best model is 59.7 (FCOS). Interestingly, FCOS performs quite close to the upper bound at IOU=0.5 (Fig. 1). Models perform better here than over VOC. The FASHION UAP is lower than VOC UAP perhaps because classification is more challenging on the former dataset. The gap between UAP and model AP here, however, is much smaller than VOC. This could be partly due to the fact that FASHION scenes have less clutter and larger objects than the VOC scenes. While per-class UAP is above the AP of the best model over all VOC classes, UAPs of 5 FASHION categories fall below the best model AP (messenger bags, tunics, long sleeve shirts, blouses, and rompers). Looking at the classification scores, we find that they have a low accuracy.
COCO. Existing benchmarks have provided an efficient ecosystem for developing, evaluating and comparing detection models especially on the COCO dataset. They provide trained models over a variety of settings. Borrowing the MMDetection benchmark and adding the results from CenterNet to it, we end up comparing 15 models (71 in total; combination of models and backbones). Model scores are shown in Fig. 8. The best models here are Hybrid Task Cascade model  and Cascade MaskRCNN , with APs of 46.9 and 45.7, respectively. The upper bound AP on COCO is about 78.2. Recall that UAP does not depend on the IOU threshold since detected boxes are classified ground truth targets. The gap between the best model AP and UAP is above 30. The gap is much smaller for AP at IOU=0.5 which is about 10. The UAP is much lower over small objects than UAP over large objects. This also holds for models. The gap between UAP and model AP over small objects is about 35 which is much higher than the gap over medium or large objects.
Breakdown APs over object categories are shown in Fig. 9. For this, we use the Detectron2 benchmark which reports per-category results mainly over RCNN model family. We noticed that aggregate scores on MMDetection and Detectron2 are quite consistent. Among 18 variants of Faster-RCNN and MASK-RCNN, the best model has the AP of 44.3 (shown by the dashed line) which is lower than the best available model on COCO (46.9; Fig. 1) and the upper bound AP. Among 80 classes, only three (snowboard, toothbrush, and toaster) have UAPs below the best model APs.
A summary of upper bound precision and recall values over VOC, FASHION and COCO datasets is provided in Table3.
|(AP) @[ IoU=0.50:0.95||| area= all||| maxDets=100 ]||0.916||47.3||47.9||0.712||0.541||0.597||0.782||0.364||0.428|
|Avg. Prec.||(AP) @[ IoU=0.50||| area= all||| maxDets=100 ]||0.916||71.3||71.0||0.712||0.698||0.711||0.782||0.584||0.626|
|Avg. Prec.||(AP) @[ IoU=0.75||| area= all||| maxDets=100 ]||0.916||52.6||51.4||0.712||0.614||0.647||0.782||0.391||0.457|
|Avg. Prec.||(AP) @[ IoU=0.50:0.95||| area= small||| maxDets=100 ]||0.707||08.6||11.1||0.457||0.108||0.182||0.635||0.215||0.265|
|Avg. Prec.||(AP) @[ IoU=0.50:0.95||| area=medium||| maxDets=100 ]||0.861||30.7||32.1||0.614||0.315||0.376||0.816||0.400||0.469|
|Avg. Prec.||(AP) @[ IoU=0.50:0.95||| area= large||| maxDets=100 ]||0.941||58.1||58.4||0.721||0.570||0.627||0.846||0.466||0.545|
|Avg. Rec.||(AR) @[ IoU=0.50:0.95||| area= all||| maxDets= 1 ]||0.579||40.3||41.2||0.662||0.618||0.692||0.483||0.304||0.345|
|Avg. Rec.||(AR) @[ IoU=0.50:0.95||| area= all||| maxDets= 10 ]||0.908||53.8||58.5||0.767||0.712||0.822||0.797||0.489||0.552|
|Avg. Rec.||(AR) @[ IoU=0.50:0.95||| area= all||| maxDets=100 ]||0.930||54.1||59.5||0.774||0.714||0.824||0.812||0.514||0.582|
|Avg. Rec.||(AR) @[ IoU=0.50:0.95||| area= small||| maxDets=100 ]||0.736||11.2||19.5||0.504||0.194||0.303||0.663||0.324||0.388|
|Avg. Rec.||(AR) @[ IoU=0.50:0.95||| area=medium||| maxDets=100 ]||0.877||36.9||45.2||0.660||0.499||0.639||0.843||0.554||0.628|
|Avg. Rec.||(AR) @[ IoU=0.50:0.95||| area= large||| maxDets=100 ]||0.954||65.7||70.2||0.782||0.742||0.850||0.893||0.645||0.735|
OpenImages. This dataset 
is the latest endeavor in object detection and is much more challenging than its predecessors. Our classifier achieves 69.0% top-1 accuracy on the validation set of OpenImages V4 which is lower than other the three datasets. We achieve 58.9 UAP, using the TensorFlow evaluation API for computing AP999https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/challenge_evaluation.md on this dataset, which is different than COCO AP calculation (here we discarded grouping and super-category). We are not aware of any model scores on this set of OpenImages V4.
AP vs. classification accuracy. We found that there is a linear positive correlation ( = 0.81 on COCO) between the UAP and the classification accuracy (Fig. 10). The higher the classification accuracy, the higher the UAP. We did not find a correlation between the accuracy and model APs, nor between the object size and accuracy (or UAP). The dependency of UAP on accuracy, highlights the importance of recognition on object detection and constitutes the core of our analyses in the next two sections.
5 Error Diagnosis
To pinpoint the shortcomings of object detectors, we follow the analysis by Hoeim et al. , but revise it in two major ways. First, instead of inspecting errors across categories, here we perform a per-category error analysis (i.e. binary manner). This simplifies the process and makes it easier to understand. See Fig. 11. We combine all types of class confusions (e.g. similar classes, other classes, and background in Hoeim ) into two types of classification errors: a) confusion with the background (Type I), and 2) misses (Type II). Notice that this implicitly contains the above misclassification types but is much easier to investigate. In fact, recent object detectors such as FCOS  and CenterNet  also adopt this strategy to classify objects (i.e. an object is of a particular class or is not). Second, Hoeim et al. successively remove errors to reach the AP of 1. We argue that this approach convolutes different error types and does not correctly reflect the true contribution of errors (i.e. understating or exaggerating error types). For example, according to the COCO analysis tool, any matches to objects with a different class label but in the same supercategory do not count as either a FP or a TP. Also, the COCO tool removes mislocalized predictions. In this case, we argue that correcting the mislocalized predictions is more effective than removing them because it can reveal other sources of weakness in a model. For example, it may lead to generating duplicates which would have been overlooked by removing the detections. In contrast, here we explicitly handle the errors by removing, correcting or adding detections when appropriate. Similar to Hoeim et al. our analysis is also based on IOU=0.5.
We repeat the following procedure for each category-image pair (shown in Fig. 11; from left to right). First, we remove the detections with the maximum IOU with any target (i.e. classification error Type I; confusion with the background). Second, we correct the miss-localized predictions with . In this step, coordinates of these boxes are replaced with their matching target box coordinates (which is the target with the max IOU) while their confidence scores and labels are preserved. Third, duplicates (i.e. redundant detections) are removed. An unmatched detection is considered duplicate if it falls (i.e. has IOU) over a target with an already assigned detection (with higher score). Fourth, eventually, misses are treated. A miss is a target with with any unmatched detection, and is added to the list of detections (with score of 1). Before performing this step, we set the coordinates of detections as the coordinates of their matching targets, since we now know which prediction is paired with which target (i.e. one to one mapping; no duplicates).
Results of error diagnosis are shown in Table 4 for 3 models over 3 datasets. We start from the original detection set and progressively measure the impact of fixing each error type in the order explained above and shown in Fig. 11. Confusion with the background (and other classes; see above) has the highest contribution to the overall error, across all models. This indicates that models often falsely confuse background clutter or other classes as a particular object category. The second most important error type is misses. Interestingly, localization error weighs more than duplicates and has higher impact on COCO and VOC datasets than the FASHION dataset, possibly because the former two contain a larger number of small objects. Conversely, over the FASHION dataset, duplicates matter more, perhaps because class confusion is higher (e.g. confusion in slippers vs. sandals; different types of hats, etc.). Models behave almost consistently across the three datasets.
- Cls. (Type I)
We also cross checked our results with results obtained using the COCO analysis tool (implementing Hoeim et al. ). Notice that numbers from COCO analysis tool are not directly comparable to ours since our strategy is different and, unlike us, it does not explicitly address duplicate errors. Nevertheless, based on APs and PR curves in Fig. 12, we arrive at similar conclusions to ours. Here, again we observe that classification error Type I (Sim, Oth, and BG in Fig. 12) accounts for the largest fraction of errors, followed by misses (FN) and localization (Loc) errors. Fig. 13 shows the breakdown of error analysis over small, medium, and large objects using the COCO analysis tool. As it can be seen, all three models miss a much larger number of small objects compared to medium or large ones. As expected, models obtain a much higher mAP over large objects than small ones. A lot or background regions, however, are still classified as large objects.
6 Invariance Analysis
Complementary to our error diagnosis, here we conduct a series of experiments to reduce the impact of localization or recognition in detection pipelines (one at a time). Our principal emphasize is on the recognition component. These experiments are performed over the COCOval2017 set and are illustrated in Fig. 14 and Fig. 16. Trained models, over the COCOtrainval0712 set, are employed.
Analysis of context.
In the first experiment, we generated stimuli in which a single object was placed in a white background or in a white noise background (one object per image, hence number of images equal to the number of objects). Contrary to our expectation, we found that models either underestimate or overestimate the distribution of target bounding boxes. Table5 shows the number of generated bounding boxes by models. All models overestimate the number of ground-truth bounding boxes which is 36,781. Interestingly, FasterRCNN generates a significantly lower number of boxes compared to other models. Fig. 15 shows the difference in distribution of predicted boxes and distribution of ground-truth boxes. Interestingly, models search all over the place. FasterRCNN and RetinaNet oversample boxes around targets, while FCOS generates a fair amount. This hints towards the shortcomings in objectness prediction in models. Quantitative results, presented in Table 6, show that models perform poorly on these images (about the same in both conditions but lower than the original images). They are hindered much more on small objects than medium or large ones, which shows how critical context is for recognition and detection of small objects. Interestingly, in white/noise BG and object-only cases, the AP-large increases but the AP-small decreases (compared to orig. images). FCOS, ranking higher on original images, does better here as well. FCOS, ranking higher on original images, does better here as well.
|Model||white BG||noise BG||objects_only|
In the second experiment, object-only case, we removed the image background and preserved all the objects (hence the same number of images as in COCOval2017). To our surprise, FCOS and SSD performed better on these images than the original ones (Column 1 vs. 10 in Table 6). Compared to the original images, they did better on large objects and lower on small objects in the object-only case.
In the third experiment, we paste objects in incongruent backgrounds (e.g. a boat in the street), similar to  but over a larger dataset and including more models. Also, unlike Rosenfeld et al. , we report the AP. We paste 9 objects including bear, keyboard, refrigerator, surfboard, train, tv, cake, horse, and oven on 100 images taken from the FASHION dataset; 900 images in total. Fig. 16 shows some examples. Results are given in Table. 7. Interestingly, models performed well on this dataset. They failed drastically on surfboard and oven which seem to be a little hard for humans. Cake, bear, and horse were the easiest ones. FCOS did the best among models. Overall, we did not a dramatic failure of models in detecting out of context objects, at least on the set of objects we tried. Nonetheless, in some other scenarios (e.g. smaller objects) models may fail to detect objects out of their common contexts. This highlights and aligns with the current view that deep learning models fit themselves to the statistics of the datasets. We believe that retraining the object detectors on these examples can alleviate the problem to some extent. Similar attempts have been made in the past to strengthen object detectors by training them on degraded images (e.g. as in Michaelis et al. ) or making recognition models robust to adversarial examples through adversarial training (e.g. as in Goodfellow et al. ).
Robustness to image transformations. In the fourth experiment, we evaluated models on objects that were a) cropped right out of the image, or b) cropped and resized such that their smallest dimension became 300 pixels (while preserving the aspect ratio). Models performed terribly in both cases as shown in Table 8, with RetinaNet doing better. Poor performance here demonstrates how sensitive models are to object scale and that they lack robustness to object appearance. Visually inspecting the images, we found it very difficult to recognize the cropped objects, especially the small ones.
Fifth and sixth experiments regard testing models on Gaussian blur (with a 11 11 kernel) and vertical flip, respectively. Results in Table 8 show that both types of transformations dramatically hinder performance with higher impact for vertical flip. We do not have a baseline for human performance on these cases, but a quick browsing shows that it is still possible to detect objects, albeit with more effort. RetinaNet and FCOS outperform other models here.
Analysis of errors. Here we measure the impact of each error type in three detection tasks including object-only, Gaussian blur and vertical flip. See Table 9 for results. Error types in order of importance include: misses, localization, misclassification (Type I), and duplicates, over three tasks. Models miss more objects in vertical flip and Gaussian blur cases compared to the objects-only case. There is less confusion with BG in objects-only case than original images (classification Type I) since there is no background clutter.
Finally, Table 10 and Table LABEL:tab:invariance_category summarize all results of the invariance analysis experiments, including precision, recall, and breakdown over categories.
|Model||crop||Gaussian blur||vertical flip||orig img.|
- Cls. (Type I)
|Avg. Prec. (AP)||@[ IoU=50:95||| area= all||| maxDets=100 ]||33.1||32.7||14.3||0.084||0.398||0.215||0.187||0.569|
|Avg. Prec. (AP)||@[ IoU=50||| area= all||| maxDets=100 ]||41||39.1||19.4||0.15||0.584||0.347||0.307||0.659|
|Avg. Prec. (AP)||@[ IoU=75||| area= all||| maxDets=100 ]||37.3||36.6||15.9||0.082||0.434||0.225||0.193||0.636|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= small||| maxDets=100 ]||8.3||6.4||-1||0||0.189||0.051||0.075||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area=medium||| maxDets=100 ]||37.5||38.3||0.001||0.013||0.445||0.228||0.205||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= large||| maxDets=100 ]||53.2||54.2||16.1||0.187||0.564||0.39||0.295||0.571|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 1 ]||55||54.3||31.7||0.214||0.335||0.223||0.216||0.67|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 10 ]||57.1||56.9||35.5||0.254||0.52||0.342||0.349||0.681|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets=100 ]||57.2||56.9||35.7||0.255||0.549||0.358||0.369||0.681|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= small||| maxDets=100 ]||25.1||22||-1||0.045||0.309||0.089||0.152||-1|
|Avg. Rec. (AR)||@[ IoU=50:96||| area=medium||| maxDets=100 ]||68.4||70.6||6.8||0.164||0.608||0.385||0.384||-1|
|Avg. Rec. (AR)||@[ IoU=50:97||| area= large||| maxDets=100 ]||81.4||83.2||35.8||0.435||0.73||0.625||0.577||0.681|
|RetinaNet||Avg. Prec. (AP)||@[ IoU=50:95||| area= all||| maxDets=100 ]||31.1||31.8||11.2||0.169||0.359||0.21||0.155||0.502|
|Avg. Prec. (AP)||@[ IoU=50||| area= all||| maxDets=100 ]||40.2||39.8||16.9||0.227||0.558||0.337||0.273||0.715|
|Avg. Prec. (AP)||@[ IoU=75||| area= all||| maxDets=100 ]||36.1||36.8||12.1||0.188||0.395||0.216||0.157||0.579|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= small||| maxDets=100 ]||7.5||7||-1||0.001||0.175||0.053||0.062||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area=medium||| maxDets=100 ]||35.9||36.6||0.005||0.052||0.406||0.225||0.166||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= large||| maxDets=100 ]||49.9||52.1||13.2||0.341||0.486||0.374||0.247||0.511|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 1 ]||47||48.8||21.8||0.396||0.302||0.222||0.187||0.575|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 10 ]||48.5||50.1||24.5||0.452||0.474||0.344||0.294||0.596|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets=100 ]||48.5||50.1||24.6||0.454||0.495||0.357||0.307||0.596|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= small||| maxDets=100 ]||16.1||15.8||-1||0.099||0.266||0.094||0.119||-1|
|Avg. Rec. (AR)||@[ IoU=50:96||| area=medium||| maxDets=100 ]||58.7||61.3||10||0.459||0.554||0.385||0.322||-1|
|Avg. Rec. (AR)||@[ IoU=50:97||| area= large||| maxDets=100 ]||73.9||77.4||24.6||0.688||0.657||0.615||0.492||0.596|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= all||| maxDets=100 ]||34.5||34.2||12.1||0.143||0.436||0.171||0.191||0.651|
|Avg. Prec. (AP)||@[ IoU=50||| area= all||| maxDets=100 ]||40.2||39.8||15.7||0.185||0.606||0.296||0.302||0.705|
|Avg. Prec. (AP)||@[ IoU=75||| area= all||| maxDets=100 ]||37.1||37.4||13.2||0.153||0.469||0.174||0.196||0.685|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= small||| maxDets=100 ]||8.5||9.4||-1||0.001||0.221||0.038||0.08||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area=medium||| maxDets=100 ]||39.8||39.5||0.001||0.045||0.488||0.183||0.208||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= large||| maxDets=100 ]||55.2||54.8||14.4||0.322||0.587||0.315||0.3||0.654|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 1 ]||60.4||60.6||36.7||0.454||0.357||0.192||0.221||0.765|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 10 ]||64.1||66.1||41.6||0.526||0.566||0.285||0.355||0.783|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets=100 ]||64.3||66.2||41.7||0.527||0.594||0.293||0.374||0.783|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= small||| maxDets=100 ]||34.2||38.8||-1||0.188||0.367||0.055||0.17||-1|
|Avg. Rec. (AR)||@[ IoU=50:96||| area=medium||| maxDets=100 ]||76.8||78.3||23.5||0.537||0.655||0.311||0.386||-1|
|Avg. Rec. (AR)||@[ IoU=50:97||| area= large||| maxDets=100 ]||85.8||87.1||41.8||0.758||0.759||0.535||0.575||0.783|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= all||| maxDets=100 ]||27.4||26||10||0.134||0.305||0.151||0.121||0.523|
|Avg. Prec. (AP)||@[ IoU=50||| area= all||| maxDets=100 ]||36.7||33.4||14.2||0.189||0.486||0.266||0.222||0.673|
|Avg. Prec. (AP)||@[ IoU=75||| area= all||| maxDets=100 ]||32.3||30.4||11.2||0.149||0.329||0.152||0.119||0.628|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= small||| maxDets=100 ]||7||4.6||-1||0.001||0.098||0.02||0.04||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area=medium||| maxDets=100 ]||31.4||29.3||0||0.029||0.357||0.152||0.126||-1|
|Avg. Prec. (AP)||@[ IoU=50:95||| area= large||| maxDets=100 ]||45.1||45.2||10.8||0.257||0.484||0.309||0.225||0.523|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 1 ]||45.8||43.1||20||0.288||0.273||0.172||0.156||0.581|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets= 10 ]||47.3||44.6||21.5||0.317||0.407||0.246||0.236||0.594|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= all||| maxDets=100 ]||47.5||44.7||21.7||0.321||0.429||0.259||0.255||0.594|
|Avg. Rec. (AR)||@[ IoU=50:95||| area= small||| maxDets=100 ]||16.4||11.5||-1||0.08||0.152||0.026||0.076||-1|
|Avg. Rec. (AR)||@[ IoU=50:96||| area=medium||| maxDets=100 ]||57.7||53.8||5.7||0.243||0.503||0.269||0.265||-1|
|Avg. Rec. (AR)||@[ IoU=50:97||| area= large||| maxDets=100 ]||71.3||71.1||21.8||0.496||0.634||0.5||0.427||0.594|
Summary of the learned lessons. Through exhaustive analyses, we found that a) models perform significantly below what is empirically possible, b) the performance gap is larger over small objects, indicating that scale is one of the major problems in object detection, c) the bottleneck in object detection is object recognition, and d) detection models lack generalization in terms of searching the right places, utilizing context, recognition of small objects, and robustness to image transformation.
Recognizing objects in natural scenes and empirical upper bound. What we essentially did in this paper was to build the best object classifier for objects embedded in natural scenes as opposed to isolated objects in object recognition datasets (e.g. ImageNet). Using this object classifier we then approximated the empirical upper bound in average precision. In contrast to prior investigations on the influence of context in object recognition (e.g. ), we did not find a significant contribution from the surrounding context in recognizing a target object, except in one case which was training and testing on 1.8 context and testing on the same condition over the COCO dataset (See Table 1). An evidence corroborating this finding was our experiment in placing objects out of their context. Results showed that models are still able to detect objects (at least large objects used here). Rosenfeld et al.  performed a similar experiment, but over a much smaller set of images, and found that models fail to detect out of context objects (e.g. an elephant in the room shown in Fig. 17. Since they did not report the AP of models on these images, a fair and careful comparison is not feasible at this point, which leaves answering this question to future investigations.
The contradiction between our results and previous investigations on the role of visual context might be due to the fact that we performed these experiments in large scale and over a much larger set of objects. A more systematic investigation of this may bring new insights. For example, it is very likely that context will play a more important role for recognition and detection of small objects or occluded ones. Further, we found that data augmentation using boxes overlapping with a target object did not lead to better classification accuracy. A further investigation of this using extensive data augmentation, external data, other backbones, or other optimization approaches may improve the upper bound slightly, but perhaps not significantly.
We invite researchers to periodically update the upper bound in detection scores including AP and other recently proposed ones such as LRP 101010It turns out that this score reduces to the classic AP when there is no localization error, thus the UAP computed here also applies to the LRP. and probability-based detection quality , especially over upcoming large scale datasets such as OpenImages . The same can also be repeated for other tasks such as semantic segmentation, instance segmentation, image and video captioning111111A work on this already exists ., saliency prediction [13, 37, 77, 34], image generation [29, 6] and detection of specific objects (e.g. faces, pedestrians). An ongoing effort is also addressing the shortcomings of detection scores. One line of research known as panoptic segmentation  is encouraging a new way to evaluate the segmentation models. Perhaps in the future, researchers may abandon predicting bounding boxes in images (and hence getting rid of complications in AP calculation) and focus on the panoptic segmentation task which regards classifying all image pixels (into object and stuff classes). In this sense, object detection is a subset of panoptic segmentation.
Challenges regarding object detection. Object detection models (and also object recognition models) perform very well on input images in which objects do not go under (relatively) drastic variations (e.g.
face detection or face recognition). Even in the case of faces, models suffer from issues such as low light conditions, highly occluded faces, or tiny ones. Detection of generic objects is very difficult due to several challenges. First, objects can be partially occluded (e.g.
a dogs behind the couch). As a result, the features extracted at the object location are not powerful enough to classify that object. To harness this, a large number of data points covering those scenarios are needed. Second, object appearance and shape vary significantly from different viewpoints (e.g. due to in-depth rotation). An object detector trained on specific viewpoints may fail to detect an object from a novel viewpoint at the test time. Third, similarly, objects may appear large or small due to their distance to the camera or due to their natural scale. Currently, object detectors utilize data augmentation or a scale pyramid (e.g. feature pyramid network proposed in ) to solve this problem. Nonetheless, as we showed here, the scale problem still stands. Fourth, not all objects are rigid. Some non-rigid objects such as cloths can undergo drastic deformations which makes detecting them very challenging. To make matters even worse, a non-rigid object can be split into several (disconnected) parts. Fifth, in some situations, especially in videos, captured images are occasionally blurred due to the object or camera motion. This is critical to overcome in particular for applications involving moving robots, self-driving cars, or drones.
As we mentioned repeatedly throughout the paper, above problems are not specific to the detection methods and stem from the shortcomings of the recognition component in the detectors. After all, all modern object detectors are based on convolutional neural networks which suffer from the lack of generalization and high demand for a large number of annotated data. Despite these shortcomings, there is still room to improve in object detection, as models perform much lower than the empirical upper bound (calculated in this paper). This is great because it means that we can still significantly boost the object detection performance. The best mAP performance on COCOval2017 dataset is 46.9% (See Fig.1, which is far below the empirical upper bound of 78.2%. The best mAP over the COCO2017 test-dev dataset is 51.0% (See Fig. 18). Since results over the COCO test-dev are usually higher than the results over the COCO validation set, we predict the empirical upper bound to be better over the former set, but most likely there will still be a large gap between models and the empirical upper bound on the test-dev dataset.
Error diagnosis and invariance analysis We proposed a novel approach to study the errors of object detectors. Our error analysis experiments show that classification errors are much more prevalent than other types of errors and contribute the most to the overall error. This aligns with our argument in the previous section which showed that detection upper bound depends on the recognition accuracy. An alternative approach to evaluate the recognition component of an object detector, is to feed the target boxes to a model and collect its decisions on those boxes. This is, however, cumbersome and needs to be implemented for each model separately121212 A preliminary investigation by feeding GT bounding boxes (at inference time) to FasterRCNN models with ResNet50 backbone and FPN, results in mAP of 73.3% on COCOval2017. , whereas our diagnosis tool is general.
Our new diagnosis tool can be employed to pinpoint weaknesses in other object detection models. Also, error analysis of models for other tasks (e.g. object tracking ) is encouraged. Further, a more systematic investigation of invariance properties of object detectors along with adversarial examples to challenge object detectors can bring new insights into the failures of object detectors and for building better models. In this regard, the MMDetection benchmark offers code for analyzing models over transformed images (such as noise, blur, etc). Finally, here we were not concerned with the processing speed of the object detection models. Future work can study empirical upper bound when speed is also a concern.
Shall we dismiss object detection? As was mentioned in section 2.1, object detection is tightly related to object recognition and semantic segmentation. In particular, it relates to instance segmentation where the task is to label pixels belonging to individual objects of different classes (i.e. distinguishing different cars). In a sense it generalizes many other tasks. A natural question to ask is whether instance segmentation models can outperform object detection models in terms of accuracy and speed. If so, then maybe we should abandon the object detection problem (i.e. predicting bounding boxes) and focus on instance segmentation.
To gain insights regarding the above question, we generated bounding boxes from the predicted instance masks by a model, thus creating an object detector. The AP-Box of this object detector is then calculated. Conversely, predicted bounding boxes of an object detector can be considered as instance masks, thus an instance segmentation model. The AP-Mask of this model is then calculated. Performance of five models, all using the R50-FPN backbone, over the MS-COCO val2017 (36781 objects) are shown in Table 11.
The first 4 models in Table 11 fall under the category of ‘detect-then-segment’ models whereas the last one, TensorMask , performs instance segmentation directly. The latter also does object detection but independently from segmentation. The first (last) two rows in each model show AP-Box (AP-Mask). The second row shows AP when using predicted masks as boxes (i.e. circumscribed rectangles) and the fourth row shows applying boxes as masks. Note that all five models generate both boxes and masks.
Results show that APBox is higher (about 1 to 2%) using the predicted bounding boxes than using boxes fitted to segmentation masks. This indicates that predicting bounding boxes directly leads to better accuracy (so far), and faster inference time, indicating that it makes sense to continue working on object detection. According to Table 11, APL when using masks as boxs is very close to when using the original predicted boxes. This is perhaps because predicted instance masks over large objects are already very accurate (witnessed by the higher APMask for large objects). Also, applying boxes as masks results in much lower AP-Mask compared to using the original predicted masks (see the fourth rows), since regions from other objects and the image background are also included in the box.
|AP-Box [ predicted boxes ]||0.373||0.590||0.402||0.219||0.409||0.481|
|AP-Box [ mask as box ]||0.366||0.575||0.392||0.203||0.402||0.485|
|AP-Mask [ predicted masks ]||0.342||0.559||0.362||0.158||0.369||0.501|
|AP-Mask [ box as mask ]||0.123||0.352||0.066||0.075||0.132||0.166|
|R-CNN||AP-Box [ mask as box ]||0.394||0.580||0.428||0.212||0.426||0.536|
|AP-Mask [ box as mask ]||0.128||0.353||0.073||0.081||0.133||0.173|
|Cascade||AP-Box [ mask as box ]||0.405||0.598||0.435||0.217||0.438||0.555|
|AP-Mask [ box as mask ]||0.132||0.367||0.074||0.087||0.139||0.178|
|R-CNN + DCN||AP-Box [ mask as box ]||0.422||0.617||0.459||0.233||0.456||0.578|
|AP-Mask [ box as mask ]||0.134||0.373||0.077||0.091||0.141||0.179|
|AP-Box [ mask as box ]||0.399||0.589||0.427||0.232||0.438||0.511|
|AP-Mask [ box as mask ]||0.127||0.358||0.070||0.087||0.135||0.167|
Evaluation measures. Evaluating object detection models has been a matter of debate. mAP is a well-established score but it has several shortcomings. Since mAP is calculated per class, it sometimes generates non-intuitive values. An example is illustrated in Fig. 20 where a detector generating boxes for non-existing objects (i.e. false positives) attains perfect mAP. Also, previous research has shown that an object detector with a lot of lower-confidence false positives can win over a detector with comparitively lower false positives (See ). As was stated in the Introduction section, mAP calculation is complicated. Further, it is unclear how much a small improvement in mAP (say 1%) will matter in real world applications (see also the discussion in ). Eventually, as was discussed above, with the rise of the instance segmentation and its corresponding evaluation measures, the suitability of mAP for object detection demands further discussions.
Is localization ignored in our analysis? Our main intention here was to assess the power of classification in object detection. Conversely, as a complementary approach, one could take the predictions by a model and ask humans to annotate them. However, this is very cumbersome, whereas our setup is straightforward as it does not need annotations by humans. In fact, our error diagnosis does something similar to the latter. Overall, we aim to understand the power of deep learning in solving object detection. We have not neglected the localization component for the following reasons. First, we hypothesize that our setup gives the empirical upper-bound. So, we had to fix the localization to reach the upper-bound. Any error in localization will only lower the UAP (which will not be upper-bound anymore!). Second, we have investigated how inaccuracy in localization affects UAP (section 4). Third, we have provided a very detailed analysis of the localization error in models (section 5).
The role of context in object recognition versus object detection. In section 4, we found that the surrounding context is not important in classifying the center object (on average). In section 6, we analyzed context as it is incorporated in current object detection models. Here, context is more important for small objects (compare Table 6 objects_only vs. Table 8 orig_img; last columns) since both localization and classification are involved. Overall, to answer how important context is in detecting or recognizing objects depends on how it is utilized.
Sampling boxes from the background. Here we only considered object boxes to train the classifiers. Since we are looking for the best classifier to only classify objects, including the background class will only lower the accuracy of the classifier. There is no need to include the background since for computing the upper-bound, background regions are already discarded (i.e. assuming perfect localization and objectness prediction).
Understanding when and why the upper bound fails. In some rare occasions (e.g. tunics in Fig. 7, toaster in Fig. 9; often small objects), UAP is lower than model AP possibly because our classifier has to elicit a decision for any box, thus it may generate more false positives than a model that misses objects (i.e. we do not have misses). This may results in lower precision for some classes for our UAP than a model, but our setup has a higher recall.
Sampling boxes at different scales. Our proposed sampling strategy is efficient at covering the space of translation. However, it does not capture variation in scale. In contrast, two-stage detectors do so through bounding box regression in the first stage. The way we mitigate this is through extensive data augmentation during training the classifier. This classifier is applied to the scaled versions of the boxes surrounding the target box. A more general box sampling approach than ours has been proposed in parallel by Oksuz et al.  which can be used to sample boxes at different scales. See Fig. 21.
One possible reason for lower detection performance on small objects. It could be because small objects (e.g. a pen) are very small in both train and test sets. It is even hard for humans to recognize small objects out of context. There is not much scale variation in both train and test set in detection datasets over small objects. This is in contrast to object recognition datasets where scale variation is larger. Thus, even data augmentation for these classes may not help detection performance much. To detect or recognize small objects, it might be better to observe them also in large scale. To remedy this problem, one way may be to rely on external data for small objects.
Our error diagnosis compared to Hoeim et al. . The main difference is that instead of removing detections, we fix them. Consider two methods (method 1 and method 2) both generating many mislocalized detections (FPs). For method 1, after correcting the mislocalized detections based on our protocol, many of them recover the misses and become TPs. On the contrary, after correcting the mislocalized detections for method 2, they become redundant to the correct detections (TPs) and hence are considered as FPs. Thus, the mislocalized predictions in the two methods are actually different. Our error analysis is able to discern method 1 and method 2, while the vanilla protocol fails since it removes the mislocalized predictions blindly.
8 Conclusion and Outlook
Modern object detectors are far from perfect despite intensive research in this area and significant progress in deep learning over the last couple of years. This signifies the limitation of deep learning to solve challenging vision tasks suggesting perhaps some fundamental ingredients are missing. Fixing the localization problem leads to the empirical upper-bound but reaching beyond that demands having better object recognition models.
As it stands robustness to scale still remains the main challenge in object detection. Scale variation (across various objects) is much higher in detection datasets than recognition datasets because objects are captured in their natural habitats. A certain objects might appear in a certain scale most of the time (i.e. their natural scale). For example a pen may always appear small compared to other objects in the scene. This makes recognition of such objects, out of their context, very difficult (See Fig. 16). This is less problematic in recognition datasets, such as ImageNet, since objects in those datasets are intentionally selected to be visually recognizable by humans. What all these means is that perhaps we need more data for object detection or we need to resort to external data to improve results on existing detection datasets. Humans are much better in detecting small objects and in exploiting the surrounding context around an object possibly due to the structural differences between the human visual system and CNNs . For instance, human retina consists of a high resolution central region called fovea and a lower resolution peripheral region. By moving the fovea over the scene, our eyes capture finer details of objects, whereas the resolution is fixed in still images fed to CNNs .
Lastly, our investigation here shows that we are far from solving the object detection problem. Further, this task can be considered as a litmus test to assess the capacity of deep learning and CNNs for solving vision problems. Existence of adversarial examples against object detection models (e.g. [79, 25, 12]) also exacerbates the problem and demonstrates how fragile these models are (and also many other models based on CNNs; See ). Adversarial examples are perhaps a byproduct of the lack of robustness in vision models (e.g. . Designing more powerful architectures (e.g. using architecture search techniques 
), incorporating heuristics, or using more data, while helpful, might not be enough to fully solve the object detection task. As an example, detecting mirrors or windows in images demands high-level reasoning and scene understanding. Please see and Fig. 19.
- Alamri and Pugeault  Alamri F, Pugeault N (2019) Contextual relabelling of detected objects. arXiv preprint arXiv:190602534
- Alwassel et al  Alwassel H, Caba Heilbron F, Escorcia V, Ghanem B (2018) Diagnosing error in temporal action detectors. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 256–272
- Azulay and Weiss  Azulay A, Weiss Y (2018) Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:180512177
- Bar  Bar M (2004) Visual objects in context. Nature Reviews Neuroscience 5(8):617
Barnea and Ben-Shahar 
Barnea E, Ben-Shahar O (2019) Exploring the bounds of the utility of context for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7412–7420
- Borji  Borji A (2019) Pros and cons of gan evaluation measures. Computer Vision and Image Understanding 179:41–65
- Borji and Iranmanesh  Borji A, Iranmanesh SM (2019) Empirical upper-bound in object detection and more. arXiv preprint arXiv:191112451
Borji and Itti 
Borji A, Itti L (2012) State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence 35(1):185–207
- Borji and Itti  Borji A, Itti L (2014) Human vs. computer in scene and object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 113–120
- Borji et al  Borji A, Sihite DN, Itti L (2012) Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing 22(1):55–69
- Borji et al  Borji A, Cheng MM, Jiang H, Li J (2015) Salient object detection: A benchmark. IEEE transactions on image processing 24(12):5706–5722
- Braunegg et al  Braunegg A, Chakraborty A, Krumdick M, Lape N, Leary S, Manville K, Merkhofer E, Strickhart L, Walmer M (2019) Apricot: A dataset of physical adversarial attacks on object detection. arXiv preprint arXiv:191208166
- Bylinskii et al  Bylinskii Z, Recasens A, Borji A, Oliva A, Torralba A, Durand F (2016) Where should saliency models look next? In: European Conference on Computer Vision, Springer, pp 809–824
- Cai and Vasconcelos  Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162
- Chen et al [2019a] Chen K, Pang J, Wang J, Xiong Y, Li X, Sun S, Feng W, Liu Z, Shi J, Ouyang W, et al (2019a) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4974–4983
- Chen et al [2019b] Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, Sun S, Feng W, Liu Z, Xu J, et al (2019b) Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:190607155
- Chen et al [2019c] Chen X, Girshick R, He K, Dollár P (2019c) Tensormask: A foundation for dense object segmentation. arXiv preprint arXiv:190312174
- Chen et al  Chen Z, Huang S, Tao D (2018) Context refinement for object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 71–86
- Cordts et al  Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
- Corke et al  Corke P, Dayoub F, Hall D, Skinner J, Sünderhauf N (2020) What can robotics research learn from computer vision research? arXiv preprint arXiv:200102366
- Divvala et al  Divvala SK, Hoiem D, Hays JH, Efros AA, Hebert M (2009) An empirical study of context in object detection. In: 2009 IEEE Conference on computer vision and Pattern Recognition, IEEE, pp 1271–1278
- Dollar et al  Dollar P, Wojek C, Schiele B, Perona P (2011) Pedestrian detection: An evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence 34(4):743–761
- Everingham et al  Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88(2):303–338
- Everingham et al  Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1):98–136
- Eykholt et al  Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Tramer F, Prakash A, Kohno T, Song D (2018) Physical adversarial examples for object detectors. arXiv preprint arXiv:180707769
- Galleguillos and Belongie  Galleguillos C, Belongie S (2010) Context based object categorization: A critical survey. Computer vision and image understanding 114(6):712–722
- Girshick  Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
- Goldman et al  Goldman E, Herzig R, Eisenschtat A, Goldberger J, Hassner T (2019) Precise detection in densely packed scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5227–5236
- Goodfellow et al [2014a] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014a) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
- Goodfellow et al [2014b] Goodfellow IJ, Shlens J, Szegedy C (2014b) Explaining and harnessing adversarial examples. arXiv preprint arXiv:14126572
- Hall et al  Hall D, Dayoub F, Skinner J, Corke P, Carneiro G, Sünderhauf N (2018) Probability-based detection quality (pdq): A probabilistic approach to detection evaluation. arXiv preprint arXiv:181110800
- He et al  He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
- He et al  He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
He et al 
He S, Tavakoli HR, Borji A, Pugeault N (2019) Human attention in image captioning: Dataset and analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp 8529–8538
- Heitz and Koller  Heitz G, Koller D (2008) Learning spatial context: Using stuff to find things. In: European conference on computer vision, Springer, pp 30–43
- Hoiem et al  Hoiem D, Chodpathumwan Y, Dai Q (2012) Diagnosing error in object detectors. In: European conference on computer vision, Springer, pp 340–353
- Hou et al  Hou Q, Cheng MM, Hu X, Borji A, Tu Z, Torr PH (2017) Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3203–3212
- Hu et al [2018a] Hu J, Shen L, Albanie S, Sun G, Vedaldi A (2018a) Gather-excite: Exploiting feature context in convolutional neural networks. In: Advances in Neural Information Processing Systems, pp 9401–9411
- Hu et al [2018b] Hu J, Shen L, Sun G (2018b) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
- Huang et al  Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7310–7311
- Huang et al  Huang Z, Huang L, Gong Y, Huang C, Wang X (2019) Mask scoring r-cnn. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6409–6418
- Karianakis et al  Karianakis N, Dong J, Soatto S (2016) An empirical evaluation of current convolutional architectures’ ability to manage nuisance location and scale variability. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4442–4451
- Kirillov et al  Kirillov A, He K, Girshick R, Rother C, Dollár P (2019) Panoptic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9404–9413
- Kuznetsova et al  Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Duerig T, et al (2018) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:181100982
- Li et al  Li H, Singh B, Najibi M, Wu Z, Davis LS (2019) An analysis of pre-training on object detection. arXiv preprint arXiv:190405871
- Lin et al  Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
- Lin et al [2017a] Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017a) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
- Lin et al [2017b] Lin TY, Goyal P, Girshick R, He K, Dollár P (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
- Liu et al  Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2018) Deep learning for generic object detection: A survey. arXiv preprint arXiv:180902165
- Liu et al  Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37
- Lu et al  Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid r-cnn. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7363–7372
- Marat and Itti  Marat S, Itti L (2012) Influence of the amount of context learned for improving object classification when simultaneously learning object and contextual cues. Visual Cognition 20(4-5):580–602
- Michaelis et al  Michaelis C, Mitzkus B, Geirhos R, Rusak E, Bringmann O, Ecker AS, Bethge M, Brendel W (2019) Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:190707484
- Mishkin et al  Mishkin D, Sergievskiy N, Matas J (2017) Systematic evaluation of convolution neural network advances on the imagenet. Computer Vision and Image Understanding 161:11–19
- Mottaghi et al  Mottaghi R, Fidler S, Yuille A, Urtasun R, Parikh D (2015) Human-machine crfs for identifying bottlenecks in scene understanding. IEEE transactions on pattern analysis and machine intelligence 38(1):74–87
- Oksuz et al  Oksuz K, Can Cam B, Akbas E, Kalkan S (2018) Localization recall precision (lrp): A new performance metric for object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 504–519
- Oksuz et al  Oksuz K, Cam BC, Akbas E, Kalkan S (2020) Generating positive bounding boxes for balanced training of object detectors. In: IEEE Winter Conference on Applications of Computer Vision (WACV)
- Pang et al  Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra r-cnn: Towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 821–830
- Parikh and Zitnick  Parikh D, Zitnick CL (2011) Finding the weakest link in person detectors. In: CVPR 2011, Citeseer, pp 1425–1432
- Rabinovich et al  Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie SJ (2007) Objects in context. In: ICCV, Citeseer, vol 1, p 5
- Recht et al  Recht B, Roelofs R, Schmidt L, Shankar V (2019) Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:190210811
- Redmon and Farhadi  Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:180402767
- Redmon et al  Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
- Ren et al  Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
- Rezatofighi et al  Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 658–666
- Rosenfeld et al  Rosenfeld A, Zemel R, Tsotsos JK (2018) The elephant in the room. arXiv preprint arXiv:180803305
- Russakovsky et al  Russakovsky O, Deng J, Huang Z, Berg AC, Fei-Fei L (2013) Detecting avocados to zucchinis: what have we done, and where are we going? In: Proceedings of the IEEE International Conference on Computer Vision, pp 2064–2071
- Russakovsky et al  Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115(3):211–252
- Singh and Davis  Singh B, Davis LS (2018) An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3578–3587
- Song et al  Song Z, Chen Q, Huang Z, Hua Y, Yan S (2011) Contextualizing object detection and classification. In: CVPR 2011, IEEE, pp 1585–1592
- Tan et al  Tan M, Pang R, Le QV (2019) Efficientdet: Scalable and efficient object detection. arXiv preprint arXiv:191109070
- Tian et al  Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. arXiv preprint arXiv:190401355
- Torralba  Torralba A (2003) Contextual priming for object detection. International journal of computer vision 53(2):169–191
- Torralba and Sinha  Torralba A, Sinha P (2001) Statistical context priming for object detection. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, IEEE, vol 1, pp 763–770
Tsipras et al 
Tsipras D, Santurkar S, Engstrom L, Turner A, Madry A (2018) Robustness may be at odds with accuracy. arXiv preprint arXiv:180512152
- Vondrick et al  Vondrick C, Khosla A, Malisiewicz T, Torralba A (2013) Hoggles: Visualizing object detection features. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1–8
- Wang et al  Wang W, Shen J, Guo F, Cheng MM, Borji A (2018) Revisiting video saliency: A large-scale benchmark and a new model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4894–4903
- Wolf and Bileschi  Wolf L, Bileschi S (2006) A critical view of context. International Journal of Computer Vision 69(2):251–261
- Xie et al  Xie C, Wang J, Zhang Z, Zhou Y, Xie L, Yuille A (2017) Adversarial examples for semantic segmentation and object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1369–1378
- Yang et al  Yang X, Mei H, Xu K, Wei X, Yin B, Lau RWH (2019) Where is my mirror? 1908.09101
- Yao et al  Yao L, Ballas N, Cho K, Smith JR, Bengio Y (2016) Empirical performance upper bounds for image and video captioning
- Zhang et al  Zhang P, Wang J, Farhadi A, Hebert M, Parikh D (2014) Predicting failures of vision systems. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3566–3573
- Zhang et al  Zhang S, Benenson R, Omran M, Hosang J, Schiele B (2016) How far are we from solving pedestrian detection? In: Proceedings of the iEEE conference on computer vision and pattern recognition, pp 1259–1267
- Zheng et al  Zheng WS, Gong S, Xiang T (2009) Quantifying contextual information for object detection. In: 2009 IEEE 12th International Conference on Computer Vision, IEEE, pp 932–939
- Zhou et al  Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:190407850
- Zhu et al  Zhu X, Vondrick C, Ramanan D, Fowlkes CC (2012) Do we need more training data or better models for object detection?. In: BMVC, Citeseer, vol 3, p 5
- Zhu et al  Zhu X, Cheng D, Zhang Z, Lin S, Dai J (2019) An empirical study of spatial attention mechanisms in deep networks. arXiv preprint arXiv:190405873
- Zhu et al  Zhu Z, Xie L, Yuille AL (2016) Object recognition with and without objects. 1611.06596
Zoph and Le 
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:161101578
- Zou et al  Zou Z, Shi Z, Guo Y, Ye J (2019) Object detection in 20 years: A survey. arXiv preprint arXiv:190505055