Man-made scenes can be densely packed, containing numerous objects, often identical, positioned in close proximity. We show that precise object detection in such scenes remains a challenging frontier even for state-of-the-art object detectors. We propose a novel, deep-learning based method for precise object detection, designed for such challenging settings. Our contributions include: (1) A layer for estimating the Jaccard index as a detection quality score; (2) a novel EM merging unit, which uses our quality scores to resolve detection overlap ambiguities; finally, (3) an extensive, annotated data set, , representing packed retail environments, released for training and testing under such extreme settings. Detection tests on and counting tests on the CARPK and PUCPR+ show our method to outperform existing state-of-the-art with substantial margins. The code and data will be made available on <www.github.com/eg4000/SKU110K_CVPR19>.READ FULL TEXT VIEW PDF
Object detection in densely packed scenes is a new area where standard o...
Retail scenes usually contain densely packed high number of objects in e...
Object detection is a famous branch of research in computer vision, many...
This paper presents a new multi-view RGB-D dataset of nine kitchen scene...
Object detection has achieved remarkable progress in the past decade.
This work is a solution to densely packed scenes dataset SKU-110k. Our w...
There is an increasing interest in algorithms to learn invariant correla...
Working with scale: 2nd place solution to Product Detection in Densely Packed Scenes
This repository shows how to train an object detection algorithm with Detectron2 on Amazon SageMaker
evaluation code for SKU110K dataset
Recent deep learning–based detectors can quickly and reliably detect objects in many real world scenes [15, 16, 19, 27, 30, 36, 37, 38]. Despite this remarkable progress, the common use case of detection in crowded images remains challenging even for leading object detectors.
We focus on detection in such densely packed scenes, where images contain many objects, often looking similar or even identical, positioned in close proximity. These scenes are typically man-made, with examples including retail shelf displays, traffic, and urban landscape images. Despite the abundance of such environments, they are underrepresented in existing object detection benchmarks. It is therefore unsurprising that state-of-the-art object detectors are challenged by such images.
|UCSD (2008) ||2000||24.9||1||1||✓||✗||✗|
|PACAL VOC (2012) ||22,531||2.71||20||2||✗||✗||✓|
|ILSVRC Detection (2014) ||516,840||1.12||200||2||✗||✗||✓|
|COCO (2015) ||328,000||7.7||91||3.5||✗||✗||✓|
|Penguins (2016) ||82,000||25||1||1||✓||✗||✗|
|TRANCOS (2016) ||1,244||37.61||1||1||✓||✓||✗|
|WIDER FACE (2016) ||32,203||12||1||1||✗||✗||✓|
|CityPersons (2017) ||5000||6||1||1||✗||✗||✓|
|PUCPR+ (2017) ||125||135||1||1||✓||✓||✓|
|CARPK (2018) ||1448||61||1||1||✓||✓||✓|
|Open Images V4 (2018) ||1,910,098||8.4||600||2.3||✗||✓||✓|
To understand what makes these detection tasks difficult, consider two identical objects placed in immediate proximity, as is often the case for items on store shelves (Fig. 1). The challenge is to determine where one object ends and the other begins; minimizing overlaps between their adjacent bounding boxes. In fact, as we show in Fig. 1(a,c), the state-of-the-art RetinaNet detector , often returns bounding boxes which partially overlap multiple objects or detections of adjacent object regions as separate objects.
We describe a method designed to accurately detect objects, even in such densely packed scenes (Fig. 1(b,d)). Our method includes several innovations. We propose learning the Jaccard index with a soft Intersection over Union (Soft-IoU) network layer. This measure provides valuable information on the quality of detection boxes. We explain how detections can be represented as a Mixture of Gaussians
(MoG), reflecting their locations and their Soft-IoU scores. An Expectation-Maximization (EM) based method is then used to cluster these Gaussians into groups, resolving detection overlap conflicts.
To summarize, our novel contributions are as follows:
Soft-IoU layer, added to an object detector to estimate the Jaccard index between the detected box and the (unknown) ground truth box (Sec. 3.2).
EM-Merger unit, which converts detections and Soft-IoU scores into a MoG, and resolves overlapping detections in packed scenes (Sec. 3.3).
A new data set and benchmark, the store keeping unit, 110k categories (SKU-110K), for item detection in store shelf images from around the world (Sec. 4).
We test our detector on SKU-110K. Detection results show our method to outperform state-of-the-art detectors. We further test our method on the related but different task of object counting, on SKU-110K and the recent CARPK and PUCPR+ car counting benchmarks . Remarkably, although our method was not designed for counting, it offers a considerable improvement over state-of-the-art methods.
Object detection. Work on this problem is extensive and we refer to a recent survey for a comprehensive overview 
. Briefly, early detectors employed sliding window–based approaches, applying classifiers to window contents at each spatial location[10, 14, 45]. Later methods narrow this search space by determining region proposals before applying sophisticated classifiers [1, 7, 35, 44, 52].
Deep learning–based methods now dominate detection results. To speed detection, proposal-based detectors such as R-CNN  and Fast R-CNN  were developed, followed by Faster R-CNN  which introduced a region proposal network (RPN), then accelerated even more by R-FCN . Mask-RCNN  later added segmentation output and better detection pooling . We build on these methods, claiming no advantage in standard object detection tasks. Unlike us, however, these two-stage methods were not designed for crowded scenes where small objects appear in dense formations.
. To handle scale variance, feature pyramid network (FPN) added up-scaling layers. RetinaNet  utilized the same FPN model, introducing a Focal Loss to dynamically weigh hard and easy samples for better handling of class imbalances that naturally occur in detection datasets. We extend this approach, introducing a new detection overlap measure, allowing for precise detection of tightly packed objects.
These methods use hard-labeled log-likelihood detections to produce confidences for each candidate image region. We additionally predict a Soft-IoU confidence score which represents detection bounding box accuracy.
Merging duplicate detections. Standard non-maximum suppression (NMS) remains a de-facto object detection duplicate merging technique, from Viola & Jones  to recent deep detectors [27, 37, 38]. NMS is a hand-crafted algorithm, applied at test time as post-processing, to greedily select high scoring detections and remove their overlapping, low confidence neighbors.
, or heuristic variants[4, 40, 23]. GossipNet  proposed to perform duplicate-removal using a learnable layer in the detection network. Finally, others bin IoU values into five categories . We instead take a probabilistic interpretation of IoU prediction and a very different general approach.
Few of these methods showed improvement over simple, greedy NMS, with some also being computationally demanding . In densely packed scenes, resolving detection ambiguities is exacerbated due to the many overlapping detections. We propose an unsupervised method, designed for clustering duplicate detection in cluttered regions.
Crowded scene benchmarks. Many benchmarks were designed for testing object detection or counting methods and we survey a few in Table 1. Importantly, we are unaware of detection benchmarks intended for densely packed scenes, such as those of interest here.
Popular object detection sets include ILSVRC , PASCAL VOC  detection challenges, MS COCO , and the very recent Open Images v4 . None of these provides scenes with packed items. A number of recent benchmarks emphasize crowded scenes, but are designed for counting, rather than detection [2, 8, 34].
As evident from Table 1, our new SKU-110K dataset, described in Sec. 4, provides one to three orders of magnitude more items per image than nearly all these benchmarks (the only exception is the PUCPR+  which offers two orders of magnitude fewer images, and a single object class to our more than 110k classes). Most importantly, our enormous, per image, object numbers imply that all our images contain very crowded scenes, which raises the detection challenges described in Sec. 1. Moreover, identical or near identical items in SKU-110K are often positioned closely together, making detection overlaps a challenge. Finally, the large number of classes in SKU-110K implies appearance variations which add to the difficulty of this benchmark, even in challenges of object/non-object detection.
Our approach is illustrated in Fig. 2. We build on a standard detection network design, described in Sec. 3.1. We extend this design in two ways. First, we define a novel Soft-IoU layer which estimates the overlap between predicted bounding boxes and the (unknown) ground truth (Sec. 3.2). These Soft-IoU scores are then processed by a proposed EM-Merger unit, described in Sec. 3.3, which resolves ambiguities between overlapping bounding boxes, returning a single detection per object.
Our base detector is similar to existing methods [26, 27, 30, 38]. We first detect objects by building a FPN network  with three upscaling-layers, using ResNet-50  as a backbone. The proposed model provides three fully-convolutional output heads for each RPN : Two heads are standard and used also by previous work [27, 37] (our novel third head is described in Sec. 3.2).
The first is a detection head which produces a bounding box regression output for each object, represented as 4-tuples: for the 2D coordinates of a bounding box center, height and width. The second, classification head provides an objectness score (confidence) label, (assuming an object/no-object detection task with one object class). In practice, we filter detections for which , to avoid creating a bias towards noisy detections when training our Soft-IoU layer, described next.
In non-dense scenes, greedy NMS applied to objectness scores, , can resolve overlapping detections. In dense images, however, multiple overlapping bounding boxes often reflect multiple, tightly packed objects, many of which receive high objectness scores. As we later show (Sec. 5.2), in such cases, NMS does not adequately discriminate between overlapping detections or suppress partial detections.
To handle these cluttered positive detections, we propose predicting an additional value for each bounding box: The IoU (i.e., Jaccard index) between a regressed detection box and the object location. This Soft-IoU score, , is estimated by a fully-convolutional layer which we add as a third head to the end of each RPN in the detector.
Given predicted detections, the IoU between a predicted bounding box , and its ground truth bounding box, , is defined as:
We chose to be the closest annotated box to (in image coordinates). If the two do not overlap, then . Both and count pixels.
We take a probabilistic interpretation of Eq. (1), learning it with our Soft-IoU layer using a binary cross-entropy loss:
where is the number of samples in each batch.
The loss used to train each RPN in the detection network is therefore defined as:
Objectness vs. Soft-IoU. The objectness score used in previous methods predicts object/no-object labels whereas our Soft-IoU predicts the IoU of a detected bounding box and its ground truth. So, for instance, a bounding box which partially overlaps an object can still have a high objectness score, , signifying high confidence that the object appears in the bounding box. For the same detection, we expect to be low, due to the partial overlap.
In fact, object/no-object classifiers are trained to be invariant to occlusions and translations. A good objectness classifier would therefore be invariant to the properties which our Soft-IoU layer is sensitive to. Objectness and Soft-IoU could thus be considered reflecting complementary properties of a detection bounding box.
We now have predicted bounding box locations, each with its associated objectness, , and Soft-IoU, , scores. Bounding boxes, especially in crowded scenes, often clump together in clusters, overlapping each other and their item locations. Our EM-Merger unit filters, merges, or splits these overlapping detection clusters, in order to resolve a single detection per object. We begin by formally defining these detection clusters.
Detections as Gaussians. We consider the bounding boxes produced by the network as a set of 2D Gaussians:
with , a 2D image coordinate. The -th detection is thus represented by a 2D mean, the central point of the box, , and a diagonal covariance, , reflecting the box size, .
We represent these Gaussians, jointly, as a single Mixture of Gaussians (MoG) density:
where the mixture coefficients, , reflecting our confidence that the bounding box overlaps with its ground truth, are normalized to create a MoG.
Fig. 3 visualizes the density of Eq. (5) as heat-maps, translating detections into spatial region maps representing our per-pixel confidences of detection overlaps; each region weighted by the accumulated Soft-IoU.
We treat the problem of resolving the final detections as finding a set of Gaussians,
such that when aggregated, the selected Gaussians approximate the original MoG distribution of Eq. (5), formed by all detections. That is, if is defined by
then we seek a mixture of Gaussians, , for which
is minimized, where KL is the KL-divergence  used as a non-symmetric distance between two detection boxes.
An EM-approach for selecting detections. We approximate a solution to minimization of Eq. (8) using an EM-based method. The E-step assigns each box to the nearest box cluster, where box similarity is defined by a KL distance between the corresponding Gaussians. E-step assignments are defined as:
The M-step then re-estimates the model parameters by:
Note that these matrix computations are fast in 2D space. Moreover, all our Gaussians represent axis-aligned detection and so they all have diagonal covariances. In such cases, the KL distance between two Gaussians has a simpler form which is even more efficient to compute.
General EM theory guarantees that the iterative process described in Eq. (9)–(10), is monotonically decreasing in the value of Eq. (8) and converging to a local minimum . We determine convergence when the value of Eq. (8) is smaller than . We found this process to nearly always converge within ten iterations and so we set a maximum number of iterations at that number.
EM parameters are often initialized using fast clustering to prevent convergence to poor local minima. We initialize it with an agglomerative, hierarchical clustering, where each detection initially represents a cluster of its own and clusters are successively merged until clusters remain.
. Such methods are designed for clustering high-dimensional data, training autoencoders to map input data into a low-dimensional feature space where clustering is easier. We instead use EM, as these methods are not relevant in our settings, where the original data is two-dimensional.
Gaussians as detections. Once EM converged, the estimated Gaussians represent a set of detections. As an upper bound for the number of detections, we use , approximating the amount of non-overlapping, mean-sized boxes that fit into the image. As post-processing, we suppress less confident Gaussians which overlap other Gaussians by more than a predefined threshold. This step can be viewed as model selection and it determines the actual number of detected objects, .
To extract the final detections, for each of the
Gaussians, we consider the ellipse at two standard deviations around its center, visualized in Fig.3 in green. We then search the original set of detections (Sec. 3.1) for those whose center, , falls inside this ellipse. A Gaussian is converted to a detection window by taking the median dimensions of the detections in this set.
We assembled a new labeled data set and benchmark containing images of supermarket shelves. We focus on such retail environments for two main reasons. First, to maximize sales and store real-estate usage, shelves are regularly optimized to present many items in tightly packed, efficient arrangements [3, 33]. Our images therefore represent extreme examples of dense environments; precisely the type of scenes we are interested in.
Second, retail items naturally fall into product, brand, and sub-brand object classes. Different brands and products are designed to appear differently. A typical store can sell hundreds of products, thereby presenting a detector with many inter-class appearance variations. Sub-brands, on the other hand, are often distinguishable only by fine-grained packaging differences. These subtle appearance variations increase the range of nuisances that detectors must face (e.g., spatial transformations, image quality, occlusion).
As we show in Table 1, SKU-110K is very different from existing alternatives in the numbers and density of the objects appearing in each image, the variability of its item classes, and, of course, the nature of its scenes. Example images from SKU-110K are provided in Fig. 1, 2, and 5.
Image collection. SKU-110K images were collected from thousands of supermarket stores around the world, including locations in the United States, Europe, and East Asia. Dozens of paid associates acquired our images, using their personal cellphone cameras. Images were originally taken at no less than five mega-pixel resolution but were then JPEG compressed at one megapixel. Otherwise, phone and camera models were not regulated or documented. Image quality and view settings were also unregulated and so our images represent different scales, viewing angles, lighting conditions, noise levels, and other sources of variability.
Bounding box annotations were provided by skilled annotators. We chose experienced annotators over unskilled, Mechanical Turkers, as we found the boxes obtained this way were more accurate and did not require voting schemes to verify correct annotations [28, 42]. We did, however, visually inspect each image along with its detection labels, to filter obvious localization errors.
Benchmark protocols. SKU-110K images were partitioned into train, test, and validate splits. Training consists of 70% of the images ( images) and their associated bounding boxes; 5% of the images (), are used for validation (with their bounding boxes). The rest, images ( bounding boxes) were used for testing. Images were selected at random, ensuring that the same shelf display from the same shop does not appear in more than one of these subsets.
. We adopt evaluation metrics similar to those used by COCO, reporting the average precision (AP) at IoU=.50:.05:.95 (their primary challenge metric), AP at IoU=.75, AP (their strict metric), and average recall (AR) at IoU=.50:.05:.95 ( is the maximal number of objects). We further report the value sampled from the precision-recall curve at recall for IoU=0.75 (P).
The many, densely packed items in our images are reminiscent of the settings in counting benchmarks [2, 22]. We capture both detection and counting accuracy, by borrowing the error measures used for those tasks: If is the predicted numbers of objects in each test image, , and are the per image ground truth numbers, then the mean absolute error (MAE) is and the root mean squared error (RMSE) is .
Table 2 compares average frames per second (FPS) and detections per second (DPS) for baseline methods and variations of our approach. Runtimes were measured on the same machine using an Intel(R) Core(TM) i7-5930K CPU @3.50GHz GeForce and a GTX Titan X GPU.
Our base detector is modeled after RetinaNet  and so their runtimes are identical. Adding our Soft-IoU layer does not affect runtime. EM-Merger is slower despite the optimizations described in Sec. 3.3, mostly because of memory swapping between GPU and CPU/RAM. Our initial tests suggest that a GPU optimized version will be nearly as fast as the base detector.
|Base & NMS||.413||.384||.484||.491||24.962||34.382|
|Soft-IoU & NMS||.418||.386||.483||.492||25.394||34.729|
|Base & EM-Merger||.482||.540||.553||.802||23.978||283.971|
|Our full approach||.492||.556||.554||.834||14.522||23.992|
Baseline methods. We compare the detection accuracies of our proposed method and recent state-of-the-art on the SKU-110K benchmark. All methods, with the exception of the Monkey detector, were trained on the training set portion SKU-110K.
The following two baseline methods were tested using the original implementations released by their authors: RetinaNet  and Faster-RCNN . YOLO9000  is not suited for images with more than 50 objects. We offer results for YOLO9000
, which is YOLO9000 with its loss function optimized and retrained to support detection of up to 300 boxes per image.
We also report the following ablation studies, detailing the contributions of individual components of our approach.
Monkey: Because of the tightly packed items in SKU-110K images, it is plausible that randomly tossed bounding boxes would correctly predict detections by chance. To test this naive approach, we assume we know the object number, , the mean and standard-deviation width, , , and height, , , for these boxes. Monkey samples 2D upper-left corners for theand , respectively.
Base & NMS: Our basic detector of Sec. 3.1 with standard NMS applied to objectness scores, .
Soft-IoU & NMS: Base detector with Soft-IoU (Sec. 3.2). Standard NMS applied to Soft-IoU scores, , instead of objectness scores.
Base & EM-Merger: Our basic detector, now using EM-Merger of Sec. 3.3, but applying it to original objectness scores, .
Our full approach: Applying the EM-Merger unit to Deep-IoU scores, .
To test MAE and RMSE we report the number of detected objects, , and compare it with the true number of items per image. In RetinaNet the number of detections is extremely high so we first filter detections with low confidences. This confidence threshold was determined using cross-validation to optimize the results of this baseline.
|Counting results on CARPK|
|Faster R-CNN (2015) ||24.32||37.62|
|YOLO (2016) ||48.89||57.55|
|One-Look Regression (2016) ||59.46||66.84|
|LPN Counting (2017) ||23.80||36.79|
|YOLO9000 (2017) ||45.36||52.02|
|RetinaNet (2018)  ||16.62||22.30|
|IEP Counting (2019) ||51.83||-|
|Our full approach||6.77||8.52|
|Counting results on PUCPR+|
|Faster R-CNN (2015) ||39.88||47.67|
|YOLO (2016) ||156.00||200.42|
|One-Look Regression (2016) ||21.88||36.73|
|LPN Counting (2017) ||22.76||34.46|
|YOLO9000 (2017) ||130.40||172.46|
|RetinaNet (2018) ||24.58||33.12|
|IEP Counting (2019) ||15.17||-|
|Our full approach||7.16||12.00|
Detection results on SKU-110K. Quantitative detection results are provided in Table 3, result curves are presented in Fig. 4, and a selection of qualitative results, comparing our full approach with RetinaNet , the best performing baseline system, is offered in Fig. 5.
Apparently, despite the packed nature of our scenes, randomly tossing detections fails completely, as evident by the near zero accuracy of Monkey. Both Faster-RCNN  and YOLO9000  are clearly unsuited for detecting so many tightly packed objects. RetinaNet , performs much better, in fact outperforming our base network despite sharing a similar design (Sec. 3.1). This could be due to the better framework optimization of RetinaNet.
Our full system outperforms all its baselines with wide margins. Much of its advantage seems to come from our EM-Merger (Sec. 3.3). Comparing the accuracy of EM-Merger applied to either objectness scores or our Soft-IoU demonstrates the added information provided by Soft-IoU. This contribution is especially meaningful when examining the counting results, which show that Soft-IoU scores provide a much better means of filtering detection boxes than objectness scores.
It is further instructional to compare detection accuracy with counting accuracy. The counting accuracy gap between our method and the closest runner up, RetinaNet, is greater than the gap in detection accuracy (though both margins are wide). The drop in counting accuracy can at least partially be explained by their use of greedy NMS compared with our EM-Merger. In fact, Fig. 5 demonstrates the many overlapping and/or mis-localized detections produced by RetinaNet compared to the single detections per item predicted by our approach (see, in particular, Fig. 5(a,e)).
Finally, we note that our best results remain far from perfect: The densely packed settings represented by SKU-110K images appear to be highly challenging, leaving room for further improvement.
We test our method on data from other benchmarks, to see if our approach generalizes well to other domains beyond store shelves and retail objects. To this end, we use the recent CARPK and PUCPR+  benchmarks. Both data sets provide images of parking lots from high vantage points. We use their test protocols, comparing the number of detections per image to the ground truth numbers made available by these benchmarks. Accuracy is reported using MAE and RMSE, as in our SKU-110K (Sec. 4).
Counting results. We compare our method with results reported by others [22, 41]: Faster R-CNN , YOLO , and One-Look Regression . Existing baselines also include two methods designed and tested for counting on these two benchmarks: LPN Counting  and IEP Counting . In addition, we trained and tested counting accuracy with YOLO9000  and RetinaNet .
Table 4 reports the MAE and RMSE for all tested methods. Despite not being designed for counting, our method is more accurate than recent methods designed for that task. A significant difference between these counting datasets and our SKU-110K is in the much closer proximity of the objects in our images. This issue has a significant impact on baseline detectors, as can be seen in Tables 4 and 3. Our model suffers a much lower degradation in performance due to better filtering of these overlaps.111See project web-page for qualitative results on these benchmarks.
The performance of modern object/no-object detectors on existing benchmarks is remarkable yet still limited. We focus on densely packed scenes typical of every-day retail environments and offer SKU-110K, a new benchmark of such retail shelf images, labeled with item detection boxes. Our tests on this benchmark show that such images challenge state-of-the-art detectors.
To address these challenges, along with our benchmark, we offer two technical innovations designed to raise detection accuracy in such settings: The first is a Soft-IoU layer for estimating the overlap between predicted and (unknown) ground truth boxes. The second is an EM-based unit for resolving bounding box overlap ambiguities, even in tightly packed scenes where these overlaps are common.
We test our approach on SKU-110K and two existing benchmarks for counting, and show it to surpass existing detection and counting methods. Still, even the best results on SKU-110K are far from saturated, suggesting that these densely packed scenes remain a challenging frontier for future work.
This research was supported by Trax Image Recognition for Retail and Consumer Goods https://traxretail.com/. We are thankful to Dr. Yair Adato and Dr. Ziv Mhabary for their essential support in this work.
European Conf. Comput. Vision, 2016.
Parsimonious reduction of Gaussian mixture models with a variational-bayes approach.Pattern Recognition, 43(3):850–858, 2010.
AAAI Conf. on Artificial Intelligence workshops, 2012.
Robust real-time face detection.Int. J. Comput. Vision, 57(2):137–154, 2004.
Unsupervised deep embedding for clustering analysis.In Int. Conf. Mach. Learning, 2016.
Towards K-means-friendly spaces: Simultaneous deep learning and clustering.In Int. Conf. Mach. Learning, 2017.