A robot or autonomous system often operates in uncontrolled and detrimental conditions that pose severe challenges to its perception system. Robots are inherently active agents that act in, and interact with, the physical real world. They have to make decisions based on incomplete and uncertain knowledge, with potentially catastrophic results.
have had a significant influence on the advancements in object recognition, object detection, semantic segmentation, image captioning, and visual question answering in recent years. These challenges posed motivating problems to the computer vision and machine learning research communities and proposed datasets and evaluation metrics that allowed comparison of different approaches in a standardised way.
However, visual perception for robotics faces challenges that are not well covered or evaluated by the existing benchmarks. Specifically, deployment in open-set conditions 
requires reliable uncertainty estimation to identify unknown objects; A robot will inevitably encounter objects of unknown classes, and should not assign high-confidence labels to these unknown objects. Fusing semantic information with spatial information for scene understanding or semantic SLAM also requires estimation of the uncertainty not only of what an object is (semantic uncertainty), but also where it is (spatial uncertainty).
Further, most existing object detection challenges test on datasets of uncorrelated images mined from internet repositories like Flickr. However, this does not represent what a robot experiences. A robot instead receives video as input, which is highly temporally correlated, and inference may propagate information between frames.
This paper proposes a new vision task, probabilistic object detection, and a new Robotic Vision Challenge focused on evaluating this task. This challenge is the ACRV Robotic Vision Challenge 1 - Probabilistic Object Detection. This paper gives details of new datasets used for the challenge and summarises how the challenge defines probabilistic object detections and the new metric used by the challenge (PDQ) that rewards detectors that accurately estimate their spatial and semantic uncertainty. The new dataset consists of video sequences that also allows participants to design detection systems that can exploit the temporal correlation between individual image frames.
Ii Simulated Data
ACRV Robotic Vision Challenge 1 - Probabilistic Object Detection uses video sequences captured from simulation, spanning multiple environments, day and night, and different camera heights. These video sequences are divided into a test set, a test_dev set, and a validation set, with ground truth made publicly available for the validation set. The test set is used for the first of our fixed-time challenges where top competitors will be awarded prizes at CVPR 2019. The test_dev set provides an ongoing benchmark that can be used as a baseline to drive future research.
No training set is provided for this challenge, and participants are encouraged to train on whatever data seems appropriate. This is facilitated by our challenge evaluating competitors on a subset of the well-known COCO classes. In doing so, we hope to avoid dataset bias and encourage competitors to develop systems that can generalise to the test data, rather than fitting to the particulars of this challenge.
The two test sets (CVPR test and test_dev) consist of video sequences captured in 3 different environments, with day and night lighting for each environment, and 3 different camera heights at each light level (a total of 18 sequences). The validation set is a reduced version of the test set, including data from only a single environment, in day and night modes, and 2 different camera heights (a total of 4 sequences). None of the environments or 3D models in the validation set are present within the test set. The 4 environments used are showcased in Fig. 1. Table I contains some broad statistics on each dataset, which class frequences are presented in Fig. 2.
|number of images||21,491||56,513||123,704|
|ground truth objects||56,578||187,487||552,611|
|avg objects per image||2.63||3.31||4.46|
|avg pixels per object||2,791.1||2,791.3||2292.4|
Ii-a Benefits of Simulated Data
The use of simulation for collecting training data provides a number of advantages. The simulator can output pixel-perfect ground truth segmentation for every frame, without the costs associated with hand-labelling video data. These labels are finer-grained, more precise, and more consistent than are commonly present in human-labelled data. An example of this can be seen in Fig. 3. Automated labels allow “is_crowd” labels to be avoided (as are present in Microsoft COCO) in favour of labelling each individual instance. Simulation allows video data to be collected in a wider range of environments, which can be more finely controlled than are available to researchers in the real world. For example, the same location and objects can be present in an environment, but the lighting conditions are adjusted to simulate day or night data capture. It also allows faster and more forgiving iteration and re-collection in the event of problems or faults.
Ii-B Generation process
The video sequences are all rendered using Unreal Engine 4111https://www.unrealengine.com, using a modified version of the NVidia Dataset Synthesizer222https://github.com/jskinn/Dataset_Synthesizer to record frames and ground truth. The environments used were all purchased from Evermotion333https://evermotion.org/. The test and test_dev sets use 3 environments each, while the validation set uses only a single environment. No environments are re-used between sets.
To generate realistic robot motions, we attach the camera to an agent that moves through the environment. The agent first queries the environment for all objects of known classes in the environment. It then chooses a single random object of each class, in a random class order, to produce a list of target objects that the agent will visit. The agent uses a version of recast navigation444https://github.com/recastnavigation/recastnavigation built into Unreal Engine 4 to navigate between successive target objects in the list. When the agent reaches a particular target object, the camera zooms toward the object as far as it can in a straight line without intersecting any object in the environment. The camera then returns to its initial placement on the agent, and the agent navigates to the next target object in the list. This motion can be seen as representative of a camera mounted on a robotic arm attempting to get a better look at a given object.
These two types of motion (moving between instances and zooming toward them) provide a wide variety of viewpoints on the test objects (see Fig. 4). Zooming the camera toward objects also helps handle objects of varying sizes, ensuring that small objects (such as knives, forks, or cell phones) are seen from close enough to detect, while introducing difficult frames for large objects like beds or tables that are too close to see the entire object. Making the generation object-centric also helps ensure that image sequences contain as many different classes as possible.
The agent has collision detection and simple blocking physics interaction with the environment, which sometimes leads to it becoming stuck on corners, or stuck against a wall trying to move closer to an object than is physically possible. This behaviour is representative of real robots, which may become stuck or remain stationary for reasons that are opaque to the object detector, and it is important for the object detectors used on robot to be able to handle these cases.
Similarly, when moving between target objects, the movement speed of the robot is decoupled from the frame rate of the data capture, which makes the distance moved between frames inconsistent. This too is representative of real robots, which cannot guarantee that the camera moves at a constant speed.
Ii-C Ground truth segments
Often, an image-aligned box is not a good representation of the true shape of an object. To alleviate this, instead of evaluating detections against ground truth bounding boxes, we evaluate against ground truth segments. A detection is rewarded for including pixels inside the ground truth segment, and penalised for pixels outside the bounding box of that segment. Pixels which are inside the ground truth bounding box, but not part of the true segment are neither penalised nor rewarded, as an accurate box detection cannot help but include them. This helps encourage detectors that can actually find the majority of the target object, rather than the background around it. See section IV and  for a more detailed explanation of our evaluation process).
Ii-D Tiny objects
As illustrated in Fig. 3, the simulator can label every single pixel belonging to every single object in the image. This is true even when objects are too small to feasibly detect, such as when they are only a single pixel. For this reason, we filter out ground truth objects that are less than 10px wide in either dimension, or do not have at least 100 pixels total. The filtering is implemented after detections are matched to ground truth objects, so that if by chance a detector manages to detect a tiny object, it is not penalised as a false positive.
Iii Probabilistic Bounding Box (PBox) Detection Format
object detections represent the location of objects as probabilistic bounding boxes where the corners are modelled as 2D Gaussians (left). This induces a probability distribution over the pixels and allows the object detector to express spatial uncertainty (right). Figure courtesy of
Typically, the location of a detected object is expressed using a bounding box or segmentation mask. However, bounding boxes do not allow the expression of spatial uncertainty, a pixel is either inside the box, or it is not. Evaluating spatial uncertainty (as this challenge seeks to) requires some mechanism for expressing spatial uncertainty, and standard bounding boxes are not able to do so.
As storing pixel-wise spatial probabilities for every pixel in an images, for each detection, across all images in a dataset is memory-intensive, we choose to express spatial uncertainty using probabilistic bounding boxes (PBoxes) as outlined in . Instead of defining the corners of a bounding box as fixed locations as is done traditionally, PBoxes define corners as Gaussian distributions. These distributions express the detector’s uncertainty about the location of the object, and each pixel’s probability of being part of the object is its probability of being inside the box. For further details on PBox generation, we direct readers to the original paper by Hall et al.  A visualisation of PBox Gaussian corners and subsequent spatial probability heatmap is shown visually in Fig. 5.
Our challenge accepts either PBoxes or standard bounding boxes for detections. Standard bounding boxes are simply treated as PBoxes where the corner covariances are 0, with no spatial uncertainty, typically leading to high penalties for being over-confident.
|Participant 1 (FasterRCNN with fixed covariance)||0.141||0.482||0.384||0.737||98,916||41,645||197,451|
|Participant 2 (method unknown)||0.133||0.476||0.372||0.770||109,241||91,598||187,126|
|Participant 3 (method unknown)||0.088||0.344||0.239||0.772||87,031||42,054||209,336|
|Participant 4 (YOLOv3 with percentage covariance)||0.082||0.499||0.378||0.855||50,713||12,234||245,654|
Iv Probability-based Detection Quality (PDQ)
Existing object detection metrics such as mean Average Precision (mAP) cannot evaluate detections with spatial uncertainty. Therefore, for this challenge we use the Probability-based Detection Quality (PDQ). PDQ explicitly evaluates the spatial and semantic quality for each ground truth/detection pair, before combining the two in to a single quality score (the ”pairwise PDQ” or pPDQ). It then performs optimal assignment between ground truth segments and detections to produce a final average PDQ score.
More formally, the pairwise PDQ for a given detection and ground truth
is the weighted geometric mean of two components, spatial quality and label quality.
The label quality, , is simply the probability assigned to the true class of the ground truth by the detection. The spatial quality is calculated from two other values, the foreground and background loss ( and respectively).
The foreground loss measures how high a probability the detection assigns to the pixels in the ground truth segment, while the background loss measures how low a probability it assigns to pixels outside the bounding box (Section II-C explains this distinction).
For further details on how PDQ is calculated, please see .
PDQ produces the best scores when the uncertainty is correctly calibrated. That is, it is better to produce inaccurate detections that are known to be inaccurate than to produce more accurate detections that are over or under confident. This helps encourage researchers to produce systems which accurately express their spatial uncertainties, and allows robotic systems to reason about the decisions they derive from object detector outputs.
Calculating the PDQ score requires the calculation of a number of intermediate values, including the spatial quality, label quality, number of true positives (non-zero pPDQ assignments), false positives (no optimally assigned ground truth), and false negatives (no optimally assigned detection). This is ideal for this challenge, as it allows us to provide fine-grained feedback to participants about the strengths and weaknesses of a particular system. Table II illustrates these useful intermediate statistics from the challenge leaderboard.
While participant 1 has the highest overall PDQ score, the intermediate statistics reveal that participant 4 has the highest average score when it actually detects something, but misses far more objects than the other participants. Meanwhile, participant 2 detects the most objects successfully, but at a lower quality on average than participant 1, leading to a lower overall score.
In this paper, we have introduced the first ACRV robotic vision challenge. This new challenge introduces probabilistic object detection, extending existing object detection tasks to provide spatial and semantic uncertainty. We introduce a new test dataset that better represents a robot’s viewpoint and motions, use a new evaluation measure that rewards well calibrated expressions of spatial uncertainty. We hope that this new challenge will help drive object detection research toward robotics focused applications.
This research was conducted by the Australian Research Council Centre of Excellence for Robotic Vision under project CE140100016, and supported by a Google Faculty Research Award to Niko Sünderhauf.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 211–252 740–755.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf, “Dropout sampling for robust object detection in open-set conditions,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–7.
-  D. Hall, F. Dayoub, J. Skinner, P. Corke, G. Carneiro, and N. Sünderhauf, “Probability-based detection quality (pdq): A probabilistic approach to detection evaluation,” arXiv preprint arXiv:1811.10800, 2018.