The innovations in robotics have undergone numerous improvements to maneuver over rough terrains, to avoid collisions and even to take flight. However, a huge limitation for these autonomous agents is still represented by the lack of perception of the world around them. Without perception, modern robots are incapable of autonomously interacting with the objects present in their surroundings. Hence, although the advances in hardware to stabilize the robot’s movement have reached remarkable achievements, the perception software still falls behind.
Most robotic perception tasks aim at providing an autonomous agent with the skill of either localizing itself in the surrounding environment, or interacting with nearby objects through grasping and manipulation. Our work has investigated on the latter skill, where we develop robotic perception techniques for real-time object detection and tracking of multiple 3D objects from RGB-D data acquired using consumer depth cameras. In particular, we have focused on industrial robotic applications, where intelligent robots have to localize, grasp, assemble and relocate objects on the production line, for tasks such as pick-and-place and bin picking. Looking from a wider perspective, the solution we developed is fundamental for most of the envisioned applications concerning service and personal robotics, where robotic assistants help out people for daily tasks in their domestic environments. In addition, it is also fundamental to AR applications, where the pose of several objects has to be efficiently estimated while the user interacts with them.
Our framework is inspired by the way humans interact with the objects around them. To be able to interact with objects in our surrounding, we first need to localize each object in the scene, then keep track of it throughout the process. Converting this intuitive idea into an algorithm involves a two-step procedure – object detection and temporal tracking. These two methods work hand-in-hand such that the former perceives the object in the scene while the latter keeps track of the object’s movement during interaction.
2 Seamless Object Detection and Tracking
The goal of the framework is to find the objects in the scene while estimating their orientation and location in the 3D space, then to continuously track them throughout the following frames. Object detection performs a sliding window approach to simultaneously find the objects in the scene and estimate their pose. Taking the resulting pose from the detector as input, the tracker estimates the relative transformation of the object between two consecutive frames and temporally relays the pose from one frame to the next. Intended for autonomous robots, the object detection and tracking run automatically in order to detect whenever the object of interest is present and stop tracking when the object is no longer visible.
Given the 3D CAD model of the object, both algorithms learn random forests from multiple, synthetically rendered depth images acquired by positioning the camera around the model. Motivated to perform different tasks, the detector predicts the rotation and translation parameters for each region of the sliding window while the tracker predicts the relative transformation between two consecutive frames [10, 11, 12].
|Coffee Cup||Shampoo||Joystick||Camera||Juice Carton||Milk||Mean|
|Point-Pair Features ||86.7%||65.1%||27.7%||40.7%||60.4%||25.9%||51.1%|
|Coordinate Reg. ||91.2%||82.4%||75.9%||69.1%||89.7%||47.6%||75.9%|
|Latent Forest ||89.1%||79.2%||54.9%||39.4%||88.3%||39.7%||65.1%|
|Deep Learning ||97.2%||91.0%||89.2%||38.3%||86.6%||46.3%||74.8%|
|Errors||PCL ||C&C ||Krull ||Learner ||AR |
(a) Kinect Box
(c) Orange Juice
|Requirement||(4 cores)||(1 core)||(1 core)|
The framework satisfies various characteristics that are required by applications in robotics and AR. These include (1) the robustness and accuracy to find the object in the scene and to estimate its pose; (2) the efficiency to run in real-time with a minimal computational expense; and, (3) the cost-effective system requirements.
3.1 Robust and Accurate to Detect and Track
Fundamentally, the goal is to develop two powerful algorithms such that, in the combined approach, each algorithm can perform their assigned tasks very well.
Robust Detector with 6D Pose Estimate.
. We remind the reader that the f1-score measures the detection rate by incorporating both the precision and recall into one value. An f1-score of 99.3% then indicates that the detection rate is almost perfect.
An interesting observation from Table 1 is that all the other methods [1, 4, 5, 6, 7, 13] have a low f1-score on the “Milk” sequence. Among them, the highest reached 55.8% from LineMod . The reason for the low results is because the authors from  added small objects on the object of interest as shown in Fig. 2. In effect, this occludes several regions of the object of interest and slightly changes its geometry. Contrary to their results, we can handle these occlusions and achieve an f1-score of 99.3% on this sequence which is 43.5% higher than the other methods.
Accurate Temporal Tracker.
On the other hand, the temporal tracker generates the lowest errors in translation and rotation in Table 2 on the dataset of . In addition, we also introduce a version of the tracker for AR . Compared to [10, 12], the latest version of the temporal tracker  aims at the user experience for AR. Thus, it minimizes jitter by optimizing through the RGB and depth images to acquire more accurate pose estimations as evaluated in Table 2.
Hence, the combined approach incorporates high detection rates as well as highly accurate pose. Moving past the public datasets, Fig. 1 demonstrates the robustness of our algorithm in different types of challenging scenarios, including clutter, partical occlusions, and varying lighting conditions. From simple to complex geometrical shapes, the results from Table 2 and Fig. 1 show the capability of the framework to generalize its performance for different object shapes and sizes.
|Our Detector||Our Tracker|
|Time||872.1 ms||1.4-2.2 ms|
|Power||(8 Cores)||(1 Core)|
3.2 Low Latency and Low Memory Consumption
Table 3 summarizes the detection time and tracking time with respect to the computational power. Note that, after detecting the object, only the temporal tracker keeps track of the object in the subsequent frames. Hence, the latency of the framework depends on the temporal tracker alone, which is approximately 2 ms per frame for each object with a single CPU core. This efficiency is a substantial improvement against other temporal trackers from Table 2 that run at 2786.0 ms for , 132.0 ms for  and 131.0 ms for , where [3, 8] use GPU. We further evaluated that we achieve a real-time performance in tracking 108 moving objects at 30 fps with 8 CPU cores . Moreover, considering that we have a learning-based approach that needs to store the random forests, we efficiently achieved a low memory footprint of about 40.5 MB per object.
3.3 Low-Cost Hardware Requirements
The hardware requirements for our framework is (1) a standard computer; and, (2) a consumer depth sensor (Microsoft Kinect and Kinect II, Asus Xtion, PrimeSense Carmine, Orbbec Astra) that costs about 150 USD. These sensors are much cheaper than the industrial 3D sensors which are currently used. In addition, due to the framework’s efficiency, the cost remains low since powerful GPU’s are not required because both the tracker and the detector only use the CPU cores. Here, all the quantitative and qualitative evaluations use an Intel(R) Core(TM) i7 CPU in a Lenovo W530 laptop.
We present a seamless object detection and tracking framework that automatically finds the object of interest in the scene and keeps track of these objects across time. Our evaluation proves that its robustness, accuracy, efficiency and cost-effective characteristics make it an ideal framework for applications in robotic perception and interaction for industrial robotics in the production line. Another set of applications is in Augmented Reality (AR), Mixed Reality (MR) and Virtual Reality (VR), where not only does the pose estimation play a fundamental role in finding the objects in the scene but it is also an enabling technology in a pipeline composed of multiple modules running on the same power- and memory-limited hardware platform. Therefore, this allows applications to have ample of time to build or render on top of our framework as well as the capacity to fully utilize the machine’s hardware resources. Notably, it is also highly suitable for applications in mobile platforms.
We prepared demonstrative videos111Link to video: https://youtu.be/7rKBZZHJkFk222Link to video: https://youtu.be/1P184ZocMo8 to show the framework’s performance. For AR applications, we introduce another set of videos333Link to video: https://youtu.be/t-WDIqEPQ3g444Link to video: https://youtu.be/8-0xsc2abQs that uses our new temporal tracker  to estimate a more accurate pose and to handle challenging scenarios.
This extended abstract was submitted to the demo session of ISMAR 2017 and the 3rd International Workshop on Recovering 6D Object Pose organized at ICCV 2017.
-  E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6d object pose estimation using 3d object coordinates. In ECCV, 2014.
-  L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
-  C. Choi and H. I. Christensen. Rgb-d object tracking: A particle filter approach on gpu. In IROS, 2013.
-  A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.-K. Kim. Recovering 6d object pose and predicting next-best-view in the crowd. In CVPR, 2016.
-  B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In CVPR, 2010.
-  S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In ACCV, 2012.
-  W. Kehl, F. Milletari, F. Tombari, S. Ilic, and N. Navab. Deep learning of local rgb-d patches for 3d object detection and 6d pose estimation. In ECCV, 2016.
-  A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, and C. Rother. 6-dof model based tracking via object coordinate regression. In ACCV, 2014.
-  R. B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). In ICRA, 2011.
-  D. J. Tan and S. Ilic. Multi-forest tracker: A chameleon in tracking. In CVPR, 2014.
-  D. J. Tan, N. Navab, and F. Tombari. Looking beyond the simple scenarios: Combining learners and optimizers in 3d temporal tracking. TVCG, 2017.
-  D. J. Tan, F. Tombari, S. Ilic, and N. Navab. A versatile learning-based 3d temporal tracker: Scalable, robust, online. In ICCV, 2015.
-  A. Tejani, D. Tang, R. Kouskouridas, and T.-K. Kim. Latent-class hough forests for 3d object detection and pose estimation. In ECCV. 2014.