Affordable RGB-D sensors (such as Microsoft Kinect and Intel Realsense) are becoming more and more common in the modern robotics, due to their cheap price and portable size. With the increasing attention on semantic understanding for mobile autonomous robots, we ask ourself the question: how might mobile robots take advantage of the depth information (RGB-D) for object detection in real-world robotics applications. In this work, we take hazmat sign (see Fig. 1) detection as an example task for using geometric image rectification to aid a CNN-based object detector.
and convolutional neural network (CNN)  . Since the initial proposal of RCNN , state-of-art CNN-based approaches   have achieved impressive results on large-scale standardized datasets .
A common approach for current state-of-art object detection networks is to feed images into the CNN detection network directly, as an end-to-end solution. This eliminates the need for feature engineering and enables CNN networks to learn features automatically. With sufficient training images, state-of-art detection networks can thus distinguish different signs with good performance.
However, feeding images directly to CNN detector has two main drawbacks, in terms of mobile robotics applications. First, the CNN detector needs a large number of images for training, but creating such a training set is considerably challenging in most cases. In practice, collecting the training set turns out to be very time-consumption. As we expect the CNN detector to automatically learn how to deal with geometrical information for perspective distortion and multiple scales, the network needs training data on different points of view. Besides, given the huge amount of human resource and time needed for dataset collection, it is more difficult for real-world mobile robotics.
Second, even with a nice training set, there is no guarantee that the detector can correctly learn perspective distortion. One key issue is that an end-to-end solution requires deeper and more complicated neural networks. This is because the network needs more layers for viewpoint angle estimation and perspective distortion. In some cases, this might lead to an over-fitting situation, where the network fails to learn useful features.
Moreover, previous research work has shown that it is hard for CNN detection framework to learn from raw RGB-D images automatically. Perspective distortion is introduced when the image is not captured in a canonical view of the object. Intuitively, parallel 3D lines are no longer parallel on the 2D image. This results in differences between images from different viewing angles. For planar objects, a homography matrix can be used to transfer between those images.
In this work, we propose to utilize the depth information for rectifying RGB information with homography matrix. In short, a homography matrix is calculated from depth information to transform input RGB images to the canonical view. The CNN detector then takes rectified RGB images as input to perform detection. The final detection result on rectified RGB images is transformed back the original images in the end.
The proposed method provides two key advantages. First, the image rectification simplifies the problem for CNN detector after image rectification. Typical CNN detectors suffer from multi-scale input images and show noticeable weakness in bounding box regression accuracy. Image rectification can avoid the multi-scale problem to some extent. For the bounding box accuracy, as all images have been rectified to the canonical view, the problem gets slightly easier for CNN. Secondly, because the CNN detector only takes canonical-view images as input, the proposed method requires a smaller training set, which reduces the workload for practical deployment with mobile robotics.
What should be learned? Perspective distortion is something for which exact models are known, so we should not have to learn this. This combination of the geometric and semantic model is motivated by having a smaller, more efficient network that focuses on the part of the problem which they are best and most needed for.
Our contributions are:
We propose a feasible way of combining geometric information with CNN detectors to improve detection performance.
Our approach shows good tolerance towards noise in depth images, with homography based image rectification.
We successfully reduce the number of images needed for training the CNN detector, because perspective distortion has been dealt with in advance. It is especially meaningful for practical usage in mobile robotics when facing a new environment or target object.
We release a new hazmat sign detection dataset. To the best of our knowledge, it is the first RGB-D hazmat sign detection dataset.
I-C Paper Organization
Ii Related Work
Ii-a Hazmat Sign Detection
Hazmat sign detection has been studied by the robotics community for a long time. It is still challenging because of detection speed, illumination changes, background similarity, size variety, and inter-class variety. Due to the importance of hazmat sign detection for rescue robots, the RoboCup Rescue League competition has the task of detecting hazmat signs on system inspection stage and exploration tasks .
Existing sign detection algorithms can be categorized into 4 types: 1) color-based 2) shape-based 3) saliency-based 4) Deep neural network-based. Typical color-based methods use color as strong prior knowledge and key features, like the color histogram . However, color-based methods can be easily affected by illumination changes. Shaped-based methods usually use feature detection and matching, like SIFT or SURF  . Those methods, however, can be affected by perspective distortion from different viewpoints. Saliency-based methods use saliency detection to speed up the detection process and get the region of interest where the hazmat sign might locate in  . Recent research focued on CNN methods, such as  .
Ii-B Image Rectification
There are two main methods for image rectification, 3D reprojection or 2D homography matrix transformation. For 3D reprojection, it first generates a 3D point cloud from input (RGB-D) and then reprojects the points to the new image plane. In the 2D homography matrix transformation approach we directly apply a 3 by 3 matrix () on the input RGB image.
Where and are images, x are the homogenous coordinates.
In the proposed approach, rectification via homography matrix is used, due to its robustness. Low-cost RGB-D cameras (like Intel Realsense) tend to have a quite noisy depth map. Through our experiments, we found that the pointcloud recovered from the depth map is too noisy for the other method. As it does not operate on 3D space, the homography method only requires the pointcloud to be accurate enough for plane extraction, which can be easily met via RANSAC .
Ii-C Combining geometry and CNN
. Unfortunately, it remains unclear about how to take good advantage from the additional depth information from RGB-D. Feeding raw RGB-D images into CNN will not help much, as the depth information with its high variance makes the learning process of CNN detector even harder.
Some work instead utilizes depth information geometrically.  combines CNN with geometry priors by using CNN for image-based 2D part location estimates and assumes the geometry model for 3D pose reconstruction.  utilize the geometric constraints on translation imposed by the 2D bounding box to recover a stable and accurate 3D object pose.  is using geometric shape features to boost the performance of neural networks.  uses depth for proposal generation with contour detection. It also encodes the depth information before forwarding it to the CNN detector. One close work to this paper is . They first do contour detection on RGB images and then use depth for rectifying the proposal to canonical view, so that it becomes easier for feature matching. However, their experiments were performed on conventional feature descriptors, such as SIFT  and ORB . Different from , our system does not require contour detection. This extends the usage to complicated planar objects, where contour detection might fail. To the best of our knowledge, there is no concrete research on use image rectification for CNN-based object detectors, especially with the consideration of mobile robotics.
Iii-a Overview Framework
The detection framework takes 3 inputs: 1) RGB images 2) point cloud 3) Camera intrinsic matrix. The Intel Realsense Driver provides the point cloud as output, so we use it as the input. This could be easily substituted with using depth images in the implementation.
Iii-B Rectification Pipeline
The goal of the algorithm is to rectify the input images based on planes detected in the 3D data, as shown in Figure 2. The overall pipeline contains the following stages and is shown in Figures 4 and 5. The input image is rectified to multiple parallel viewpoints, so the rectification module needs to compute a set of homography matrices as the output.
Estimate plane segmentation from 3D point cloud
Calculate virtual canonical viewpoint
Compute the initial homography matrix for image rectification (rectify to the virtual canonical viewpoint)
Refine the homography matrix by applying translation matrix (sliding through the image)
Apply rectification matrices to get rectified images
Iii-C Plane Segmentation
Iii-C1 Plane Estimation
RANSAC  is used to estimate plane parameters from the input point cloud. The usage of RANSAC shows a good robustness for plane estimation on noisy inputs, which then improve the robustness of the whole detection framework. In this step, we obtain the major planes in the scene. For each plane, the following are calculated (with respect to the original viewpoint, ):
the plane parameters .
where is the normal of the plane, , is the distance from viewpoint to plane.
boundary points where
Iii-C2 Unique Normal
Each plane has two normals, in opposite directions. In order to calculate the new viewpoint, which has to be correctly aligned with the original viewpoint, it is necessary to use one unique normal for each plane. The unique normal is defined as the one not facing toward the origin.
Iii-D Calculating the Virtual Canonical Viewpoint
In order to transform images to the canonical view, a virtual canonical viewpoint needs to be calculated from the plane parameters. We set the virtual viewpoint () at a fixed distance away from the plane centroid. In our experiments, we set this to 1.2 meter to align with the training set, as the training set is collected with the distance of 0.8 and 1.2 meter.
The new viewpoint is calculate as such:
Where is a 4 * 4 matrix denotes the position and orientation of in . is the normal of the plane, , is the distance from viewpoint to plane.
Iii-E Calculate Homography Matrix for Virtual Viewpoint (2D)
Once the virtual viewpoint is calculated, it is easy to calculate an equivalent homography matrix that denotes the transformation between and
. Non-robust DLT (Direct Linear Transform) is used to compute the homography.
For homography calculation, four points are sampled from the 3D plane, denoted as . On both camera viewpoints (the original viewpoint and the new one), we project the 3D points to 2D images, denoted as and . As shown in Figure 2.
As the plane is an ideal infinite plane, and since the 2D image points are all obtained from re-projection, non-robust DLT (Direct Linear Transform)  is used to compute the homography.
where is homogenous coordinates.
Iii-F Refine Homography Matrix
At this point, we have a homography matrix which transforms side view images to the canonical view. However, there is still one minor issue: Some pixels go out of view, due to the FoV (field of view) of the camera. As shown in Figure 3, in some case, only part of the pixels are included in the resulting image.
Iii-F1 Bounding box around the plane
In our case, as the plane is a finite plane defined by boundary points, the first step is to calculate the tight bounding box around the reprojected plane in , as is shown in blue in Figure 3.
Where denotes points around plane boundary, denotes projection matrix.
From , we can easily calculate a tight bounding box by taking the minimum and maximum. Specifically, we calculate the top-left corner , the bounding box height and the width .
Iii-F2 Refinement Algorithm
This next step is to calculate the final rectification matrices which produce the final resulting images. As is shown in Equation 7, we slide the camera window through the reprojected plane on . Every two sliding images have 50% of overlap either horizontally or vertically, so that hazmat signs around the border will present in the center of its sliding image. See Figure 3 for an example.
Denote as the top-left corner of the plane bounding box on , as the bounding box height and width. as height and width of the resulting image.
where i = 1 to 2 * , to 2 *
The final step is to apply the refined homography matrix to obtain the final rectified images. The warpping is done with bilinear interpolation. By doing that, object patches from non-canonical images are transformed into canonical view. Besides, rectified images are on the same distance, as we manually set it to a fixed distance (1.2 meters in this case). This avoids the multi-scale problem, which is challenging for CNN networks.
Currently, there are only very few publicly available hazmat sign detection datasets.  published their high-resolution RGB hazmat detection dataset. However, the dataset was collected with a hand-held single-lens camera, containing only RGB images. On the contrary, our dataset contains images from an affordable RGB-D sensor (Intel Realsense RGB-D Camera). With the additional depth information, we are able to provide geometric information.
We provide a high-resolution RGB-D hazmat dataset with labels in this paper, which can be found here111https://robotics.shanghaitech.edu.cn/datasets/MARS-Hazmat-RGBD. It contains both RGB images and depth images with a resolution of . Ground truth label information of the RGB images is also provided. of the RGB and depth images contain only one of the types of hazmat label. Each of these images contains two backgrounds (plain and plywood) and five positions (top left, top right, center, bottom left, bottom right). For the rest of images, each image contains types of hazmat labels. Nine different angles (, , , , ) with three distances (, , ) are included in these images.
V Experiment and Results
V-a Evaluation Metric
In our detection framework, we use a training set containing canonical-view images only to train the CNN detector. It effectively reduces the size of the training set, thus lowering the difficulty for collecting a good training set for mobile robotics applications. In the testing stage, each image first goes through the rectification system to get rectified images in the canonical view, as is shown in Figure 4. Then CNN detector is used for performing the actual detection. Finally, all detection results get warped back to original images to get final detection results as the output of the system.
We use the MSCOCO object detection evaluation matrix to evaluate the detection performance. Two main metrics are IoU (Intersection of Union) and mAP (mean Average Precision). We propose an extend NMS method to select final bounding boxes. Since each image is split into a series of images, we can convert the bounding boxes back to origin image by utilizing the homography matrix we calculate before. For each splitted image we utilize the homography matrix to recover final bounding boxes. The implementation of our method is provided here222Will be available in the final version.
V-B Experiments Setup
Most experiments are performed on our own self-collected dataset, for there is no public available RGB-D hazmat detection dataset. To show that our approach can effectively reduce the difficulty for the CNN detector by dealing with perspective distortion in advance, only images from canonical views are used for training the network. The test sets include images from various angles (-75 to 75).
We use yolov3-tiny  as the CNN detector. We choose yolov3-tiny because 1) We take it as an typical example of off-the-shelf CNN-based detection network and 2) because it is small and fast enough for real-world deployment on mobile robots. The training time takes about one and a half hours with our computer (Intel Core i7-6700 CPU, GeForce GTX 1080, GiB Memory). We trained our model from scratch with a batch size of , momentum , subdivisions , burnin , maxbatches , learningrate and the learningrate will be multiplied by 0.1, when the number of batches is 3000, 4000, 5000, 6000, 7000.
V-C Rectification Parameters
For the plane segmentation we assume that 90% of the points of each frame are from planes. As a result, the RANSAC keeps extracting planes until less than 10% of total points are in the remaining set. We set the number of maximum planes per image to 1, because it is known that the test set only contains one plane per test image. The distance from the virtual viewpoints to the plane is set as 1.2m. Because the training dataset is collected with distances between 1m and 1.5m, it is reasonable to assume that the CNN detector will have better performance on 1.5m or 1m than others. The training dataset contains two main parts, images that contain one hazmats and images with 13 hazmats, in order to prevent overfitting to the background. The homography matrix can be calculated in a closed form solution by using planar homography, but we are using DLT to compute the homography matrix. In the future we plan to move to the closed form solution.
We compare the detection performance with and without geometry rectification. The results are shown in Table I. Baseline means without geometry rectification. From Table I, we can see that, after geometry rectification, the performance is much better than before. increases nearly after geometry rectification while increases .
Table II shows the performance of our approach at different angles. The test dataset contains nine angles (). From Table II, we can see even on very large angles such as our approach can still detect some hazmat signs. In our approach the is while without geometry rectification is . On other angles such as , according to the results shown in Table II, our results are better than the previous results. increases at angle , which is a huge improvement.
Besides perspective distortion, our proposed detection approach helps the CNN detector module to avoid dealing with multi-scale detection problems by explicitly rectifying the target patch to an ideal scale, as the distance from the plane to virtual viewpoints is fixed. Previous research has shown that the scale problem is challenging for CNN networks. As the result, in Table II, we have better performance even at 0, where there is no perspective distortion. For 0 images, the mAP performance gets improved from 0.263 to 0.375.
The top images on Figure 6 and Figure 7 show examples of the detection results of baseline algorithm without geometry while the bottom images in Figure 6 and Figure 7 show the detection results of our approach in the same images. We can see that on both Figure 6 and Figure 7 our approach performs much better than baseline approach. Our approach can detect more hazmat signs at very large angles. Also, the accuracy of bounding boxes is more precise in our approach.
Running single-threaded on a CPU, our algorithm needs about 5.7 seconds per image, a value that should be improved, since it is too slow to run live on a robot.
In this work, we showed a simple but effective way to combine geometric information with an off-the-shelf CNN-based detector. By doing image rectification explicitly in advance of the CNN detector, we take full advantage of available geometric information from RGB-D images to 1) reduce the time for training stage; 2) reduce the size of training set required; 3) improve the performance of overall detection system; and 4) produce more accurate detection result (more tight bounding box). This approach also features a high robustness towards noisy depth information input (noisy point cloud), as the depth is just used to estimate the plane parameters, where RANSAC is effective even with noisy input.
For the mobile robotics application, especially hazmat sign detection in rescue robotics, our approach lowers the work required to create a nice training dataset, because fewer training images are needed. In the interest of reproducible science we provide the dataset used in the paper as well as our code to the public.
Surf: speeded up robust features.
European conference on computer vision, pp. 404–417. Cited by: §I-A.
-  (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §I-A, §V-A.
-  (2013) Indoor semantic segmentation using depth information. In International Conference on Learning Representations (ICLR2013), April 2013, Cited by: §II-C.
Depth-assisted rectification for real-time object detection and pose estimation. Machine Vision and Applications 27 (2), pp. 193–219. Cited by: §II-C.
-  (2019) Hazmat label recognition and localization for rescue robots in disaster scenarios. Electronic Imaging 2019 (7), pp. 463–1. Cited by: §II-A.
-  (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §II-B, §III-C1.
Rich feature hierarchies for accurate object detection and semantic segmentation.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587. External Links: Cited by: §I-A.
-  (2008) Danger Sign Detection Using Color Histograms and SURF Matching. In Proceedings of the 2008 IEEE International Workshop on Safety, Security and Rescue Robotics, pp. 13–18. External Links: Cited by: §II-A.
-  (2014) Learning rich features from RGB-D images for object detection and segmentation. In ECCV 2014, Vol. 8695 LNCS, pp. 345–360. External Links: Cited by: §I-A.
-  (2014) Learning rich features from rgb-d images for object detection and segmentation. In European conference on computer vision, pp. 345–360. Cited by: §II-C.
-  (2016) Gvnn: neural network library for geometric computer vision. In European Conference on Computer Vision, pp. 67–82. Cited by: §II-C.
-  (2017) Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 2980–2988. External Links: Cited by: §I-A.
-  (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §I-A.
-  (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §I-A, §II-C.
-  (2018) Seeing Signs of Danger: Attention-Accelerated Hazmat Label Detection. In 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics, SSRR 2018, External Links: Cited by: §II-A, §IV.
3D bounding box estimation using deep learning and geometry. pp. 5632–5640. Cited by: §II-C.
-  (2013) Hazardous material sign detection and recognition. In 2013 IEEE International Conference on Image Processing, ICIP 2013 - Proceedings, pp. 2640–2644. External Links: Cited by: §II-A.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §I-A, §V-B.
-  (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Vol. 11, pp. 2. Cited by: §I-A, §II-C.
-  (2011) The robocuprescue robot league: guiding robots towards fieldable capabilities. In Advanced Robotics and its Social Impacts, pp. 31–34. Cited by: §II-A.
-  (2016) 16 years of robocup rescue. KI-Künstliche Intelligenz 30 (3-4), pp. 267–277. Cited by: §II-A.
-  (2018) An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: §V-D.
-  (1974) Three-dimensional data input by tablet. Proceedings of the IEEE 62 (4), pp. 453–461. Cited by: §III-E, §III-E.
-  (2016) Differential geometry boosts convolutional neural networks for object detection. pp. 1006–1013. Cited by: §II-C.
-  (2013) Mobile-based hazmat sign detection and recognition. In 2013 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2013 - Proceedings, pp. 735–738. External Links: Cited by: §II-A.
-  (2019) MonoCap: monocular human motion capture using a cnn coupled with a geometric prior. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 901–914. Cited by: §II-C.
-  (2016) Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2110–2118. Cited by: §II-A.
-  (2018) RGB-D Object Recognition Using Deep Convolutional Neural Networks. Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 2018-Janua, pp. 887–894. External Links: Cited by: §I-A, §II-C.