Improving CNN-based Planar Object Detection with Geometric Prior Knowledge

09/23/2019 ∙ by Jianxiong Cai, et al. ∙ 0

In this paper, we focus on the question: how might mobile robots take advantage of affordable RGB-D sensors for object detection? Although current CNN-based object detectors have achieved impressive results, there are three main drawbacks for practical usage on mobile robots: 1) It is hard and time-consuming to collect and annotate large-scale training sets. 2) It usually needs a long training time. 3) CNN-based object detection shows significant weakness in predicting location. We propose a novel approach for the detection of planar objects, which rectifies images with geometric information to compensate for the perspective distortion before feeding it to the CNN detector module, typically a CNN-based detector like YOLO or MASK RCNN. By dealing with the perspective distortion in advance, we eliminate the need for the CNN detector to learn that. Experiments show that this approach significantly boosts the detection performance. Besides, it effectively reduces the number of training images required. In addition to the novel detection framework proposed, we also release an RGB-D dataset for hazmat sign detection. To the best of our knowledge, this is the first public-available hazmat sign detection dataset with RGB-D sensors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Affordable RGB-D sensors (such as Microsoft Kinect and Intel Realsense) are becoming more and more common in the modern robotics, due to their cheap price and portable size. With the increasing attention on semantic understanding for mobile autonomous robots, we ask ourself the question: how might mobile robots take advantage of the depth information (RGB-D) for object detection in real-world robotics applications. In this work, we take hazmat sign (see Fig. 1) detection as an example task for using geometric image rectification to aid a CNN-based object detector.

I-a Background

For modern robotics, image based object detection methods can be categorized into two main approaches: Feature-based matching [14] [1] [19]

and convolutional neural network (CNN)

[18] [12] [28]. Since the initial proposal of RCNN [7], state-of-art CNN-based approaches [18] [12] have achieved impressive results on large-scale standardized datasets[2] [13].

A common approach for current state-of-art object detection networks is to feed images into the CNN detection network directly, as an end-to-end solution. This eliminates the need for feature engineering and enables CNN networks to learn features automatically. With sufficient training images, state-of-art detection networks can thus distinguish different signs with good performance.

However, feeding images directly to CNN detector has two main drawbacks, in terms of mobile robotics applications. First, the CNN detector needs a large number of images for training, but creating such a training set is considerably challenging in most cases. In practice, collecting the training set turns out to be very time-consumption. As we expect the CNN detector to automatically learn how to deal with geometrical information for perspective distortion and multiple scales, the network needs training data on different points of view. Besides, given the huge amount of human resource and time needed for dataset collection, it is more difficult for real-world mobile robotics.

Second, even with a nice training set, there is no guarantee that the detector can correctly learn perspective distortion. One key issue is that an end-to-end solution requires deeper and more complicated neural networks. This is because the network needs more layers for viewpoint angle estimation and perspective distortion. In some cases, this might lead to an over-fitting situation, where the network fails to learn useful features.

Moreover, previous research work has shown that it is hard for CNN detection framework to learn from raw RGB-D images automatically[9]. Perspective distortion is introduced when the image is not captured in a canonical view of the object. Intuitively, parallel 3D lines are no longer parallel on the 2D image. This results in differences between images from different viewing angles. For planar objects, a homography matrix can be used to transfer between those images.

Fig. 1: Hazmat Sign Reference Images. (GHS hazard pictograms)

In this work, we propose to utilize the depth information for rectifying RGB information with homography matrix. In short, a homography matrix is calculated from depth information to transform input RGB images to the canonical view. The CNN detector then takes rectified RGB images as input to perform detection. The final detection result on rectified RGB images is transformed back the original images in the end.

The proposed method provides two key advantages. First, the image rectification simplifies the problem for CNN detector after image rectification. Typical CNN detectors suffer from multi-scale input images and show noticeable weakness in bounding box regression accuracy. Image rectification can avoid the multi-scale problem to some extent. For the bounding box accuracy, as all images have been rectified to the canonical view, the problem gets slightly easier for CNN. Secondly, because the CNN detector only takes canonical-view images as input, the proposed method requires a smaller training set, which reduces the workload for practical deployment with mobile robotics.

What should be learned? Perspective distortion is something for which exact models are known, so we should not have to learn this. This combination of the geometric and semantic model is motivated by having a smaller, more efficient network that focuses on the part of the problem which they are best and most needed for.

I-B Contribution

Our contributions are:

  • We propose a feasible way of combining geometric information with CNN detectors to improve detection performance.

  • Our approach shows good tolerance towards noise in depth images, with homography based image rectification.

  • We successfully reduce the number of images needed for training the CNN detector, because perspective distortion has been dealt with in advance. It is especially meaningful for practical usage in mobile robotics when facing a new environment or target object.

  • We release a new hazmat sign detection dataset. To the best of our knowledge, it is the first RGB-D hazmat sign detection dataset.

I-C Paper Organization

Section II discusses related work while Section III introduces the new detection framework with the homogrpahy-based rectification. Section IV presents the new RGB-D hazmat sign detection dataset. Experiments results are shown in Section V and conclusions are drawn in Section VI.

Ii Related Work

Ii-a Hazmat Sign Detection

Hazmat sign detection has been studied by the robotics community for a long time. It is still challenging because of detection speed, illumination changes, background similarity, size variety, and inter-class variety. Due to the importance of hazmat sign detection for rescue robots, the RoboCup Rescue League competition has the task of detecting hazmat signs on system inspection stage and exploration tasks[21] [20].

Existing sign detection algorithms can be categorized into 4 types: 1) color-based 2) shape-based 3) saliency-based 4) Deep neural network-based. Typical color-based methods use color as strong prior knowledge and key features, like the color histogram [8]. However, color-based methods can be easily affected by illumination changes. Shaped-based methods usually use feature detection and matching, like SIFT or SURF [8] [15]. Those methods, however, can be affected by perspective distortion from different viewpoints. Saliency-based methods use saliency detection to speed up the detection process and get the region of interest where the hazmat sign might locate in [25] [17]. Recent research focued on CNN methods, such as [5] [27].

Ii-B Image Rectification

There are two main methods for image rectification, 3D reprojection or 2D homography matrix transformation. For 3D reprojection, it first generates a 3D point cloud from input (RGB-D) and then reprojects the points to the new image plane. In the 2D homography matrix transformation approach we directly apply a 3 by 3 matrix () on the input RGB image.

Where and are images, x are the homogenous coordinates.

In the proposed approach, rectification via homography matrix is used, due to its robustness. Low-cost RGB-D cameras (like Intel Realsense) tend to have a quite noisy depth map. Through our experiments, we found that the pointcloud recovered from the depth map is too noisy for the other method. As it does not operate on 3D space, the homography method only requires the pointcloud to be accurate enough for plane extraction, which can be easily met via RANSAC [6].

Ii-C Combining geometry and CNN

With the impressive performance of CNN networks, recent works mainly focus on end-to-end CNN solutions [11] [28]

. Unfortunately, it remains unclear about how to take good advantage from the additional depth information from RGB-D. Feeding raw RGB-D images into CNN will not help much, as the depth information with its high variance makes the learning process of CNN detector even harder

[3].

Some work instead utilizes depth information geometrically. [26] combines CNN with geometry priors by using CNN for image-based 2D part location estimates and assumes the geometry model for 3D pose reconstruction. [16] utilize the geometric constraints on translation imposed by the 2D bounding box to recover a stable and accurate 3D object pose. [24] is using geometric shape features to boost the performance of neural networks. [10] uses depth for proposal generation with contour detection. It also encodes the depth information before forwarding it to the CNN detector. One close work to this paper is [4]. They first do contour detection on RGB images and then use depth for rectifying the proposal to canonical view, so that it becomes easier for feature matching. However, their experiments were performed on conventional feature descriptors, such as SIFT [14] and ORB [19]. Different from [4], our system does not require contour detection. This extends the usage to complicated planar objects, where contour detection might fail. To the best of our knowledge, there is no concrete research on use image rectification for CNN-based object detectors, especially with the consideration of mobile robotics.

Iii Method

Fig. 2: The proposed approach rectifies the image from the original viewpoint to a virtual viewpoint (canonical view).

Iii-a Overview Framework

The detection framework takes 3 inputs: 1) RGB images 2) point cloud 3) Camera intrinsic matrix. The Intel Realsense Driver provides the point cloud as output, so we use it as the input. This could be easily substituted with using depth images in the implementation.

Iii-B Rectification Pipeline

The goal of the algorithm is to rectify the input images based on planes detected in the 3D data, as shown in Figure 2. The overall pipeline contains the following stages and is shown in Figures 4 and 5. The input image is rectified to multiple parallel viewpoints, so the rectification module needs to compute a set of homography matrices as the output.

  1. Estimate plane segmentation from 3D point cloud

  2. Calculate virtual canonical viewpoint

  3. Compute the initial homography matrix for image rectification (rectify to the virtual canonical viewpoint)

  4. Refine the homography matrix by applying translation matrix (sliding through the image)

  5. Apply rectification matrices to get rectified images

Iii-C Plane Segmentation

Iii-C1 Plane Estimation

RANSAC [6] is used to estimate plane parameters from the input point cloud. The usage of RANSAC shows a good robustness for plane estimation on noisy inputs, which then improve the robustness of the whole detection framework. In this step, we obtain the major planes in the scene. For each plane, the following are calculated (with respect to the original viewpoint, ):

  • the plane parameters .

    where is the normal of the plane, , is the distance from viewpoint to plane.

  • centriod point

  • boundary points where

Iii-C2 Unique Normal

Each plane has two normals, in opposite directions. In order to calculate the new viewpoint, which has to be correctly aligned with the original viewpoint, it is necessary to use one unique normal for each plane. The unique normal is defined as the one not facing toward the origin.

(1)

Iii-D Calculating the Virtual Canonical Viewpoint

In order to transform images to the canonical view, a virtual canonical viewpoint needs to be calculated from the plane parameters. We set the virtual viewpoint () at a fixed distance away from the plane centroid. In our experiments, we set this to 1.2 meter to align with the training set, as the training set is collected with the distance of 0.8 and 1.2 meter.

The new viewpoint is calculate as such:

(2)
(3)
(4)

Where is a 4 * 4 matrix denotes the position and orientation of in . is the normal of the plane, , is the distance from viewpoint to plane.

Fig. 3: The middle image shows the partial loss of the plane after applying . Result image size is kept same as input. Green lines denotes the border of a plane, solid rectangular are image patches, blue bounding box are the bounding box of the plane transformed to . The last image is the visualization of system output matrices after applying the transformation. The resulting rectified images are with 50% of overlapping with its neighbors. The result images are the rectangular boxes in black, orange and blue (img1, img2 and img3)
Fig. 4: Proposed detection framework (testing pipeline)
Fig. 5: Overall Geometric Image Rectification Pipeline (Rectification Module)

Iii-E Calculate Homography Matrix for Virtual Viewpoint (2D)

Once the virtual viewpoint is calculated, it is easy to calculate an equivalent homography matrix that denotes the transformation between and

. Non-robust DLT (Direct Linear Transform)

[23] is used to compute the homography.

For homography calculation, four points are sampled from the 3D plane, denoted as . On both camera viewpoints (the original viewpoint and the new one), we project the 3D points to 2D images, denoted as and . As shown in Figure 2.

As the plane is an ideal infinite plane, and since the 2D image points are all obtained from re-projection, non-robust DLT (Direct Linear Transform) [23] is used to compute the homography.

(5)

where is homogenous coordinates.

Iii-F Refine Homography Matrix

At this point, we have a homography matrix which transforms side view images to the canonical view. However, there is still one minor issue: Some pixels go out of view, due to the FoV (field of view) of the camera. As shown in Figure 3, in some case, only part of the pixels are included in the resulting image.

Iii-F1 Bounding box around the plane

In our case, as the plane is a finite plane defined by boundary points, the first step is to calculate the tight bounding box around the reprojected plane in , as is shown in blue in Figure 3.

(6)

Where denotes points around plane boundary, denotes projection matrix.

From , we can easily calculate a tight bounding box by taking the minimum and maximum. Specifically, we calculate the top-left corner , the bounding box height and the width .

Iii-F2 Refinement Algorithm

This next step is to calculate the final rectification matrices which produce the final resulting images. As is shown in Equation 7, we slide the camera window through the reprojected plane on . Every two sliding images have 50% of overlap either horizontally or vertically, so that hazmat signs around the border will present in the center of its sliding image. See Figure 3 for an example.

Denote as the top-left corner of the plane bounding box on , as the bounding box height and width. as height and width of the resulting image.

(7)

where i = 1 to 2 * , to 2 *

The final step is to apply the refined homography matrix to obtain the final rectified images. The warpping is done with bilinear interpolation. By doing that, object patches from non-canonical images are transformed into canonical view. Besides, rectified images are on the same distance, as we manually set it to a fixed distance (1.2 meters in this case). This avoids the multi-scale problem, which is challenging for CNN networks.

Iv Dataset

Currently, there are only very few publicly available hazmat sign detection datasets. [15] published their high-resolution RGB hazmat detection dataset. However, the dataset was collected with a hand-held single-lens camera, containing only RGB images. On the contrary, our dataset contains images from an affordable RGB-D sensor (Intel Realsense RGB-D Camera). With the additional depth information, we are able to provide geometric information.

We provide a high-resolution RGB-D hazmat dataset with labels in this paper, which can be found here111https://robotics.shanghaitech.edu.cn/datasets/MARS-Hazmat-RGBD. It contains both RGB images and depth images with a resolution of . Ground truth label information of the RGB images is also provided. of the RGB and depth images contain only one of the types of hazmat label. Each of these images contains two backgrounds (plain and plywood) and five positions (top left, top right, center, bottom left, bottom right). For the rest of images, each image contains types of hazmat labels. Nine different angles (, , , , ) with three distances (, , ) are included in these images.

V Experiment and Results

V-a Evaluation Metric

In our detection framework, we use a training set containing canonical-view images only to train the CNN detector. It effectively reduces the size of the training set, thus lowering the difficulty for collecting a good training set for mobile robotics applications. In the testing stage, each image first goes through the rectification system to get rectified images in the canonical view, as is shown in Figure 4. Then CNN detector is used for performing the actual detection. Finally, all detection results get warped back to original images to get final detection results as the output of the system.

We use the MSCOCO[2] object detection evaluation matrix to evaluate the detection performance. Two main metrics are IoU (Intersection of Union) and mAP (mean Average Precision). We propose an extend NMS method to select final bounding boxes. Since each image is split into a series of images, we can convert the bounding boxes back to origin image by utilizing the homography matrix we calculate before. For each splitted image we utilize the homography matrix to recover final bounding boxes. The implementation of our method is provided here222Will be available in the final version.

V-B Experiments Setup

Most experiments are performed on our own self-collected dataset, for there is no public available RGB-D hazmat detection dataset. To show that our approach can effectively reduce the difficulty for the CNN detector by dealing with perspective distortion in advance, only images from canonical views are used for training the network. The test sets include images from various angles (-75 to 75).

We use yolov3-tiny [18] as the CNN detector. We choose yolov3-tiny because 1) We take it as an typical example of off-the-shelf CNN-based detection network and 2) because it is small and fast enough for real-world deployment on mobile robots. The training time takes about one and a half hours with our computer (Intel Core i7-6700 CPU, GeForce GTX 1080, GiB Memory). We trained our model from scratch with a batch size of , momentum , subdivisions , burnin , maxbatches , learningrate and the learningrate will be multiplied by 0.1, when the number of batches is 3000, 4000, 5000, 6000, 7000.

mAP
(IoU=
0.50)
mAP
(IoU=
0.75)
mAp
(IoU=
0.50:0.05:0.95)
AR
(IoU=
0.50:0.05:0.95)
baseline 0.236 0.053 0.088 0.15
our 0.53 0.193 0.246 0.351
TABLE I: Experiment result with / without geometry rectification.
mAP
(IoU=
0.50)
mAP
(IoU=
0.75)
mAP
(IoU= 0.50:
0.05:0.95)
AR
(IoU= 0.50:
0.05:0.95)
-75° baseline 0.009 0 0.001 0.004
our 0.132 0.006 0.034 0.051
-60° baseline 0.184 0.016 0.054 0.077
our 0.477 0.055 0.158 0.216
-45° baseline 0.329 0.099 0.139 0.195
our 0.645 0.274 0.303 0.401
-30° baseline 0.445 0.149 0.188 0.272
our 0.679 0.365 0.362 0.456
baseline 0.538 0.222 0.263 0.357
our 0.665 0.386 0.375 0.492
30° baseline 0.434 0.09 0.169 0.247
our 0.632 0.367 0.364 0.455
45° baseline 0.287 0.041 0.098 0.145
our 0.663 0.361 0.364 0.46
60° baseline 0.116 0.015 0.029 0.045
our 0.586 0.179 0.269 0.351
75° baseline 0.026 0 0.004 0.005
our 0.53 0.114 0.222 0.281
TABLE II: Experiment result with / without geometry rectification in different angles.

V-C Rectification Parameters

For the plane segmentation we assume that 90% of the points of each frame are from planes. As a result, the RANSAC keeps extracting planes until less than 10% of total points are in the remaining set. We set the number of maximum planes per image to 1, because it is known that the test set only contains one plane per test image. The distance from the virtual viewpoints to the plane is set as 1.2m. Because the training dataset is collected with distances between 1m and 1.5m, it is reasonable to assume that the CNN detector will have better performance on 1.5m or 1m than others. The training dataset contains two main parts, images that contain one hazmats and images with 13 hazmats, in order to prevent overfitting to the background. The homography matrix can be calculated in a closed form solution by using planar homography, but we are using DLT to compute the homography matrix. In the future we plan to move to the closed form solution.

V-D Results

We compare the detection performance with and without geometry rectification. The results are shown in Table I. Baseline means without geometry rectification. From Table I, we can see that, after geometry rectification, the performance is much better than before. increases nearly after geometry rectification while increases .

Fig. 6: Detection results at on background one.
Fig. 7: Detection results at on background two.

Table II shows the performance of our approach at different angles. The test dataset contains nine angles (). From Table II, we can see even on very large angles such as our approach can still detect some hazmat signs. In our approach the is while without geometry rectification is . On other angles such as , according to the results shown in Table II, our results are better than the previous results. increases at angle , which is a huge improvement.

Besides perspective distortion, our proposed detection approach helps the CNN detector module to avoid dealing with multi-scale detection problems by explicitly rectifying the target patch to an ideal scale, as the distance from the plane to virtual viewpoints is fixed. Previous research has shown that the scale problem is challenging for CNN networks[22]. As the result, in Table II, we have better performance even at 0, where there is no perspective distortion. For 0 images, the mAP performance gets improved from 0.263 to 0.375.

The top images on Figure 6 and Figure 7 show examples of the detection results of baseline algorithm without geometry while the bottom images in Figure 6 and Figure 7 show the detection results of our approach in the same images. We can see that on both Figure 6 and Figure 7 our approach performs much better than baseline approach. Our approach can detect more hazmat signs at very large angles. Also, the accuracy of bounding boxes is more precise in our approach.

Running single-threaded on a CPU, our algorithm needs about 5.7 seconds per image, a value that should be improved, since it is too slow to run live on a robot.

Vi Conclusions

In this work, we showed a simple but effective way to combine geometric information with an off-the-shelf CNN-based detector. By doing image rectification explicitly in advance of the CNN detector, we take full advantage of available geometric information from RGB-D images to 1) reduce the time for training stage; 2) reduce the size of training set required; 3) improve the performance of overall detection system; and 4) produce more accurate detection result (more tight bounding box). This approach also features a high robustness towards noisy depth information input (noisy point cloud), as the depth is just used to estimate the plane parameters, where RANSAC is effective even with noisy input.

For the mobile robotics application, especially hazmat sign detection in rescue robotics, our approach lowers the work required to create a nice training dataset, because fewer training images are needed. In the interest of reproducible science we provide the dataset used in the paper as well as our code to the public.

References

  • [1] H. Bay, T. Tuytelaars, and L. Van Gool (2006) Surf: speeded up robust features. In

    European conference on computer vision

    ,
    pp. 404–417. Cited by: §I-A.
  • [2] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §I-A, §V-A.
  • [3] C. Couprie, C. Farabet, L. Najman, and Y. Lecun (2013) Indoor semantic segmentation using depth information. In International Conference on Learning Representations (ICLR2013), April 2013, Cited by: §II-C.
  • [4] J. P. S. do Monte Lima, F. P. M. Simões, H. Uchiyama, V. Teichrieb, and E. Marchand (2016)

    Depth-assisted rectification for real-time object detection and pose estimation

    .
    Machine Vision and Applications 27 (2), pp. 193–219. Cited by: §II-C.
  • [5] R. Edlinger, G. Zauner, and M. Zauner (2019) Hazmat label recognition and localization for rescue robots in disaster scenarios. Electronic Imaging 2019 (7), pp. 463–1. Cited by: §II-A.
  • [6] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §II-B, §III-C1.
  • [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    ,
    pp. 580–587. External Links: Document, 1311.2524, ISBN 9781479951178, ISSN 10636919 Cited by: §I-A.
  • [8] D. Gossow, J. Pellenz, and D. Paulus (2008) Danger Sign Detection Using Color Histograms and SURF Matching. In Proceedings of the 2008 IEEE International Workshop on Safety, Security and Rescue Robotics, pp. 13–18. External Links: ISBN 9781424420322 Cited by: §II-A.
  • [9] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik (2014) Learning rich features from RGB-D images for object detection and segmentation. In ECCV 2014, Vol. 8695 LNCS, pp. 345–360. External Links: Document, arXiv:1407.5736v1, ISBN 9783319105833, ISSN 16113349 Cited by: §I-A.
  • [10] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik (2014) Learning rich features from rgb-d images for object detection and segmentation. In European conference on computer vision, pp. 345–360. Cited by: §II-C.
  • [11] A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, and A. Davison (2016) Gvnn: neural network library for geometric computer vision. In European Conference on Computer Vision, pp. 67–82. Cited by: §II-C.
  • [12] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017) Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 2980–2988. External Links: Document, 1703.06870, ISBN 9781538610329, ISSN 15505499 Cited by: §I-A.
  • [13] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §I-A.
  • [14] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §I-A, §II-C.
  • [15] M. A. Mohamed, J. Tünnermann, and B. Mertsching (2018) Seeing Signs of Danger: Attention-Accelerated Hazmat Label Detection. In 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics, SSRR 2018, External Links: Document, ISBN 9781538655726 Cited by: §II-A, §IV.
  • [16] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017)

    3D bounding box estimation using deep learning and geometry

    .
    pp. 5632–5640. Cited by: §II-C.
  • [17] A. Parra, B. Zhao, A. Haddad, M. Boutin, and E. J. Delp (2013) Hazardous material sign detection and recognition. In 2013 IEEE International Conference on Image Processing, ICIP 2013 - Proceedings, pp. 2640–2644. External Links: Document, ISBN 9781479923410 Cited by: §II-A.
  • [18] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §I-A, §V-B.
  • [19] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Vol. 11, pp. 2. Cited by: §I-A, §II-C.
  • [20] R. Sheh, T. Kimura, E. Mihankhah, J. Pellenz, S. Schwertfeger, and J. Suthakorn (2011) The robocuprescue robot league: guiding robots towards fieldable capabilities. In Advanced Robotics and its Social Impacts, pp. 31–34. Cited by: §II-A.
  • [21] R. Sheh, S. Schwertfeger, and A. Visser (2016) 16 years of robocup rescue. KI-Künstliche Intelligenz 30 (3-4), pp. 267–277. Cited by: §II-A.
  • [22] B. Singh and L. S. Davis (2018) An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: §V-D.
  • [23] I. E. Sutherland (1974) Three-dimensional data input by tablet. Proceedings of the IEEE 62 (4), pp. 453–461. Cited by: §III-E, §III-E.
  • [24] C. Wang and K. Siddiqi (2016) Differential geometry boosts convolutional neural networks for object detection. pp. 1006–1013. Cited by: §II-C.
  • [25] B. Zhao, A. Parra, and E. J. Delp (2013) Mobile-based hazmat sign detection and recognition. In 2013 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2013 - Proceedings, pp. 735–738. External Links: Document, ISBN 9781479902484 Cited by: §II-A.
  • [26] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and K. Daniilidis (2019) MonoCap: monocular human motion capture using a cnn coupled with a geometric prior. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 901–914. Cited by: §II-C.
  • [27] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu (2016) Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2110–2118. Cited by: §II-A.
  • [28] S. Zia, B. Yüksel, D. Yüret, and Y. Yemez (2018) RGB-D Object Recognition Using Deep Convolutional Neural Networks. Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 2018-Janua, pp. 887–894. External Links: Document, ISBN 9781538610343 Cited by: §I-A, §II-C.