Recent progress in computer vision has enabled the implementation of autonomous vehicle prototypes across urban and highway scenarios[schwarting2018planning]. Autonomous vehicles need accurate self-localization in the environment allowing them to plan their actions. For accuracy in localization, the High Definition (HD) maps of the environments containing information on 3D geometry of road boundaries, lanes, traffic signs, and other semantically meaningful landmarks are necessary. However, the process of creating these HD maps involves the use of expensive sensors mounted on the collection vehicles [jiao2018machine], thereby limiting the scale of their coverage. It is also desired that any changes in the environment, such as the type or positions of traffic signs, are regularly reflected in the map. Therefore, the creation and maintenance of HD maps at scale remain a challenge.
To extend the map coverage to more regions or update the landmarks over time, crowdsourced maps are an attractive solution. However, in contrast with the automotive data collection vehicles with high grade calibrated sensors, crowdsourced maps would require the use of consumer-grade sensors whose intrinsics may be unknown, or change over time. The commonly available sensors for crowdsourced mapping are a monocular color camera and a global positioning system (GPS). To utilize these sensors for crowdsourced mapping, it is required to perform camera self-calibration followed by monocular depth or ego-motion estimation. Over the years, geometry as well as deep learning based approaches have been proposed to compute the camera intrinsics [bogdan2018deepcalib, gordon2019depthwild, schonberger2016structure], and estimate the depth/ego-motion [mur2015orb, engel2017direct, gordon2019depthwild, zhou2017unsupervised] from a sequence of images. However, the state-of-the-art solution to crowdsourced mapping assumes the camera intrinsics to be known a priori, and relies upon only geometry based ego-motion estimation [dabeer2017end].
Geometry based approaches for self-calibration and visual depth/ego-motion estimation often depend on carefully designed features and matching them across frames. Thus, they fail in scenarios with limited features such as highways, during illumination change, occlusions, or have poor matching due to structure repetitiveness. Recently, deep learning based approaches for camera self-calibration as well as depth and ego-motion estimation have been proposed [zhou2018deeptam, zhou2017unsupervised, godard2018digging, gordon2019depthwild]. These methods perform in an end-to-end fashion and often being self-supervised, enable application in challenging scenarios. They are usually more accurate than geometry based approaches on short linear trajectories, resulting in a higher local agreement with the ground truth [zhou2017unsupervised, gordon2019depthwild]. Moreover, deep learning based approaches can estimate monocular depth from a single frame as opposed to geometry based approaches that require multiple frames. Nonetheless, the localization accuracy of geometry based approaches is higher for longer trajectories due to loop closure and bundle adjustment. Therefore, we hypothesize that eliminating the requirement to know the camera intrinsics a priori while mapping through a hybrid of geometry and deep learning methods, will increase the global map coverage and enhance the scope of its application.
In this work, we focus on the 3D positioning of traffic signs, as it is critical to the safe performance of autonomous vehicles, and is useful for traffic inventory and sign maintenance. We propose a framework for crowdsourced 3D traffic sign positioning that combines the strengths of geometry and deep learning approaches to self-calibration and depth/ego-motion estimation. Our contributions are as follows:
We evaluate the sensitivity of the 3D position triangulation to the accuracy of the self-calibration.
We quantitatively compare deep learning and multi-view geometry based approaches to camera self-calibration, as well as depth and ego-motion estimation for crowdsourced traffic sign positioning.
We demonstrate crowdsourced 3D traffic sign positioning using only GPS information and a monocular color camera without the prior knowledge of camera parameters.
We show that combining the strengths of deep learning with multi-view geometry is important for increased map coverage.
To facilitate evaluation and comparison on this task, we construct and provide an open source 3D traffic sign ground truth positioning dataset on KITTI111https://github.com/hemangchawla/3d-groundtruth-traffic-sign-positions.git.
Ii Related Work
Traffic sign 3D positioning
Arnoul et al. [arnoul1996traffic]
used a Kalman filter for tracking and estimating positions of traffic signs in static scenes. In contrast, Madeira et al.[madeira2005automatic] estimated traffic sign positions through least-squares triangulation using GPS, Inertial Measurement Unit (IMU), and wheel odometry. Approaches using only a monocular color camera and GPS were also proposed [krsak2011traffic, welzel2014accurate]. However, Welzel et al.[welzel2014accurate] utilized prior information about the size and height of traffic signs to achieve an average absolute positioning accuracy up to 1m. A similar problem of mapping the 3D positions and orientations of traffic lights was tackled by Fairfield et al. [fairfield2011traffic]. For related tasks of 3D object positioning and distance estimation, deep learning approaches [chen2016monocular, ku2019monocular, qin2019monogrnet, zhu2019learning] have been proposed. However, they primarily focus on volumetric objects, ignoring the near-planar traffic signs. Recently, Dabeer et al. [dabeer2017end] proposed an approach to crowdsource the 3D positions and orientations of traffic signs using cost-effective sensors with known camera intrinsics, and achieved a single journey average relative and absolute positioning accuracy of 46 cm and 57 cm respectively. All the above methods either relied upon collection hardware dedicated to mapping the positions of traffic control devices or assumed known accurate camera intrinsics.
Geometry based approaches for self-calibration use two or more views of the scene to estimate the focal lengths [bocquillon2007constant, gherardi2010practical], while often fixing the principal point at the image center [de1998self]. Structure from motion (SfM) reconstruction using a sequence of images has also been applied for self-calibration [pollefeys2008detailed, schonberger2016structure]. Moreover, deep learning approaches have been proposed to estimate the camera intrinsics using a single image through direct supervision [lopez2019deep, rong2016radial, workman2015deepfocal, zhuang2019degeneracy], or as part of a multi-task network [bogdan2018deepcalib, gordon2019depthwild]. While self-calibration is essential for crowdsourced 3D traffic sign positioning, its utility has not been evaluated until now.
Monocular Depth and Ego-Motion estimation
Multi-view geometry based monocular visual odometry (VO), and simultaneous localization and mapping (SLAM) estimate the camera trajectory using visual feature matching and local bundle adjustment [klein2007parallel, mur2015orb], or through minimization of the photometric reprojection error [engel2014lsd, engel2017direct, newcombe2011dtam]. Supervised learning approaches predict monocular depth [cao2017estimating, eigen2014depth, liu2015learning] and ego-motion [wang2017deepvo, zhou2018deeptam] using ground truth depths and trajectories, respectively. In contrast, self-supervised approaches jointly predict ego-motion and depth utilizing image reconstruction as a supervisory signal [casser2019unsupervised1, godard2018digging, gordon2019depthwild, zhou2017unsupervised, godard2017unsupervised, li2018undeepvo, zhan2018unsupervised]. Self-supervised depth prediction has also been integrated with geometry based direct sparse odometry [engel2017direct] as a virtual depth signal [yang2018deep]. However, some of these self-supervised approaches rely upon stereo image pairs during training [yang2018deep, godard2017unsupervised, li2018undeepvo, zhan2018unsupervised].
In this section, we describe our proposed system for 3D traffic sign positioning. The input is a sequence of color images of width and height , and corresponding GPS coordinates . The output is a list of detected traffic signs with the corresponding class identifiers , absolute positions , and the relative positions with respect to the corresponding frames in which the sign was detected. An overview of the proposed system for 3D traffic sign positioning is depicted in Fig. 2. Our system comprises of the following key modules:
Iii-a Traffic Sign Detection & Inter-frame Sign Association
The first requirement for the estimation of 3D positions of traffic signs is detecting their coordinates in the image sequence and identifying their class. The output of this step is a list of 2D bounding boxes enclosing the detected signs, and their corresponding track and frame numbers. Using the center of the bounding box we extract the coordinates of the traffic sign in the image. However, we disregard those bounding boxes that are detected at the edge of the images to account for possible occlusions.
Iii-B Camera Self-Calibration
For utilizing the crowdsourced image sequences to estimate the 3D positions of traffic signs, we must perform self-calibration for cameras whose intrinsics are previously unknown. For this work, we utilize the pinhole camera model. From the set of geometry based approaches, we evaluate the Structure from Motion based method using Colmap [schonberger2016structure]. Note that self-calibration suffers from ambiguity for the case of forward motion with parallel optical axes [bocquillon2007constant]. Therefore we only utilize those parts of the sequences in which the car is turning. To extract the sub-sequences in which the car is turning, the Ramer-Douglas-Peucker (RDP) algorithm [ramer1972iterative, douglas1973algorithms] is used. From the deep learning based approaches, we evaluate the Self-Supervised Depth From Videos in the Wild (VITW) [gordon2019depthwild]. The burden of annotating training data [lopez2019deep, zhuang2019degeneracy] makes supervised approaches inapplicable to crowdsourced use-cases.
Iii-C Camera Ego-Motion and Depth Estimation
For applying approach A described in Fig. 3 to 3D traffic sign positioning, the ego-motion of the camera must be computed from the image sequence. Note that camera-calibration through Colmap involves SfM, but only utilizes those sub-sequences which contain a turn (Sec. III-B). Therefore, we evaluate state-of-the-art geometry based monocular approach ORB-SLAM [mur2015orb] against self-supervised Monodepth 2 [godard2018digging] and VITW. While the geometry based approaches compute the complete trajectory for the sequence, the self-supervised learning based approaches output the camera rotation and translation per image pair. The adjacent pair transformations are then concatenated to compute the complete trajectory. After performing visual ego-motion estimation, we use the GPS coordinates to scale the estimated trajectory. First, we transform the GPS geodetic coordinates to local East-North-Up (ENU) coordinates. Thereafter, using the Umeyama’s algorithm [umeyama1991least], a similarity transformation, (rotation , translation , and scale ) is computed that scales and aligns the estimated camera positions () with the ENU positions () minimizing the mean squared error between them. The scaled and aligned camera positions are therefore given by
Thereafter, this camera trajectory is used for computation of the 3D traffic sign positions as described in section III-D.
For applying approach B described in Fig. 3 to 3D traffic sign positioning, dense monocular depth maps are needed. To generate the depth maps, we evaluate the self-supervised approaches, Monodepth 2, and VITW. These approaches simultaneously predict the monocular depth as well as the ego-motion of the camera. While the estimated dense depth maps maintain the relative depth of the observed objects, we obtain metric depth by preserving forward and backward scale consistency. Given camera calibration matrix , the shift in pixel coordinates due to rotation and translation between adjacent frames and , is given by
where and represent the unscaled depths corresponding to the homogeneous coordinates of pixels and . By multiplying equation 2 with forward scale estimate , it is seen that scaling the relative translation similarly scales the depths and . This is also explained through the concept of similar triangles in Fig. 4. Given relative ENU translation , we note that the scaled relative translation is given by,
Therefore, the forward scale estimate
Similarly the backward scale estimate is computed. Accordingly, for frames , the scaling factor is given by the average of forward and backward scale estimates, and . Thereafter, these scaled dense depth maps are used for computation of the 3D traffic sign positions as described in section III-D.
Iii-D 3D Positioning and Optimization
For the final step of estimating and optimizing the 3D positions of the detected traffic signs, we adopt two approaches as shown in Fig. 3.
In this approach, the estimated camera parameters, the computed and scaled ego-motion trajectory, and the 2D sign observations in images are used to compute the sign position through triangulation. For a sign observed in frames, we compute the initial sign position estimate using the mid-point algorithm [szeliski2010computer]. Thereafter, non-linear Bundle Adjustment (BA) is applied to refine the initial estimate by minimizing the reprojection error to output
To compute the sign positions relative to frames , the estimated absolute sign position is projected to the corresponding frames in which it was observed
If the relative depth of the sign is found to be negative, triangulation of that sign is considered to be failed. We can use this approach with the full trajectory of the sequence or with short sub-sequences corresponding to the detection tracks. The use of full and short trajectory for triangulation is compared in section IV-D.
In approach B, the estimated camera parameters, the scaled dense depth maps, and the 2D sign observations in images are used to compute the 3D traffic sign positions through inverse projections. For a sign observed in frames, each corresponding depth map produces a sign position hypothesis given by
where represents the pixel coordinate of sign in the frame , and is the corresponding depth scaling factor. Since, sign depth estimation may not be as reliable beyond a certain distance, we discard that sign position hypotheses whose estimated relative depth is more than 20m. For computing the absolute coordinates of the sign, each relative sign position is projected to the world coordinates, and their centroid is computed as the absolute sign position,
Finally, for both the above approaches, the metric absolute positions of traffic signs are converted back to the GPS geodetic coordinates.
In order to evaluate the best approach to 3D traffic sign positioning, it is pertinent to consider the impact of the different components on the overall accuracy of the estimation. First, we analyze the sensitivity of 3D traffic sign positioning performance against the camera calibration accuracy, demonstrating the importance of good self-calibration. Thereafter, we compare approaches to ego-motion and depth estimation, and camera self-calibration that compose the 3D sign positioning system. Finally, the relative and absolute traffic sign positioning errors corresponding to the approaches A and B are evaluated. For the above comparisons, we use the traffic signs found in the raw KITTI odometry dataset [geiger2012we], sequences (Seq) 0 to 10 (Seq 3 is missing from the raw dataset), unless specified otherwise.
Iv-a Ground Truth Traffic Sign Positions
While 3D object localization datasets usually contain annotations for volumetric objects, such as vehicles and pedestrians, such annotations for near-planar objects like traffic signs are lacking. Furthermore, related works dealing with 3D traffic sign positioning have relied upon closed source datasets [dabeer2017end, welzel2014accurate]. Therefore we generate the ground truth (GT) traffic sign positions required for validation of the proposed approaches in the KITTI dataset. We choose the challenging KITTI dataset, commonly used for benchmarking ego-motion, as well as depth estimation because it contains the camera calibration parameters, and synced LiDAR information that allows annotation of GT 3D traffic sign positions.
As shown in Fig. 5, the LiDAR scans corresponding to the images captured, along with the GT trajectory poses are used to annotate the absolute as well as relative GT positions of the traffic signs. In total, we have annotated 73 signs across the 10 validation sequences.
Iv-B Sensitivity to Camera Calibration
The state-of-the-art approach to 3D sign positioning relies upon multi-view geometry triangulation. In this section, we analyze the sensitivity of this method to the error in the estimate of camera focal lengths and principal point. To evaluate the sensitivity, we introduce error in the GT camera intrinsics and perform SLAM, both with (w/) and without (w/o) loop closure (LC) using the incorrect camera matrix, followed by the sign position triangulation using the full trajectory. Its performance for the corresponding set of camera intrinsics is then evaluated as the mean of relative positioning error normalized by the number of signs successfully triangulated. We perform this analysis for KITTI Seq 5 (containing multiple loops) and 7 (containing a single loop). For each combination of camera parameters, we repeat the experiment 10 times and report the minimum of the above metric.
The one-at-a-time (OAT) sensitivity analysis measures the effect of error (-15% to +15%) in a single camera parameter while keeping the others at their GT values. Fig. 6 shows the sensitivity of sign positioning performance to the error in focal lengths ( and are varied simultaneously) and principal point ( and are varied simultaneously). The performance w/ LC is better than that w/o LC. Furthermore, the performance gap between triangulation w/ and w/o LC is higher with a higher number of loops (Seq 5). Moreover, the triangulation is more sensitive to underestimating the focal length, and overestimating the principal point, primarily at large errors.
The interaction sensitivity analysis measures the effect of error (-5% to +5%) while varying the focal lengths and the principal point simultaneously. Fig. 7 shows the sensitivity to the combined errors in focal lengths and principal point for Seq 5 and Seq 7. The sensitivity to varying the principal point is higher than the sensitivity to varying the focal length for both the sequences. Furthermore for this shorter range of errors, underestimating the focal length and overestimating the principal point results in a better performance than contrariwise. This is in contrast to the observed effect when the percentage errors in intrinsics are higher (cf. Fig. 6). Note that the best performance is not achieved at zero percentage errors for the focal length and principal point. We conclude that accurate estimation of the camera intrinsics is pertinent for accurate sign positioning.
Iv-C Sign Positioning Components Analysis
In order to compute the sign positions, we need the camera intrinsics through self-calibration, and the ego-motion/depth maps as shown in Fig. 2. Here we quantitatively compare state-of-the-art deep learning and multi-view geometry based methods to monocular camera self-calibration, as well as depth and ego-motion estimation. For these experiments, Monodepth 2 and VITW are trained on 44 sequences from KITTI raw in the city, residential, and road categories.
Table I shows the average percentage error for self-calibration with VITW and Colmap. VITW estimates the camera intrinsics for each pair of images in a sequence. Therefore, we compute the mean () and median (m) of each parameter across image pairs as the final estimate. To evaluate the impact of the turning radius on self-calibration with VITW, we also compute the parameters considering only those frames detected as part of a turn (through the RDP algorithm). Multi-view geometry based Colmap gives the lowest average percentage error for each parameter. The second best self-calibration estimation is given by VITW Turns (m). However, both of the above fail in self-calibrating the camera using Seq 4, which does not have any turns. For such a sequence, VITW (m) performs better than VITW (). All methods underestimate the focal length, and overestimate the principal point. Moreover, VITW estimates the focal length with higher magnitude of error compared to that of the principal point. The upper bound for error estimate of is inversely proportional to the amount of rotation about the axis [gordon2019depthwild]. . Therefore, estimates of and are better than that of and for all methods, because of the near-planar motion in the sequences.
|VITW Turns ()||-14.694.34||-22.962.83||1.250.38||4.081.17|
|VITW Turns (m)||-11.826.65||-22.622.92||1.200.34||3.881.18|
Table II shows the average absolute trajectory errors (ATE) in meters for full [horn1987closed] and 5-frame sub-sequences (ATE-5) [zhou2017unsupervised] from ego-motion estimation. The multi-view geometry based ORB-SLAM w/ LC has the lowest ATE full. However, ORB-SLAM w/o LC has a higher local agreement with the GT trajectory depicted by the lowest ATE-5 mean of . Both ORB-SLAM methods suffer from track failure for Seq 1, unlike Monodepth 2 and VITW. For Seq 1, VITW has a better performance than Monodepth 2. While Monodepth 2 has the lowest ATE-5 Std, and an ATE-5 mean similar to that of ORB-SLAM w/o LC, its ATE full is much higher than that of the ORB-SLAM methods.
|Method||ATE Full||ATE-5 Mean||ATE-5 Std|
|ORB-SLAM (w/ LC)||17.034||0.015||0.017|
|ORB-SLAM (w/o LC)||37.631||0.014||0.015|
Table III shows the performance of depth estimation based on the metrics defined by Zhou et al.[zhou2017unsupervised]. While Monodepth 2 outperforms VITW in all the metrics, its training uses the average camera parameters from the dataset being trained on, thereby necessitating some prior knowledge about the dataset.
|Method||Abs Rel Diff||Sq Rel Diff||RMSE||RMSE (log)|
|Approach A||Approach B|
|ORB-SLAM w\LC||ORB-SLAM w\o LC||VITW||Monodepth 2|
|VITW turns (m)||5.21||2.70||3.4||1.29||0.72||4.09||0.67||3.4||1.00||0.22||5.93||3.3||2.12||3.53||3.5||1.98|
Thus, we conclude that for self-calibration, Colmap, VITW (m), and VITW Turns (m) are the better choices. For sign positioning with Approach A using ego-motion estimation, ORB-SLAM (w/ and w/o LC) are the better choices. However, for sign positioning with Approach B using depth estimation, both Monodepth 2 and VITW need to be considered. Finally, it is hypothesized that a combination of multi-view geometry and deep learning approaches is needed for successful sign positioning in all sequences.
Iv-D 3D Positioning Analysis
We compare the accuracy of 3D traffic sign positioning using Approach A against Approach B. We also compare the effect of multi-view geometry and deep learning based self-calibration on the 3D sign positioning accuracy. We compute the average relative sign positioning error normalized by the number of signs successfully positioned as the metric.
Table IV shows the comparison of the mean performance of 3D traffic sign positioning for the different combinations of self-calibration and depth/ego-motion estimation techniques. The average relative sign positioning error using the full and short trajectories is denoted by and respectively, while denotes the average number of successfully positioned signs. The relative sign positioning error using depth maps in Approach B is denoted by . Note that the best average performance is given by Approach A using Colmap for self-calibration and short ORB-SLAM (w/o LC) for ego-motion estimation. The better performance of ORB-SLAM w/o LC for relative sign positioning is explained by the lower ATE-5 (Table II) as compared to ORB-SLAM w/ LC. Therefore, approach A using short sub-sequences for triangulation generally performs better than approach B. However, it is not the case for all the sequences. For Seq 1, where ORB-SLAM fails tracking, the best sign positioning error is given by Approach B using a combination of Colmap for self-calibration and Monodepth 2 for depth estimation. For Seq 4 which does not contain any turns, calibration with Colmap or VITW Turns (m) is not feasible, and VITW (m) has to be used.
While ORB-SLAM (short) w/o LC gives better relative positioning error than w/ LC, the average absolute positioning error is lower when using ORB-SLAM w/ LC () than w/o LC (). This is because the loop closures help in correcting the accumulated trajectory drift, thereby improving the absolute positions of the traffic signs.
We therefore propose a scheme for crowdsourced 3D traffic sign positioning that combines the strengths of multi-view geometry and deep learning techniques for self-calibration, ego-motion and depth estimation to increase the map coverage. This scheme is shown in Fig. 8. The mean relative and absolute 3D traffic sign positioning errors for each validation sequence, computed using this scheme are shown in Table V. With this approach, our single journey average relative and absolute sign positioning error per sequence is and respectively. The average relative positioning error for all frames is , while the absolute positioning error for all signs is . Our relative positioning accuracy is comparable to [dabeer2017end] which unlike our framework uses a camera with known intrinsics, GPS, as well as an IMU to estimate the traffic sign positions. Our absolute positioning accuracy is comparable to [welzel2014accurate], which also assumes prior knowledge of camera intrinsics as well as the size and height of traffic signs.
In this paper, we proposed a framework for 3D traffic sign positioning using crowdsourced data from only monocular color cameras and GPS, without prior knowledge of camera intrinsics. We demonstrated that combining the strengths of multi-view geometry and deep learning based approaches to self-calibration, depth and ego-motion estimation results in an increased map coverage. We validated our framework on traffic signs in the public KITTI dataset for single journey sign positioning. In the future, the sign positioning accuracy can be further improved with optimization for multiple journeys over the same path. We will also explore the effect of camera distortion and rolling shutter in the crowdsourced data to expand the scope of our method.