Line-based Camera Pose Estimation in Point Cloud of Structured Environments

11/23/2019 ∙ by Huai Yu, et al. ∙ Carnegie Mellon University Wuhan University 0

Accurate registration of 2D imagery with point clouds is a key technology for imagery-LiDAR point cloud fusion, camera to laser scanner calibration and camera localization. Despite continuous improvements, automatic registration of 2D and 3D data without using additional textured information still faces great challenges. In this paper, we propose a new 2D-3D registration method to estimate 2D-3D line feature correspondences and the camera pose in untextured point clouds of structured environments. Specifically, we first use geometric constraints between vanishing points and 3D parallel lines to compute all feasible camera rotations. Then, we utilize a hypothesis testing strategy to estimate the 2D-3D line correspondences and the translation vector. By checking the consistency with computed correspondences, the best rotation matrix can be found. Finally, the camera pose is further refined using non-linear optimization with all the 2D-3D line correspondences. The experiments demonstrate the effectiveness of the proposed method on the synthetic and real dataset (outdoors and indoors) with repeated structures and rapid depth changes.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The 2D to 3D registration problem is to match a query image with a 3D model to establish the geometric correspondence between the two modalities and estimate camera pose [1]

. It is essential for many applications, e.g. point cloud colorization

[2, 3] and camera localization [4]. Based on the estimated transformation parameters, image textures can be used to colorize untextured point clouds, which is beneficial for further interpretation. With the developments of low-cost LiDAR sensors and the advancements in LiDAR-based SLAM algorithms [5, 6], point cloud 3D models can be easily obtained. Thus 2D-3D registration can be used to localize a small lightweight camera inside pre-built 3D maps, which is attractive and complementary to existing visual SLAM technology.

However, the 2D to 3D registration is very challenging because of the appearance differences and modality gaps. Generally, the current 2D to 3D matching methods are often established on the same kind of descriptors (e.g. SIFT) across different modalities [7, 8]. Nevertheless, appearance and visual feature changes may occur between viewpoints, light conditions, weather and seasons, which make visual features not suitable for the registration of optical imagery with unorganized and untextured point clouds. And the same kinds of appearances and visual features are not always available for point cloud data. On the other hand, LiDAR point clouds are 3D data with geometric position, while images are projected 2D data with textured information. The modality differences and description gaps make the common image registration methods fail to get correspondences. Fortunately, 2D images share some geometric consistent features with point clouds, such as line segments and planes [9]. Thus we can use geometric co-occurrence to find these feature correspondences. We focus on the scenarios with rich geometric information, such as urban scene with buildings. This kind of Manhattan world commonly consists of a triplet of 3D parallel lines, which are shown as 2D lines intersecting at vanishing points in image planes [10]. These geometric constraints are beneficial for establishing the correspondences between the two modalities.

Fig. 1: Pipeline of 2D-3D registration for line correspondence and pose estimation.

In this paper, we propose an automatic 2D-3D registration method to estimate line correspondence and camera pose for structured environments. As shown in Fig.1, the proposed method starts from a single image and untextured point clouds, and does not need any pose priors or texture information. Based on the geometric relationship of 2D-3D lines, we coarsely match primary vanishing directions with 3D parallel orientations to obtain several rotation candidates of camera pose. Most importantly, we use a simple hypothesis testing strategy to eliminate the ambiguity of rotation matrix and simultaneously estimate the translation vector and the 2D-3D line correspondences. Finally, the camera pose is further optimized by minimize all the 3D-2D line projection error. The main contributions of this work can be summarized as follows:

  • We propose a new registration method to globally estimate camera pose in untextured point clouds of structured environments. The strategy using vanishing direction matching and hypothesis testing does not need any pose priors and can well avoid the estimation being stuck into a local optimum.

  • Compared with point-based registration methods, our proposed line-based method gives more repeatable and reliable 2D-3D correspondences between 2D images and point clouds. It is robust to 2D and 3D feature outliers and can deal with challenging scenes with repeated structures and rapid depth changes.

The structure of this paper is organized as follows: Section II reviews previous work related to 2D-3D registration. Section III details the methodology. Experimental results and discussions are presented in Section IV. Finally, Section V gives the conclusion of this paper.

Ii Related Work

For image to point cloud registration, the general approach is transferring one modality to the other, i.e. project the point cloud to image space [11, 8] and reconstruct point cloud using multi-view geometry [2], and then registering at the same dimension. However, the reconstruction approaches do not work well for the registration of a single image to untextured and unorganized point clouds. For the projection approaches, because there is no texture information in the point cloud, geometric features are often used, which are more robust than appearance features. Therefore, the main issue is how to estimate 2D-3D feature correspondences and the camera pose.

With 2D-3D feature correspondences, the pose estimation problem can be well solved by PnP or PnL algorithms [12, 13]. However, it is very challenging when both the pose and correspondence are unknown, which is called the chicken-and-egg conundrum. A conventional approach is using the RANSAC-based strategy [14], which can simultaneously consider feature correspondence and geometric information. However, without any constraint, RANSAC suffers from a large search space and high time complexity to avoid the local optimum. To guarantee global optimum and improve the efficiency, the existed approaches mainly rely on point and line feature correspondence [15, 16, 17, 18, 4, 19]. Several approaches simultaneously determine the correspondence and pose between 2D and 3D points, such as SoftPosit [15] and BlindPnP [18]. However, they are local optimization methods and need a good initialization. More recent works use branch-and-bound strategies to guarantee the global optimum without a pose prior [19, 4]. However, these approaches start from the existing point features and are conducted on synthetic data which guarantees the proportion of inlier correspondences, which are very difficult to follow for real data.

For real 2D-3D data, it is troublesome to extract the highly repeatable features across modalities. Point features are often used (e.g. SIFT, junctions[20]), but they need careful design to encode geometric information and maintain the concurrence ability across modalities. Apart from point feature, line feature is another feature suitable for 2D-3D registration, which shares the characteristics of stability and representative. However, it is difficult to describe the local information of line feature for both 2D and 3D data. An early approach modifies SoftPosit algorithm for line feature [21]. However, it needs a good initialization and may get stuck in a local optimum. Recently, Brown et al. [22] utilize both point features and line features to minimize the projection error, then use branch-and-bound formulation to guarantee the global optimization.

Structured environments like buildings are different to natural scenes because of the repeated structure and weak texture. The aforementioned methods may fail to find a good pose for these kinds of data. Fortunately, many 3D parallel lines in structured environments generate vanishing points in 2D images. In previous researches [23, 24], vanishing points have been used for rotation estimation between images, but rarely being used in registration between 2D and 3D data. A systematic 2D-3D registration method is proposed for 2D-image and 3D-range in an interactive manner [17]. The camera orientation is recovered by matching vanishing points with 3D directions and translation is estimated by RANSAC among all the linear features. Because it is hard to determine the correspondence between vanishing points with 3D directions, the authors interactively rotate the 3D model to align 3D directions with 2D vanishing directions. Therefore, there is a great demand for exploring the automatic 2D-3D registration problem of data in structured environments.

Iii Overview of the proposed method

To address the automatic registration of a single image with point clouds in structured environments, we propose to separately estimate the rotation matrix and translation vector by using geometric correspondences. It starts from a single image (either from RGB or gray) and untextured point clouds. And the outputs are the 2D-3D line correspondences and camera poses related to the point cloud frame. We first extract line segments from both 2D and 3D data and cluster them into two sets (vanishing point lines for 2D and parallel lines for 3D). Based on the relationship of vanishing point for 2D images and the parallel directions of 3D lines [23], we then coarsely match primary 2D vanishing directions and 3D parallel orientations to compute several rotation matrices of camera. With each rotation matrix candidate, we use a hypothesis testing strategy to estimate the translation parameters. The estimation with the most line correspondence inliers is the output of iterations. Additionally, by comparing the number of inliers for different rotation candidates, we can estimate the final line correspondence and translation vector, simultaneously removing the wrong rotation estimations. After obtaining the 2D-3D line correspondences we further optimize the camera pose by minimizing the line projection error. The main steps include line segment extraction, rotation matrix candidate estimation, line correspondence estimation using hypothesis testing strategy, and final pose refinement using line correspondence (Fig.1).

Iii-a Line segment extraction for image and point cloud

For line segment extraction of images, there exist many state-of-the-art methods in the literature [25, 26]. Considering the computation efficiency and quality of extracted lines, we choose the LSD method [25]. However, the line segment extraction methods for unorganized point clouds are specified by the geometry structures. Here, we utilize a simple and efficient 3D line detection algorithm [27], which is very suitable for the structured environments with the existing of many plane features. Finally, the adjacent 3D line segments are merged to obtain stable and significant ones.

(a) 2D line detection (b) 3D line detection
Fig. 2: A demonstration of the 2D and 3D line segment extraction method.

Fig.2 shows an example of the line segment extraction result for both 2D and 3D data. Their structures are very similar and share many co-occurring line segments, which is very important for later correspondence estimation.

Fig. 3: Projection geometry of 3D parallel line segments

Iii-B Rotation matrix candidate estimation

The projections of 3D parallel line segments are 2D line segments sharing with the same vanishing point. As shown in Fig.3, there is a set of 3D parallel lines with normalized homogeneous orientation in the point cloud frame. Their projections on the unit sphere in camera frame are great circles intersecting at one point on the sphere. The direction from camera center to the point forms a vanishing direction which is corresponding to the parallel 3D lines in camera frame. Thus, the transformation from to is a typical Euclidean 3D transform


where and are the rotation and translation parameters from point cloud coordinate system to camera coordinate system, and denote the inhomogeneous 2D and 3D directions. We can observe that the rotation is totally determined by the direction correspondences. The rotation matrix can be estimated with at least two direction correspondences.


where is corresponding to , is corresponding to , respectively. and can be the cross products of the former two orientations or another correspondence.

From the extracted 3D and 2D line segments, we need to cluster at least two primary 2D vanishing directions and two corresponding 3D directions. For 2D lines, we use a multiple RANSAC-based vanishing direction detector [28]

to cluster the lines into several sets. After clustering, each set of the line segments is used to compute a vanishing direction using Principal Component Analysis (PCA). Likewise, we cluster 3D parallel directions in the point cloud frame using RANSAC-based detector based on line distributions. We randomly select one 3D line segment, clustering the other 3D lines with similar normalized 3D directions (

) to get the largest number of inliers. To maintain the robustness of the vanishing direction and 3D orientation estimation, the number of 2D and 3D clusters (denoted as ) needs to be set more than 3, e.g. c=5. After getting directions, we further merge the line set with collinear and adjacent directions and select the former two vectors with largest number of lines as the final 2D and 3D directions. Although there may exist more than two primary sets of line segments in 2D and 3D data, we only use the former two or three ones because it is already very stable for structured data.

However, the geometric distribution of line segment sets is not robust enough to distinguish the orientation correspondence between 2D-3D line segment sets. Additionally, the ambiguity of 3D orientations (two opposite directions) gives more possibilities of the correspondence. Even though we know the correspondence of a 2D vanishing direction to a 3D orientation, there are still four possibilities for rotation estimation. Thus, using two sets of 2D-3D line correspondences, there will be eight rotation candidates. For three sets of 2D-3D line correspondences, the additional correspondence can be used for validation of the estimation. Thus there will be less than eight candidates. The rotation candidate estimation algorithm is outlined in Algorithms


1:2D line segments , 3D line segments ;
2:8 rotation candidates ;
3:Clustering 2D lines into sets and calculating the vanishing directions;
4:Merging the vanishing directions and picking the former 2 with most number of line segments ;
5:Clustering 3D lines into parallel line sets;
6:Merging the 3D directions and picking the former 2 with most number of 3D line segments , ;
7:Calculating the rotation matrices using Eq.2, ;
8:return ;
Algorithm 1 Rotation matrix candidate estimation

Iii-C Translation vector and line correspondence estimation

Although there exist ambiguities of the rotation estimation, the searching space is greatly decreased. For each rotation estimation candidate, the correspondence of 2D line sets with 3D line sets is determined. By analyzing the projection model from point cloud frame to camera frame, it is still a non-linear and non-convex problem. Bad initialization of the position will result in local optimum. Thus a hypothesis testing method on individual line correspondence can be used to decrease the possibility of getting stuck in a local optimum since there are only three translation parameters left. Before picking 2D-3D line correspondences, some breaking segments are merged and some isolated short segments need to be removed [17]. For each 2D-3D line segment correspondence, the transformed 3D line segment in camera frame is co-planar with the 2D line segments (Fig.4), thus the transformation of the 3D line center point is perpendicular to the normal of the corresponding 2D line segment ,


By randomly selecting 3 pairs of 2D-3D line correspondences, a translation vector can be calculated. Then the estimated and are used to estimate the total inlier correspondences. Each 3D line is first transformed to the camera coordinate system. For each 2D segment, several co-planar 3D segments are selected using Eq.3 and are projected to the image plane as finite 2D lines. We calculate the overlap length and select the inliers with more than of 2D segment length. This constraint using overlap avoids the degenerate case where all 3D line segments become inliers when the camera is sufficiently distant. For each RANSAC iteration, we can get a translation estimation and a certain number of 2D-3D line correspondences. When the number of inliers exceeds of the total number of 2D or 3D line segments, or the iteration time reaches the setting maximum, it returns the estimation with the largest number of inliers as the estimated translation.

Fig. 4: Geometry relationship between a 2D line and the matched 3D line.

Thus, we can get 8 translation estimations for 8 rotation candidates, respectively. Then the one with the most inliers among the 8 estimations is selected as the final translation vector. At the same time, we obtain pairwise 2D-3D line correspondences and eliminate the ambiguity of rotation estimation. The hypothesis testing strategy takes the geometric distribution and individual line segment correspondence into consideration, which greatly decreases the possibility of falling into local optimum. The outline of translation estimation part is shown in Algorithms 2.

1:2D line segments , 3D line segments and 8 rotation candidates ;
2:Optimal rotation , translation and 2D-3D line correspondences ;
3:Merging 2D and 3D line segments and removing short isolated segments;
4:for all   do
5:     Initializing line correspondences’ number ;
6:     loop
7:         if  or max iterations then terminate;          
8:         Randomly matching 3 pairs of 2D and 3D segments, using Eq.3 to compute ;
9:         Calculating line overlap length and counting the number of long overlap lines ();
10:         if  then ;               
11:Picking and the corresponding ;
12:return , and ;
Algorithm 2 Translation and line correspondence estimation

Iii-D Pose refinement using line correspondence

At the former stage, we obtain pairwise line correspondences and 6DOF pose parameters. Similar to point-based registration methods [29], we further optimize the camera pose by minimizing the projection error of all pairwise line correspondences. However, it is not easy to use Euclidean distance to measure the projection error for line correspondence. For a pair of matched 2D-3D line segment, the registration error contains overlap distance and angle difference. Because the overlap length has been constrained at the inlier estimation step, we further optimize it with the collinear constraint. If the projections of 3D line end points are collinear to the corresponding 2D lines, there will be no angle difference between the correspondence. For two 3D end points , the variable is camera pose , its Lie algebra is . The projected two end points are ,


We want to minimize the distance of both end points to the corresponding infinite 2D line , whose coefficient vector is denoted as . Considering all the matched 2D-3D line segments, the minimization function can be formulated as


where contains two end points, the distance is the sum of the end points to the corresponding infinite line. It is finally formulated as a non-linear least squares problem. With Lie algebra, we can transform it to the unconstrained optimization problem and use the L-M algorithm to solve it. For a 3D end point , the 3D transformed point , the projected 2D point is . The Jacobian matrix of the cost function is


where is the partial derivative of a 2D point to a 2D line,


and is the standard 3D to 2D projection model [30],


We can use g2o library [31] to implement the optimization (Eq.5). To remove outliers, we iteratively optimize the cost function and reject the outliers using the refined pose. The iteration terminates when there is no outliers or maximum iterations reached.

Iv Experiments

To demonstrate the effectiveness of the proposed method, we test it on both synthetic (IV-A) and real data (IV-B). Both datasets are structure environments including 3D parallel lines and outliers, which is the prerequisite for the vanishing direction matching.

Iv-a Synthetic data experiments

To evaluate the proposed method with a setting where the true camera pose was known, 50 independent Monte Carlo simulations are conducted. Two sets of random 3D parallel lines are generated from , a fraction of 3D lines with random orientations are added to form the original 3D line segments; a fraction of the 3D lines are randomly selected as outliers to model occlusion; the inliers are projected to a pixels virtual image with an focal lenght of

; Gaussian noise is added to the 2D line endpoints with a standard deviation

of 2 pixels; and some 2D lines with random orientation are added to the image as the 2D line outliers. Based on these setups, we do not need to conduct line segment extraction for images and point clouds. Visualization of synthetic setups and registration results are shown in Fig.5.

(a) 3D Result (b) 2D Result
Fig. 5: Synthetic 3D and 2D experimental results using random 3D lines. (a) 3D models(red lines), generated pose priors(green points) and estimated camera pose (o-xyz). (b)2D projection alignment results. 3D line projections shown as green lines, red as 2D image lines.

The quantitative results are shown in Fig.6 and Fig.7. The success rates measure the fraction of trials where the correct pose is found, where the rotation error is less than 0.1 radians and position error related to the ground truth is less than 0.1, as used in [4]. Compared with RANSAC (RS), the proposed method (VP) has a higher success rate with the growth of line feature numbers. The running time becomes longer because the data volume affects the efficiency of translation estimation. We can find the camera pose is less than 5 seconds. While the pose estimation errors for both rotation and translation are very small. Additionally, we can observe from Fig.7 that the proposed method is very robust to 2D and 3D outliers.

(a) Success rate (b) Runtime (c) Camera pose error
Fig. 6: Results for synthetic dataset with different number of 3D lines. 50 Monte Carlo simulations are conducted for each setting.
(a) 3D outlier fraction (b) 2D outlier fraction
Fig. 7: Outlier analysis. (a) Mean success rates with different 3D outlier fraction. (b) Mean success rates with different 2D outlier fraction.

Iv-B Real data experiments.

The dataset consists of four outdoor and indoor scenes of structured environment, CMU NSH wall, Hamburg Hall windows, Hamerschlag Hall and NSH lounge. For each scene, there are a point cloud and 10 images taken with different poses. A FARO laser scanner focus3D S and FLIR BFLY-U3-13S2C-CS camera (Fig.8) are used to capture 3D point clouds and 2D images.

Fig. 8: Sensor setup for collecting 2D and 3D data.

To validate the effectiveness of the rotation estimation framework, we show an example of the estimated vanishing directions and the corresponding 2D line segment sets in Fig.9(a), while the primary 3D orientations and the corresponding 3D parallel line segments in Fig.9(b). The two primary vanishing directions for 2D image are corresponding to the two primary 3D line orientations. However, without visualization of 2D lines in images and 3D lines in point clouds, it is difficult to compute 2D-3D direction correspondence. Thus, we keep the possibilities of orientation matching and get 8 rotation matrix candidates.

(a) Vanishing directions and associated 2D lines (b) 3D orientations and associated 3D lines
Fig. 9: Vanishing direction to 3D orientation correspondence.

Some qualitative results of line correspondences are shown in Fig.10 for four scenes. All the matched 3D line segments are projected to the image plane (in green) using the estimated pose for visualization, while the red are the corresponding 2D lines. We can observe that the global geometric structure aligns well. Some 2D lines have more than one corresponding 3D lines and vice versa. This is reasonable because we can not guarantee that the fragments in 2D and 3D line segments are totally removed. There exist some (both 2D and 3D) lines having no correspondence because they contribute to the vanishing direction matching but not for translation estimation. Based on these 2D-3D line correspondences and the coarse estimated pose, we further use the local optimization method in Sec.III-D to refine the pose. The iterations of pose refinement often less than 5 times when there is no outlier.

(a) NSH wall. (b) Hamburg Hall windows. (c) Hamerschlag Hall. (d) NSH lounge.
Fig. 10: Demonstration of the 2D-3D line correspondence for three scenes. (Green: 3D line projections, red: 2D lines.)

The final camera poses are shown in Fig.11. For each scene, we give an example of the true pose and our estimated pose on the left point cloud map. The true poses are marked as in dash lines, while the estimations are in continuous lines. There is small drift of camera positions, but the orientations of axises are parallel to each other. To better visualize the registration results, we project the original point cloud to image plane with the same camera model and the estimated camera pose, which are shown on the right of each figure. The projected image is fused with the original camera image to visualize the misalignment. The overlapped area (brighter area) with small motion blur means that the registration result is better. From the global perspective, it overlaps well and the motion blur is small. When the depth changes dramatically, there exist some drifts caused by misalignment.

(a) camera pose. (b) point cloud projection.
Fig. 11: Camera pose estimation visualization.(First row: NSH wall; Second: Hamburg Hall windows; Third: Hamerschlag Hall; Fourth: NSH lounge)

To quantitatively analyze the results, we use the number of matched 2D-3D line segments (), mean rotation error , position error , and position error related to the ground truth to measure the performance. Tab.I shows the registration results for each scene with 10 images respectively. For NSH wall data, the total number of matched 2D-3D line segments is relatively smaller compared with other two outdoor scenes, but the line distributions are reliable. The mean rotation error is 0.36 degrees and the mean position error of ten images is 0.20 meters. For Hamburg Hall windows, there exist more repeated structures, but the results are still feasible. There are more than 150 matched segments for each image. The mean rotation error is 0.65 degrees, the mean position error is about 0.39 meters, and the position errors relative to the ground truths are less than 0.1. For Hamerschlag Hall, there exist both repeated structures and dramatic depth changes. The estimation errors are slightly bigger, about 0.62 degrees for the mean rotation error and about 0.54 meters for the mean position error. For NSH lounge, because it is indoor environment, the total number of matched 2D-3D line segments is much smaller compared with outdoor scenes. Fortunately, the distance from camera to object is relatively smaller, we can get very high precision of pose estimation once sufficient 2D-3D line correspondences are found. The mean rotation error is 0.20 degrees and the mean position error is 0.15 meters.

1 2 3 4 5 6 7 8 9 10
NSH wall
N 126 87 101 87 76 74 82 92 84 106
0.27 0.15 0.13 0.54 0.34 0.54 0.67 0.45 0.34 0.22
0.29 0.15 0.31 0.11 0.10 0.21 0.12 0.43 0.25 0.06
0.05 0.03 0.08 0.02 0.03 0.05 0.02 0.08 0.05 0.01
Hamburg Hall windows
N 182 187 192 175 198 156 202 189 172 185
0.56 0.34 0.97 0.35 0.45 0.78 0.89 0.44 0.42 1.32
0.21 0.50 0.18 0.24 0.27 0.57 0.43 0.36 0.52 0.66
0.03 0.07 0.03 0.03 0.04 0.08 0.06 0.05 0.07 0.09
Hamerschlag Hall
N 189 155 173 241 185 176 179 163 198 176
0.42 0.33 0.87 0.34 0.65 0.65 0.66 1.22 0.45 0.65
0.44 0.73 0.44 0.12 0.89 0.73 0.89 0.58 0.17 0.45
0.09 0.13 0.10 0.03 0.28 0.25 0.30 0.23 0.07 0.18
NSH lounge
N 15 13 17 20 18 17 11 16 19 15
0.10 0.45 0.12 0.05 0.15 0.12 0.40 0.23 0.18 0.15
0.08 0.25 0.20 0.07 0.19 0.12 0.17 0.08 0.20 0.15
0.06 0.15 0.08 0.10 0.18 0.05 0.11 0.09 0.13 0.09
TABLE I: Matching quantities and registration errors for real data.

Regarding the processing time, the 3D line segment extraction step costs the most time, which is related to the volume of point cloud data. With the extracted 3D line segments, it takes about 8 seconds to estimate the camera pose of a single image on Matlab implementation using an 8-core Intel i7 CPU. The time changes a little depending on the number of merged 2D and 3D line segments. We use parallel computing for different rotation candidates, thus it does not add much time for the estimation of translation vectors.

During the experiments, we recommend to set the clusters of the vanishing point and 3D parallel line as 5. For Manhattan world, the common vanishing point number is 3. But 3 clusters sometimes result in 2 or 1 vanishing points after merging. 5 clusters can yield more stable and robust output. Another parameter is the overlap length threshold, we consider it is a 2D-3D correspondence when the overlap length exceeds half of the 2D line length. This is an empirical setting based on the performance. Larger setting can reject more inliers while smaller yields more outliers. This setting suits well both in the synthetic and real data experiments. In general, the estimated poses are promising, especially for the rotation calculation being less than 1 degree error for different scenes. The translation error may change a lot with structure repetitions and depth changes. Meanwhile, if we use RANSAC from the beginning with 6 line correspondences to estimate both and , it rarely succeeds. This is because the search can easily fall into a local optimum. In our proposed method, the strategy using vanishing direction matching and hypothesis testing greatly reduces the chance of getting stuck into a local optimum.

V Conclusion

In this paper, we have presented an image to point cloud registration method to simultaneously estimate line correspondence and camera pose in structured environments. Based on geometric information, the method decouples the rotation calculation and translation estimation in two steps. Eight rotation candidates are obtained using the correspondence of vanishing directions to 3D primary parallel directions. Then a hypothesis testing approach is used to estimate the line correspondence and translation vector. Specifically, the framework using the hypothesis testing approach successfully deals with the rotation ambiguity from matching vanishing directions with 3D orientations. The alignment using vanishing directions and hypothesis testing strategy can be generalized to any kind of point clouds (with or without color, organized or unorganized) in structured environments. Experiments were conducted on synthetic and real data (both outdoors and indoors) with challenging repeated structures and rapid depth changes. The results demonstrate that the proposed method can effectively estimate line correspondence and camera pose. In the future, we will exploit more efficient ways to find the global solutions of camera pose using line correspondence.


The authors want to thank Warren Whittaker from CMU for the instructions on using FARO scanner and Dylan Campbell from ANU for the discussions and helps.


  • [1] D. P. Paudel, “Local and global methods for registering 2d image sets and 3d point clouds,” Ph.D. dissertation, Dijon, 2015.
  • [2] I. Stamos, L. Liu, C. Chen, G. Wolberg, G. Yu, and S. Zokai, “Integrating automated range registration with multiview geometry for the photorealistic modeling of large-scale scenes,”

    International Journal of Computer Vision

    , vol. 78, no. 2-3, pp. 237–260, 2008.
  • [3] A. Dhall, K. Chelani, V. Radhakrishnan, and K. M. Krishna, “Lidar-camera calibration using 3d-3d point correspondences,” arXiv preprint arXiv:1705.09785, 2017.
  • [4] D. J. Campbell, L. Petersson, L. Kneip, and H. Li, “Globally-optimal inlier set maximisation for camera pose and correspondence estimation,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [5] D. Droeschel and S. Behnke, “Efficient continuous-time slam for 3d lidar-based online mapping,” in 2018 IEEE International Conference on Robotics and Automation, 2018, pp. 1–9.
  • [6] L. Zhou, Z. Li, and M. Kaess, “Automatic extrinsic calibration of a camera and a 3d lidar using line and plane correspondences,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 5562–5569.
  • [7] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localization using direct 2d-to-3d matching,” in International Conference on Computer Vision, 2011, pp. 667–674.
  • [8] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1744–1756, 2017.
  • [9] T. Goto, S. Pathak, Y. Ji, H. Fujii, A. Yamashita, and H. Asama, “Line-based global localization of a spherical camera in manhattan worlds,” in 2018 IEEE International Conference on Robotics and Automation, 2018, pp. 2296–2303.
  • [10] J. Lezama, R. Grompone von Gioi, G. Randall, and J.-M. Morel, “Finding vanishing points via point alignments in image primal and dual domains,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2014, pp. 509–515.
  • [11] Y. Feng, L. Fan, and Y. Wu, “Fast localization in large-scale environments using supervised indexing of binary features,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 343–358, 2016.
  • [12] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” International journal of computer vision, vol. 81, no. 2, p. 155, 2009.
  • [13] C. Xu, L. Zhang, L. Cheng, and R. Koch, “Pose estimation from line correspondences: A complete analysis and a series of solutions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1209–1222, 2017.
  • [14] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [15] P. David, D. Dementhon, R. Duraiswami, and H. Samet, “Softposit: Simultaneous pose and correspondence determination,” International Journal of Computer Vision, vol. 59, no. 3, pp. 259–284, 2004.
  • [16] B. Kamgar-Parsi and B. Kamgar-Parsi, “Matching 2d image lines to 3d models: Two improvements and a new algorithm,” in CVPR 2011, 2011, pp. 2425–2432.
  • [17] L. Liu and I. Stamos, “A systematic approach for 2d-image to 3d-range registration in urban environments,” Computer Vision and Image Understanding, vol. 116, no. 1, pp. 25–37, 2012.
  • [18] F. Moreno-Noguer, V. Lepetit, and P. Fua, “Pose priors for simultaneously solving alignment and correspondence,” in European Conference on Computer Vision, 2008, pp. 405–418.
  • [19] M. Brown, D. Windridge, and J.-Y. Guillemaut, “Globally optimal 2d-3d registration from points or lines without correspondences,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2111–2119.
  • [20] G.-S. Xia, J. Delon, and Y. Gousseau, “Accurate junction detection and characterization in natural images,” International journal of computer vision, vol. 106, no. 1, pp. 31–56, 2014.
  • [21] P. David, D. DeMenthon, R. Duraiswami, and H. Samet, “Simultaneous pose and correspondence determination using line features,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition., vol. 2, 2003, pp. II–II.
  • [22] M. Brown, D. Windridge, and J.-Y. Guillemaut, “A family of globally optimal branch-and-bound algorithms for 2d–3d correspondence-free registration,” Pattern Recognition, vol. 93, pp. 36–54, 2019.
  • [23] J.-K. Lee and K.-J. Yoon, “Real-time joint estimation of camera orientation and vanishing points,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1866–1874.
  • [24] M. E. Antone and S. Teller, “Automatic recovery of relative camera rotations for urban scenes,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition., vol. 2, 2000, pp. 282–289.
  • [25] R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “Lsd: A fast line segment detector with a false detection control,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 4, pp. 722–732, 2008.
  • [26] N. Xue, S. Bai, F. Wang, G.-S. Xia, T. Wu, and L. Zhang, “Learning attraction field representation for robust line segment detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [27] X. Lu, Y. Liu, and K. Li, “Fast 3d line segment detection from unorganized point cloud,” arXiv preprint arXiv:1901.02532, 2019.
  • [28] C. Rother, “A new approach to vanishing point detection in architectural environments,” Image and Vision Computing, vol. 20, no. 9-10, pp. 647–655, 2002.
  • [29] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using invariant features,” International journal of computer vision, vol. 74, no. 1, pp. 59–73, 2007.
  • [30] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge university press, 2003.
  • [31] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g2o: A general framework for graph optimization,” in 2011 IEEE International Conference on Robotics and Automation, 2011, pp. 3607–3613.