Autonomous vehicles are often equipped with several multi-modal sensors for environmental perception. Typical perceptual sensors include range sensors which measure the distance to objects using signals (e.g., light, radio waves, and sound waves) and vision sensors which project 3D information onto a 2D image frame. If multiple sensor types are equipped, obtaining relative coordinate transformations via extrinsic calibration becomes an essential pre-processing module. According to literature, accurate extrinsic calibration enhances the performance of data fusion between multiple sensors and the complementary exploitation of other modalities. For example, Shin et al.  utilized depth information from LiDAR to calculate the photometric error of two images. Zhang and Singh  implemented a fast odometry method featuring low drift by fusing visual odometry and LiDAR odometry methods. Han et al.  conducted a study to estimate road area using LiDAR data and camera images using a conditional random field framework.
The most common extrinsic calibration method between LiDAR and cameras involves the use of a checkerboard-like target and the process of minimizing the re-projection error of the correspondences between two sensor data [4, 5, 6]. Some studies adopted a modified target such as a circular hole [7, 8]. However, such calibration methods, require a target and the performance of the calibration depends on the accuracy of the correspondence estimated from each sensor data. In particular when LiDAR measurements are non-continuous, the detection of an accurate correspondence between the two modalities is not straightforward. Furthermore, this type of calibration is not possible if the two sensors do not provide covisibility at a given moment in time
a given moment in time. For the purpose of general calibration, several algorithms were introduced to perform calibration using information on the surrounding environment from sensors without using a specific target [9, 10, 11, 12, 13]. However, these approaches rely on an overlapping FOV between two sensors. In order to overcome the non-overlapping configuration issue, a local 3D pointcloud is accumulated using the odometry to ensure that the overlapping region appears on the camera [14, 15, 16]. Another line of study [17, 18, 19], referred to as hand-eye calibration, estimates the motion of each sensor and performs extrinsic calibration based on the estimated motion relationship.
The configuration of our sensor system for extrinsic calibration consist of four LiDARs and a single stereo camera, as shown in fig:system_configuration. Two 3D LiDARs are mounted on the left and right to maximize the range of data acquisition. The two 2D LiDARs are installed facing forward and backward, respectively. As the four LiDARs have overlapping regions, calibration was performed using the geometric information from LiDARs. However, as the stereo camera is facing forward, the images from the stereo camera possess no overlap with any of the LiDAR sensor data.
Ii Related Works
In this paper, we perform an extrinsic calibration between a stereo camera and LiDARs when no overlapping FOV is guaranteed. Literature sources related to our extrinsic calibration method can be divided into two major categories: target-less and calibration with non-overlapping configurations.
Ii-a Target-less Extrinsic Calibration
Target-less extrinsic calibration refers to the method of performing calibration using surrounding data from a general space without the use of a checkerboard or a specific shape target. Establishing correspondences between heterogeneous sensor data without a specific target is a highly challenging task.
Zhao et al.  manually selected the correspondence between LiDAR intensity image and camera image to calculate the extrinsic parameter. LiDAR intensity image was generated by fusing LiDAR scan data and odometry estimated using horizontal 2D LiDAR data. However, the selection process for correspondence between heterogeneous sensor data is a laborious task and potential source of human errors. In order to mitigate this issue, many researchers have attempted to calibrate without correspondence through the use of information theory. Taylor and Nieto  proposed to find the optimal alignment between sensors by introducing the NMI (NMI) metric. NMI is a measure of the mutual dependence of data from two sensors. This study yields the optimal solution by finding the largest NMI
value using the particle swarm optimization method. Similarly,Pandey et al.  used the MI (MI) metric to estimate the extrinsic parameter. Several local optimal points are encountered if a single scan is used for optimization. In order to cope with the problem, the method of this study used multiple scan data to find global optimal points with maximum MI value. Recently, with more studies
using deep learning, there have been attempts to perform extrinsic calibration using deep learningtechnique. Iyer et al.  used the CNN architecture to estimate calibration parameters using a supervised method.
Ii-B Extrinsic Calibration of Non-overlapping Configuration
The aforementioned target-less extrinsic calibration still requires an overlap between sensors. Several studies have focused on solving extrinsic calibration for configurations without any overlap between sensors. For example, Napier et al.  performed extrinsic calibration between push-broom 2D LiDAR and cameras. The push-broom LiDAR obtained distance measurement from the left-hand and the right-hand sides of the vehicle, and the cameras obtained images from the front. This means that there was no overlapping region between the two sensor data. Therefore, extrinsic calibration was performed using edge images from the camera image and LiDAR intensity image generated by projecting the 3D pointcloud that was calculated by applying the vehicle odometry measurement.
Scott et al.  leveraged the NID (NID) metric to estimate optimal extrinsic parameter between 2D LiDAR and multiple cameras. In the optimization process, they utilized factory-calibrated transformation as constraint. Scott et al.  added a data selection module by comparing the NID value around the optimal parameter to reduce the local optimal points.
Our proposed method utilizes the road marking information for the extrinsic calibration of non-overlapping configurations. The road are observed together through both sensors (camera and LiDAR) regardless of the acquisition time of the data. In order to minimize the local optimal point, multiple images were used and the images were selected using the costs defined with the vanishing point. In the optimization process, we defined multiple costs using image and LiDAR data which include NID cost, edge alignment cost and plane fitting cost to make it more convergent. This paper presents the following:
An automatic extrinsic calibration framework for non-overlapping configuration
Utilization of road regions estimated using stereo images for a more stable optimization process
An automatic image selection process using estimated vanishing point and road markings
A multiple cost function for robust optimization even with rough initial values
Iii Notation and Local Map Generation
We first generated a 3D pointcloud using odometry and LiDAR data and performed calibration by comparing the stereo images and LiDAR intensity image over the generated local pointcloud map. For local map generation, the odometry of the vehicle was calculated with high accuracy using the wheel encoder, the 3-axis FOG (FOG), and the IMU (IMU). Before describing the details of the extrinsic calibration, this paper first introduces notations and assumptions for generating the local map.
Matrix represents the rigid body transformation that registers the data defined in coordinate system to coordinate system. Matrix is also represented by a tuple , where . Here, and represent the relative translation along each axis in meters, and and represent the relative rotation that is roll, pitch, and yaw in radians. represents the pointcloud set from sensor in coordinate system . For instance, represents the pointcloud from the LiDAR at time in the coordinate system . Subscripts , and denote LiDAR, the left camera of stereo, the vehicle center and the global coordinate system.
Iii-B Local Map Generation
Due to the fact that no overlap is attained between the camera and the LiDAR at any given instance of time, we accumulated LiDAR data to generate a local pointcloud map. We assume that accurate relative transformation is calculable using a 3-axis FOG, an IMU and wheel encoder. We also began with pre-computed extrinsic calibration between LiDARs and LiDAR-to-vehicle using our previously proposed algorithm .
Given an accurate relative sequential pose and extrinsic calibration between the four LiDARs, we accumulated the LiDAR data and constructed a local pointcloud map written in the global coordinate system. The local pointcloud that was written in each sensor coordinate system at time was represented in the vehicle coordinate system using the computed vehicle-to-LiDAR coordinate system transform as described in (1). This was followed by the sequential transformation of pointcloud from each LiDAR sensor in the vehicle coordinate system into the global coordinate frame via (2). Lastly, the global pointcloud was obtained by adding of the all pointcloud of each LiDAR via (3).
Our method used the global pointcloud , the stereo image , and the odometry for stereo extrinsic calibration. Particular attention was required as the accuracy of the local pointcloud map depends on the accuracy of the relative vehicle pose estimation. In this study, we accumulated the local map for a range of approximately 80m to minimize the potential accumulation error.
The local pointcloud map was obtained at time when the LiDAR measurement was received. As the camera images were captured at different frequencies, we needed to consider the local pointcloud at time when the image is received. The pointcloud in the global coordinate system can be transformed into the vehicle coordinate system using the odometry information at time time at which the stereo image was acquired (4). The pointcloud in the vehicle coordinate system was subsequently registered into the stereo camera coordinate system using the extrinsic parameter of the stereo camera, which is what our method is aiming to calculate. (5). The LiDAR intensity image was generated by projecting the pointcloud onto stereo image plane (fig:main_figure) using the pre-calibrated intrinsic parameter (6).
Iv Target-less Multi-modal Extrinsic Calibration with No Overlap
With the constructed local map, the proposed method peforms two steps: (i) informative image selection and (ii) multiple cost optimization. In the image selection process, the road regions are extracted using a stereo disparity map, and images that include abundant road markings were selected for the optimization process by extracting the lines facing the vanishing point. The local optima issue can be alleviated by using multiple images to estimate extrinsic calibration. fig:whole_diagram shows the overall calibration process. The following sections describe each module in detail.
Iv-a Road Detection and Informative Image Selection
Images are marked as informative if abundant road markings are contained. In this section, we detect the road region within an image and propose the image utility measure for extrinsic calibration. The overall scheme is provided in fig:image_selection_diagram.
Iv-A1 Road Region Detection using Stereo Images
In the proposed method, we compute the v-disparity image  from a disparity map for a pixel to detect the road region. The histogram of the row line on the disparity map is depicted on the row of the v-disparity image, and sample v-disparity images are shown on the right-hand side of fig:road_extraction_process.
Following this process, the road region is detected by fitting a line on the v-disparity image using RANSAC (RANSAC). Due to the fact that each pixel value of the v-disparity image denotes the histogram of the disparity map, fitting a line captures the continuously changing disparity along the direction (i.e., ground) of the image. The road region is informed from pixels corresponding to the fitted line . The sample of the detected road region can be seen in fig:road_mask.
Iv-A2 Vanishing Point Estimation
Next, the vanishing point is estimated from the detected road region. The vanishing point allows for the discerning of road markings from other features such as the shadows of street lamps and trees. Given the line in the v-disparity image (i.e., the road plane), the value at which meets the - axis is identified. The horizontal line at this value is referred to as the , as depicted in fig:image_with_v_disparity.
The is near to the vanishing point but may not be exact. A voting process is additionally applied to estimate the vanishing point near the . After defining a voting region of size by around the central point of , the lines , which are voters participating in the vote, were detected on the road region using a LSD (LSD) . The symbol is used to indicate that the corresponding data was extracted from the road region. Each line , where and are the start, end and center point of each line respectively, votes depending on the weight over each pixel in the (7).
Here, refers to an angle between voter and the line connecting and in degrees, as illustrated in fig:vanishing_region. The voting weight of each line is the inverse of the angle to emphasize lines with smaller angles (with a maximum weight of 10). In this paper, threshold was set to . The vanishing point was determined as the largest value in the voting results (8), and denotes the corresponding confidence level. fig:vanishing_point_result shows the estimated vanishing point and line segments on the road.
Iv-A3 Image Selection
The utility of each image for image selection is calculated using , and (10). A higher represents a greater number of road markings, which in turn indicates better data for calibration. In order to suppress the effect of the local optimal point in the optimization process, multiple images and the corresponding pointcloud were selected. Based on the utility, five images with the highest utility are selected.
Iv-A4 Plane Estimation of the Road
Once the informative images are selected, the plane , is estimated from the disparity map . Conversions are made from in camera coordinate system to the 3D pointcloud using the camera model: the focal length , and base line . is subsequently calculated using a RANSAC plane fitting with the pointcloud. The plane is used in optimization process to calculate the plane fitting cost.
Iv-B Multi-cost Optimization
Through the aforementioned image selection process, a LiDAR intensity image, a road masked image, and a plane model were obtained. For the extrinsic calibration, we also extracted points belonging to roads. Specifically, the road pointcloud was calculated from the global pointcloud using the region growing segmentation method . For extrinsic calibration, this paper proposes using the following three costs.
Iv-B1 Edge Alignment Cost
The first cost, edge alignment cost, is used to evaluate the discrepancy between RGB and LiDAR intensity images. A camera edge image () is extracted from a road masked image (), and a LiDAR edge image () is computed from a LiDAR intensity image () by applying the canny edge algorithm .
In order to compare the two edge images by defining a differentiable cost function, RGB-induced edge image is converted into a distance transform image . The edge alignment cost was computed by calculating the sum of the multiplied pixel value of and (11). Defining the cost using the distance image aids the algorithm to converge with rough initial parameters. The edge cost reaches a minimum value when the two data precisely match one another. fig:estimate_distance_transform shows the process of calculating the edge alignment cost.
Iv-B2 NID Cost
The second cost is induced from the NID
metric, which measures the correlation between two random variables,and . In this paper, the NID cost between and was used to identify the best alignment. The NID cost was defined as:
Here, and are single entropy and cross entropy. The computed NID metric has the range of . Lower NID costs indicate a greater degree of similarity between the distributions of two data. In our case, the intensity value of LiDAR and the image pixel value in grayscale achieves the best alignment between two images with the lowest NID cost.
Iv-B3 Plane Fitting Cost
The final cost is computed using the plane model obtained from the stereo image and the road pointcloud acquired from the global 3D map. By including this geometric cost, convergence of the and the rotation parameter is improved.
In this cost, , and are the , , and values of point in the stereo camera coordinate system. The cost is summed over the number of the points () in the target pointcloud . Lower plane fitting costs are yielded when the pointcloud and road model closely match one another.
Optimization is performed using the weighted sum of the three costs: , and . In this paper, , , and are set as 2.0, 500.0, 0.1 respectively, considering the scale of each cost.
The extrinsic parameter of the stereo camera was estimated by identifying the minimum point of sum of the costs by utilizing the downhill simplex method (15).
V Experimental Results
This section presents experimental results to evaluate the defined cost in both a quantitative and qualitative manner. We also verified the repeatability and sensitivity of the extrinsic parameter between the stereo camera and LiDAR during the proposed extrinsic calibration.
V-a Evaluation of Cost Function
We started by depicting the cost function with respect to the perturbation from the optimal values. The variation of each cost and total cost are plotted in fig:cost by increasing the number of selected images. From the first row to the fourth row, up to four informative images were incrementally included in the information order, from the most informative to the fourth most informative image. For these cases, the cost was calculated using only the information on the road region.
As shown in the results, the proposed cost reshapes the function in all parameter spaces. The NID cost has a smooth curve shape compared to the edge cost, whereas the segment for converging to the global optimal point is relatively narrow. On the other hand, the edge cost graph is somewhat noisier but has a broader segment to converge to the global optimal point. This feature of the edge cost helps to converge to the global optimal point even with a large initial error of calibration. The shape of the edge cost graph in the y-direction (lateral direction) can be explained considering that the selected informative images tended to contain more road markings with repeated patterns (e.g., cross walks). However, note that the NID cost alleviated this repeated local minimum in the y-axis direction in the summed cost function.
The plane fitting cost affects z, roll, and pitch by exploiting the geometric road information, and mainly enhances the convergence in z and roll. The plane fitting cost depends on the accuracy of the disparity map. The computed disparity map may include errors due to pixel discontinuity, thus resulting in a less sharp shape at the optimal point for the optimal point. To incorporate the global tendency while hindering the local details of the plane cost, we set the weight of the plane fitting cost to be somewhat smaller. Overall, the sum of the cost shows a distribution of shapes that can converge toward the global optimal point in the overall section.
The graph in the last row reveals the effect of the image selection. The plot shows the cost variation without an image selection process, by selecting a image randomly. When an image is arbitrarily selected, the cost produces significant amount of noise for all the parameters and fails to provide a clear global optimal point regardless of the number of images. This proves that an informative image selection process significantly influences extrinsic parameter estimation.
V-B Repeatability of Parameter Optimization
Repeatability is the property of outputting the same result for different inputs through repetition of the algorithm. In order to test repeatability, the algorithm was executed 40 times for each number of images using random initial values within the range of 0.3m and for translation and rotation, respectively (fig:repeatability_teset). The top and bottom of the box plot refer to the and percentiles respectively, and the middle red line is the median of the error. The higher the number of images, the better the overall repeatability as a whole, and the repeatability is significantly improved when two or more images are used.
V-C Comparison with Available Ground Truth
It is challenging to calculate the ground truth of the actual calibration parameters for extrinsic calibration, especially when no overlap is guaranteed as in the case of our system. Therefore, in this paper, we used the baseline of the stereo camera as the ground truth. If the stereo camera is correctly calibrated, there should be a translation of the baseline in exactly the y-axis direction between the rectified images, and the rest of the translation and rotation should be zero.
The extrinsic parameters of each of the two rectified images were calculated using the proposed algorithm, and the difference between the two parameters was compared. tab:result_table shows the comparison with the ground truth. On average, there was an error of approximately 0.02m for translation and approximately for rotation. fig:quality_eval shows the image when the global pointcloud is projected onto the left stereo camera image using the finally computed extrinsic parameter. We can see that the pointcloud of the road and the structure is projected correctly onto the corresponding pixels.
Vi Conclusion and Future Works
This paper proposed an automatic extrinsic calibration method for estimating the rigid body transformation between a stereo camera and LiDAR with no overlapping FOV. By exploiting the static features in the urban environment such as road markings, it was confirmed that extrinsic parameters could be calculated with a degree of accuracy of approximately 0.02m and without manual operation by a human operator.
In this paper, we used the assumption that the odometry for generating poitncloud is locally accurate. However, in cases where the motion of a vehicle undergoes several accelerations and decelerations, there may be a distortions in the 3D map due to errors in motion estimation. Therefore, it is preferable to avoid making assumptions about the accuracy of the odometry to use the proposed method in all situations. Therefore, if the uncertainty of the odometry and the SLAM framework are included in the calibration framework, these two problems can be simultaneously solved together through the optimization process.
- Shin et al.  Y.-S. Shin, Y. S. Park, and A. Kim, “Direct visual SLAM using sparse depth for camera-lidar system,” in Proc. IEEE Intl. Conf. on Robot. and Automat., 2018, pp. 1–8.
- Zhang and Singh  J. Zhang and S. Singh, “Visual-lidar odometry and mapping: Low-drift, robust, and fast,” in Proc. IEEE Intl. Conf. on Robot. and Automat., 2015, pp. 2174–2181.
- Han et al.  X. Han, H. Wang, J. Lu, and C. Zhao, “Road detection based on the fusion of lidar and image data,” Intl. J. of Robot. Research, vol. 14, no. 6, pp. 1–10, 2017.
- Yousef et al.  K. M. A. Yousef, B. J. Mohd, K. Al-Widyan, and T. Hayajneh, “Extrinsic calibration of camera and 2D laser sensors without overlap,” IEEE Sensors J., vol. 17, no. 10, pp. 1–24, 2017.
- Zhou  L. Zhou, “A new minimal solution for the extrinsic calibration of a 2D lidar and a camera using three plane-line correspondences,” IEEE Sensors J., vol. 14, pp. 442–454, 2014.
- Krishnan and Saripalli  A. K. Krishnan and S. Saripalli, “Cross-calibration of RGB and thermal cameras with a lidar for rgb-depth-thermal mapping,” Unmanned Sys., vol. 5, no. 02, pp. 59–78, 2017.
- Peršić et al.  J. Peršić, I. Marković, and I. Petrović, “Extrinsic 6DoF calibration of 3D LiDAR and radar,” in European Conf. on Mobile Robots, 2017, pp. 1–6.
- Alismail et al.  H. Alismail, L. D. Baker, and B. Browning, “Automatic calibration of a range sensor and camera system,” in Proc. IEEE Intl. Conf. on 3D Imaging, Modeling, Processing, Visualization and Transmission, 2012, pp. 286–292.
- Taylor and Nieto  Z. Taylor and J. Nieto, “Automatic calibration of LiDAR and camera images using normalized mutual information,” in Proc. IEEE Intl. Conf. on Robot. and Automat., 2013.
- Pandey et al.  G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Automatic extrinsic calibration of vision and lidar by maximizing mutual information,” J. of Field Robot., vol. 32, no. 5, pp. 696–722, 2015.
Iyer et al. 
G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “Calibnet: Geometrically supervised extrinsic calibration using 3D spatial transformer networks,” inProc. IEEE/RSJ Intl. Conf. on Intell. Robots and Sys., 2018, pp. 1110–1117.
- Zhao et al.  H. Zhao, Y. Chen, and R. Shibasaki, “An efficient extrinsic calibration of a multiple laser scanners and cameras’ sensor system on a mobile platform,” in Proc. IEEE Intell. Vehicle Symposium, 2007, pp. 422–427.
- Tamas and Kato  L. Tamas and Z. Kato, “Targetless calibration of a lidar-perspective camera pair,” in Proc. of the IEEE Int. Conf. on Comp. Vision Workshops, 2013, pp. 668–675.
- Scott et al.  T. Scott, A. A. Morye, P. Piniés, L. M. Paz, I. Posner, and P. Newman, “Exploiting known unknowns: Scene induced cross-calibration of lidar-stereo systems,” in Proc. IEEE/RSJ Intl. Conf. on Intell. Robots and Sys., 2015, pp. 3647–3653.
- Scott et al.  ——, “Choosing a time and place for calibration of lidar-camera systems,” in Proc. IEEE Intl. Conf. on Robot. and Automat., 2016, pp. 4349–4356.
- Napier et al.  A. Napier, P. Corke, and P. Newman, “Cross-calibration of push-broom 2D lidars and cameras in natural scenes,” in Proc. IEEE Intl. Conf. on Robot. and Automat., 2013, pp. 3679–3684.
- Taylor and Nieto  Z. Taylor and J. Nieto, “Motion-based calibration of multimodal sensor extrinsics and timing offset estimation,” IEEE Trans. Robot., vol. 32, no. 5, pp. 1215–1229, 2016.
- Andreff et al.  N. Andreff, R. Horaud, and B. Espiau, “Robot hand-eye calibration using structure-from-motion,” Intl. J. of Robot. Research, vol. 20, no. 3, pp. 228–248, 2001.
Heller et al. 
J. Heller, M. Havlena, A. Sugimoto, and T. Pajdla, “Structure-from-motion
based hand-eye calibration using l minimization,” in
Proc. IEEE Conf. on Comput. Vision and Pattern Recog., 2011, pp. 3497–3503.
- Jeong et al.  J. Jeong, Y. Cho, Y.-S. Shin, H. Roh, and A. Kim, “Complex urban LiDAR data set,” in Proc. IEEE Intl. Conf. on Robot. and Automat., Brisbane, May. 2018, pp. 6344–6351.
- Labayrade et al.  R. Labayrade, D. Aubert, and J.-P. Tarel, “Real time obstacle detection in stereovision on non flat road geometry through” v-disparity” representation,” in Proc. IEEE Intell. Vehicle Symposium, vol. 2. IEEE, 2002, pp. 646–651.
- Gioi et al.  R. G. V. Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “LSD: A fast line segment detector with a false detection control,” IEEE Trans. Pattern Analysis and Machine Intell., vol. 32, no. 4, pp. 722–732, 2010.
- Rabbani et al.  T. Rabbani, F. V. D. Heuvel, and G. Vosselmann, “Segmentation of point clouds using smoothness constraint,” Intl. archives of photogrammetry, remote sensing and spatial info. sci., vol. 36, no. 5, pp. 248–253, 2006.
- Canny  J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Analysis and Machine Intell., no. 6, pp. 679–698, 1986.