ROS package to find a rigid-body transformation between a LiDAR and a camera
With the advent of autonomous vehicles, LiDAR and cameras have become an indispensable combination of sensors. They both provide rich and complementary data which can be used by various algorithms and machine learning to sense and make vital inferences about the surroundings. We propose a novel pipeline and experimental setup to find accurate rigid-body transformation for extrinsically calibrating a LiDAR and a camera. The pipeling uses 3D-3D point correspondences in LiDAR and camera frame and gives a closed form solution. We further show the accuracy of the estimate by fusing point clouds from two stereo cameras which align perfectly with the rotation and translation estimated by our method, confirming the accuracy of our method's estimates both mathematically and visually. Taking our idea of extrinsic LiDAR-camera calibration forward, we demonstrate how two cameras with no overlapping field-of-view can also be calibrated extrinsically using 3D point correspondences. The code has been made available as open-source software in the form of a ROS package, more information about which can be sought here: https://github.com/ankitdhall/lidar_camera_calibration .READ FULL TEXT VIEW PDF
Recently, the rapid development of Solid-State LiDAR (SSL) enables low-c...
Most sensor setups for onboard autonomous perception are composed of LiD...
The homogeneous transformation between a LiDAR and monocular camera is
Calibration of devices with different modalities is a key problem in rob...
3D LiDARs and 2D cameras are increasingly being used alongside each othe...
Calibrating narrow field of view soccer cameras is challenging because t...
This paper presents a framework for the targetless extrinsic calibration...
ROS package to find a rigid-body transformation between a LiDAR and a camera
Robotic platforms, both autonomous and remote controlled, use multiple sensors such as IMUs, multiple cameras and range sensors. Each sensor provides data in a complementary modality. For instance, cameras provide rich color and feature information which can be used by state-of-the-art algorithms to detect objects of interest (pedestrians, cars, trees, etc.). Range sensors have gained a lot of popularity recently despite being more expensive and also contain moving parts. These can provide rich structural information and if correspondence can be drawn between the camera and the LiDAR, when a pedestrian is detected in an image, it’s exact 3D location can be estimated and be used by an autonomous car to avoid obstacles and prevent accidents. Multiple sensors are employed to provide redundant information which reduces the chance of having erroneous measurements. In the above cases, it is essential to obtain data from various sensors with respect to a single frame of reference so that data can be fused and redundancy can be leveraged. Marker based
as well as automatic calibration for LiDAR and cameras has been proposed but methods and experiments discussed in these use the high-density, more expensive LiDAR and do not extend very well when a lower-density LiDAR, such as the VLP-16 is used. We propose a very accurate and repeatable method to estimate extrinsic calibration parameters in the form of 6 degrees-of-freedom between a camera and a LiDAR.
The method we propose makes use of sensor data from a LiDAR and a camera. The intrinsic parameters of the camera should be known before starting the LiDAR-camera calibration process. The camera can only sense the environment directly in front of the lens unlike a LiDAR such as the Velodyne VLP-16, which has a 360-degree view of the scene, and the camera always faces the markers directly. Each time data was collected, the LiDAR and camera were kept at arbitrary distance in 3D space. The transformation between them was measured manually. Although, the tape measurement is crude, it serves as a sanity check for values obtained using various algorithms. Measuring translation is easier than rotation. When the rotations were minimal, we assumed them to be zero, in other instances, when there was considerable rotation in the orientation of the sensors, we measured distances and estimated the angles roughly using trigonometry.
Before working on our method that uses 3D-3D point correspondences, we tried methods that involved 2D-3D correspondences. We designed our own experimental setup to help calibrate a LiDAR and camera, first, using 2D-3D methods. The setup involves markers of a specific type: hollow rectangular cardboards. Even normal cardboards work fine, however, as we shall see in the upcoming discussion, provide less correspondences as opposed to a hollowed out rectangular cardboard.
This method involves finding the 6-DoF between the camera and the LiDAR by the means of matching 2D-3D point correspondences. 2D correspondences can be easily obtained by manually marking feature points in an image with an accuracy of 3-4 pixels. Obtaining corresponding 3D points is not that straight-forward. For one reason LiDARs does not give a high density point cloud and with increasing distance (away from the LiDAR center) the point cloud becomes more and more sparse. A planar cardboard can provide 4 corner points i.e. 4 point correspondences. In 3D these points are obtained by line-fitting followed by line-intersection and their 2D correspondences can be obtained by marking pixel co-ordinates. If a hollowed out rectangular cardboard is used, it provides 8 3D-2D point correspondences: 4 corners on the outer rectangle and 4 corners on the inner rectangle; doubling the correspondences, allowing for more data points with lesser number of boards. Such a setup allows to have enough data to run a RaNSaC version of PnP algorithms and also will help reduce noisy data, in general. We use rectangular (planar cardboard) markers. If in the experimental setup, the markers are kept with one of their sides parallel to the ground, due to the horizontal nature of the LiDAR’s scan lines one can obtain the vertical edges, but not necessarily the horizontal ones. To overcome this, we tilt the board to make approximately 45 degrees between one of the edges and the ground plane. With such a setup we always obtain points on all four edges of the board. RanSaC is used to fit lines on the points from the LiDAR.
The most prominent feature on the marker is the corner. It can be marked with relative easy on the image and since we have quite accurate line equations for the four edges, their intersection is calculated in 3D. Again, these lines may not actually intersect, but come very close. We approximate the corner to the midpoint on the shortest line-segment between the two lines. As, a check, that this point is indeed a very close approximation to the actual corner, we calculated the length of the shortest line-segment. Also, since we know the dimensions of the cardboard marker, the length of the opposite sides should be very close to each other and also to the actual length measure by tape. We consistently observed that the distance between two line-segments was of the order meters and the error between the edge lengths was about 1 centimeter on average. Collecting data over multiple experiments, we observed that the edges are extremely close and the corner and intersection are at most off by 0.68 mm on average. An average absolute deviation of 1cm is observed between the expected and estimated edge lengths of the cardboard markers. With the above two observations one can conclude that the intersections are indeed a very accurate approximation of the corner in 3D.
Using hollowed out markers, we obtained 20 corner points: 2 hollow rectangular markers (8+8 points) and one solid rectangular marker (4 points) increasing the number of point correspondences from our initial experiments. Perspective n-Point (PnP) finds the rigid-body transformation between a set of 2D-3D correspondences. Equation 1 shows how the 3D points are projected after applying the which is estimated by PnP. Equation 2 represents the general cost-function to solve such a problem.
is the projection operation from 3D to 2D on the image plane
represents points in 3D
represents points in 2D
) the points (by visualizing the outliers) we were able to lower the back-projection error to 1.88 pixels on average. However, one did not observe theclose to the values measure by measuring tape between the camera and LiDAR.
|Expected Value||# of point correspondences|
|(by tape)||20 points||11 points||8 points|
|X (in m)||0||0.0094||0.0356||0.0776|
|Y (in m)||-0.51||-0.6174||-0.6142||-0.5348|
|Z (in m)||-0.755||-0.7101||-0.7420||-0.7447|
|Rot x (in deg)||-||-5.4194||-5.4547||-3.5616|
|Rot y (in deg)||-||-0.7949||-1.6225||-2.5203|
|Rot z (in deg)||-||-0.5163||-0.5538||-0.7039|
|Back projection error (in pixels)||-||4.8211||2.9476||1.8836|
In a previous experiment, when the LiDAR and camera were quite close (12cm apart) we ran the E-PnP with 12 points and did not obtain the expected values. We observed that we got an error of 10cm and if our expected value is around that measure of granularity then we can expect to obtain noisy values. In subsequent experiments the the camera and LiDAR were kept even farther apart so that the influence of any error is mitigated. While examining the data, we found that there were some noisy data points who were contributing to a large back projection error. We ran a modified E-PnP with a RanSaC algorithm on top. This would in theory ensure that noisy data is not considered while calculating the rigid body transformation between the camera and the LIDAR. RanSaC selects a random subset of the data, fits a model, estimates data points that are inliers to the fitted model (given a threshold ) and then fits the model on the inliers. This is repeated multiple times to try and exhaust large number of possible configurations. In this case, a subset of 2D-3D point correspondences are used to find using E-PnP, inliers are found and new are estimated. The back projection error was less than a pixel but the we obtained was far from what we were expecting through manual tape measurements. This could mean that minimizing back projection error may not be a holistic measure in our scenario and we may have to use a better metric which relates to in a more explicit manner. In the data we collected we also introduced slight rotation. The expected rotation were calculated using trigonometry.
|rot_data||Expected Value||E-PnP||RanSaC E-PnP|
|(by tape)||20 points||20 points|
|X (in m)||0.785||1.0346||1.0422|
|Y (in m)||-0.52||-0.557||-0.5883|
|Z (in m)||-0.69||-0.393||-0.4078|
|Rot x (in deg)||unmeasurable||-5.82||-6.613|
|Rot y (in deg)||-15||-17.54||-17.703|
|Rot z (in deg)||2.6||3.27||3.126|
|Back projection error (in pixels)||-||4.35||0.5759|
The 2D-3D correspondences method did not seem to work very well in our experimental setup. Error could have crept up due to not-so-accurate marking of 2D points (in pixels by looking at the image) or noisy data points to perform PnP. The back-projection error seemed to get minimized, however, the transformation values did not seem to be in agreement with the values measured by tape. Setups being used in real-time require extrinsic calibration to be quite accurate and produce minimal error. Fusion is one way to visualize the accuracy of the extrinsic calibration parameters. Bad calibration can result in fused data to have hallucinations in the form of duplication of objects in the fused point clouds due to bad alignment. One such application requiring real-time fusion from multiple sensors is autonomous driving. Bad calibration can result in erroneous fused data, which can be fatal for the car as well as nearby cars, pedestrians and property. This part of involves using augmented-reality (AR) tags and the LiDAR point cloud to find the extrinsic calibration parameters. Multiple versions of AR tags have been released by the open-source community  . The method proposed here uses the ArUco tags .
To find the transformation between the camera and Velodyne, we need two sets of 3D points: one in the camera frame and another in the Velodyne frame. Once, these point correspondences are found, one can try to solve for between the two sensors.
Most types of calibration employ markers, dimensions, shape and specific features of which depend on the application and type of calibration being performed. Checkerboards are the most common type of markers, generally used to estimate the intrinsic calibration parameters of a camera.  uses special markers with circular cut-outs for calibrating a LiDAR and a camera. We have devised a pipeline which uses cost-effective markers that can be constructed easily with just a planar surface such as a cardboard and an A4 sheet of paper. The design of the marker was driven keeping in mind that it should be able to provide features/correspondences that are easy to detect, both, in the camera frame as well as the LiDAR frame.
The rectangular cardboard can be of any arbitrary size. The experiments we performed used a Velodyne VLP-16 which has only 16 rings in a single scan, a handful as compared to higher density LiDARs (32 and 64 rings per scan). For a low density LiDAR, if the dimensions of the board are small and the LiDAR is kept farther than a specific distance, the number of rings hitting the board become low (2 to 3 rings resulting in only 2 to 3 points on an edge) , making it very difficult to fit lines on the edges (using RanSaC). The boards used in the experiments had length/breadth ranging between 45.0-55.0 centimeters. Keeping the LiDAR about 2.0 meters away from board with these dimensions, enough points were registered on the board edges to fit lines, calculate intersections and run the whole pipeline smoothly. It is recommended that before you run the pipeline, ensure that there are considerable number of points on the edge of the boards in the pointcloud. Any planar surface can be used: cardboards, wood or acrylic sheets. Cardboards are light-weight and can be hung easily.
The ArUco markers are special encoded patterns that facilitate the detection and error correction of the tags themselves. More details about how they work can be found here .
The tags are stuck on a planar surface such as a rectangular cardboard. If the dimensions of the cardboard (on which the ArUco tags are stuck) and the location of the ArUco marker is known, the location of the corners (from the center of the ArUco marker) can be easily calculated. The tags provide between the camera and the center of the marker. This transform can be used to convert corner points from the marker’s frame-of-reference (which is the cardboard plane with the origin being the center of the ArUcO marker) to the camera’s frame-of-reference. This allows to obtain the corners as 3D points in the camera frame. We used ZED stereo camera .
Points in the LiDAR can be found by detecting edges of the cardboard, which in turn can be solved for corners in a similar fashion described in Section 3.
The values of transformations obtained using ArUco markers, especially the translation was quite accurate and close to values measured by tape between the camera and the center of each marker. Once the two sets of point correspondences are obtained, between their co-ordinate frames can be estimated using the Iterative Closest Point (ICP) algorithm . The ICP tries to minimize the error in 3D and is given by equation 4.
The general ICP algorithm considers the closest points in both the point clouds as correspondences (there are other variants of choosing the closest points), following which, it finds the which best align the two point clouds by minimizing the euclidean distance between corresponding points. Finding the right correspondences can be tricky and may lead to an undesired solution. Since, in our proposed method the point correspondences are known, the corners of the marker in this case, a closed form solution exists. The Kabsch algorithm  finds the rotation between two point clouds and the translation can be found once the co-ordinate frames are aligned. Using the same arguments as in . First, we assume that the rotation is known and solve for the translation between the two point clouds, and .
then, the objective becomes,
using properties of the trace of a matrix, the above equation can be simplified as,
since, the is an orthonormal matrix, it preserves lengths i.e. ,
re-writing the objective function by eliminating terms that do not involve ,
substituting the value of and using property of the trace,
using SVD on ,
let, , then,
is a product of orthonormal matrices and is an orthonormal matrix as well with
. The length of each column vector inis equal to one and each component of a vector is less than or equal to one. Now, to maximize the above equation, let each , forcing the remaining components of the vector to zero to satisfy the unit vector constraint. Thus,
, an identity matrix.
to ensure, that is a proper rotation matrix, i.e. , we need to make sure that . If the obtained from equation 18 has we need to find such that takes the second largest value possible.
the second largest value of the term in equation 19 occurs when and . Taking into account the above,
where is a correction matrix,
In our initial experiments, we observed that even in a closed room where the boards are as stationary as can be, the pointcloud visualized in Rviz shows that the points (from the LiDAR), on the contrary, are not stationary and there is a small amount of position shift between two instants. To reduce any noise that might creep up we further propose to collect multiple samples of rotations and translations (using the method discussed above). Rotations and translation estimated over multiple runs can be used to obtain a more accurate and less noisy rigid-body transformation that transforms points from the LiDAR frame to the camera frame. Multiple sensor data, is collected over iterations, keeping the positions of the LiDAR and camera fixed. From each one of the runs we estimate the rotation and translation. We can average the observed translation vectors,
where and is the average translation between the two sensors. Taking average of rotation matrices is not very straight-forward, so we transform them to quaternions, compute the average quaternion in and then convert it back to rotation matrix.
where is a quaternion representation of rotation matrix obtained in the th run and is the average rotation between the two sensors represented by a unit-quaternion. The results of averaging in three separate configurations for LiDAR and camera positions can be seen in figure 9.
|(a) Sensor Configuration 1||(b) Sensor Configuration 2|
|(c) Sensor Configuration 3|
Averaging multiple estimates while keeping the position of the relative positions of the LiDAR and camera fixed helps to reduce noisy data due to imperfect marker edges and errors that might be introduced due to LiDAR points being slightly inaccurate. Also, if there is a modest amount of motion of the cardboards, we effectively observe many data points around the actual translation and rotation, averaging which shall give a better estimate of the rigid-body transformation between the LiDAR and camera.
Our main objective was to accurately calibrate multiple cameras that may not have an overlapping field-of-view. If a set of transformations can be estimated that transform all sensor data to a single frame of reference, data, such as point clouds from stereo cameras can be fused. A setup with multiple stereo cameras facing in different directions, one can obtain a point cloud that provides a 360-degree field-of-view by fusing individual point clouds from each stereo camera.
|(a) from tape measurement||(b) from our method|
To do this, we introduced the LiDAR which has a 360-degree field-of-view and very precise 3D point co-ordinates can be obtained with it. We use it to find transformations between the cameras. Once, this is done, we can remove the LiDAR; effectively using the LiDAR only for calibrating the cameras. It is to be noted that the method described in this document calibrates a monocular camera and a LiDAR. If there is a stereo camera, we only calibrate the left camera and the LiDAR. Since, the baseline and stereo camera calibration parameters are already known, calibrating only one of the cameras (left in our case) is sufficient to fuse the point clouds. If we can find a transformation between two cameras and , we can easily extend the same procedure to obtain transformation between arbitrary number of cameras. Given, two (stereo) cameras, the proposed pipeline finds the transformation that transforms all points in the LiDAR frame to the camera frame. We first run the algorithm, with and LiDAR, , and obtain a matrix,
We then run the algorithm, with and LiDAR, , and obtain a matrix,
Now, to obtain a transform that transforms all points in to , we chain the transforms, and ,
Equation 27, finds the transform between to , and if these are stereo cameras, we can obtain point clouds and fuse them using this transform. If the transform is very accurate, the two point clouds (from the two stereo cameras) will align properly. However, if there is translation error, when viewing the fused point cloud, hallucinations of objects will be clearly visible, and there will be two of everything. If there is error in the rotation, the points in the two clouds will diverge more and more as the distance from the origin increases. To verify the method in a more intuitive manner, lidar_camera_calibration was used to fuse point clouds obtained from two stereo cameras. We also provide the visualization of the fused point clouds.
First, we compare the calibration parameters obtained from our method against meticulously measured values using tape by a human.
The fused point cloud obtained when using manual measurements versus when using the method proposed in this document is shown in the video. Notice the large translation error, even when the two cameras are kept on a planar surface. Hallucinations of markers, cupboards and carton box (in the background) can be seen as a result of the two point clouds not being aligned properly. On the other hand, rotation and translation estimated by our package almost perfectly fuses the two individual point clouds. There is a very minute translation error (1-2cm) and almost no rotation error. The fused point cloud is aligned so properly, that one might actually believe that it is a single point cloud, but it actually consists of 2 clouds fused using extrinsic transformation between their sources (the stereo cameras). The resultant fused point clouds from both manual and lidar_camera_calibration methods can be seen on https://youtu.be/AbjRDtHLdz0.
We also wanted to see the potential of this method and used it to calibrate cameras kept at about 80 degrees and almost no overlapping field-of-view. In principle, with a properly designed experimental setup our method can calibrate cameras with zero overlapping field of view. However, to visualize the fusion, we needed a part to be common in both point clouds. We chose a large checkerboard to be seen in both cameras’ field-of-view, since it can be used to see how well the point clouds have aligned and if the dimensions of the checkerboard squares are known, one can even estimate the translation errors.
There is very less translation error, about 3-4 cm. Also, the ground planes align properly, at all distances, near and far from the camera, implying that the rotations estimated are correct.
The resultant fused point clouds after extrinsic calibration of stereo cameras kept at approximately 80 degrees using our method can be seen on
https://youtu.be/Om1SFPAZ5Lc. We believe, that better intrinsic calibration of the cameras can help drive down the error to about 1 centimeter or even less.
As a sanity check, a coarse translation was manually measured using a measuring tape. Rotations however were difficult to measure and their coarse values are omitted. The tabulations below compare the results obtained by using off-the-shelf ICP algorithms and an implementation of the Kabsch algorithm which exploits the information about known correspondences. The Kabsch algorithm repeatedly gives values close to the measurements and the root mean square error (RMSE) is also quite low. Datasets were collected in varying camera and LiDAR configurations with assorted rotations and translations to verify repeatability and accuracy of the proposed method. Separate experiments were performed with a different camera (Point Gray Black Fly) which has a very large focal length as compared to zed stereo camera and the results showed similar accuracy and confirmed robustness of the method proposed.
|Dataset 1||Tape measurement||ICP||Kabsch|
|X (in m)||-0.38 to -0.41||-0.3837||-0.4249||-0.4249|
|Y (in m)||0.04 to -0.06||0.0895||0.0592||0.0591|
|Z (in m)||0.14 to 0.16||0.1412||0.1399||0.1399|
|Roll (in deg)||unmeasurable||-||1.5943||1.5957|
|Pitch (in deg)||unmeasurable||-||1.4166||1.4166|
|Yaw (in deg)||unmeasurable||-||-1.0609||-1.0635|
|RMSE (in m)||-||-||0.0263||0.0240|
|Dataset 2||Tape measurement||ICP||Kabsch|
|X (in m)||-0.29 to -0.31||-0.2725||-0.3126||-0.3126|
|Y (in m)||-0.25 to -0.27||-0.0188||-0.2366||-0.2367|
|Z (in m)||0.09 to -0.11||0.1152||0.1195||0.1196|
|Roll (in deg)||unmeasurable||-||-0.3313||-0.3326|
|Pitch (in deg)||unmeasurable||-||1.6615||1.6625|
|Yaw (in deg)||unmeasurable||-||-9.1813||-9.1854|
|RMSE (in m)||-||-||0.0181||0.0225|
|Dataset 3||Tape measurement||ICP||Kabsch|
|X (in m)||-0.48 to -0.52||-0.5762||-0.2562||-0.5241|
|Y (in m)||0.36 to 0.39||0.4583||0.1480||0.3282|
|Z (in m)||0.07 to 0.09||0.1060||0.0852||0.0565|
|Roll (in deg)||unmeasurable||-||-6.6721||0.4016|
|Pitch (in deg)||unmeasurable||-||-2.3090||-2.2246|
|Yaw (in deg)||unmeasurable||-||-4.5873||-5.3311|
|RMSE (in m)||-||-||0.2476||0.0203|
The code is written in C++ and is implemented as a ROS package. It can be found at http://wiki.ros.org/lidar_camera_calibration. A comprehensive readme file is also available for setting up and getting started with the package is available on the GitHub repo at https://github.com/ankitdhall/lidar_camera_calibration.
We proposed a novel pipeline to perform accurate LiDAR-camera extrinsic calibration using 3D-3D point correspondences. An experimental setup to find correspondences in each sensor’s frame: the camera and the LiDAR. The proposed pipeline uses tags that can be easily printed and stuck on planar surfaces such as cardboards or wooden planks. A point extraction pipeline was implemented to obtain corner points of the cardboards from the pointcloud recorded using the LiDAR. The two sets of point correspondences are used to solve for the , which gives accurate and repeatable results with different cameras. As opposed to ICP which relies on matching point correspondences, our method, with relatively less number of points and correct correspondences is able to estimate transformation optimally. The method’s consistency is further improved by averaging over multiple results. We also showed how this method could be used to extrinsically calibrate two or more cameras, even when they do not have any overlapping field-of-view. We also successfully demonstrated visually, the quality of the calibration by fusing point clouds and almost perfectly aligning them. An open-source implementation is available in the form of a ROS package.
. International Journal Computer Vision, 2009