Data driven method for automatic extrinsic calibration between camera and radar sensors
Most intelligent transportation systems use a combination of radar sensors and cameras for robust vehicle perception. The calibration of these heterogeneous sensor types in an automatic fashion during system operation is challenging due to differing physical measurement principles and the high sparsity of traffic radars. We propose - to the best of our knowledge - the first data-driven method for automatic rotational radar-camera calibration without dedicated calibration targets. Our approach is based on a coarse and a fine convolutional neural network. We employ a boosting-inspired training algorithm, where we train the fine network on the residual error of the coarse network. Due to the unavailability of public datasets combining radar and camera measurements, we recorded our own real-world data. We demonstrate that our method is able to reach precise and robust sensor registration and show its generalization capabilities to different sensor alignments and perspectives.READ FULL TEXT VIEW PDF
Data driven method for automatic extrinsic calibration between camera and radar sensors
Modern ITS utilize many redundant sensors to obtain a robust estimate of their perceived environment. By using sensors of different modalities, the system can compensate the weaknesses of one sensor type with the strengths of another. Especially in the field of traffic surveillance and ITS, the combination of cameras and radar sensors is common practice[1, 2, 3]. Reliably fusing measurements from such sensors requires precise spatial registration and is necessary to construct a consistent environment model. A precise sensor registration can be achieved by an extrinsic calibration that results in the correct transformation between the reference frames of the sensors in relation to each other and the world. Fig. 1 demonstrates the effects of an accurate sensor calibration. The upper image shows uncalibrated sensors, where the projected radar detections do not align with the vehicles. After the extrinsic calibration each detection overlays with its corresponding object in the image.
Manual sensor calibration is a tedious and expensive process. Especially in multi-sensor systems, automatic calibration is crucial to handle the growing number of redundant sensors. Here manual calibration does not scale. Additionally, this technique is infeasible for automatic online recalibration, which is necessary to account for changes to the sensor system. These decalibrations occur frequently in real world applications, for example due to vibrations, wear and tear of the sensor mounting, or changing weather conditions. Furthermore, these calibration methods should be independent of explicitly provided calibration targets, as their installation into such systems or their observed scenes is impractical and would suffer from deterioration as well.
Calibrating systems with cameras and radar sensors is challenging due to different physical measurement principles and the high sparsity of radar detections. For the calibration without specific targets, a complex association problem between the sensors’ measurements must be solved. As traffic radars do not provide visual features, such as edges or corners to easily associate detections with vehicles in the camera image, this association problem must be solved solely based on the relative spatial alignment and estimated distance measures between the vehicles.
In this paper we present – to the best of our knowledge – the first method for the automatic calibration of radar and camera sensors without explicit calibration targets. We focus on the rotational calibration between sensors because of its high influence on the spatial registration between cameras and radar sensors in ITS, especially for large observation distances. On the other hand, the projective error caused by translational miscalibrations in the centimeter range is low and easy to minimize in static scenarios by measuring with modern laser distance meters. To solve the problem of rotational auto-calibration, we propose a two-stream CNN that is trainable in an end-to-end fashion. We employ a boosting-inspired training algorithm, where we first train a coarse model and afterwards transform the training data with its correction estimates to train a fine model on the residual rotational errors. We evaluate our approach on real-world data, recorded on the German highway A9 and show that it is able to achieve precise sensor calibration. Furthermore, we demonstrate the generalization capability of our approach by applying it to a previously unobserved perspective.
Much research has been done on calibrating multi-sensor systems with homogeneous sensors, e.g. camera to camera, resulting in various state-of-the-art target-based and targetless calibration methods. However, heterogeneous sensors must share detectable features in the observed scene. Correlating features of different modality can be challenging due to different physical measurement principles. While camera images provide dense data in form of pixels, lidar and even more so traffic radar sensors only record sparse depth data.
Classic approaches for the calibration of camera and laser-based depth sensors use planar checkerboards as dedicated calibration targets [4, 5]. These techniques achieve very precise estimates of the relative sensor poses, but require prepared calibration scenes. Approaches without physical targets calibrate the sensors by matching features of the natural scenes. In manual methods a human has to pair the corresponding features in the image and depth data by hand . For automatic classic approaches the capabilities regarding decalibration ranges and parameter extraction are limited [7, 8]. These drawbacks restrict the application scope of those approaches and prevent them from being used for automatic calibration during system operation.
Recently, the task of sensor calibration has been approached using deep learning techniques. Early applications of CNN in this field focused on camera relocalization. With RegNet, Schneider et al. 
presented the first CNN for camera and lidar registration, which performs feature extraction, matching, and pose regression in an end-to-end fashion on an automotive sensor setup. Their method is able to estimate the extrinsic parameters and to compensate decalibrations online during operation. To refine their result the authors use multiple networks trained on datasets with different calibration margins. In contrast, we do not need to define calibration margins for our networks as our second network specializes on the errors of the first network by design. Liu et al. apply this method to the calibration of three sensors by first fusing a depth camera and a lidar that were calibrated with RegNet, and then they use the resulting dense point cloud for the calibration to a camera. Iyer et al.  propose CalibNet, which they train with a geometric and photometric consistency loss of the input images and point clouds, rather than the explicit calibration parameters. Due to difficulties in estimating translation parameters in a single run, they first estimate the rotation and use it to correct the depth map alignment. Then they feed the corrected depth map back into the network to predict the correct translation. However, in contrast to our approach these methods use lidar sensors with relatively dense point clouds compared to the measurements of traffic radars.
The measurement characteristics of radar sensors cause a lack of calibration methods without dedicated targets for radar-based systems. Traffic radars output preprocessed measurement data in form of detected objects. Due to missing descriptive features in the radar data and its sparsity, spatial calibration is challenging. Additionally, measurement noise, missing object detections, and false positives amplify this problem. Therefore, existing approaches for the calibration of multi-sensor systems with radars rely on dedicated targets, for example corner reflectors or plates based on conductive material that ensure a reliable detection by radar sensors [13, 14]. Recently, these calibration concepts were extended towards the combination of radars with other sensor types. Especially the calibration with cameras is challenging, as the sensors do not share common features such as edges, shapes or depth. Natour et al.  calibrate a radar-camera setup by optimizing a non-linear criterion, obtained from a single measurement with multiple targets and known inter-target distances. However, the targets in the radar and image data are extracted and matched manually. Peršić et al.  designed a triangular target to calibrate a 3D lidar and an automotive radar. They experienced variable error margins in the estimated parameters due to the sparse and noisy radar data and the geometric properties of their sensor setup. As a result, an additional optimization step using a priori knowledge of the specified radar field-of-view refines these estimated parameters. Song et al.  use a radar-detectable augmented reality marker for a traffic surveillance system based on a 2D radar and camera, enabling an analytic solution of the paired measurements. However, there is a lack of approaches for automatic and targetless radar-camera calibration which we address in this work.
The calibration of a radar and a camera means to estimate the transformation that projects the radar detections into the camera image, such that each detection spatially aligns with its corresponding object in the image. As we use a traffic radar, the detected objects are vehicles as shown in Fig. 1. However, our approach is not limited to the traffic domain. The described projection of detections into the image can be computed by
where is the position of a detected vehicle in the radar coordinate system, are its corresponding pixel coordinates and the straight-line distance of the detected vehicle to the image plane of the camera, i.e. the depth of the projected pixel. The projection matrix is based on the intrinsic camera parameters and is the extrinsic calibration matrix. The latter represents the camera pose relative to the radar and is defined as
with being the rotational and the translational component. While can be estimated in a controlled calibration setting prior to deploying the sensors, must be determined after deployment in situ. In our work we focus on computing the rotational component as – compared to the translational component – it is hard to measure and has a high impact on the quality of the inter-sensor registration, especially with greater observation distance.
Our goal is to estimate the transformation , describing the true relative pose between the two sensors. This estimate is obtained if the projected radar measurements match the vehicles in the image. Assuming an initially incorrect calibration , we need to determine the present decalibration that quantifies the error between and and thus
In fact, we estimate since it can be directly applied to the decalibrated extrinsic sensor parameters by
to obtain the corrected, calibrated result.
Our objective is to regress the relative orientation of a camera with respect to a radar sensor. To achieve this, an association problem between the radar detections and the vehicles in the camera image must be solved. This is a difficult problem, as radar detections do not contain descriptive features. A neural network can learn how to solve this association problem based on the spatial alignment between the projected radar detections and the vehicles in the image.
Our approach leverages two convolutional neural networks, where we train the first coarse network on the initially decalibrated data and then a fine network on its residual error. Both models share the same architecture, loss and hyperparameters. In this section we first explain the model and then the training process in detail.
Our model is built as a two-stream neural network, consisting of an rgb-input and a radar-input as shown in Fig. 2. It outputs a transformation to correct the rotational error of the calibration between respective camera and radar as a quaternion.
The rgb-input is a camera image and the radar-input is a sparse matrix with radar projections. The image is standardized and resized to a resolution of pixels. It gets propagated through the rgb stream of our network that starts with a cropped MobileNet 
with width multiplier 1.0. We crop the MobileNet after the third depthwise convolution layer (conv_dw_3_relu) to extract low-level features, while preserving spatial information. We use the weights pre-trained on ImageNet, but include the layers for fine-tuning in further training. The MobileNet is followed by two MlpConv  layers, each consisting of a 2D convolution with kernel size , followed by two convolutions and 16 filter maps in each component. The task of the rgb stream is to detect vehicles and to estimate where radar detections will occur.
The radar-stream receives the projected radar detections with the same resolution as the camera image as input. Each projection occupies one cell in the sparse matrix, storing the inverse depth of respective projected detection, as proposed by . We apply a max-pooling to reduce the input dimension to a feasible size. We do not use convolutions in the radar stream to retain the sparse radar information.
We embed each stream into a 50 dimensional latent representation using a fully-connected layer. This latent representation contains dense information and reduces the dimensionality. The following regression block consists of three layers with 512, 256 and 4 neurons. Between the first two layers we apply dropout regularization . Four output neurons correspond to a quaternion describing the calibration correction. We use linear activations for the final output layer, and PReLu 
activations everywhere else, except in the MobileNet block. This empirically lead to better performance compared to classic ReLu activations. The regression block estimates the rotational correction that solves the misalignment between the camera image and the radar detections.
We used the Euclidean distance as the loss function between the true quaternion, and the predicted quaternion that represents the estimated correction of the decalibration.
Using the Euclidean distance as a distance measure to define a rotational loss function over quaternions is common practice [10, 9]. Since this metric is ambiguous and can lead to different errors for similar rotations , we also evaluated the training performance using the geodesic quaternion metric
proposed by . We added a length error term, weighted by that we empirically evaluated to . Without this additional length term the network’s output diverges and the learning plateaus. As this loss resulted despite its theoretical superiority in similar performance, we used the Euclidean distance to save an additional hyperparameter.
We used the Adam optimizer  with the parameters proposed by its authors and learning rate , that we reduced by a factor of
once the validation loss plateaus for five epochs. To initialize our weights we use orthogonal initialization
. For the dropout we set a probability of, chose a batch size of 16 and used early stopping when the validation loss does not improve for 10 epochs.
To improve the calibration results of our first, coarse network trained on the original data, we train a second, fine network on the remaining residual error. This is inspired by gradient boosting algorithms, where each subsequent learner is trained on the residual error of the previous one.
Our method has multiple advantages leading to more accurate calibration parameters. During operation, the first network roughly corrects the initial calibration error and the sensors are at least approximately aligned. For the second network more radar detections can be projected into the camera image, leading to a higher number of correspondences that enable the second network to perform a more fine-grained correction. Furthermore, the second network implicitly focuses on solving the errors in those axes that the first network performed poorly on. In our case, the fine network performed much better on solving the roll error, as errors around the -axis of the camera cause only relatively small projective discrepancy, and thus the coarse network focuses on tilt and pan.
In detail, we train the first network on dataset , for which the radar detections of each sample were projected with a transformation , obtained by applying a random decalibration on the true calibration . After the first network’s training we transform into a new dataset on which we train the second network. contains training samples corrected by the output of the first network. We transform the samples by converting the output quaternion for each sample to transformation . Then we compute a new, corrected extrinsic matrix
for each sample and reproject the corresponding radar detections as described in Eq. 1. We obtain the correction of the residual decalibration by
which serves as the new label in the transformed dataset . The second network is then trained on .
At inference time we obtain the approximate true calibration for a new sample by computing
where is the output of the second network and of the first network. Note that before computing we perform a reprojection the same way we do it during the training.
In the field of ITS and autonomous driving, sensor data is usually available as a continuous stream. A single decalibration of the sensor setup is more likely than completely random decalibrations for each sample. A temporal filtering over correction estimates for multiple consecutive samples can reduce the influence of estimation errors made for individual samples and thus increase the robustness and accuracy of our method. In the most simple case this can be implemented as a moving average of several consecutive model outputs.
In this section we explain which data we used to train and evaluate our approach. Furthermore, we explain our evaluation process in detail and present quantitative, as well as qualitative results.
In the field of ITS, public datasets containing data of radars combined with cameras are not available. Therefore, we generated our own dataset using sensor setups developed within the scope of the research project Providentia . Two identical setups were installed on existing gantry bridges along the Autobahn A9, overlooking a total of eight traffic lanes. Our sensor setup is shown in Fig. 3 and consisted of a Basler acA1920-50gc camera with a lens of focal length and a smartmicro UMRR-0C Type 40 traffic radar.
The camera records rgb images with a resolution of pixels, while the radar outputs vehicle detections as positions. The radar measurements can result in undetected vehicles, multi-detections for large vehicles like trucks or buses, and false positives due to measurement noise.
Since our approach requires a reference transformation
between the sensors, we put special emphasis on the initial manual calibration. This corresponds to manual labeling in supervised learning problems.
We calibrated the cameras intrinsically with a checkerboard based method in our laboratory, while the radar is intrinsically calibrated ex-factory. The translational extrinsic parameters of the sensor setup were manually measured on-site with a spirit level and laser distance meter. We estimated the initial rotation parameters of the sensors towards the road using vanishing point based camera calibration  (one vanishing point, known height above the road and known focal length) and the internal calibration function of the radar sensor. Afterwards, we fine-tuned the extrinsic rotational parameters by minimizing the visual projective error.
To obtain the necessary number of samples to train and evaluate our networks, it would be infeasible to use many different sensor setups and manually estimate each groundtruth calibration. Therefore, we randomly distorted the calibration for one sensor setup per measurement point as proposed by . In particular, we randomly generated 6-DoF decalibrations for each sample and used these decalibrations to compute initial decalibrated extrinsic matrices , according to Eq. 3. Afterwards, we projected the radar detections on the image according to Eq. 1, leading to a mismatch between the detections and the vehicles in the image. Besides, we filtered generated samples with less than 10 remaining correspondences. This ensures the exclusion of training samples without correspondences between camera and radar projection, on which learning is not possible.
In particular, the decalibration angles were sampled from a uniform distribution onfor the tilt and pan, and
for the roll angle. We assumed a smaller roll decalibration as this angle is easier to measure with a spirit level. We multiplied resulting matrices into a single rotational decalibration. Furthermore, we added a translation error with a standard deviation of. Even though translation errors are minimal as distances are easy to measure, by this we account for errors during the manual calibration process and show that our approach is robust to it. Creating our dataset as described resulted in a total of samples for the first sensor setup, of which we used for training and for validation. Additionally, we generated an independent test set with samples as well as a set of samples from the separate sensor setup on the second gantry bridge in the same manner to evaluate the generalization of our approach.
We trained our model with the boosting-inspired approach described in Sec. IV-D and the dataset generated with random decalibrations as explained in Sec. V-A. The following results are generated using the independent test sets.
Tab. I presents the average angular errors of our networks using a test set with samples with random decalibration. While the coarse network achieves significant improvements in the tilt and pan angles, it struggles regressing errors in the roll angle. The roll error is weaker correlated with the input as it has small projective influence over the long distances we work with. However, our fine model decreases the roll error significantly as it has more influence on the total remaining projective discrepancy after the coarse correction step. In total, we achieve a mean error reduction of in tilt, in pan and
in roll over the initial decalibrations. The errors are approximately normally distributed around zero, which means our approach works reliably with few outliers and can be trusted in a real-world setting (Fig.4).
In Fig. 5 we demonstrate qualitative examples of applying our approach to different decalibration scenarios. The main task of the coarse network is to find the right correspondences between radar detections and vehicles in the images. Based on these correspondences it estimates a rough correction for the initial decalibration. In case of decalibrations with only few successfully projected detections, the network’s correction leads to more projections onto the image plane that are then provided to the fine network. This effect can be observed in the first row in Fig. 5. The fine network then makes use of the increased number of correspondences and refines the calibration as shown in column (c). The fine network is particularly good at correcting rotational errors in the roll direction. In the second row (b) it can be observed that the coarse network is not able to solve the roll error because it has a relatively small impact on the projection error. The yellow points in the left half of the image are rotated below, and in the right half of the image above the blue ground truth detections. The fine network in the second row (c) was able to correct this residual error.
We also evaluated our approach by applying the same decalibration to all samples of the test set. This is a more realistic setting. In this manner we evaluated 100 different decalibrations. By computing the mean remaining error over all samples for each decalibration, our approach achieved average errors over all 100 runs of for tilt, for pan and for roll. As shown in Fig. 6 (b)
, taking the average over all samples with the same static decalibration significantly reduces the error variance compared to using only a single frame for calibration like in the random decalibration test shown in Fig.4 (b). Even with the static errors, our model is able to reduce them towards a distribution with approximately zero mean, as shown for two examples in Fig. 7. This further explains why temporal filtering of the estimated decalibration corrections as proposed in Sec. IV-E is a suitable method to improve accuracy and robustness.
To demonstrate the generalization capability of our approach we applied it to a sensor setup located at a different gantry bridge that was not included in the training data and has never been observed before. In this case the trajectory of the street is different and thus the distribution of vehicles in the image. Besides, the true extrinsic calibration differs from the first sensor setup and the perspective of the camera observing the vehicles changed. Despite these challenges, our approach achieved reasonable results for random decalibrations with average errors of for tilt, for pan and for roll (Fig. 8). While the performance dropped compared to the sensor setup used for training, it indicates that our approach is able to generalize if trained on a more diverse dataset with different perspectives. However, the achieved results already greatly reduce manual calibration efforts and can support other calibration methods in practice.
The manual calibration of sensors in an ITS is a tedious and expensive process, especially concerning sensor orientations. For radars and cameras there is a lack of automatic calibration methods due to differing sensor modalities, and the sparsity and lack of descriptive features in radar detections. In this work, we addressed this problem and presented the first approach for the automatic rotational calibration of radar and camera sensors without the need of dedicated calibration targets. Our approach consists of two convolutional neural networks that are trained with a boosting-inspired learning regime. We evaluated our method on a real-world dataset that we recorded on the German Autobahn A9. Our method achieves precise rotational calibration of the sensors and is robust to missing vehicle detections, multiple detections for single vehicles and noise. We demonstrated its generalization capability and achieved reasonable results by applying it on a second measurement point with a different viewing angle on the highway and vehicles. This reduces the efforts of manual calibration drastically.
We expect that in the future the generalization capabilities of our approach could be further improved by using a more diverse dataset that includes multiple camera views. Furthermore, as sensors record a time series, sequences of frames could be used for the iterative calibration with a recurrent neural network to increase calibration precision and robustness. As after the application of our approach the association of radar detections with vehicle detections in the image could be easy to achieve with nearest-neighbor algorithms, the final results could be revised by solving a classic, convex optimization problem.
International Conference on Pattern Recognition (ICPR), 2016.
International Conference on Computer Vision (ICCV), 2015.
G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “CalibNet: Self-supervised extrinsic calibration using 3D spatial transformer networks,”International Conference on Intelligent Robots and Systems (IROS), 2018.
Z. Li and H. Leung, “An expectation maximization based simultaneous registration and fusion algorithm for radar networks,”Canadian Conference on Electrical and Computer Engineering (CCECE), 2006.
Journal of Machine Learning Research (JMLR), 2014.