A CNN-based Camera Calibration Toolbox
Camera calibration is a crucial technique which significantly influences the performance of many robotic systems. Robustness and high precision have always been the pursuit of diverse calibration methods. State-of-the-art calibration techniques based on classical Zhang's method, however, still suffer from environmental noise, radial lens distortion and sub-optimal parameter estimation. Therefore, in this paper, we propose a hybrid camera calibration framework which combines learning-based approaches with traditional methods to handle these bottlenecks. In particular, this framework leverages learning-based approaches to perform efficient distortion correction and robust chessboard corner coordinate encoding. For sub-pixel accuracy of corner detection, a specially-designed coordinate decoding algorithm with embed outlier rejection mechanism is proposed. To avoid sub-optimal estimation results, we improve the traditional parameter estimation by RANSAC algorithm and achieve stable results. Compared with two widely-used camera calibration toolboxes, experiment results on both real and synthetic datasets manifest the better robustness and higher precision of the proposed framework. The massive synthetic dataset is the basis of our framework's decent performance and will be publicly available along with the code at https://github.com/Easonyesheng/CCS.READ FULL TEXT VIEW PDF
A CNN-based Camera Calibration Toolbox
Camera calibration is crucial for many robotic applications, such as visual odometry , 3D scene reconstruction  and autonomous driving . Especially, in some industrial and medical applications , precision and robustness are of utmost importance, so camera calibration in such systems has a significant impact on the overall performance.
The most widely-used camera calibration toolboxes [5, 6] are built based on Zhang’s technique . Usually chessboard images are captured to calculate camera parameters according to the established feature correspondences between 3D world and 2D images. This pipeline is flexible and easy to implement. Building a precise and robust calibration system, however, is still a challenging problem, mainly due to the following issues:
Inexact detection. Sub-pixel feature localization is hard to achieve, especially in scenario with noise and bad illumination.
Radial distortion. Severe radial lens distortion may result in calibration failure.
Sub-optimal estimation. Purely algebraic optimization of re-projection error leads to sub-optimal and unstable calibration results.
make efforts to sub-pixel feature localization using hand-crafted features, but are not robust when confronted with noises. As the rapid advance of deep learning, Convolution Neural Network (CNN) is introduced[11, 12, 13] to detect chessboard corners but still limited in pixel level precision. To address this issue, we extend deep learning paradigm to achieve sub-pixel feature localization by adopting an efficient encoding-decoding scheme which has been proved efficient in body joint estimation . However, unlike joints, the chessboard corners preserve two peculiarities: (1) commonly-seen pattern resulting in fake corner detection in background and (2) collineation under projective transformation without considering distortion. Therefore, we propose a specifically-designed sub-pixel coordinate decoding algorithm, in which Gaussian surface fitting is applied to get initial sub-pixel coordinates and then distribution-aware outlier rejection is performed to eliminate unreliable corners. Moreover, the collineation refinement is adopted to get the final corner coordinates after inlier candidates are selected.
The second issue is about lens distortion correction, which is a non-trivial problem in camera calibration, as real cameras often exhibit lens distortion, especially radial distortion . Classical methods [16, 17, 7]
estimate camera and distortion parameters simultaneously by iterative optimization. However, ambiguity is introduced in this way as the parameters are tangled together leading to failure under severe distortion. Fish-eye image rectification provides a positive example to handle this issue, where the same distortion is handled by deep neural network successfully[18, 19, 20], as the distortion is related to the curvature of the straight scene lines. Inspired by this, we adopt CNN to infer the correction parameters from distorted images but adopt a more practical distortion model different from previous works. As the corner detection accuracy suffers from radial distortion and our collineation refinement assumes no distortion, so the distortion correction is taken as the first step in the proposed framework.
The third issue is caused by purely algebraic optimization which may lead to unreasonable calibration results. Some works [21, 22] propose geometry-based algorithms to address this issue, but not complete enough to be widely used. Random Sample Consensus (RANSAC)  algorithm has been introduced into calibration to eliminate outliers to enhance the stability of parameter estimation [24, 25]. We also adopt a RANSAC-based calibration procedure to improve the robustness of parameter estimation. Different from previous work, as outlier rejection already gets involved in our model, the proposed RANSAC-based algorithm aims at searching for the optimal camera model.
In sum, the critical components of a calibration system, distortion correction, feature detection and parameter estimation, are reforged and integrated as a novel and efficient calibration framework (Fig. 1). The contributions can be summarized as follows.
A novel camera calibration framework is proposed, which includes radial distortion correction, sub-pixel feature detection and robust parameter estimation.
We design a sub-pixel coordinate decoding algorithm with outlier rejection and collineation refinement to cooperate with the learning coordinate encoding method.
The new framework achieves precise results on both synthetic and real data, outperforming widely-used methods by a noteworthy margin.
We review literature concerning the three parts of calibration divided by our framework.
Corner Detection. Firstly, corner detectors such as Harris are adopted, but they are sensitive to noise. Lucchese et al. propose a sub-pixel corner detection algorithm utilizing saddle points based on the chessboard corner feature and it is extended by Placht et al. through a surface fitting refinement. The widely-used findChessboadrCorners in OpenCV refines coordinates to sub-pixel according to gray distribution constraints. In , corner coordinates are refined to sub-pixel accuracy based on the chessboard structure. While these methods achieve sub-pixel precision, heavily relying on hand-crafted features leads to the lack of robustness. Recently, some detection algorithms based on learning features are proposed [11, 12, 13]. They are robust against noise but trapped in pixel level accuracy. On the other hand, learning coordinate encoding with refinement decoding methods [14, 27] boost the performance of body joint localization which are suggestive for this task.
Radial Distortion Correction. Classical algorithms integrate distortion in camera model and solve it by non-linear optimization techniques . Tsai  estimates distortion parameters by an iterative scheme. Zhang’s technique  corrects the distortion by minimizing the reprojection error. Although this method works well under slight radial distortion, they may end up with a bad solution when the distortion is severe. In similar applications like fisheye image rectification, some researchers [28, 18, 19] adopt deep CNN to learn distortion parameters and achieve decent performance.
Parameter Estimation. Widely-used Zhang’s technique  first solves the initial guess of parameters based on the correspondence between the real world and image. Then these parameters are refined by minimizing the reprojection error. However, purely algebraic optimization is unstable leading to suboptimal calibration results. To address this issue, geometric-based methods [22, 21] are proposed. Moreover, RANSAC-based algorithms [25, 24] are proposed to achieve robust parameter estimation by excluding unreliable images or corners.
In this section, we introduce our camera calibration framework by describing each component.
As simultaneous parameter estimation and distortion correction not only increase calibration effort but also reduce calibration precision, distortion correction is first performed separately in our framework. Inspired by previous works [18, 20] performing distortion correction on general images taken by fisheye camera, we apply learning approach to correct camera distortion. Unlike general images, chessboard images contain much more straight lines which are wrapped due to distortion, then the correlation between curved line features and distortion is supposed to be intensive. Therefore, we adopt a simple CNN-based encoder with regression layers to infer correction parameters from distorted images. Unlike previous work, we apply a more practical distortion model, the radial model (), which is symmetric and flexible . Its symmetric property maintains consistency between the distortion and correction, which is another radial model with higher order.
Its flexibility allows us to choose appropriate correction parameter numbers according to the specific distortion. To train the network, massive distorted chessboard images are generated and robust sampling grid loss  is adopted.
Learning heatmap encoding and decoding methods provide decent performance in body joint localization[14, 27]. Inspired by these methods, we proposed a similar but more specific chessboard corner detection technique. Firstly, a CNN-based encoder-decoder network is adopted to transform the corrected chessboard image into a coordinate heatmap. Besides, heatmap modulation 
is utilized to improve the encoding performance. This process can encode sub-pixel corner coordinates as a 2-dimensional Gaussian distribution () is centered at the labelled coordinate of each corner in the ground truth heatmap used to train the network. However, unlike body joints, the chessboard corners have unique properties: (i) The black and white feature is commonly-seen resulting in fake corners in the background. (ii) As camera projective transformation (without distortion) is a collineation , the chessboard corners in the image should be the intersections of groups of lines. Therefore, we propose a specially-designed coordinate decoding algorithm. First, Gaussian surface fitting algorithm is applied on each corner’s distribution of heatmaps to get the center
and the variance.
where is the point set distribution of a single corner in the transformed heatmap. However, some abnormal distributions in heatmaps occur in practice. They are caused by similar patterns in background (fake corners) and encoding failure due to occlusion, extreme pose and bad lighting (lost corners) which can be seen in Fig. 2. In order to exclude these fake and lost corners, we propose the distribution-aware outlier rejection. This process determines outliers by comparing each distribution variance with the ground truth one in training data. After inliers are selected, collineation refinement is proposed not only to refine the sub-pixel coordinates but also to recover the lost corners. Due to the distortion being removed first by our framework, this refinement utilizes initial corner candidates to fit lines and then calculate the final coordinates by intersecting these lines.
In order to achieve robust parameter estimation, we propose a simple yet effective parameter estimation method which improves Zhang’s technique  by RANSAC. Different with early works [25, 24], the proposed parameter estimation is aimed at searching for the best camera model which keeps consistency in reprojection error, because our corner detection part already includes outlier rejection. This method can be illuminated as follows:
Choose some of the images randomly to estimate parameters based on Zhang’s technique.
Calculate the reprojection error of all images and determine inliers whose reprojection errors are less than the threshold.
Output the parameters if the inliers number is big enough; otherwise repeat the above steps.
In this section, the performance of the proposed camera calibration framework is evaluated on both synthetic and real data. We also construct experiments to demonstrate the capabilities of our distortion correction part and corner detection part respectively.
To train our network, we generate massive chessboard images with ground truth corner heatmaps and camera parameters. Moreover, noise, bad lighting, distortion and fake background using TUM dataset  are applied on images as data augmentation. Specifically, the image distortion level is decided by parameter .
As camera calibration is aimed at intrinsic parameter estimation, we use the metrics related to focal length (FL) and principle points (PP) which can be defined as:
where means intrinsic parameter error.
To extensively evaluate our framework, we conduct three kinds of calibration experiments: calibration under noise and bad lighting, calibration under distortion and calibration on real data. Our framework is compared with two widely-used calibration toolboxes: OpenCV  and Matlab .
Calibration under noise and bad lighting. To demonstrate the robustness and accuracy of our framework in terms of environmental noise like low sensor resolution and uneven illumination, we perform calibration on synthetic data with or without extra Gaussian noise and uneven brightness. The average results of 50 independent trials are summarized in Table II. It can be seen that our framework not only achieves the best accuracy but also maintains the best robustness. For instance, the intrinsic parameter error of our system reduces than OpenCV and
than Matlab on dataset without noise and bad lighting. What’s more, the performances of our framework under noise and bad lighting are much better than the other methods, as the learning approaches are robust against environmental noise. On the other hand, the lower standard deviations manifest the stability of our framework owing to the RANSAC procedure.
Calibration under distortion. In this part, we conduct calibration experiments on distorted images to evaluate our framework performance. To extensively test our framework under different distortion levels, we change the parameter from to . Fig.3 shows the examples of distorted images along with the images corrected by our correction part. For each distortion level, 50 independent trials of calibration are performed and average results are shown in Fig. 4
. It can be seen that the calibration errors of the two widely-used methods are increasing with the distortion level. However, our system keeps precise results under all distortion levels, which is probably due to our efficient distortion correction part.
The standard deviation of .
Calibration on real data. To evaluate the performance of our system under realistic conditions, we perform calibration on a HIKROBOT MV-CA016-10GM camera with resolution of by a chessboard. We repeat 20 times of calibration with different combinations of chessboard poses and get the average results of intrinsic parameters, reprojection error and the standard deviation of parameters (average of the 4 parameters). The results compared with two widely-used calibration toolboxes are provided in Table. I. We can observe that the three systems produce similar results and our results are closer to Matlab instead of the OpenCV. While the reprojection error of our system is almost the same as Matlab, our framework is more accurate because the principle point coordinates should be close to the center of the image according to the prior knowledge about the real camera. Moreover, the lower standard deviation also proves the higher stability of our framework.
As calibration benefits from precise chessboard corner coordinates, the accuracy of our corner detection method is tested on synthetic data in this subsection. Compared with both feature-based [5, 10] and learning-based  corner detection methods with sub-pixel accuracy, we conduct experiments on synthetic images with different configurations including extra noise, bad lighting and radial distortion (). Each configuration contains 2K chessboard images with ground truth sub-pixel corner coordinates. The results are shown in Table. III. It can be seen that our method gets the highest precision on different images, which is consistent with the calibration experiments and proves the precision of our method.
To measure the contribution of our distortion correction part, the calibration accuracy of our framework with and without this part are compared. What’s more, we combine our distortion correction part with Matlab and compare its accuracy with the original one. The experiments are conducted on our synthetic dataset whose distortion parameter and the results are provided in Table IV
. It can be seen that the distortion correction network improves the accuracy a lot in our framework. Matlab combined with the distortion correction part gets accuracy improvement which is yet limited because the distortion correction brings into noise caused by image interpolation. As our corner detection part can achieve accurate results against this noise, our framework can handle the distortion well.
In this paper, we propose a novel camera calibration framework containing three parts: distortion correction, corner detection and parameter estimation. This framework integrates learning-based approaches with traditional methods to achieve efficient distortion correction, accurate sub-pixel feature detection and stable parameter estimation. This framework surpasses other widely-used calibration methods by a large margin in terms of accuracy on both synthetic and real dataset. Additionally, extensive experiments prove the robustness of this framework against noise, lighting and radial distortion. Besides, the corner detection and distortion correction part are evaluated respectively where decent results manifest the contribution of these parts to our framework.
S. Donné, J. De Vylder, B. Goossens, and W. Philips, “MATE: Machine Learning for Adaptive Calibration Template Detection,”Sensors, vol. 16, no. 11, pp. 1858–17, Nov. 2016.
F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7093–7102.