Cameras provide rich information about the scene, while being small, lightweight, inexpensive, and power efficient. Despite their wide availability, camera calibration largely remains a manual, time-consuming process that typically requires collecting images of known targets (e.g., checkerboards) as they are deliberately moved in the scene [zhang2000flexible]. While applicable to a wide range of camera models [scaramuzza2006flexible, kannala2006generic, grossberg2001general], this process is tedious and has to be repeated whenever the camera parameters change. A number of methods perform calibration “in the wild” [caprile1990using, pollefeys1997stratified, cipolla1999camera]. However, they rely on strong assumptions about the scene structure, which cannot be met during deployment in unstructured environments. Learning-based methods relax these assumptions, and regress camera parameters directly from images, either by using labelled data for supervision [bogdan2018deepcalib] or by extending the framework of self-supervised depth and ego-motion estimation [garg2016unsupervised, zhou2017unsupervised] to also learn per-frame camera parameters [gordon2019depth, vasiljevic2020neural].
While these methods enable learning accurate depth and ego-motion without calibration, they are either over-parameterized [vasiljevic2020neural] or limited to near-pinhole cameras [gordon2019depth]. In contrast, we propose a self-supervised camera calibration algorithm capable of learning expressive models of different camera geometries in a computationally efficient manner. In particular, our approach adopts a family of general camera models [usenko2018double] that can scale to higher resolutions than previously possible, while still able to model highly-complex geometries such as catadioptric lenses. Furthermore, our framework learns camera parameters per-sequence rather than per-frame, resulting in self-calibrations that are more accurate and more stable than those achieved using contemporary learning methods. We evaluate the reprojection error of our approach compared to conventional target-based calibration routines, showing comparable sub-pixel performance despite only using raw videos at training time.
Our contributions can be summarized as follows:
We propose to self-calibrate a variety of generic camera models from raw video using self-supervised depth and pose learning as a proxy objective, providing for the first time a calibration evaluation of camera model parameters learned purely from self-supervision.
We demonstrate the utility of our framework on challenging and radically different datasets, learning depth and pose on perspective, fisheye, and catadioptric images without architectural changes.
We achieve state-of-the-art depth evaluation results on the challenging EuRoC MAV dataset by a large margin, using our proposed self-calibration framework.
Ii Related Work
Camera Calibration. Traditional calibration for a variety of camera models uses targets such as checkerboards or AprilTags to generate 2D-3D correspondences, which are then used in a bundle adjustment framework to recover relative poses as well as intrinsics [zhang2000flexible, hartley2000zisserman]. Targetless methods typically make strong assumptions about the scene, such as the existence of vanishing points and known (Manhattan world) scene structure [caprile1990using, pollefeys1997stratified, cipolla1999camera]. While highly accurate, these techniques require a controlled setting and manual target image capture to re-calibrate. Several models are implemented in OpenCV [bradski2000opencv], kalibr [rehder2016extending]. These methods require specialized settings to work and thus form an upper bound of what is possible with self-calibration.
Camera Models. The pinhole camera model is ubiquitous in robotics and computer vision [leonard08, urmson2008autonomous]
, and especially common in recent deep learning architectures for depth estimation[zhou2017unsupervised]. There are two main families of models for high-distortion cameras. The first is the “high-order polynomial” distortion family that includes pinhole radial distortion [fryer1986lens], omnidirectional (omni) [scaramuzza2006flexible], and Kannala-Brandt (KB) [kannala2006generic]. The second is the “unified camera model” family that includes the Unified Camera Model (UCM) [geyer2000unifying], Extended Unified Camera Model (EUCM) [khomutenko2015enhanced] and Double Sphere Camera Model (DS) [usenko2018double]. Both families are able to achieve low reprojection errors for a variety of different camera geometries [usenko2018double], however the unprojection operation of the “high-order polynomial” models requires solving for the root of the high-order polynomial, which is usually done by iterative optimization, an expensive and not easily differentiable operation. In contrast, the “unified camera model” family, which we use as the basis for our work, has an easily computed, closed-form unprojection function.
Learning Camera Calibration. Work in learning-based camera calibration can be divided into two types: supervised approaches that leverage ground-truth calibration parameters or synthetic data to train single-image calibration regressors; and self-supervised methods that utilize only image sequences. Our proposed method falls in the latter category, and aims to self-calibrate a camera system using only image sequences. Early work on applying CNNs to camera calibration focused on regressing the focal length [workman2015deepfocal] or horizon lines [workman2016horizon]; synthetic data was used for distortion calibration [rong2016radial] and fisheye rectification [yin2018fisheyerecnet]. Using panorama data to generate images with a wide variety of intrinsics, Lopez et al. [lopez2019deep] are able to estimate both extrinsics (tilt and roll) and intrinsics (focal length and radial distortion). DeepCalib [bogdan2018deepcalib] takes a similar approach: given a panoramic dataset, generate projections with different focal lengths. Then, they train a CNN to regress from a set of synthetic images to their (known) focal lengths . Typically, training images are generated by taking crops of the desired focal lengths from 360 degree panoramas [hold2018perceptual, zhu2020single]. While this can be done for any kind of image, and does not require image sequences, it does require access to panoramic images. Furthermore, the warped “synthetic” images are not the true 3D-2D projections. This approach has been extended to pan-tilt-zoom [zhang2020deepptz] and fisheye [yin2018fisheyerecnet] cameras. Methods also exist for specialized settings like portraits [zhao2019learning], 3D point cloud data [yin2021learning] and learning rectification [yang2021progressively, liao2021deep].
Self-Supervised depth and ego-motion
. Self-supervised learning has also been used to learn camera parameters from geometric priors. Gordon et al.[gordon2019depth] learn a pinhole and radial distortion model, while Vasiljevic et al. [vasiljevic2020neural] learn a generalized central camera model applicable to a wider range of camera types, including catadioptric. These methods both learn calibration on a per-frame basis, and do not offer a calibration evaluation of their learned camera model. Furthermore, while [vasiljevic2020neural] is much more general than [gordon2019depth], it is limited to fairly low resolutions by the complex and approximated generalized projection operation. In our work, we trade some degree of generality (i.e. a global central model vs. per-pixel) for a closed-form and efficient projection operation and ease of calibration evaluation.
First, we describe the self-supervised monocular depth learning framework that we use as proxy for self-calibration. Then we describe the family of unified camera models we consider and how we learn their parameters end-to-end.
Iii-a Self-Supervised Monocular Depth Estimation
Self-supervised depth and ego-motion architectures consist of a depth network that produces depth maps for a target image , as well as a pose network that predicts the relative rigid-body transformation between target and context frames, . We train the networks jointly by minimizing the reprojection error between the actual target image and a synthesized image generated by projecting pixels from the context image (usually preceding or following in a sequence) onto the target image using the predicted depth map and ego-motion [zhou2017unsupervised]. See Fig. 6 for an overview. The general pixel-warping operation is defined as:
where are camera intrinsic parameters modeling the geometry of the camera, which is required for both projection of 3D points onto image pixels via and unprojection via assuming an estimated pixel depth of . The camera parameters are generally the standard pinhole model [hartley2000zisserman] defined by the intrinsic matrix , but can include any differentiable model such as the Unified Camera Model family [usenko2018double] as described next.
Iii-B End-to-End Self-Calibration.
The Unified Camera Model (UCM) [geyer2000unifying] is a parametric global central camera model that uses only five parameters to represent a diverse set of camera geometries, including perspective, fisheye, and catadioptric. A 3D point is projected onto a unit sphere and then projected onto the image plane of a pinhole camera, shifted by from the center of the sphere (Fig. 7). The Extended UCM (EUCM) and Double Sphere Camera Model (DS) are two extensions of the UCM model. EUCM replaces the unit sphere with an ellipse as the first projection surface, and DS replaces the one unit sphere with two unit spheres in the projection process. We self-calibrate all three models (in addition to a pinhole baseline) in our experiments. For brevity, we only describe the original UCM and refer the reader to usenko2018double for details on the EUCM and DS models.
There are multiple parameterizations for UCM [geyer2000unifying], and we use the one from usenko2018double since it has better numerical properties. UCM extends the pinhole camera model with only one additional parameter . The 3D-to-2D projection of is defined as
where the camera parameters are and
The unprojection operation of pixel at estimated depth is:
, the UCM camera model provides closed-form projection and unprojection functions that are both differentiable. Therefore, the overall architecture is end-to-end differentiable with respect to both neural network parameters (for pose and depth estimation) and camera parameters. This enables learning self-calibration end-to-end from the aforementioned view synthesis objective alone. At the start of self-supervised depth and pose training, rather than pre-calibrating the camera parameters, we initialize the camera with “default” values based on image shape only (for a detailed discussion of the initialization procedure, please see SectionIV-D). Although the projection (2) and unprojection (3) are initially inaccurate, they quickly converge to highly accurate camera parameters with sub-pixel re-projection error (see Table I).
As we show in our experiments, our method combines flexibility with computational efficiency. Indeed, our approach enables learning from heterogeneous datasets with potentially vastly differing sensors for which separate parameters
are learned. As most of the parameters (in the depth and pose networks) are shared thanks to the decoupling of the projection model, this enables scaling up in-the-wild training of depth and pose networks. Furthermore, our method is efficient, with only one extra parameter relative to the pinhole model. This enables learning depth for highly-distorted catadioptric cameras at a much higher resolution than previous over-parametrized models (vs. for vasiljevic2020neural). Note that, in contrast to prior works [gordon2019depth, vasiljevic2020neural], we learn intrinsics per-sequence rather than per-frame. This increases stability compared to per-frame methods that exhibit frame-to-frame variability [vasiljevic2020neural], and can be used over sequences of varying sizes.
In this section we describe two sets of experimental validations for our architecture: (i) calibration, where we find that the re-projection error of our learned camera parameters compares favorably to target-based traditional calibration toolboxes; and (ii) depth evaluation, where we achieve state-of-the-art results on the challenging EuRoC MAV dataset.
Self-supervised depth and ego-motion learning uses monocular sequences [zhou2017unsupervised, godard2019digging, gordon2019depth, packnet] or rectified stereo pairs [godard2019digging, superdepth] from forward-facing cameras [geiger2012we, packnet, caesar2020nuscenes]. Given that our goal is to learn camera calibration from raw videos in challenging settings, we use the standard KITTI dataset as a baseline, and focus on the more challenging and distorted EuRoC [burri2016euroc] fisheye sequences.
KITTI [geiger2012we] We use this dataset to show that our self-calibration procedure is able to accurately recover pinhole intrinsics alongside depth and ego-motion. Following related work [zhou2017unsupervised, godard2019digging, gordon2019depth, packnet] we use the training protocol of [eigen2014depth], including filtering static images as described by zhou2017unsupervised. The resulting training set contains of images, with images left for evaluation.
EuRoC [burri2016euroc] The dataset consists of a set of indoor MAV sequences with general six-DoF motion. Consistent with recent work [gordon2019depth], we train using center-cropping and down-samplle the images to a resolution, while training and evaluating on the same split. For calibration evaluation, we follow usenko2018double and use the calibration sequences from the dataset. We evaluate the UCM, EUCM and DS camera models in terms of re-projection error.
OmniCam [schonbein2014calibrating] A challenging outdoor catadioptric sequence, containing 12000 frames captured by an autonomous car rig. As this dataset does not provide ground-truth depth information, we only provide qualitative results.
Iv-B Training Protocol
We implement the group of unified camera models described in [usenko2018double]
as differentiable PyTorch[paszke2017automatic] operations, modifying the self-supervised depth and pose architecture of monodepth2 to jointly learn depth, pose, and the unified camera model intrinsics. We use a learning rate of e- for the depth and pose network and e- for the camera parameters. We use a StepLR scheduler with and a step size of . All of the experiments are run for epochs. The images are augmented with random vertical and horizontal flip, as well as color jittering. We train our models on a Titan X GPU with 12 GB of memory, with a batch size of when training on images with a resolution of . We note that our method requires significantly less memory than that of vasiljevic2020neural which learns a generalized camera model parameterized through a per-pixel ray surface.
|Method||Mean Reprojection Error|
Iv-C Camera Self-Calibration
We evaluate the results of the proposed self-calibration method on the EuRoC dataset; detailed depth estimation evaluations are provided in Sec. IV-F. To our knowledge, ours is the first direct calibration evaluation of self-supervised intrinsics learning; although gordon2019depth compare ground truth calibration to their per-frame model, they do not evaluate the re-projection error of their learned parameters.
Following usenko19nfr we evaluate our self-supervised calibration method on the family of unified camera models: the Unified Camera Model (UCM), Extended Unified CameraModel (EUCM), and the Double Sphere Model (DS) as well as the perspective (pinhole) baseline. As a lower bound, we use the Basalt [usenko19nfr] toolbox and compute camera calibration parameters for each unified camera model using the calibration sequences of the EuRoC dataset. We note that unlike Basalt, our method regresses the intrinsic calibration parameters directly from raw videos, without using any of the calibration sequences.
Table I summarizes our re-projection error results. We use the EuRoC AprilTag [olson2011apriltag] calibration sequences with Basalt to measure re-projection error using the full estimation procedure (Table I - Target Based) and learned intrinsics (Table I - Learning). For consistency, we optimize for both intrinsics and camera poses for the baselines and only for the camera poses for the learned intrinsics evaluation. Note that with learned intrinsics, UCM, EUCM and DS models all achieve sub-pixel mean projection error despite the camera parameters having been learned from raw video data.
Table II compares the target-based calibrated parameters to our learned parameters for different camera models trained on the cam0
sequences of the EuRoC dataset. Though the parameter vectors were initialized with no prior knowledge of the camera model and updated purely based on gradients from the reprojection error, they converge to values very close to the output of a procedure that uses bundle adjustment on calibrated image sequences.
Iv-D Camera Re-calibration: Perturbation Experiments
In many real-world robotics settings, the camera calibration is not completely unknown as in our setting so far; instead, we wish to re-calibrate based on a (possibly highly incorrect) prior calibration. Generally, this requires the capture of new calibration data. Instead, we can initialize our parameter vectors with this initial calibration (in this setting, a perturbation of Basalt calibration of the EUCM model) and see the extent to which self-supervision can nudge the parameters back to their “true value”.
Given Basalt parameters , we preturb them as , , , and initialize the camera parameters at the beginning of training with these values. All runs have warm start, i.e., freezing the gradients for the intrinsics for the first epochs to let the depth and pose networks train. The convergence for each parameter is shown in Figure 15—for most of the parameters, we are able to get to within of the Basalt parameter. The values of the converged parameters and the mean projection error (MRE) of each run can be seen in Table III.
Iv-E Camera Rectification
Using our learned camera parameters, we rectify calibration sequences on the EuRoC datasets to demonstrate the quality of the calibration. EuRoC was captured with a fisheye camera, thus there is a high degree of radial distortion which causes the straight edges of the checkerboard grid to be curved. In Figure 8, we can see that our learned parameters allow for the rectified grid to track closely to the true underlying checkerboard.
|Method||Camera||Abs Rel||Sq Rel||RMSE|
|Method||Camera||Abs Rel||Sq Rel||RMSE|
|vasiljevic2020neural111The paper did not evaluate on this dataset, we used the training code available at X to retrain the model on EuROC.||NRS||0.303||0.056||0.154||0.556|
Iv-F Depth Estimation
While in this work depth and pose estimation are only proxy tasks for camera self-calibration, the unified camera model framework allows us to achieve meaningful results compared to prior camera-learning based approaches (see Figures16, 19).
KITTI results. Our results on this dataset are presented in Table IV. We note that our approach is able to model the simple pinhole setting, achieving results which are on par with related work tailored specifically for this geometry. Interestingly, we record an increase in performance when using the UCM model, which we attribute to the ability to further account for and correct calibration errors.
EuRoC results. EuRoC is a significantly more challenging setting than KITTI, involving cluttered indoor sequences with 6DoF motion. Compared to the per-frame distorted camera models in gordon2019depth and vasiljevic2020neural (see Table II), we achieve significantly better absolute relative error, especially with EUCM, where the error is reduced by (Table V). We also train NRS [vasiljevic2020neural] on this dataset for further comparison, using the official repository.
|Dataset||Abs Rel||Sq Rel||RMSE|
Combining heterogeneous datasets. One of the strengths of the unified camera model is that it can represent a wide variety of cameras without prior knowledge. As long as we know which sequences come from which camera, we can learn separate calibration vectors that share the same depth and pose networks. This is particularly useful as a way to improve performance on smaller datasets, since it enables the introduction of unlabeled data from other sources. To evaluate this property, we experimented with mixing KITTI and EuRoC. In this experiment, we reshaped the KITTI images match those in the EuRoC dataset (i.e., ), and found that we could improve EuRoC depth evaluation (see Table VI).
Iv-G Computational Cost
Our work is closely related to the learned general camera model (NRS) of vasiljevic2020neural given that in both works the parameters of a central general camera model are learned in a self-supervised way. NRS, being a per-pixel model, is more general than ours, and can handle settings where there is local distortion which a global camera necessarily cannot model. However, the computational requirements of the per-pixel NRS are significantly higher. For example, we train on EuRoC images with a resolution of with a batch size of , which consumes about 6 GB of GPU memory. Each epoch takes about minutes.
On the same GPU, NRS uses 16 GB of GPU memory with a batch size of one to train on the same sequences, running one epoch in about two hours. This is due to the high dimensional (yet approximate) projection operation required for a generalized camera. Thus, we trade some degree of generality for significantly higher efficiency than prior work, with higher accuracy on the EuRoC dataset (see Table V).
We proposed a procedure to self-calibrate a family of general camera models using self-supervised depth and pose estimation as a proxy task. We rigorously evaluated the quality of the resulting camera models, demonstrating sub-pixel calibration accuracy comparable to manual target-based toolbox calibration approaches. Our approach generates per-sequence camera parameters, and can be integrated into any learning procedure where calibration is needed and the projection and un-projection operations are interpretable and differentiable. As shown in our experiments, our approach is particularly amenable to online re-calibration, and can be used to combine datasets of different sources, learning independent calibration parameters while sharing the same depth and pose network.