Log In Sign Up

Self-Supervised Camera Self-Calibration from Video

Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.


page 1

page 4

page 6


Camera Calibration: a USU Implementation

The task of camera calibration is to estimate the intrinsic and extrinsi...

Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion

Self-supervised learning has emerged as a powerful tool for depth and eg...

Practical Auto-Calibration for Spatial Scene-Understanding from Crowdsourced Dashcamera Videos

Spatial scene-understanding, including dense depth and ego-motion estima...

Self-Supervised Learning of Depth and Camera Motion from 360° Videos

As 360 cameras become prevalent in many autonomous systems (e.g., self-d...

Why Having 10,000 Parameters in Your Camera Model is Better Than Twelve

Camera calibration is an essential first step in setting up 3D Computer ...

Integrating Motion into Vision Models for Better Visual Prediction

We demonstrate an improved vision system that learns a model of its envi...

I Introduction

Cameras provide rich information about the scene, while being small, lightweight, inexpensive, and power efficient. Despite their wide availability, camera calibration largely remains a manual, time-consuming process that typically requires collecting images of known targets (e.g., checkerboards) as they are deliberately moved in the scene [zhang2000flexible]. While applicable to a wide range of camera models [scaramuzza2006flexible, kannala2006generic, grossberg2001general], this process is tedious and has to be repeated whenever the camera parameters change. A number of methods perform calibration “in the wild” [caprile1990using, pollefeys1997stratified, cipolla1999camera]. However, they rely on strong assumptions about the scene structure, which cannot be met during deployment in unstructured environments. Learning-based methods relax these assumptions, and regress camera parameters directly from images, either by using labelled data for supervision [bogdan2018deepcalib] or by extending the framework of self-supervised depth and ego-motion estimation [garg2016unsupervised, zhou2017unsupervised] to also learn per-frame camera parameters [gordon2019depth, vasiljevic2020neural].

(a) Input
(b) Predicted depth
(c) Rectified image
(d) Example re-calibration results from perturbations of a camera parameter.
Fig. 5: Our self-supervised self-calibration procedure can recover accurate parameters for a wide range of cameras using a structure-from-motion objective on raw videos (EuRoC dataset, top), enabling on-the-fly re-calibration and robustness to intrinsics perturbation (bottom).

While these methods enable learning accurate depth and ego-motion without calibration, they are either over-parameterized [vasiljevic2020neural] or limited to near-pinhole cameras [gordon2019depth]. In contrast, we propose a self-supervised camera calibration algorithm capable of learning expressive models of different camera geometries in a computationally efficient manner. In particular, our approach adopts a family of general camera models [usenko2018double] that can scale to higher resolutions than previously possible, while still able to model highly-complex geometries such as catadioptric lenses. Furthermore, our framework learns camera parameters per-sequence rather than per-frame, resulting in self-calibrations that are more accurate and more stable than those achieved using contemporary learning methods. We evaluate the reprojection error of our approach compared to conventional target-based calibration routines, showing comparable sub-pixel performance despite only using raw videos at training time.

Our contributions can be summarized as follows:

  • We propose to self-calibrate a variety of generic camera models from raw video using self-supervised depth and pose learning as a proxy objective, providing for the first time a calibration evaluation of camera model parameters learned purely from self-supervision.

  • We demonstrate the utility of our framework on challenging and radically different datasets, learning depth and pose on perspective, fisheye, and catadioptric images without architectural changes.

  • We achieve state-of-the-art depth evaluation results on the challenging EuRoC MAV dataset by a large margin, using our proposed self-calibration framework.

Ii Related Work

Camera Calibration. Traditional calibration for a variety of camera models uses targets such as checkerboards or AprilTags to generate 2D-3D correspondences, which are then used in a bundle adjustment framework to recover relative poses as well as intrinsics [zhang2000flexible, hartley2000zisserman]. Targetless methods typically make strong assumptions about the scene, such as the existence of vanishing points and known (Manhattan world) scene structure [caprile1990using, pollefeys1997stratified, cipolla1999camera]. While highly accurate, these techniques require a controlled setting and manual target image capture to re-calibrate. Several models are implemented in OpenCV [bradski2000opencv], kalibr [rehder2016extending]. These methods require specialized settings to work and thus form an upper bound of what is possible with self-calibration.

Camera Models. The pinhole camera model is ubiquitous in robotics and computer vision [leonard08, urmson2008autonomous]

, and especially common in recent deep learning architectures for depth estimation 

[zhou2017unsupervised]. There are two main families of models for high-distortion cameras. The first is the “high-order polynomial” distortion family that includes pinhole radial distortion [fryer1986lens], omnidirectional (omni) [scaramuzza2006flexible], and Kannala-Brandt (KB) [kannala2006generic]. The second is the “unified camera model” family that includes the Unified Camera Model (UCM) [geyer2000unifying], Extended Unified Camera Model (EUCM) [khomutenko2015enhanced] and Double Sphere Camera Model (DS) [usenko2018double]. Both families are able to achieve low reprojection errors for a variety of different camera geometries [usenko2018double], however the unprojection operation of the “high-order polynomial” models requires solving for the root of the high-order polynomial, which is usually done by iterative optimization, an expensive and not easily differentiable operation. In contrast, the “unified camera model” family, which we use as the basis for our work, has an easily computed, closed-form unprojection function.

Learning Camera Calibration. Work in learning-based camera calibration can be divided into two types: supervised approaches that leverage ground-truth calibration parameters or synthetic data to train single-image calibration regressors; and self-supervised methods that utilize only image sequences. Our proposed method falls in the latter category, and aims to self-calibrate a camera system using only image sequences. Early work on applying CNNs to camera calibration focused on regressing the focal length [workman2015deepfocal] or horizon lines [workman2016horizon]; synthetic data was used for distortion calibration [rong2016radial] and fisheye rectification [yin2018fisheyerecnet]. Using panorama data to generate images with a wide variety of intrinsics, Lopez et al. [lopez2019deep] are able to estimate both extrinsics (tilt and roll) and intrinsics (focal length and radial distortion). DeepCalib [bogdan2018deepcalib] takes a similar approach: given a panoramic dataset, generate projections with different focal lengths. Then, they train a CNN to regress from a set of synthetic images to their (known) focal lengths . Typically, training images are generated by taking crops of the desired focal lengths from 360 degree panoramas [hold2018perceptual, zhu2020single]. While this can be done for any kind of image, and does not require image sequences, it does require access to panoramic images. Furthermore, the warped “synthetic” images are not the true 3D-2D projections. This approach has been extended to pan-tilt-zoom [zhang2020deepptz] and fisheye [yin2018fisheyerecnet] cameras. Methods also exist for specialized settings like portraits [zhao2019learning], 3D point cloud data [yin2021learning] and learning rectification [yang2021progressively, liao2021deep].

Self-Supervised depth and ego-motion

. Self-supervised learning has also been used to learn camera parameters from geometric priors. Gordon et al. 

[gordon2019depth] learn a pinhole and radial distortion model, while Vasiljevic et al. [vasiljevic2020neural] learn a generalized central camera model applicable to a wider range of camera types, including catadioptric. These methods both learn calibration on a per-frame basis, and do not offer a calibration evaluation of their learned camera model. Furthermore, while  [vasiljevic2020neural] is much more general than [gordon2019depth], it is limited to fairly low resolutions by the complex and approximated generalized projection operation. In our work, we trade some degree of generality (i.e. a global central model vs. per-pixel) for a closed-form and efficient projection operation and ease of calibration evaluation.

Iii Methodology

First, we describe the self-supervised monocular depth learning framework that we use as proxy for self-calibration. Then we describe the family of unified camera models we consider and how we learn their parameters end-to-end.

Iii-a Self-Supervised Monocular Depth Estimation

Self-supervised depth and ego-motion architectures consist of a depth network that produces depth maps for a target image , as well as a pose network that predicts the relative rigid-body transformation between target and context frames, . We train the networks jointly by minimizing the reprojection error between the actual target image and a synthesized image generated by projecting pixels from the context image (usually preceding or following in a sequence) onto the target image using the predicted depth map and ego-motion  [zhou2017unsupervised]. See Fig. 6 for an overview. The general pixel-warping operation is defined as:


where are camera intrinsic parameters modeling the geometry of the camera, which is required for both projection of 3D points onto image pixels via and unprojection via assuming an estimated pixel depth of . The camera parameters are generally the standard pinhole model [hartley2000zisserman] defined by the intrinsic matrix , but can include any differentiable model such as the Unified Camera Model family [usenko2018double] as described next.

Iii-B End-to-End Self-Calibration.

The Unified Camera Model (UCM) [geyer2000unifying] is a parametric global central camera model that uses only five parameters to represent a diverse set of camera geometries, including perspective, fisheye, and catadioptric. A 3D point is projected onto a unit sphere and then projected onto the image plane of a pinhole camera, shifted by from the center of the sphere (Fig. 7). The Extended UCM (EUCM) and Double Sphere Camera Model (DS) are two extensions of the UCM model. EUCM replaces the unit sphere with an ellipse as the first projection surface, and DS replaces the one unit sphere with two unit spheres in the projection process. We self-calibrate all three models (in addition to a pinhole baseline) in our experiments. For brevity, we only describe the original UCM and refer the reader to usenko2018double for details on the EUCM and DS models.

There are multiple parameterizations for UCM [geyer2000unifying], and we use the one from usenko2018double since it has better numerical properties. UCM extends the pinhole camera model with only one additional parameter . The 3D-to-2D projection of is defined as


where the camera parameters are and

The unprojection operation of pixel at estimated depth is:



Fig. 6: Our self-supervised self-calibration pipeline. We use gradients from the photometric loss to update the parameters of a unified camera model (Fig. 7).
Fig. 7: The Unified Camera Model [usenko2018double] used in our self-calibration pipeline. Points are projected onto a unit sphere before being projected onto an image plane of a standard pinhole camera offset by from the sphere center.

As shown in Equations 2 and 3

, the UCM camera model provides closed-form projection and unprojection functions that are both differentiable. Therefore, the overall architecture is end-to-end differentiable with respect to both neural network parameters (for pose and depth estimation) and camera parameters. This enables learning self-calibration end-to-end from the aforementioned view synthesis objective alone. At the start of self-supervised depth and pose training, rather than pre-calibrating the camera parameters, we initialize the camera with “default” values based on image shape only (for a detailed discussion of the initialization procedure, please see Section 

IV-D). Although the projection (2) and unprojection (3) are initially inaccurate, they quickly converge to highly accurate camera parameters with sub-pixel re-projection error (see Table I).

As we show in our experiments, our method combines flexibility with computational efficiency. Indeed, our approach enables learning from heterogeneous datasets with potentially vastly differing sensors for which separate parameters

are learned. As most of the parameters (in the depth and pose networks) are shared thanks to the decoupling of the projection model, this enables scaling up in-the-wild training of depth and pose networks. Furthermore, our method is efficient, with only one extra parameter relative to the pinhole model. This enables learning depth for highly-distorted catadioptric cameras at a much higher resolution than previous over-parametrized models (

vs. for vasiljevic2020neural). Note that, in contrast to prior works [gordon2019depth, vasiljevic2020neural], we learn intrinsics per-sequence rather than per-frame. This increases stability compared to per-frame methods that exhibit frame-to-frame variability [vasiljevic2020neural], and can be used over sequences of varying sizes.

Iv Experiments

In this section we describe two sets of experimental validations for our architecture: (i) calibration, where we find that the re-projection error of our learned camera parameters compares favorably to target-based traditional calibration toolboxes; and (ii) depth evaluation, where we achieve state-of-the-art results on the challenging EuRoC MAV dataset.

Iv-a Datasets

Self-supervised depth and ego-motion learning uses monocular sequences [zhou2017unsupervised, godard2019digging, gordon2019depth, packnet] or rectified stereo pairs [godard2019digging, superdepth] from forward-facing cameras [geiger2012we, packnet, caesar2020nuscenes]. Given that our goal is to learn camera calibration from raw videos in challenging settings, we use the standard KITTI dataset as a baseline, and focus on the more challenging and distorted EuRoC [burri2016euroc] fisheye sequences.

KITTI [geiger2012we] We use this dataset to show that our self-calibration procedure is able to accurately recover pinhole intrinsics alongside depth and ego-motion. Following related work [zhou2017unsupervised, godard2019digging, gordon2019depth, packnet] we use the training protocol of [eigen2014depth], including filtering static images as described by zhou2017unsupervised. The resulting training set contains of images, with images left for evaluation.

EuRoC [burri2016euroc] The dataset consists of a set of indoor MAV sequences with general six-DoF motion. Consistent with recent work [gordon2019depth], we train using center-cropping and down-samplle the images to a resolution, while training and evaluating on the same split. For calibration evaluation, we follow usenko2018double and use the calibration sequences from the dataset. We evaluate the UCM, EUCM and DS camera models in terms of re-projection error.

OmniCam [schonbein2014calibrating] A challenging outdoor catadioptric sequence, containing 12000 frames captured by an autonomous car rig. As this dataset does not provide ground-truth depth information, we only provide qualitative results.

Iv-B Training Protocol

We implement the group of unified camera models described in  [usenko2018double]

as differentiable PyTorch 

[paszke2017automatic] operations, modifying the self-supervised depth and pose architecture of monodepth2 to jointly learn depth, pose, and the unified camera model intrinsics. We use a learning rate of e- for the depth and pose network and e- for the camera parameters. We use a StepLR scheduler with and a step size of . All of the experiments are run for epochs. The images are augmented with random vertical and horizontal flip, as well as color jittering. We train our models on a Titan X GPU with 12 GB of memory, with a batch size of when training on images with a resolution of . We note that our method requires significantly less memory than that of vasiljevic2020neural which learns a generalized camera model parameterized through a per-pixel ray surface.

Method Mean Reprojection Error
Target-based Learned
Pinhole 1.950 2.230
UCM [geyer2000unifying] 0.145 0.249
EUCM [khomutenko2015enhanced] 0.144 0.245
DS [usenko2018double] 0.144 0.344
TABLE I: Mean reprojection error on EuRoC at resolution for UCM, EUCM and DS models using (left) AprilTag-based toolbox calibration Basalt  [usenko19nfr] and (right) our self-supervised learned (L) calibration. Note that despite using no ground-truth calibration targets, our self-supervised procedure produces sub-pixel reprojection error.

Iv-C Camera Self-Calibration

We evaluate the results of the proposed self-calibration method on the EuRoC dataset; detailed depth estimation evaluations are provided in Sec. IV-F. To our knowledge, ours is the first direct calibration evaluation of self-supervised intrinsics learning; although gordon2019depth compare ground truth calibration to their per-frame model, they do not evaluate the re-projection error of their learned parameters.

Following usenko19nfr we evaluate our self-supervised calibration method on the family of unified camera models: the Unified Camera Model (UCM), Extended Unified CameraModel (EUCM), and the Double Sphere Model (DS) as well as the perspective (pinhole) baseline. As a lower bound, we use the Basalt [usenko19nfr] toolbox and compute camera calibration parameters for each unified camera model using the calibration sequences of the EuRoC dataset. We note that unlike Basalt, our method regresses the intrinsic calibration parameters directly from raw videos, without using any of the calibration sequences.

UCM (L) 237.6 247.9 187.9 130.3 0.631
UCM (B) 235.4 245.1 186.5 132.6 0.650
EUCM (L) 237.4 247.7 186.7 129.1 0.598 1.075
EUCM (B) 235.6 245.4 186.4 132.7 0.597 1.112
FOV (L) 222.5 232.9 187.9 140.9 0.91
FOV (B) 218.7 227.8 186.5 132.9 0.92
DS (L) 184.8 193.3 187.8 130.2 0.561 -0.232
DS (B) 181.4 188.9 186.4 132.6 0.571 -0.230
TABLE II: Intrinsic calibration evaluation of different methods on the EuRoC dataset, where B denotes intrinsics obtained from Basalt, and L denotes learned intrinsics.
Fig. 8: EuRoC rectification results using images from the calibration sequences. Each column visualizes the results rendered using (left) the Basalt calibrated intrinsics and (right) our learned intrinsics. The top row shows that detected (small circles) and reprojected (big circles) corners are close using both calibration methods. The bottom row shows the same images after rectification.

Table I summarizes our re-projection error results. We use the EuRoC AprilTag [olson2011apriltag] calibration sequences with Basalt to measure re-projection error using the full estimation procedure (Table I - Target Based) and learned intrinsics (Table I - Learning). For consistency, we optimize for both intrinsics and camera poses for the baselines and only for the camera poses for the learned intrinsics evaluation. Note that with learned intrinsics, UCM, EUCM and DS models all achieve sub-pixel mean projection error despite the camera parameters having been learned from raw video data.

Perturbations MRE
init 242.3 253.6 189.5 130.7 0.5984 1.080 0.409
init 241.3 252.3 188.5 130.5 0.5981 1.078 0.367
init 240.2 251.4 187.9 130.0 0.5971 1.076 0.348
init 239.5 250.9 187.8 129.2 0.5970 1.076 0.332
init 238.8 249.6 187.7 129.1 0.5968 1.071 0.298
235.6 245.4 186.4 132.7 0.597 1.112 0.144
TABLE III: EUCM perturbation test results. With perturbed initialization, all intrinsic parameters achieve sub-pixel convergence for mean reprojection error (MRE), with only a small offset to the Basalt calibration numbers.
Fig. 15: EuRoC perturbation test, showing how our proposed learning-based method is able to recover from changes in camera parameters for online self-calibration.

Table II compares the target-based calibrated parameters to our learned parameters for different camera models trained on the cam0

sequences of the EuRoC dataset. Though the parameter vectors were initialized with no prior knowledge of the camera model and updated purely based on gradients from the reprojection error, they converge to values very close to the output of a procedure that uses bundle adjustment on calibrated image sequences.

Iv-D Camera Re-calibration: Perturbation Experiments

In many real-world robotics settings, the camera calibration is not completely unknown as in our setting so far; instead, we wish to re-calibrate based on a (possibly highly incorrect) prior calibration. Generally, this requires the capture of new calibration data. Instead, we can initialize our parameter vectors with this initial calibration (in this setting, a perturbation of Basalt calibration of the EUCM model) and see the extent to which self-supervision can nudge the parameters back to their “true value”.

Given Basalt parameters , we preturb them as , , , and initialize the camera parameters at the beginning of training with these values. All runs have warm start, i.e., freezing the gradients for the intrinsics for the first epochs to let the depth and pose networks train. The convergence for each parameter is shown in Figure 15—for most of the parameters, we are able to get to within of the Basalt parameter. The values of the converged parameters and the mean projection error (MRE) of each run can be seen in Table III.

Iv-E Camera Rectification

Using our learned camera parameters, we rectify calibration sequences on the EuRoC datasets to demonstrate the quality of the calibration. EuRoC was captured with a fisheye camera, thus there is a high degree of radial distortion which causes the straight edges of the checkerboard grid to be curved. In Figure 8, we can see that our learned parameters allow for the rectified grid to track closely to the true underlying checkerboard.

Method Camera Abs Rel Sq Rel RMSE
gordon2019depth K 0.129 0.982 5.23
gordon2019depth L(P) 0.128 0.959 5.23
gordon2019depth K 0.137 0.987 5.33 0.830
vasiljevic2020neural L(NRS) 0.134 0.952 5.26 0.832
Ours L(P) 0.129 0.893 4.96 0.846
Ours L(UCM) 0.126 0.951 4.89 0.858
TABLE IV: Quantitative depth evaluation on the KITTI [burri2016euroc] dataset, using the standard Eigen split and the Garg crop, for distances up to 80m (with median scaling). K and L() denote known and learned intrinsics, respectively.
Method Camera Abs Rel Sq Rel RMSE
gordon2019depth PB 0.332 0.389 0.971 0.420
vasiljevic2020neural111The paper did not evaluate on this dataset, we used the training code available at X to retrain the model on EuROC. NRS 0.303 0.056 0.154 0.556
Ours UCM 0.282 0.048 0.141 0.591
Ours EUCM 0.278 0.047 0.135 0.598
Ours FOV 0.316 0.063 0.159 0.523
Ours DS 0.278 0.049 0.141 0.584
TABLE V: Quantitative depth evaluation of different methods on the EuROC [burri2016euroc] dataset, using the evaluation procedure in [gordon2019depth] with center cropping. The training data consists of “Machine Room” sequences and the evaluation is on the ”Vicon Room 201” sequence (with median scaling).

Iv-F Depth Estimation

While in this work depth and pose estimation are only proxy tasks for camera self-calibration, the unified camera model framework allows us to achieve meaningful results compared to prior camera-learning based approaches (see Figures

16, 19).

KITTI results. Our results on this dataset are presented in Table IV. We note that our approach is able to model the simple pinhole setting, achieving results which are on par with related work tailored specifically for this geometry. Interestingly, we record an increase in performance when using the UCM model, which we attribute to the ability to further account for and correct calibration errors.

EuRoC results. EuRoC is a significantly more challenging setting than KITTI, involving cluttered indoor sequences with 6DoF motion. Compared to the per-frame distorted camera models in gordon2019depth and vasiljevic2020neural (see Table II), we achieve significantly better absolute relative error, especially with EUCM, where the error is reduced by (Table V). We also train NRS [vasiljevic2020neural] on this dataset for further comparison, using the official repository.

Dataset Abs Rel Sq Rel RMSE
EuRoC [gordon2019depth] 0.265 0.042 0.130 0.600 0.882 0.966
EuRoC+KITTI 0.244 0.044 0.117 0.742 0.907 0.961
TABLE VI: Quantitative multi-dataset depth evaluation on EuRoC (without cropping and with median scaling).
Fig. 16: Self-supervised monocular pointcloud for EuRoC, obtained by unprojecting predicted depth with our learned camera parameters (input image on the bottom right).

Combining heterogeneous datasets. One of the strengths of the unified camera model is that it can represent a wide variety of cameras without prior knowledge. As long as we know which sequences come from which camera, we can learn separate calibration vectors that share the same depth and pose networks. This is particularly useful as a way to improve performance on smaller datasets, since it enables the introduction of unlabeled data from other sources. To evaluate this property, we experimented with mixing KITTI and EuRoC. In this experiment, we reshaped the KITTI images match those in the EuRoC dataset (i.e., ), and found that we could improve EuRoC depth evaluation (see Table VI).

Iv-G Computational Cost

Our work is closely related to the learned general camera model (NRS) of vasiljevic2020neural given that in both works the parameters of a central general camera model are learned in a self-supervised way. NRS, being a per-pixel model, is more general than ours, and can handle settings where there is local distortion which a global camera necessarily cannot model. However, the computational requirements of the per-pixel NRS are significantly higher. For example, we train on EuRoC images with a resolution of with a batch size of , which consumes about 6 GB of GPU memory. Each epoch takes about minutes.

On the same GPU, NRS uses 16 GB of GPU memory with a batch size of one to train on the same sequences, running one epoch in about two hours. This is due to the high dimensional (yet approximate) projection operation required for a generalized camera. Thus, we trade some degree of generality for significantly higher efficiency than prior work, with higher accuracy on the EuRoC dataset (see Table V).

(a) EuRoC
(b) OmniCam
Fig. 19: Qualitative depth estimation results on non-pinhole datasets with (a) fisheye and (b) catadioptric images.

V Conclusion

We proposed a procedure to self-calibrate a family of general camera models using self-supervised depth and pose estimation as a proxy task. We rigorously evaluated the quality of the resulting camera models, demonstrating sub-pixel calibration accuracy comparable to manual target-based toolbox calibration approaches. Our approach generates per-sequence camera parameters, and can be integrated into any learning procedure where calibration is needed and the projection and un-projection operations are interpretable and differentiable. As shown in our experiments, our approach is particularly amenable to online re-calibration, and can be used to combine datasets of different sources, learning independent calibration parameters while sharing the same depth and pose network.