Efficient Pose Selection for Interactive Camera Calibration
The choice of poses for camera calibration with planar patterns is only rarely considered - yet the calibration precision heavily depends on it. This work presents a pose selection method that finds a compact and robust set of calibration poses and is suitable for interactive calibration. Consequently, singular poses that would lead to an unreliable solution are avoided explicitly, while poses reducing the uncertainty of the calibration are favoured. For this, we use uncertainty propagation. Our method takes advantage of a self-identifying calibration pattern to track the camera pose in real-time. This allows to iteratively guide the user to the target poses, until the desired quality level is reached. Therefore, only a sparse set of key-frames is needed for calibration. The method is evaluated on separate training and testing sets, as well as on synthetic data. Our approach performs better than comparable solutions while requiring 30READ FULL TEXT VIEW PDF
For building a Augmented Reality (AR) pipeline, the most crucial step is...
It is well known that the accuracy of a calibration depends strongly on ...
In many scenarios where cameras are applied, such as robot positioning a...
The present project deals with the elastostatic modeling and calibration...
Existing camera-projector calibration methods typically warp feature poi...
Calibrating sports cameras is important for autonomous broadcasting and
We propose a solution for sensor extrinsic self-calibration with very lo...
Efficient Pose Selection for Interactive Camera Calibration
We will use the pinhole camera model that, given the camera orientation , position
and the parameter vector, maps a 3D world point to a 2D image point :
Here is a 3x4 affine transformation, denotes the depth of after affine transformation, and K is the camera calibration matrix containing the focal lengths (and aspect ratio) and the principal point . Zhang 
also includes a skew parameter— however, for CCD cameras it is safe to assume to be zero [12, 6]. models the commonly used  radial (2a) and tangential (2b) lens distortions (following ) as
Given images each containing point correspondences, the underlying calibration method  minimizes the geometric error
where is an observed (noisy) 2D point in image and is the corresponding 3D object point.
Eq. (3) is also referred to as the reprojection error and often used to assess the quality of a calibration. Yet, it only measures the residual error and is subject to over-fitting. Particularly if exactly point correspondences are used [6, §7.1].
The actual objective for calibration however, is the estimation error , i.e. the distance between the solution and the (unknown) ground truth. Richardson et al. 
propose the Max ERE as an alternative metric that correlates with the estimation error and also has a similar value range (pixels). However, it requires sampling and re-projecting the current solution. Yet for user guidance and monitoring of convergence only the relative error of the parameters is needed. Therefore, we directly use the varianceof the estimated parameters. Particularly, we use the index of dispersion (IOD) to ensure comparability of the parameters among each other.
Given the covariance of the image points the backward transport of covariance [6, §5.2.3] is used to obtain
where is the Jacobian matrix, is the vector of unknowns and
denotes the pseudo inverse. For simplicity and because of the lack of prior knowledge we assume a standard deviation of 1px in each coordinate direction for the image points thus.
The diagonal entries of contain the variance of the estimated . is already computed in Levenberg-Marquardt step of .
Our approach works with any planar calibration target e.g. the common chessboard and circle grid patterns. However, for interactive user guidance a fast board detection is crucial. Therefore, we use the self-identifying ChArUco  pattern as implemented in OpenCV. This saves the time consuming ordering of the detected rectangles to a canonical topology when compared to the classical chessboard. However, one can alternatively use any of the recently developed self-identifying targets [1, 2, 4] here.
The pattern size is set to 9x6 squares resulting in up to 40 measurements at the chessboard joints per captured frame. This allows to successfully complete the initialization even if not all markers are detected as discussed in section 3.3.
The core idea of our approach is to explicitly specify individual key-frames which are used for calibration using the method of Zhang .
In this section first the relation of intrinsic parameters and board poses is discussed to motivate our split of the parameter vector into pinhole and distortion parameters. For each parameter group we then present our set of rules to generate an optimal pose while explicitly avoiding degenerate configurations.
Looking at eq. (1) we see that both and are applied at post-projection and thus describe 2D-to-2D mappings. Therefore, one might consider estimating just from one board pose that uniformly samples the image. However, as both intrinsic and extrinsic parameters are estimated simultaneously by , ambiguities arise.
Assuming and the distortion parameters to be zero, by multiplying out (1) we get
for all pattern points . In this case there are two ambiguities between
the focal length and the distance to camera and
the in-plane translation and principal point .
These ambiguities can be resolved by requiring the pattern to be tilted towards the image plane such that there is only one that satisfies eq. (1) for all pattern points.
Considering the distortion parameters of on the other hand, there are no similar ambiguities due to the non-linearity of the mapping. The parameters are rather determined by the maximal distortion strength evident in the image. Here it is more important to accurately measure the distortion in the corresponding image regions (see Figure (a)a).
Therefore, we split the parameter vector C into and and consider each group separately.
While optimizing parameters in , singular poses must be avoided. In addition to the case discussed above, we incorporate the cases identified in . Particularly, we restrict the 3D configuration of the calibration pattern as follows:
The pattern must not be parallel to the image plane.
The pattern must not be parallel to one of the image axes.
Given two patterns, the ”reflection constraint” must be fulfilled. This means that the vanishing lines of the two planes are not reflections of each other along both a horizontal and a vertical line in the image.
These restrictions ensure that each pose adds information that further constrains the pinhole parameters.
As described in Section 2.1, each parameter group requires a different strategy to generate an optimal calibration pose.
We choose a distance such that the whole pattern is visible, maximising the amount of observed 2D points.
The resulting pose would still be parallel to one of the image axes which prevents the estimation of the principal point along that axis . Therefore, the resulting view is rotated by which implements this requirement while keeping the principal orientation.
When determining the view is further shifted along the respective image axis by 5% of the image size. This increases the spread along that axis and leads to faster convergence in our experiments.
For the distortion parameters the goal is to increase sampling accuracy in image regions exhibiting strong distortions. For this we generate a distortion map based on the current calibration estimate that encodes the displacement for each pixel. Using this map we search for the distorted regions as follows:
Threshold the distortion map (Figure (a)a) to find the region with the strongest distortion.
Given the threshold image, an axis aligned bounding box (AABB) is fitted to the region, corresponding to a parallel view on the pattern. Note that the constraints for do not apply here.
The area covered by the AABB is excluded from subsequent searches (see Figure 3). Effectively, the distorted regions are thereby visited in order of distortion strength.
The pattern is aligned with the top-left corner of the AABB and positioned at a depth s.t. its projection covers 33% of the image width.
The angular range and width limits mentioned above were set such that the calibration pattern could be reliably detected using the Logitech C525 camera.
The underlying calibration method  requires at least two views of the pattern for an initial solution which we select as follows:
Without any prior knowledge we aim at an uniform sampling for estimating . To this end we compute a pose such that the pattern is parallel to the image plane and covers the whole view. While this violates the axis alignment requirements for poses, it still provides extra information as it is not coplanar to the first pose . Furthermore, the reflection constraint is fulfilled.
To render an accurate overlay for the first pose without prior knowledge of the used camera, we employ a bootstraping strategy similar to ; if the pattern can be detected, we perform a single frame calibration estimating the focal length only — the principal point is fixed at the center and is set to zero.
In the following we present the parameter refinement and user guidance parts as well as any employed heuristics. This completes the calibration pipeline as used for the real data experiments.
After obtaining an initial solution using two key-frames, the goal is to minimize the cumulated variance of the estimated parameters . We approach this problem by targeting the variance of a single parameter at a time. Here we pick the parameter with the highest index of dispersion (MaxIOD) ( iff ). Depending on the parameter group, a pose is then generated as described in Section 2.
For determining convergence, we use a ratio test of the parameter variance . If the reduction is below a given threshold, we assume the parameter to be converged and exclude it from further refinement. Here, we only consider parameters from the same group as there is typically only little reduction in the complementary group. The calibration terminates once all parameters have converged.
To guide the user, the targeted camera pose is projected using the current estimate of the intrinsic parameters. This projection is then displayed as an overlay on top of the live video stream (See Figure 1 and the video in the supplemental material).
To verify whether the user is sufficiently close to the target pose we use the Jaccard index(intersection over union) computed from the area covered by the projection of pattern from the target pose and the area covered by the projection from the current pose estimate . We assume that the user has reached the desired pose if .
Comparing the projection overlap instead of using the estimated pose directly is more robust since the pose estimate is often unreliable — especially during initialization.
Throughout the process we enforce the common heuristic [6, §7.2] that the number of constraints should exceed the number of unknowns by a factor of five. The used calibration method  not only estimates the intrinsic parameters , but also the relative pose of model plane and image plane i.e. the parameters , a 3D rotation, and , a 3D translation. When using calibration images we thus have unknowns and each point correspondence provides two constraints. For initialization () we thus have unknowns, meaning point correspondences are needed in total or 27 correspondences per frame. For any subsequent frame only 15 points are required.
To prevent inaccurate measurements due to motion blur and rolling shutter artifacts the pattern should be still. To ensure this we require all points to be re-detected in the consecutive frame and the mean motion of the points to be smaller then px (determined empirically).
The presented method was evaluated on both synthetic and real data. The synthetic experiments aimed at validating the parameter splitting and pose generation rules presented in Section 2, while the real data was used for comparison with other methods. Furthermore, the compactness of the results with real data was estimated by optimizing directly on the testing set.
We performed multiple calibrations, each using 20 synthetic images. The first two camera poses were chosen as described in section 2.4 to allow a rough initial solution. The next 8 poses were chosen to optimize while the last 10 poses were optimizing (and vice versa).
The camera parameters were based on the calibration parameters of a Logitech C525 camera . However, the actual parameters were sampled around using a covariance matrix that allowed 10% deviation for each of the parameters as
Therefore, each synthetic calibration corresponds to using a different camera C with known ground truth parameters. To allow generalization to different camera models, we kept the above pose generation sequence, but used 20 different cameras C.
Figure 4 shows the mean standard deviation of the parameters. Notably there is a significant drop in iff a pose matching the parameter group is used.
We also evaluated the usage of MaxIOD as an error metric by comparing it to MaxERE  and a known estimation error . Just as the MaxERE, the MaxIOD correlates with (see Figure (a)a). Additionally, as Figure (b)b indicates, the IOD reduction is suitable for balancing calibration quality and the number of required calibration frames.
For evaluating our method with real images, we recorded a separate testing set consisting of 50 images at various distances and angles covering the whole field of view. All images were captured using a Logitech C525 webcam at a resolution of 1280x720px. The autofocus was fixed throughout the whole evaluation, while exposure was fixed per sequence. Our method was compared to AprilCal  and calibrating without any pose restrictions using OpenCV.
We used the pattern described in section 1.2 that provides 40 measurements per frame for OpenCV as well as for our method. With AprilCal, we used the 5x7 AprilTag target that generates approximately the same amount of measurements.
The convergence threshold was set to 10% for our method and the stopping accuracy parameter of AprilCal was set to 2.0. As the OpenCV method does not provide convergence monitoring, we stopped calibration after 10 frames here.
Table 1 shows the mean results over 5 calibration runs for each method, measuring the required number of frames, and . Here our method requires only 70% of the frames required by AprilCal while arriving at a 36% lower (64% compared to OpenCV).
The results in the previous section show that our method is able to provide the lowest calibration error while using fewer calibration frames then comparable approaches. However, it is not clear whether the solution is using the minimal amount of frames or whether it is possible to use a subset of frames while arriving at the same calibration error.
Therefore, we further tested the compactness of our calibration result. We used a greedy algorithm that, given a set of frames captured by our method, tries to find a smaller subset. It optimizes for the testing set, directly minimizing the estimation error.
The algorithm is computed as follows; given a set of training images (the calibration sequence)
the initialization frames as described in Section 2.4 are added unconditionally;
each of the remaining frames is now individually added to the key-frame set and a calibration is computed.
For each calibration the estimation error is computed using the testing frames.
The frame that minimizes is incorporated into the key-frame set. Continue at step 2.
Terminate if cannot be further reduced or all frames have been used.
The greedy optimal solution requires 75% of the frames compared to the proposed method while keeping the same estimation error (see Table 1). This indicates that, while a significant improvement over , our method is not yet optimal in the compactness sense. The greedy algorithm requires an a-priori recorded testing set and only finds a minimal subset of an existing calibration sequence, but cannot generate any calibration poses.
We performed an informal survey among 5 co-workers to measure the required calibration time when using our method. The tool was used for the first time and the only given instruction was that the overlay should be matched with the calibration pattern. The camera was fixed and the pattern had to be moved. On average the users required 1:33 min for capturing 8.7 frames at .
We have presented a calibration approach to generate a compact set of calibration frames that is suitable for interactive user guidance. Singular pose configurations are avoided such that capturing about 9 key-frames is sufficient for a precise calibration. This is 30% less than comparable solutions. The provided user guidance allows even inexperienced users to perform the calibration in less than 2 minutes. Calibration precision can be weighted against the required calibration time using the convergence threshold. The camera parameter uncertainty is monitored throughout the processes, ensuring that a given confidence level can be reached repeatedly.
Our evaluation shows that the amount of required frames can still be reduced to speed up the process even more. We only use a widespread and simple distortion model, additional distortion coefficients like thin prism , rational  and tilted sensor are to be considered in future work. Eventually one could incorporate a detection of unused parameters. This would allow to start with the most complex distortion model which could be gradually reduced during calibration.
Furthermore the method needs adaptation to special cases like microscopy where the depth of field limits the possible calibration angles or calibration at large distance where scaling the pattern accordingly is not desirable.
The OpenCV based implementation of the presented algorithm is available open-source at https://github.com/paroj/pose_calib.
Computer Vision and Pattern Recognition, Proceedings., 1997 IEEE Computer Society Conference on, pages 1106–1112. IEEE, 1997.