Dynamic Event Camera Calibration

07/14/2021
by   Kun Huang, et al.
6

Camera calibration is an important prerequisite towards the solution of 3D computer vision problems. Traditional methods rely on static images of a calibration pattern. This raises interesting challenges towards the practical usage of event cameras, which notably require image change to produce sufficient measurements. The current standard for event camera calibration therefore consists of using flashing patterns. They have the advantage of simultaneously triggering events in all reprojected pattern feature locations, but it is difficult to construct or use such patterns in the field. We present the first dynamic event camera calibration algorithm. It calibrates directly from events captured during relative motion between camera and calibration pattern. The method is propelled by a novel feature extraction mechanism for calibration patterns, and leverages existing calibration tools before optimizing all parameters through a multi-segment continuous-time formulation. As demonstrated through our results on real data, the obtained calibration method is highly convenient and reliably calibrates from data sequences spanning less than 10 seconds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 6

page 7

05/26/2021

How to Calibrate Your Event Camera

We propose a generic event camera calibration framework using image reco...
03/15/2019

Calibration of Asynchronous Camera Networks for Object Reconstruction Tasks

Camera network and multi-camera calibration for external parameters is a...
07/24/2018

Encoderless Gimbal Calibration of Dynamic Multi-Camera Clusters

Dynamic Camera Clusters (DCCs) are multi-camera systems where one or mor...
06/23/2016

3D Display Calibration by Visual Pattern Analysis

Nearly all 3D displays need calibration for correct rendering. More ofte...
02/27/2019

FastCal: Robust Online Self-Calibration for Robotic Systems

We propose a solution for sensor extrinsic self-calibration with very lo...
04/30/2019

Using cameras for precise measurement of two-dimensional plant features

Images are used frequently in plant phenotyping to capture measurements....
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past decade, we have seen the emergence of several intelligent mobile devices such as smart vehicles, intelligence augmentation devices, or factory and service automation robots. Such devices need to move either actively or passively in the real world and gain an understanding of the geometry of the environment all while keeping track of location. The resulting problem is primarily a geometric perception problem the solution to which is commonly achieved using simple cameras. Although a large body of research has already lead to a certain level of maturity, vision-based methods keep facing challenges in scenarios with high dynamics, low texture distinctiveness, or challenging illumination conditions. Dynamic vision sensors—also called event cameras—present an interesting and innovative alternative in this regard. They independently measure pixel-level logarithmic brightness changes at high temporal resolution and high dynamic range. The advantages and challenges of event-based vision are well explained by the original work of Brandli et al. [5] as well as the recent survey by Gallego et al. [13]. The present work addresses intrinsic calibration of event cameras, a fundamental problem affecting potential future use in 3D vision applications such as  [27, 28, 40, 39, 17, 24, 25, 36].

Fig. 1: Concept of our event camera calibration framework. The pattern is rendered visible by keeping the camera under motion. Well-distributed sub-segments of the trajectory are chosen and optimized along with the intrinsic camera parameters.

The physical structure of an event camera is entirely comparable to a regular camera as it is composed of a lens in front of an imaging plane. The efficient usage of event cameras towards the solution of 3D reconstruction problems therefore requires the prior identification of a similar set of intrinsic calibration parameters (e.g. focal length, principal point, etc.). For regular cameras, such calibration procedures are commonly realized by using static images captured in front of a planar calibration target with known features. These features are then re-identified in each image, from which we can obtain a sufficient number of 2D-to-2D correspondences to identify the pose of each view as well as the intrinsic parameters. The problem with calibrating an event camera is that—while such calibration procedures are entirely mature—they are not immediately applicable to event streams for the obvious reason that event cameras do not produce any information when neither camera nor pattern moves. In order to produce frame-like images from events that readily permit the quasi-instantaneous reidentification of pattern feature reprojections under an identical camera pose, the calibration of event cameras requires a modified pattern with simultaneously flashing features. Even if kept still, local accumulations of events can then be used to detect the points of interest and create the necessary input data for reusing traditional calibration methods.

Although this idea sounds relatively straightforward, it suffers from two important practical drawbacks:

  • Manipulation: The most straightforward way of obtaining a flashing pattern is to visualize the pattern on a screen. However, it is sometimes desirable to move the pattern rather than the camera, which makes the usage of screens impracticable. It is furthermore difficult to ensure mechanical properties of a screen such as flatness and stiffness, especially as the screen becomes larger.

  • Construction: An alternative is given by using an array of LEDs that are triggered simultaneously by the same circuitry. While such constructions are possible, they are hard and expensive to produce given the high accuracy requirements on the placement of the LEDs.

We present dynamic event camera calibration, a convenient framework which relies on the following insights and techniques:

  • We show that it is in fact unnecessary to capture a flashing target from static camera views. A proper choice of a regular calibration target lets us easily design a robust feature extractor that works as soon as the camera or pattern is under motion.

  • After initialization using off-the-shelf techniques, our calibration concludes with a continuous-time trajectory-fragment based motion compensation scheme in which each event is accounted for with its individual time stamp.

  • The camera model can be flexibly interchanged, thus permitting the calibration of perspective cameras with or without distortion as well as more exotic fisheye, omni-directional, or catadioptric cameras.

Our method has the following advantages:

  • It relies on a regular calibration pattern. No extra efforts for making a calibration target are required.

  • The camera or calibration target can be moved at considerable speed, thus permitting accurate and convenient calibration from data sequences as short as 10s.

  • Owing to the nature of event cameras, little attention has to be paid to illumination conditions.

  • Owing to the fact that we use a common calibration target, the method easily permits the extrinsic calibration to other, possibly regular cameras.

Our fully integrated, ready-to-use event camera calibration framework builds upon existing tools and will be released to the community.

Our paper is structured as follows. Section II reviews further related work. Section III then gives an overview of our calibration framework including the details of our feature extraction technique as well as the continuous-time back-end optimization technique. Section IV finally concludes with numerous tests on real camera-lens combinations.

Ii Related work

Camera calibration is an important topic in geometric vision. The various presented methods depend strongly on the employed camera projection model, the most prominent of which is the perspective camera model [16]. Though alternative methods exist [34], the most prominent camera calibration method for perspective cameras was introduced by Zhengyou Zhang [37, 38], a technique which found its way into Bouguet’s highly popular calibration toolbox [3] as well as a related OpenCV [4] implementation. These frameworks are able to handle perspective cameras both with and without distortions, and have furthermore served as a foundation for other toolboxes that are able to calibrate fisheye and catadioptric cameras [30].

As mentioned in the introduction, the current standard event camera calibration method [6, 22, 7] consists of using a flashing calibration pattern. Simple short term accumulations of events complemented by a corner extraction method can deliver the instantaneous corner locations of the pattern, and thus enable the reuse of all above-mentioned calibration toolboxes. In contrast, our proposed method is able to use conventional camera calibration targets.

Our procedure relies on feature detection in event streams. Several techniques for event-based corner extraction have already been presented  [8, 35, 20, 2, 1], and they could be used for detecting the corners of a checkerboard pattern. However, as we will explain, it is advantageous to use a grid pattern of circles. Our feature extraction method hence draws analogies with the event-based ball detection presented by Glover et al. [14]. It relies on noise-resilient cluster detection [11] as well as circle fitting [19].

The core of our contribution is given by a back-end motion compensation framework that relies on a continuous-time parametrization of the camera motion. The employed representation shares analogies with the B-spline-based [26] framework presented by Furgale et al. [12]

, and is complemented by the cubic spline interpolation technique presented in

[18] as well as the B-spline adaptations to Lie groups presented by Sommer et al. [32] and Sommer et al. [31]. Another strongly related work is given by the rolling shutter camera calibration method presented by Oth et al. [23], who also use a continuous-time parametrization to calibrate camera-inherent parameters. Continuous-time parametrizations have also been used for event-based camera localization [21, 28]. However, we are—to the best of our knowledge—the first to apply the model to intrinsic event camera calibration.

Iii Dynamic event camera calibration

We present a novel procedure for calibrating event cameras using only a regular calibration pattern. More specifically, our method employs patterns that consist of a square grid of regularly spaced black, circular dots. The intuition behind this choice is that such features—owing to their radially symmetric nature—may be robustly extracted using a rotation-invariant feature detector as soon as local optical flow exceeds a certain level. More specifically, the event patterns generated by the circular dots are similar up to a rotation in the image plane, and notably given by two point-symmetric clusters of positive and negative events appearing at opposing poles of each circle. We exploit this fact in order to design an efficient circle extraction algorithm. The section is organized as follows. Section III-A provides an overview of our complete calibration framework. Section III-B presents our first important sub-module given by the feature extraction. Section III-C finally concludes with our novel continuous-time back-end optimization module that refines the initial result over both identified intrinsic parameters as well as the dynamic motion parameters.

Fig. 2: Block diagram of our method containing both the front-end initialization and the back-end optimization part.

Iii-a Framework overview

A high-level overview of our proposed calibration framework is illustrated in Figure 2. The algorithm groups events into sufficiently small temporal subsets on which we then perform feature extraction and pattern detection. The different steps are illustrated in Figure 3. and are defined as the timestamps of the first and last event within a temporal subset. Note that the interval duration is gradually increased as long as no pattern is detected and the number of events within the interval remains below a certain threshold. The event accumulation—which we here denote as a reference frame—is buffered once the pattern is successfully detected, or cancelled if the total number of events exceeds the said threshold (i.e. no robust detection is expected to happen) or the duration of the interval becomes too long (i.e. not enough motion, pattern not in field of view, etc.). We have

(1)

Note that the minimum gap between two subsequent intervals is furthermore given by , which causes the frames to be sufficiently distributed. The timestamp given to a reference frame is the center of the interval.

Fig. 3: The different steps of our pattern detection method: Accumulated event frame (top left), clustering result (top right), circle extraction (bottom left), and pattern detection (bottom right).

The pattern detection has two steps. The first one is given by the initial circle extraction, the details of which are outlined in Section III-B. The second step consists of employing the standard calibration pattern detection module from OpenCV [4]. Once sufficient positive reference frames have been collected, the intrinsic and extrinsic parameters for each reference frame are initialized with an off-the-shelf tool such as the OpenCV implementation of Bouguet’s toolbox [3]. Extrinsic parameters for each of the views are notably solved by utilizing a PnP method [9] within a RANSAC scheme with approximate intrinsic parameters obtained by Zhang’s method [38]

. The initialization is completed by outlier removal and feature cross-validation techniques. The following strategies are applied:

  • Before adding a reference frame into the buffer, we verify that the orientation of the extracted pattern is sufficiently consistent with previous reference frames. The check is performed by calculating the inscribed angle of the directions of corresponding pattern rows in successive frames divided by the time that has elapsed between those frames (i.e. a measure of the rotational velocity of the pattern).

  • Once initial extrinsic parameters have been estimated, we approximate translational and angular velocities between successive frames, and again put a threshold on those values.

  • The feature cross-validation relies on the initial intrinsic and extrinsic parameters to reproject the pattern circles into each reference frame. We start by removing circles that locate outside of the image plane. We then recluster the events based on the newly re-projected circle locations, and refit the circles. To conclude, the newly fitted circles are compared against the projected pattern circles. If the two circles have obvious differences, the corresponding features are again dropped. Reference frames with too few remaining features are dropped as well.

Note that after the pattern detection is completed, each surviving event is assigned with exactly one reference frame and exactly one of the circles on the pattern. The last step of the complete calibration procedure is given by an optimization module that directly uses the asynchronous events rather than the fitted circles, the location of which is only approximate owing to the temporal aggregation of the reference frames. The optimization objective consists of a multi-segment continuous time trajectory optimization that jointly optimizes over motion parameters and intrinsic camera parameters by minimizing the geometric distance between back-projected events and their corresponding pattern circles. This optimization is outlined in Section III-C.

Iii-B Feature extraction

While for conventional cameras, calibration patterns are easily detected from regular images by using standard feature extraction approaches, the nature of event data makes this problem less trivial. As mentioned in Section III-A, we start by creating virtual frames called reference frames, which represent event accumulations and serve for the feature detection. As long as local optical flow is sufficiently high, the events in their reference frame are given by pairs of clusters of opposing polarity located at opposing poles of a circle. Our feature extraction method aims at identifying the occurrence of such patterns.

We start by running DBSCAN clustering [11] individually on the positive and negative polarity subsets of the events. We then remove clusters with too few samples, which are regarded as noise. We also calculate the median of each cluster which is adopted as the corresponding cluster center. The feature extraction mechanism then relies on one of two alternatives:

  • Hard feature extraction: For each cluster of positive events, we search its nearest neighbors in the negative cluster set, and consider them as potential cluster pairs. Cluster distance is evaluated by simply considering the distance between the cluster centers. We then apply circle fitting [19] to every potential cluster pair, and afterwards compare the diameter and centre of the fitted circle against the euclidean distance and the midpoint between the two original cluster centers. If passed, we finally evaluate the normalized circle fitting error to select the best cluster pair. This method provides a reliable feature extraction result, but produces less reference frames due to its strict constraints, especially if the camera adopts a more inclined view onto the calibration board, thus causing the features to appear as ellipses rather than circles.

  • Soft feature extraction: In order to deal with noisy scenarios or situations in which the camera adopts a more inclined view onto the calibration pattern, we introduce a simplified feature extraction mechanism. For each positive cluster, we simply find its nearest negative cluster. We then immediately hypothesize circle features by adopting the line between two cluster centers as the diameter. Wrong features are filtered out by checking the normalized circle fitting error. The method provides slightly less stable feature extraction but produces more reference frames in highly challenging scenarios.

Note that we also do an inverse search by starting from negative clusters and exploring the nearest neighbours within the positive clusters. A mutual consistency check verifies that features are correctly extracted. The feature detection result for both methods is shown in Figure 4.

Fig. 4: Left: hard feature extraction result. Right: soft feature extraction result.

Iii-C Multi-segment based optimization

Let there be events that survived the initial pattern detection stage and let denote the set of all events111Note that the events that are considered for the final optimization stage are actually more than the events that have been serving for the initial extraction of the circles. At the beginning of the optimization, we assign further nearby events to circles based on both temporal difference with respect to a reference view as well as geometric distance with respect to a circle center. This procedure approximately doubles the number of considered events while still ensuring computational efficiency of the overall optimization procedure.. Each event is defined by its pixel location , timestamp , and polarity . The initial pattern detection stage also returns an ordered set of reference frames for which initial poses are available. Each of the events is assigned to exactly one reference frame and—more specifically—one of the pattern circles within that frame. Let therefore denote the reference frame of event , and the index of its corresponding circle within the pattern. The idea of our optimization framework consists of finding a continuous-time trajectory parametrization that minimizes the distance between the circle corresponding to an event and the intersection point between the pattern plane and the spatial ray corresponding to that event.

However, it is clear that it is not always possible to robustly detect the calibration pattern, which is why we have to formulate the optimization problem as a function of multiple smooth segments rather than a single long curve. This is especially true as the calibration pattern can only be detected once there is sufficient motion dynamics. To initialize the multiple spline segments, we group the reference views based on temporal proximities. A sub-set is defined by a sequence of subsequent reference views for which the difference between the respective timestamps stays below a certain threshold value. Let there be segments resulting from this grouping process. We have

(2)

Let and furthermore be the timestamps of the very first and last event in , respectively.

We now proceed to the details of the continuous-time parametrization of the trajectory segments. The various degrees of freedom of each trajectory segment are expressed by a time-parametrized vectorial function, many of which appear in the form of a weighted combination of temporal basis functions. Multiple alternatives exist, such as FFTs, polynomial kernels, or Bézier splines. In this work, we use the

th-degree B-spline curve

(3)

where represents the time parameter, represent the control points defining the -th trajectory segment, and are the

th-degree B-spline basis functions defined on the monotonically increasing and non-uniform knot vector

. represents the set of the control points of the -th trajectory segment.

Given the parameter , in order to obtain its exact value on a B-spline curve, we firstly find ’s knot span in the knot vector . is defined such that , and an exception is given when in which case . We then compute the basis functions , which are non-zero only if . Finally, we multiply the values of the nonzero basis functions with their respective control points, and sum up the terms. Note that the shape of the curve is controlled by the control points only, and the form of a B-spline and the basis functions are generally fixed.

We have not yet defined what the degrees of freedom are. We use the 7-dimensional B-spline curves given by

(4)

where is the position of the event camera expressed in a world frame, and its orientation as a unit quaternion. For rotation groups, we in fact have two choices to represent it by a B-spline:

  • The unit quaternion approximation by Kang et al. [18], which approximates the rotation by a 4-dimensional unit quaternion B-spline.

  • The SO3 Lie Group B-spline [32, 31].

Our choice of the first relies on our experiments, which show that there is no notable difference between the quality of results achieved by the two options while the unit quaternion B-spline parametrization is about twice as fast.

In order to use B-splines to parametrize smooth trajectories inside our optimization back-end, we require an appropriate initialization of the control points. We applied an existing spline curve global approximation algorithm [26] to initialize each spline segment from a set of samples along the curve. The samples for the -th segment are notably given by the 7-vector representations of the initial absolute poses of the reference frames in , which we here denote . Let furthermore be the time-stamps of these reference frames. We use the automatic knot spacing algorithm described in (9.69) in [26] to define an appropriately knot vector . This algorithm guarantees that every knot span contains at least one sample, which causes the linear problem of the initialization to be well-conditioned. Once data points are given, let and . The remaining samples are then used to optimize the control point set in the sense of the least-squares objective

(5)

We are now ready to formulate our overall optimization objective. Let be the entire set of our continuous-time, 7-dimensional B-spline trajectory segments, and be the set of all control point sets. Let furthermore be our image-to-camera transformation function, which transforms 2D image points into 3D points on the normalized image plane in the camera frame. is a function of the camera’s intrinsic parameters . is left unspecified in the below optimization problem and may indeed be exchanged against any continuously differentiable normalization function of . We use a perspective camera with radial distortion as an example, in which case the parameter vector would be given by . It includes parameters of an inverse lens distortion function. We use the higher-order polynomial based inverse radial distortion model of Drap et al. [10], which contradicts the common usage of a distortion function. Note however that both models can always be converted into one-another. For an event , the normalization is then given by

(6)

where

(7)

Knowing that the world frame is defined such that the calibration coincides with the plane , we may now calculate the intersection point between the pattern plane and the point along the spatial ray defined by the normalized event location and the pose of the camera at the specific time the event was fired. We define this point by , the depth of . It is given by

(8)

where transforms the normalized point from the camera frame to the world coordinate frame defined by the 3D calibration pattern. is the third element of the position , and takes the third row of the rotation matrix. Note that is redefined as the index of the trajectory segment to which event belongs.

The calibration optimization objective jointly optimizes over the intrinsic parameters as well as the control points of each trajectory segment. It is finally given by:

(9)

represents the 3D location of event in the pattern plane, the pattern circle radius, and the 3D location of the corresponding pattern circle center (recall that defines the index of the circle that event corresponds to). Note that the third coordinate of is always zero.

is a loss function to mitigate the influence of outliers (e.g. Huber loss).

Iv Experimental evaluation

We introduce further details about the implementation of our approach and test our methods on multiple real datasets. We assess both quality of the estimated intrinsic parameters over different types of lenses, as well as the accuracy of the estimated extrinsic parameters compared against groundtruth trajectories.

Iv-a Experiment setup

We evaluate the performance of our method by using a DAVIS346 event camera with different types of lenses as listed in Table I. The camera has a resolution of 346260. The produced event stream has a maximum temporal resolution of 1s. The camera has the advantage of also capturing regular frames at a frame rate of 30Hz under regular illumination conditions, which lets us easily compare our method against a high-quality, regular image-based calibration method222We do not compare against the calibration methods from [6, 22, 7] because they effectively emulate the traditional calibration process by keeping the calibration pattern still and by generating events through the flashing pattern. Hence their achievable calibration accuracy can be considered equivalent to the one obtained by traditional calibration methods for regular cameras.

such as the open-source OpenCV calibration pipeline. The calibration pattern used throughout the evaluation is a

asymmetric circle pattern. All our experiments are conducted on a desktop with 32GB RAM and an Intel Xeon 3.2 GHz CPU. Implementations are made in C++, and use OpenCV [4], Eigen [15], and the Ceres optimization toolbox [29] with automatic differentiation. In order to quantitatively evaluate the performance of our calibration result, we evaluate the absolute trajectory error (ATE) between the estimated extrinsic parameters and its groundtruth trajectories by utilizing the tools from the TUM-RGBD[33] benchmark. Groundtruth is provided by an Optitrack external motion tracking system.

Name Label FOV TV Distortion
Lens-1 HIK-MVL-MF1220M 40.2 -1.01
Lens-2 Kowa-LM5JCM 82.4 -0.50
Lens-3 Kowa-LM6JC 81.9 10.70
TABLE I: Specifications of the lenses used in our experiments. FOV represents the field-of-view of the lens.
Fig. 5: Effect of our feature rectification strategy (features are colored in white). Left: Original feature extraction. Right: Rectified features for poor-quality reference frames.
Dataset Duration(s) FrameNum ATE (cm)
rmse mean median std
lens1-slow1-image 110 260 0.5470 0.4660 0.4146 0.2864 345.08 345.24 167.79 123.88 -0.3695
lens1-slow1-event-soft 2136 0.9305 0.7283 0.6047 0.5792 347.99 348.90 166.43 121.44 0.3694
lens1-slow1-event-hard 1412 1.0338 0.8071 0.6659 0.6460 349.31 350.06 165.77 118.97 0.3837
lens1-slow2-image 97 207 1.1587 0.9053 0.8582 0.7231 342.53 342.78 168.056 126.065 -0.3675
lens1-slow2-event-soft 1448 1.9136 1.1268 0.9491 1.5467 344.95 344.84 167.12 121.83 0.3662
lens1-slow2-event-hard 1859 1.3622 1.0241 0.8809 0.8982 344.85 344.755 166.97 121.95 0.3598
lens2-slow-image 94 131 11.278 2.7100 1.2878 10.948 272.18 272.75 162.558 124.134 -0.0706
lens2-slow-event-soft 993 1.4796 1.3798 1.3607 0.5343 269.674 272.135 163.465 128.186 0.069
lens2-slow-event-hard 438 1.2254 1.1257 1.0515 0.4840 270.84 273.44 163.04 129.51 0.0440
lens3-slow-image 69 127 1.9512 1.5330 1.3765 1.2071 338.586 338.013 159.87 116.66 -0.3258
lens3-slow-event-soft 902 1.1889 0.9964 0.9082 0.6487 334.377 335.8 162.076 118.978 0.34097
lens3-slow-event-hard 480 1.3162 0.9768 0.9177 0.8822 335.356 336.636 161.1 118.96 0.3499
lens1-fast1-image 8.16 37 0.7605 0.6851 0.6162 0.3302 342.82 342.47 170.39 124.999 -0.3661
lens1-fast1-event-soft 101 0.8362 0.7430 0.6808 0.3835 345.097 348.76 171.94 122.31 0.4093
lens1-fast1-event-hard 76 0.7215 0.5887 0.5002 0.4172 346.18 350.725 172.72 122.56 -0.0146
lens1-fast2-image 8.4 84 0.8774 0.7277 0.6033 0.4902 337.234 338.36 169.83 128.125 -0.3637
lens1-fast2-event-soft 331 0.8722 0.6505 0.4813 0.5810 345.93 348.73 164.76 126.233 0.34578
lens1-fast2-event-hard 244 0.6025 0.4695 0.3751 0.3776 340.765 343.03 168.158 125.877 0.2943
lens1-blur1-image 70 4 3.170 2.7338 2.3104 1.6047 341.466 334.28 188.9 122.555 -1.3176
lens1-blur1-event-soft 1068 1.12 0.7524 0.6442 0.8296 344.99 347.486 158.6 120.40 0.3856
lens1-blur1-event-hard 176 0.7456 0.639 0.5718 0.3842 343.115 345.94 160.09 121.3 0.3262
lens1-blur2-event-soft 6.537 70 1.409 1.13 0.931 0.8417 359.098 365.023 165.022 123.3 0.5227
TABLE II: Comparison against GT. OpenCV’s distortion parameters are , and the relation to the inverse radial distortion model is explained in [10] ( has opposite sign).
Fig. 6: Undistortion of events and images using our calibration result on 1104-fisheye1.

Iv-B Accuracy evaluation

We test the methods on several real sequences with different lenses, motion characteristics, and illumination conditions. For different lens types, the sequences are named lensx-slowx. They are captured under normal conditions without highly dynamic motion or challenging illumination. Next, we perform high speed calibration tests with sequences shorter than 10s, which are called lens1-fast1 and lens1-fast2. They are captured with lens1 still under normal illumination conditions. We conclude with long and short sequences under high speed motion as well as reduced illumination, which leads to noisy event sequences and blurry images. The latter sequences are named lens1-blur1 and lens1-blur2.

Note that intrinsic camera parameters are potentially inter-correlated, which is why it is not possible to regard the regular camera alternative as ground truth. In order to fairly assess the performance of our methods, we perform a quantitative comparison of the absolute trajectory error between our proposed event-based solution and the regular camera alternative. Table II summarizes all results. We evaluate the root-mean-square (rmse), mean (mean), median  (median

) and standard deviation (

std) errors between the estimated extrinsics and ground truth. As can be easily observed, our method produces similar calibration results to OpenCV while showing higher pose accuracy than the standard image-based calibration tool. The following is worth noting:

  • Calibration for different lenses: We apply our calibration to different lenses in sequences lens1-slow1, lens1-slow2, lens2-slow, and lens3-slow. The obtained results are comparable to the normal image alternative. Note that—while our method always returns low ATE errors in absolute terms—our motion parameterization works best in more dynamic scenarios. This explains the small gap to regular camera alternatives on lens1-slow1 and lens1-slow2.

  • Comparison of different circle extraction schemes: Table II shows that the hard feature extraction method is better than soft feature extraction in terms of the ATE error. However, it has difficulties to correctly identify distortion coefficients when there are only few reference frames (cf. values on lens1-fast1). Hard feature extraction has a preference for noise-less reference frames which only distribute a part of the whole trajectory, which may cause a bad distribution of the reference views especially if there are only few of them.

  • High-speed calibration: lens1-fast1, lens1-fast2, and lens1-blur2 are high-speed sequences of less than 10s duration. Our method works well even on such short sequences.

  • High-speed calibration under bad illumination conditions: lens1-blur1 and lens1-blur2 are high-speed datasets captured under unfavourable illumination conditions. It produces blurry images and noisy events, and renders calibration from normal images unusable. In contrast, our proposed methods still works well, especially if using the soft feature extraction strategy.

Iv-C Further results

The effectiveness of our feature rectification approach is visualized in Figure 5. White circles indicate rectified features whereas colored ones represent the originally detected pattern. Outlier features are rectified and moved to correct positions, while badly observed features are removed. As demonstrated in the figure, our feature rectification method handles various scenarios such as mismatched features (top), noisy event streams (center), and heavily inclined views onto the pattern causing the features to appear in elliptical shape (bottom). Figure 6 finally visualizes undistorted images from different perspectives. As can be observed, our method produces visually pleasing results.

V Conclusion

We present a novel framework for event camera calibration that requires only traditional calibration patterns. The pattern is rendered observable as soon as the camera is under motion. Multiple trajectory segments are taken jointly into account. They are optimized in continuous time which permits every event to be accounted for with its exact time stamp. By adding several techniques to ensure the quality of the correspondences between events and pattern features, the overall framework finally achieves satisfying results that are comparable to regular camera alternatives. We will release our framework and are confident that it will benefit future research on event-based 3D vision.

References

  • [1] I. Alzugaray and M. Chli (2018) ACE: An Efficient Asynchronous Corner Tracker for Event Cameras. In Proceedings of the International Conference on 3D Vision (3DV), Cited by: §II.
  • [2] I. Alzugaray and M. Chli (2018) Asynchronous corner detection and tracking for event cameras in real time. IEEE Robotics and Automation Letters 3 (4), pp. 3177–3184. Cited by: §II.
  • [3] J. Bouguet Calibration toolbox by Jean-Yves Bouguet. Note: http://www.vision.caltech.edu/bouguetj/calib_doc/index.html Cited by: §II, §III-A.
  • [4] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §II, §III-A, §IV-A.
  • [5] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck (2014) A 240 180 130 db 3 s latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits 49 (10), pp. 2333–2341. Cited by: §I.
  • [6] Calibration toolbox by RPG, University of Zurich. Note: https://github.com/uzh-rpg/rpg_dvs_ros/tree/master/dvs_calibration Cited by: §II, footnote 2.
  • [7] Calibration toolbox by VLOGroup at TU Graz. Note: https://github.com/VLOGroup/dvs-calibration Cited by: §II, footnote 2.
  • [8] X. Clady, S.-H. Ieng, and R. Benosman (2015) Asynchronous event-based corner detection and matching. Neural Networks 66, pp. 91–106. Cited by: §II.
  • [9] T. Collins and A. Bartoli (2014)

    Infinitesimal plane-based pose estimation

    .
    International Journal of Computer Vision (IJCV) 109 (3), pp. 252–286. Cited by: §III-A.
  • [10] P. Drap and J. Lefèvre (2016) An exact formula for calculating inverse radial lens distortions. Sensors 16 (6), pp. 807. Cited by: §III-C, TABLE II.
  • [11] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §II, §III-B.
  • [12] P. Furgale, C.-H. Tong, T. D. Barfoot, and G. Sibley (2015) Continuous-time batch trajectory estimation using temporal basis functions. International Journal of Robotics Research (IJRR) 34 (14), pp. 1688–1710. Cited by: §II.
  • [13] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, and K. Daniilidis (2020) Event-based vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Cited by: §I.
  • [14] A. Glover and C. Bartolozzi (2016) Event-driven ball detection and gaze fixation in clutter. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), pp. 2203–2208. Cited by: §II.
  • [15] G. Guennebaud, B. Jacob, et al. (2010) Eigen v3. Note: http://eigen.tuxfamily.org Cited by: §IV-A.
  • [16] R. Hartley and A. Zisserman (2004) Multiple view geometry in computer vision. Second edition, Cambridge University Press, New York, NY, USA. Cited by: §II.
  • [17] J. Jiao, H. Huang, L. Li, Z. He, Y. Zhu, and M. Liu (2021-06) Comparing representations in tracking for event camera-based slam. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    ,
    pp. 1369–1376. Cited by: §I.
  • [18] I.G. Kang and F.C. Park (1999) Cubic spline algorithms for orientation interpolation. International Journal for Numerical Methods in Engineering 46 (1), pp. 45–64. Cited by: §II, 1st item.
  • [19] I. Kåsa (1976) A circle fitting procedure and its error analysis. IEEE Transactions on instrumentation and measurement IM-25 (1), pp. 8–14. Cited by: §II, 1st item.
  • [20] E. Mueggler, C. Bartolozzi, and D. Scaramuzza (2017) Fast event-based corner detection. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §II.
  • [21] E. Mueggler, G. Gallego, and D. Scaramuzza (2015) Continuous-time trajectory estimation for event-based vision sensors. In Proceedings of Robotics: Science and Systems (RSS), Cited by: §II.
  • [22] G. Orchard Calibration toolbox by G. Orchard. Note: https://github.com/gorchard/DVScalibration Cited by: §II, footnote 2.
  • [23] L. Oth, P. Furgale, L. Kneip, and R. Siegwart (2013) Rolling shutter camera calibration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
  • [24] X. Peng, Y. Wang, L. Gao, and L. Kneip (2020) Globally-optimal event camera motion estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §I.
  • [25] X. Peng, L. Gao, Y. Wang, and L. Kneip (2021) Globally-optimal contrast maximisation for event cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §I.
  • [26] L. Piegl and W. Tiller (2012) The NURBS book. Springer Science & Business Media. Cited by: §II, §III-C.
  • [27] H. Rebecq, G. Gallego, E. Mueggler, and D. Scaramuzza (2018) EMVS: event-based multi-view stereo—3d reconstruction with an event camera in real-time. International Journal of Computer Vision 126 (12), pp. 1394–1414. Cited by: §I.
  • [28] A. Rosinol Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza (2018) Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High Speed Scenarios. In IEEE Robotics and Automation Letters, Vol. 3, pp. 994–1001. Cited by: §I, §II.
  • [29] A. Sameer, M. Keir, et al. Ceres solver. Note: http://ceres-solver.org Cited by: §IV-A.
  • [30] D. Scaramuzza, A. Martinelli, and R. Siegwart (2006) A toolbox for easy calibrating omnidirectional cameras. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), pp. 5695–5701. Cited by: §II.
  • [31] C. Sommer, V. Usenko, D. Schubert, N. Demmel, and D. Cremers (2020) Efficient derivative computation for cumulative b-splines on lie groups. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11148–11156. Cited by: §II, 2nd item.
  • [32] H. Sommer, J. R. Forbes, R. Siegwart, and P. Furgale (2016) Continuous-time estimation of attitude using b-splines on lie groups. Journal of Guidance, Control, and Dynamics 39 (2), pp. 242–261. Cited by: §II, 2nd item.
  • [33] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of rgb-d slam systems. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), Cited by: §IV-A.
  • [34] P. Sturm and S. Ramaligam (2004) A generic concept for camera calibration. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 1–13. Cited by: §II.
  • [35] V. Vasco, A. Glover, and C. Bartolozzi (2016) Fast event-based harris corner detection exploiting the advantages of event-driven cameras. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), pp. 4144–4149. Cited by: §II.
  • [36] Y. Wang, J. Yang, X. Peng, P. Wu, L. Gao, K. Huang, J. Chen, and L. Kneip (2021) Visual odometry with an event camera using continuous ray warping and volumetric contrast maximization. arXiv preprint arXiv:2107.03011. Cited by: §I.
  • [37] Zhengyou Zhang (1999) Flexible camera calibration by viewing a plane from unknown orientations. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 666–673. Cited by: §II.
  • [38] Zhengyou Zhang (2000) A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22 (11), pp. 1330–1334. Cited by: §II, §III-A.
  • [39] Y. Zhou, G. Gallego, and S. Shen (2021) Event-based stereo visual odometry. IEEE Transactions on Robotics. Cited by: §I.
  • [40] A. Z. Zhu, Y. Chen, and K. Daniilidis (2018) Realtime time synchronized event-based stereo. In European Conference on Computer Vision, pp. 438–452. Cited by: §I.