Event-based Camera Pose Tracking using a Generative Event Model

10/07/2015 ∙ by Guillermo Gallego, et al. ∙ 0

Event-based vision sensors mimic the operation of biological retina and they represent a major paradigm shift from traditional cameras. Instead of providing frames of intensity measurements synchronously, at artificially chosen rates, event-based cameras provide information on brightness changes asynchronously, when they occur. Such non-redundant pieces of information are called "events". These sensors overcome some of the limitations of traditional cameras (response time, bandwidth and dynamic range) but require new methods to deal with the data they output. We tackle the problem of event-based camera localization in a known environment, without additional sensing, using a probabilistic generative event model in a Bayesian filtering framework. Our main contribution is the design of the likelihood function used in the filter to process the observed events. Based on the physical characteristics of the sensor and on empirical evidence of the Gaussian-like distribution of spiked events with respect to the brightness change, we propose to use the contrast residual as a measure of how well the estimated pose of the event-based camera and the environment explain the observed events. The filter allows for localization in the general case of six degrees-of-freedom motions.



There are no comments yet.


page 4

page 5

page 6

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, event-based cameras such as the Dynamic Vision Sensor (DVS) [1] have attracted a lot of attention from both the robotics and vision communities [2, 3, 4, 5, 6, 7, 8, 9, 10]. These bio-inspired sensors overcome some of the limitations of traditional image sensors: they respond very quickly (within microseconds) to brightness changes, have very high dynamic range (120 dB compared to 60 dB of standard cameras), and require low bandwidth [1]. Hence, they are very promising sensors for high-speed visual applications in challenging scenes with large brightness contrast. However, the output of these cameras (a stream of events) is fundamentally different from that of traditional ones, and so a paradigm shift is required to design algorithms that exploit the potential of these vision sensors. Examples of such emerging event-based algorithms are: event-based optical flow [4], visual odometry [5], localization [2, 6], Simultaneous Localization and Mapping (SLAM) [3, 9], mosaicing [7, 8], object recognition [10], etc.

We address the localization problem of a moving event-based camera in a known environment. One of the first works in this respect is [2], where a particle-filter system that is limited to planar motions and 2-D maps was introduced. In the experiments, they used an upward-looking DVS mounted on a ground robot moving at low speed. The provided map used for navigation consisted of line segments on the ceiling. In [5], a probabilistic filtering approach was designed to localize a DVS moving on a plane with respect to the temporally closest pair of frames provided by an additional RGB-D camera attached to the DVS. An algorithm to track the 6-DOF pose of the DVS with no additional sensing during high-speed maneuvers was given in [6]. They used a map consisting of the edges of a black square of known size and minimized the event-to-line reprojection distance to estimate the DVS pose.

We propose an implicit Extended Kalman Filter (EKF) approach [11] to localize the DVS with respect to a given dense map of the 3-D scene (consisting of geometric and photometric information) without additional sensing (as in [2, 6, 8]), just using the information contained in the event stream. The map is not constrained to consist only of lines, thus it is more general than those in [2, 6], and it is also richer in brightness changes than the barcoded scenes in [5]. We allow for localization in the general case of 6-DOF motion of the DVS and design the filter accordingly. Our main contribution pertains to the design of the likelihood function used in the correction step of the EKF to process the observed events (Section III-B), by measuring how well the system state (DVS pose and velocity) and the map explain an event from the DVS using a contrast residual. To do so, we first derive a simple yet compelling model for event generation (Section II-A). The technique is demonstrated on synthetic and real data in Section IV.

Ii Dynamic Vision Sensor (DVS):
generative event model

In contrast to standard cameras, which acquire full frames at fixed rates, event-based vision sensors such as the DVS (Fig. 1a) have independent pixels that spike events at local relative brightness changes in continuous time. A visualization of the output of the DVS is shown in Fig. 1b. Events are time-stamped with microsecond resolution and transmitted asynchronously at the time they occur. Each event is a tuple , where are pixel coordinates of the event, is its time-stamp, and is its polarity (sign of the brightness change). The sensor’s spatial resolution is limited111A new generation of event-based sensors with VGA resolution () is being developed by the group [1]. ( pixels), but its 120 dB dynamic range notably exceeds the 60 dB of high-quality traditional image sensors.

[width=0.27]images/dvs128.jpg [width=0.33]images/dvs_output.png [width=0.33]images/DVS2008_Fig6.png
(a) (b) (c)
Fig. 1: (a) The Dynamic Vision Sensor (DVS). (b) Space-time visualization of the output of a DVS viewing a rotating dot. Colored dots mark individual events. Event polarity is not displayed. Noise is visible by isolated points that are not part of the spiral. (c) The contrast of the DVS events empirically follows a unimodal distribution (e.g. Gaussian-like) centered at a selected threshold (six threshold settings are shown). Images (b) and (c) are courtesy of [1].

Next, we provide a generative event model for the DVS using a principled derivation of the equations that characterize an ideal sensor. The event model combines several hypothesis (constant brightness, temporal persistence, etc.) with particular characteristics of the DVS. The model is at the heart of data assimilation in our filtering approach for DVS localization.

Ii-a Scene modeling

Assume that objects in the 3-D world are represented by a surface with geometric and radiometric properties. Typically, objects are described by a mesh or depth map and a corresponding intensity (i.e., “texture”) function (in a Lambertian context).

The DVS has the same optics as traditional perspective cameras, therefore, standard models (e.g., pinhole) apply. In camera coordinates, the projection operation is described by , mapping a 3-D point into the image point .

Assume a simplified radiance model where each point on the surface has an intensity, , and this is the value observed by the DVS to trigger events, that is, the intensity at the image plane corresponds to the intensity defined on the surface: for 3-D points visible from the DVS. Hence, the image plane parametrizes both the image and the surface (geometric and photometric properties).

Ii-B 3-D motion and apparent (2-D) motion

The motion of a moving camera is described by a smooth trajectory in the space of Euclidean transformations, . Let the relative motion between the viewing camera and the scene be described, in the camera coordinate frame, by


where and are body angular and linear velocities, respectively, and is the cross-product matrix: . The corresponding apparent motion of the 3-D point is described by the velocity of the image point , which comprises the image motion field. Specifically, the equation that relates surface velocity (in the camera frame) to feature velocity (in normalized coordinates) is (see, e.g. [12], [13, Eq. 5.87]), dropping the notation:


where twist coordinates encode the relative motion and


is called interaction matrix, image Jacobian matrix for a point feature, or feature sensitivity matrix [12], [14, p. 460-462]. Typically, the surface is assumed to admit a depth map representation with respect to the camera, and so the depth of the 3-D point is parametrized in the image plane, . Consequently, is just a function of the surface and the image point. The motion field has two separate components for translation and rotation.

Ii-C Deterministic generative event model

The standard hypothesis in measuring image motion is that the intensity structure of local time-varying image regions are approximately constant under motion for at least a short duration (temporal persistence). Formally, if is the space-time image intensity function measured by the DVS, the total derivative vanishes for those trajectories of constant intensity values, , that is,


where is the dot product, are the first partial derivatives with respect to spatial coordinates and is the motion field.

The DVS senses brightness logarithmically222

Using the chain rule it is easy to verify that, if

, both conditions and are equivalent.: , and it generates an event at a location if the amount of intensity (grey level) change during an interval (the time since the previous event at the same location), i.e., the contrast


is larger than a threshold  [1, 5] (typically 10-15% relative brightness change):


Incorporating polarity, if the contrast , a positive event () is generated; if , a negative event () is generated; otherwise, no event is fired.

Ii-D Probabilistic generative event model

Equation (6

) is a hard decision model for the generation of events. A more realistic one takes into account sensor noise and manufacturing mismatches, yielding a soft decision represented by a smooth probability function. A characterization of the corresponding probability density averaged over all DVS pixels is shown in Fig. 6 of 

[1] (see Fig. 1

c), suggesting a unimodal Gaussian-like distribution, for which they measure its standard deviation as a function of the threshold

. This probabilistic generative event model can be included in a Bayesian filtering approach to process the events, as shown in the next section, where we adopt the simple yet powerful filter given by the Extended Kalman Filter (EKF), which assumes Gaussian probability distributions to keep a compact and manageable representation of the posterior probability of the DVS pose and velocity.

Iii Bayesian filtering approach

Iii-a State-space design

In the popular Bayesian inference framework given by the EKF 

[11] we can formulate the DVS localization problem with respect to a map as that of estimating the state of a system defined by its state-space representation (state and measurement equations).

The state equation is a non-linear function of the state and the process noise


As usual, subscripts denote temporal references. The process noise

is not additive and it is assumed to be zero-mean multivariate Gaussian distributed with covariance

. The state vector describes the DVS pose (position and orientation) and its velocity:

is the position of the optical center of the DVS, in world coordinates; is the rotation vector parametrizing the orientation of the DVS by means of the exponential coordinates (as in the filter proposed by [15]) of the rotation matrix from the world to the camera frame, ; and the linear and angular velocities (1) (, ) are given in world and camera (body) coordinates, respectively.

We chose the motion model given by the constant velocity model, which is typical of SLAM approaches [16]. This accounts for general smooth motions of the DVS. By integration of the continuous motion over a time interval333Here is the time between prediction steps in the EKF, which may or may not coincide with the time between events at the same location in (5) depending on whether events are processed in packets or individually. and discretization, (7) becomes


where the noise is . The and operators refer to the rotation group, . is the incremental rotation of angle around the axis defined by vector , is the cross-product matrix associated to a 3-vector , and is the 3-vector associated to a skew-symmetric matrix .

Iii-B Implicit measurement equation

In the standard EKF, the likelihood is specified by an equation where observations are explicitly written in terms of the state and the observation noise . This is the formulation used in classical visual localization and SLAM, where consists of the image coordinates of sensed map landmarks, and predicts the observations by using the camera model to project the landmarks. This design choice implies Gaussian image coordinate noise, and it may also be applied to DVS localization [2]. However, it does not take into account the generative event model (such as (6)). In a different (non-localization) context, an alternative approach is given in [8] to estimate the intensity gradient at each pixel: consists of event rates and a generative model is used to write such explicit dependency. This design choice implies that the temporal (event-rate) noise is Gaussian, which is an arbitrary choice.

We depart from the previous explicit models (spatial or temporal measurements) and propose an implicit measurement equation


to quantify how well the event generation model (6) is satisfied. This leads to an implicit EKF [17, 18]. Our design choice assumes that the deviations of the contrast from the nominal one that fires events is Gaussian, which Fig. 1c suggests to be. A similar unimodal density function is given in [8] only for the correction step of rotation tracking.

[height=35mm]images/neighborhood/s20150509_214810_fields.png [height=35mm]images/neighborhood/s20150509_214810_predicted.png [height=35mm]images/neighborhood/s20150509_214810_contrast.png [height=35mm]images/neighborhood/s20150509_214810_contrast_abs.png
[height=35mm]images/neighborhood/s20150509_195305_fields.png [height=35mm]images/neighborhood/s20150509_195305_predicted.png [height=35mm]images/neighborhood/s20150509_195305_contrast.png [height=35mm]images/neighborhood/s20150509_195305_contrast_abs.png
(a) (b) (c) (d)
Fig. 2: Neighborhood of an event triggered by a moving edge. The DVS is moving horizontally to the right (positive direction). Top row: positive event (dark-to-bright transition). Bottom row: negative event (bright-to-dark transition). (a) Rendering of the map on the DVS image plane, ; the event is at the center of the patch. The motion field (magenta vectors) points toward the negative direction. The image gradient (perpendicular to the edge) is displayed with cyan vectors. (b) Predicted neighborhood . (c) Constrast . (d) The implicit measurement function in (10) has the same shape as the absolute contrast, , which defines the likelihood that the event was triggered.

Assuming constant illumination and independence of the observations, each event is caused by a brightness change at pixel , depending on both the DVS state  and the map . Thus, a more rigorous description than (9) is because an event is an observation of some map point. Letting be a shorthand notation for the spatial gradient in (6), we define the implicit function as the difference between the absolute contrast (5) and the nominal threshold, . Substituting for and replacing by the measured polarity , we use (6) to define


where is the time span since the previous event at the same location , and the inner product between the gradient and the motion field depends on the event location , its corresponding 3-D point and the state . Specifically, depends on the DVS pose only (but not on its velocity) via the perspective projection between the map and point , whereas the motion field (2) depends on both the DVS pose (depth of with respect to the sensor) and velocities (twist coordinates). The gradient may be computed by taking the spatial derivatives of the predicted image intensities in a neighborhood of the current event location , obtained through rendering the dense map according to the DVS pose in the current state. Examples of the contrast function for positive and negative events are shown in Fig. 2. Patches of pixels around the event location are displayed, but the local analysis of the generative event model is only reliable close to the center. Fig. 3 reports the cases of moving edges parallel or almost perpendicular to the apparent motion, yielding largest and smallest absolute contrast, respectively.

[height=35mm]images/neighborhood/s20150509_214452_fields.png [height=35mm]images/neighborhood/s20150509_214452_predicted.png [height=35mm]images/neighborhood/s20150509_214452_contrast.png [height=35mm]images/neighborhood/s20150509_214452_contrast_abs.png
[height=35mm]images/neighborhood/s20150509_211345_fields.png [height=35mm]images/neighborhood/s20150509_211345_predicted.png [height=35mm]images/neighborhood/s20150509_211345_contrast.png [height=35mm]images/neighborhood/s20150509_211345_contrast_abs.png
(a) (b) (c) (d)
Fig. 3: Neighborhood of an event triggered by a moving edge. Same notation as in Fig. 2. Top row: at the event location, the image gradient is parallel to the motion field . Bottom row: almost perpendicular to . Both rows correspond to a negative event.

Iii-C Recursive solution: Implicit EKF equations

Once the system state and measurements equations have been designed, the update equations of the parameters of the posterior in the EKF are also determined. The recursive estimation carried out in the EKF is described by the equations in Algorithm 1. We follow the notation in [11]

for the posteriors and their moments. The DVS pose tracking filter also assumes that an accurate estimate of the initial configuration, with relatively small uncertainties, is given

. Let us further explain the steps of Algorithm 1.

1. Mean state (pred.)
2. Error covar. (pred.) , with Jacobians of .
3. Innovation
4. Innovation covar. , with and given by the Jacobians of .
5. Kalman gain
6. Mean state
7. Error covar.
Algorithm 1 Extended Kalman Filter (EKF) equations for one iteration, , with implicit measurement function .

In this step, the projection of the posterior through the kinematic model (8) gives the predicted posterior before incorporating the measurement. The state mean and error covariance are predicted according to lines (1)-(2) in Algorithm 1. Uncertainty is propagated through the system by means of the Jacobians of (8), , , evaluated at the current best estimate, .


This is the data assimilation step, where the predicted posterior is combined with the measurement to yield the updated posterior . The state mean and error covariance are corrected according to lines (3)-(7) in Algorithm 1. Events from the DVS are fed to the generative sensor equation (10) to produce a residual that drives the update of the filter variables. With regard to Figs. 2d and 3d, the correction step changes the state such that the likelihood at the event position increases (white region). The innovation process and its covariance (lines (3)-(4) in Algorithm 1) are obtained by linearization of the implicit measurement function (10) around the current best estimate, (see [17, 18]). Uncertainty is corrected in the system (up to first order) by means of the Jacobians of (10) (evaluated at ), , , with covariance of the measurement noise [17] . Since is a real value, both the noise and the innovation covariances ( and ) are scalars.

Iii-C1 Data association

An additional advantage of our approach is that there is no data association like in the classical localization problem (associating predicted measurements to actual ones), thus removing a challenging sub-problem and common source of brittleness in localization and mapping with the EKF. This is a consequence of using a dense map (as opposed to a set of isolated landmarks) to represent the scene and to design a measurement equation (10) that exploits such a representation. There is no data association problem because a correspondence between the event location and a map point will always exist, and it can be computed via ray-tracing. The errors caused by a mismatch between the true surface point that triggered the event and the predicted one are implicitly taken into account in the EKF via the innovation (10) and its covariance. For example, the value of the gradient in the neighborhood of the event will change (with some degree of smoothness) and if the predicted value does not yield the triggering of an event, the EKF adjusts the state parameters so that a different surface point will be more likely to trigger the observed event. There is no need to artificially search for a 3-D point, close to the predicted one, that better explains the event.

Iv Experiments

Iv-a Synthetic data

The proposed method was tested with synthetic and real data. The synthetic data was generated using computer graphics software (Blender444https://www.blender.org/) to render images of a given map along a specified trajectory. Adjacent images were subtracted, thresholded and randomly sampled to simulate the events generated by a DVS. We chose a pinhole camera model with intrinsics identical to the ones of a lens from the real experiments: 2.6 mm lens for a 1/3” sensor. A linear trajectory with constant acceleration was simulated. Results are reported in Fig. 4.

[width=0.305]images/synthetic/accel_position.png [width=0.305]images/synthetic/accel_velocities.png [width=0.33]images/synthetic/accel_relative_errors.png
(a) (b) (c)
Fig. 4: Constant acceleration experiment. (a) Estimated position. (b) Estimated velocity. (c) Relative errors in position and velocity between simulated trajectory and estimated one.

Groups of 500 events every 8 ms were generated between adjacent images. The algorithm processed 230k events. This experiment validated the measurement function (10) since the kinematic model (8) alone cannot predict the DVS motion. The results show that the filter successfully estimated the DVS pose and velocity, with small relative errors (Fig. 4c).

Iv-B Real data

For the experiment with real data, we mounted the DVS on a model train that runs on a straight track with constant velocity. The DVS faced sideways and observed a planar scene at a constant distance. The scene contains a pattern of complex black and white stripes and a set of circles at known locations; the latter were used for extrinsic calibration. The DVS was intrinsically calibrated using standard camera calibration techniques on the imaged points detected from the projection of an array of blinking LEDS placed in a checkerboard configuration. Horizontal edges are parallel to the apparent motion, and, consequently do not trigger events. The intensities of the map were smoothed to provide non-zero gradients in the regions near sharp edges that generate events, hence to smooth the response of the contrast function (10) and the corresponding likelihood in such regions. Fig. 5 reports some of the results of this experiment. Fig. 5b shows, for a few hundreds of events (Fig. 5a), the measured absolute contrast used in the implicit measurement function (10). Having the map intensities given in arbitrary units (log of gray levels) and lacking physical measurements of the incoming light that the DVS used to trigger the events, the threshold values in Fig. 1c () are not applicable to the map, and so a few events are used to estimate the threshold corresponding to the given map. The filter processed about 100k events and successfully estimated the DVS pose and velocities of the DVS throughout the event stream. Figs. 2 and 3 were also obtained from this experiment.

 [width=0.252]images/real/events_initial.png       [width=0.33]images/real/real1_dSAE_init.png [width=0.23]images/real/distrib_contrast.png
(a) (b) (c)
[width=0.27]images/real/real1_innovation_sec.png [width=0.31]images/real/real1_position.png [width=0.315]images/real/real1_velocities.png
(d) (e) (f)
Fig. 5: Experiment with approximately constant velocity motion. (a) Visualization of a few events from the DVS (positive events in cyan, negative events in magenta) used for filter initialization, overlaid on the rendered map. (b) Time since the last event at each pixel ( in (10)) (c) Normalized histogram of the absolute contrast in (10) (solid line) and Gaussian fit (dashed line) (cf. Fig. 1c). The mode of the Gaussian corresponds to the threshold . (d) Innovations sequence . Estimated position (e) and velocity (f) of the event-based camera.

V Conclusion

We have successfully developed an implicit EKF for event-based camera (DVS) localization based on the contrast residual (10), which provides a natural measure to define the likelihood of an event. For this, we derived a generative event model that incorporates the physical characteristics of the DVS. Our algorithm readily matches the asynchronous nature of the events and allows filter updates on an event-by-event basis. An additional advantage of our approach is that the contrast residual naturally takes into account a dense map representation of the environment, removing the data-association sub-problem. In future work, we plan to extend the developed method to event-based SLAM without additional sensing.