Compelling augmented reality (AR) experiences are achieved through the successful execution of several tasks in parallel. Notably, simultaneous localization and mapping (SLAM) [newcombe2011kinectfusion], hand tracking [mueller2017real], and object tracking in 6 degrees of freedom (6-DOF) [garon2018framework] must all be executed efficiently and concurrently with minimal latency on portable, energy-efficient devices.
This paper focuses on the task of 6-DOF rigid object tracking. In this scenario, successfully tracking the object at high speed is particularly important, since freely manipulating an object can easily result in translational and angular speeds of up to 1 m/s and respectively. Despite recent progress on real-time 6-DOF object tracking at 30 fps [garon2018framework, manhardt2018deep, li2018deepim], these methods still have trouble with very high object motion and tracking failures are still common. Increasing the speed of 6-DOF object trackers is of paramount importance to bring this problem closer to real-world applications.
To increase the speed of object tracking, one can trivially employ cameras with framerates higher than 30 fps. Indeed, 90 and even 120 fps off-the-shelf cameras are available and could be used as a drop-in replacement. However, this comes at significant practical disadvantages: higher data bandwidth, increased power consumption (since the algorithms must be executed more often), and the necessity to have sufficient light in the scene since exposure times for each frame is necessarily decreased.
In this work, we propose a system to increase the speed of 6-DOF object tracking applications with a minimal increase in bandwidth and power consumption. Specifically, we propose to combine an event camera (specifically, the DAVIS346 camera) with an RGB-D camera (the Kinect Azure) into a single “RGB-D-E” capture system. The event camera offers several key advantages: very low latency (), bandwidth, and power consumption (10–30 mW), all while having much greater dynamic range (120 dB vs 60 dB) than frame-based cameras.
This paper makes the following contributions. First, we show how to calibrate the setup both spatially and temporally. Second, we provide a new challenging 6-DOF evaluation dataset that contains approximately 2,500 RGB-D-E frames of a real-world object with high speed motion with the corresponding ground truth pose at each frame. The dataset will be made public upon acceptance of the paper. Third, we propose what we believe to be the first 6-DOF object tracker that uses event-based data. Similar to previous work [garon2018framework, manhardt2018deep, li2018deepim], our approach assumes that the object to track must be rigid (non-deforming) and its textured 3D model must be known a priori. Finally, we demonstrate through a quantitative analysis on our real evaluation dataset that, using an extension of an existing deep learning approach for 6-DOF object tracking results in a threefold decrease in the number of tracking failures and achieves robust tracking results on fast free interaction motions. We believe this paper brings 6-DOF object tracking one step closer to real-world augmented reality consumer applications.
2 Related work
The majority of computer vision systems rely on established frame-based camera architectures, where the scene irradiance is captured synchronously at each pixel or in a rapid, rolling shutter sequence[liang2008analysis]. However, such cameras need to stream large amount of data (most of which redundant), making them power- and bandwidth-hungry. Recently, a newer camera architectures with a event-based paradigm [lichtsteiner2008128] is gaining popularity. By triggering events on each pixel asynchronously when the brightness at that pixel changes by certain threshold, event-based camera can stream at a much higher frequency while consuming less power. A branch of computer vision research now focuses on developing algorithms to take advantage of this new type of data.
Event-based applications. Event-based sensors bring great promises in the field as their low power consumption makes them ideal for embedded systems such as virtual reality headset [gallego2017event], drones [vidal2017hybrid, delmerico2019we] or autonomous driving [maqueda2018event]. Their high speed resolution also enables the design of robust high-frequency algorithms like SLAM [nguyen2019real, vidal2018ultimate, weikersdorfer2014event, rebecq2016evo, bryner2019event, gallego2017event, kim2016real] or fast 2D object tracking [glover2017robust, mitrokhin2018event]. While related to our work since we also focus on tracking, all related works are still restricted to tracking objects in the 2D image plane. In this paper, we extend the use of event cameras to the challenging task of fast 6-DOF object tracking by building over a state-of-the-art frame-based 6-DOF object tracker [garon-tvcg-17]. Different from other works, we benefit from RGB, Depth and Event data to propose the first RGB-D-E 6-DOF object tracker.
Deep learning with events.
Using event-based data is not straightforward since the most efficient deep architectures for vision are designed for processing conventional image data (e.g. CNNs). In fact, it is still unclear how event-based data should be provided to networks since each event is a 4-dimensional vector storing time, 2D position, and event polarity. Experimental architectures such as spiking neural networks[maass2004computational] holds great promises but are currently unstable or difficult to train [lee2016training]
. With conventional deep frameworks, events can be converted to 2D tensors only by discarding both time and polarity dimensions[rebecq2017real] or to 3D tensors by discarding either of the two dimensions [maqueda2018event, zhu2019unsupervised]. Recent work [rebecq2019events, scheerlinck2020fast] has demonstrated that conventional grayscale frames can be reconstructed from event data, opening the way to the use of existing algorithms on these “generated” images. In this paper, we favor the Event Spike Tensor formulation from Gehrig et al. [Gehrig_2019], where time dimension is binned. This allows us to exploit event data directly without requiring the synthesis of intermediate images, while maintaining a fast convolutional network architecture.
Event-based datasets. Finally, large amount of training data are required. While a few events datasets exist mostly for localization/odometry [delmerico2019we, li2018interiornet, mueggler2017event] or 2D object tracking [hu2016dvs], there are, as of yet, no 6-DOF object tracking dataset which contains event data. Instead, event data can be synthesized with a simulator such as [Rebecq18corl] which allows various types of data augmentation [rebecq2019high]. Our experiments show that a network can be trained without using real data and is not critically affected by the real-synthetic domain gap.
3 System overview and calibration
) computed on 51 matching planar checkerboard images. (b) Linear regression of the Kinect depth map error compared to the expected depth, computed on calibration target corners.
In this section, we describe our novel RGB-D-E hardware setup, which combines a Microsoft Kinect Azure (RGB-D) with a DAVIS346 event camera (E).
3.1 System overview
As illustrated in Fig. RGB-D-E: Event Camera Calibration for Fast 6-DOF Object Tracking, the DAVIS346 event camera is rigidly mounted over the Kinect Azure using a custom-designed, 3D-printed mount. We observed that the modulated IR signal projected on the scene by the Time-of-Flight (ToF) sensor in the Kinect triggered multiple events in the DAVIS346 camera. To remedy this limitation, an infrared filter is placed in front of the event camera lens.
3.2 Spatial calibration
Our system contains 3 cameras that must be calibrated: the Kinect RGB, the Kinect Depth and the DAVIS346 sensor. In this paper, we describe a coordinate system transformation with the notation , denoting a transformation matrix from coordinate frame to .
The intrinsic parameters of each camera can be computed with a standard method [zhang2000flexible]. The checkerboard corners can easily be found using the color frame and the IR image from the Kinect Azure. Calibrating an event-based sensor is usually more difficult, however the DAVIS346 possesses an APS sensor (gray scale frame-based capture) that is spatially aligned with the event-based capture sensor. We thus use the APS sensor to detect the target corners that will be used for the intrinsic and extrinsic calibration.
Intrinsics. We capture images where a checkerboard target (
with 54 mm squares) is positioned in a spatial uniform distribution in the frustum of each camera. To account for varying fields of view, 199 images were captured for the Kinect RGB, 112 for the Kinect Depth, and 50 for the DAVID346. For each sensor, we retrieve the intrinsic parameters (focal and image center) with a lens distortion model including 6 radial and 2 tangential parameters.
Extrinsics. We retrieve the rigid transformations and by capturing images of the target in overlapping frustums. Once the 3D points are retrieved from the previously-computed camera intrinsic and the known checkerboard geometry, PnP [fischler1981random] is used to retrieve the 6-DOF transformation between each camera.
Finally, we compare our calibration procedure with the factory presets of the Kinect Azure. Motivated by previous work [chen2018calibrate] that demonstrate lower accuracy errors with factory presets calibration we capture a test dataset of 45 target images and show that we obtain a lower reprojection error in Fig. 1-(a).
3.3 Depth correction
As [hodan2017t, garon2018framework] reported for the Kinect 2, we also found that the depth from the Kinect Azure has an offset that changes linearly w.r.t the depth distance and average in an error in the range of 8.5 mm. We compare the target points from the calibration dataset with the depth pixels in each frame and fit a 2nd-degree polynomial to the errors w.r.t to their distance to the camera. In Fig. 1b, we show the error with and without the polynomial correction on the test calibration set. Using the correction, the mean error on the test calibration set is less than 4 mm.
3.4 Temporal synchronization
In a multi-sensors setup, each sensor acquires data at its own frequency aligned with its inner clock. For time-critical applications, such as fast object tracking, it is required to synchronize the sensors clocks to ensure temporal alignment of the data. Technically, this is commonly addressed with synchronization pulses emitted by a master sensor at the beginning of each data frame acquisition, subsequently triggering the acquisition of other slaves sensors.
In our setup, both Kinect and DAVIS346 support hardware synchronization but we found that the Kinect (master) emits a variable number of pulses before the first RGB-D frame. This led to incorrect triggering of DAVIS346 (slave) and thus temporal misalignment of RGB-D and Event data. Because pulses are always emitted at the same frequency, we fix this by computing the pulses offset as
where is the timestamp of the first RGB-D frame and is the Kinect frame rate (here, 30). Following this, we can pair RGBD and Event frames as . Fig. 2 illustrates the projection of events captured on a moving checkerboard. The events are captured between the two RGB frames. Alignment with the borders of the pattern shows the temporal and spatial calibration.
4 Fast 6-DOF object tracking
With the sensors spatio-temporally calibrated, we enhance an existing tracking framework by the addition of the new event modality (E). We build on the work of Garon et al. [garon-tvcg-17, garon2018framework]
who propose a deep learning approach of robust 6-DOF object tracking, which relies on the refinement between a render of the object at the current pose estimate and the current Kinect RGB-D frame. While this method is robust to occlusion and small displacements, we notice that it is significantly impacted by larger motions (over 0.5 m/s), possibly because of the motion blur induced. Additionally, the network in[garon2018framework] is fundamentally limited by a maximum pose translation of 2 cm between two frames. We note that increasing the sensor frame rate is also not a practical solution as the network computation time is the main bottleneck. In this section, we improve the tracker reactivity and robustness with the addition of an event-specific network. In the following, we first describe the generation of synthetic data for training and proceed to explain how frame-based and event-based trackers are jointly used.
4.1 Training data generation
Despite the existence of event datasets [hu2016dvs, li2018interiornet, zhu2019eventgan], none of them provide event data with 6-DOF object pose. Since capturing a dataset of sufficient magnitude and variety for training a deep network is prohibitive, we rely on synthetic data generated from an event camera simulator [Rebecq18corl]. The engine renders a stream of events that represent changes in pixel brightness, thus mimicking event-based sensors.
We build a training dataset by generating sequences of events where our target object (here, a toy dragon) is moved in front of a static camera.
We acquire a textured 3D model of the dragon with a Creaform GoScan™ handheld 3D scanner at 1 mm voxel resolution, subsequently cleaned manually using Creaform VxElements™ to remove background and spurious vertices.
As the camera remains stationary, we simulate the scene background with a random RGB texture from the SUN3D dataset [xiao2013sun3d] applied on a plane orthogonal to the virtual camera optical axis. We next describe the simulation setup followed by various data augmentation strategies applied to the data sample.
Simulation details. Event sequences are generated by first positioning the object in front of the camera at a random distance (where denotes a uniform distribution in the interval) and a random orientation. The center of mass of the object is aligned with the optical axis of the camera, so the object appears in the center of the frame. The object is then displaced by a random pose transformation over 33 ms and the generated events are recorded. The transformation is generated by first sampling two directions on the sphere using spherical coordinate with and , where as in [garon2018framework] and then sample the magnitude of the translation and rotation with and
respectively. A 3D bounding box of size 0.207 m around the object is projected on the image plane. The event spatial axes are then cropped according to the projected bounding box and resized with bilinear interpolation to a spatial resolution of. Each 33 ms pose transformation generates a set of events storing where is time, and are pixel coordinates and the polarity of the event (positive or negative, indicating a brighter or darker transition respectively). A total of 10 such event sets are simulated for each background image, leading to 180,000 training and 18,000 validation sets.
Data augmentation. To maximize the reliability of our simulations, we randomize some parameters as in [rebecq2019high] to increase variability in the dataset and reduce the domain gap between synthetic and real data. The contrast threshold, which defines the desired change in brightness to generate an event, is difficult to precisely estimate on real sensors [gallego2017event]
and is instead sampled from a gaussian distribution(where denotes a gaussian distribution with mean ). Subsequently, the proportion of ambient lighting versus diffuse lighting for the OpenGL rendering engine (employed in the simulator) is randomly sampled from . To simulate tracking errors, the center of the bounding box is offset by a random displacement of magnitude
pixels. Finally, we notice the appearance of white noise captured by the DAVIS346. To mimic this noise, we measure the number of events generated by the real DAVIS3465 camera when viewing a static scene and fit gaussian distributions. At training time, we sample a numberfrom the fitted distribution, randomly select elements in the set (across , , and ) and add a noisy event. This process is done separately for each polarity (positive and negative). Fig. 3 shows the qualitative similarity between real samples acquired with the DAVIS346 and our synthetic samples at the same pose.
4.2 Approach overview
In this paper, we assume that the pose of the object in the previous frame, , is known. In a full system, it could be initialized by a 3D object detector (e.g. SSD-6D [kehl2017ssd]) at the first frame (). The task of a temporal object tracker is to determine the relative pose change between two frames such that an estimate of the current pose can be obtained by
Note that all poses are expressed in the RGB camera coordinate system.
In this work, we rely on two deep networks to estimate . First, our novel event network that takes event data accumulated during the time interval, and cropped according to the previous object pose . Here, is the extrinsic camera calibration matrix from sec. 3.2, necessary to transform the pose estimate in the event camera coordinate system. Second, we also employ the RGB-D frame network of Garon et al. [garon2018framework] , which takes as input the current RGB-D frame and a rendering of the object at the previous pose (see [garon-tvcg-17, garon2018framework] for more details). Each network aims to estimate the relative 6-DOF pose of the object. Interestingly, while events are much more robust to fast displacement they carry less textural information than RGB-D data and we found that the event network used on its own is slightly less accurate. Therefore, we use a cascade approach where the event network first estimate , and subsequently the frame network is provided with this new estimation for refinement:
with obtained from the extrinsic camera calibration matrices from sec. 3.2 as before. Note that is an iterative method and can be run multiple time to refine its prediction. To simplify the notation we show a single iteration, in practice, 3 iterations are used as in the original implementation. A diagram overview of the method is provided in fig. 4.
. Dropout of 30% is used after the activation function of each “fire” modules and both “FC-500”. All layers, (except “FC-6”) is followed by an activation function.
4.3 Event network
Event data is fundamentally different than frame-based data as it possesses two extra dimensions for time and polarity (, where is discretized time and is polarity.). We use the “Event Spike Tensor” representation from [Gehrig_2019] where the time dimension is binned (in our case 9 bins for a 33 ms sample), and the polarity dimension is removed by simply subtracting the negative events from the positive ones. Finally, the spatial dimensions are resized as explained in the previous section. The final tensor has a shape of where each voxel represents the number of events recorded per time bin. We normalize that quantity between 0 and 1 by dividing each voxel by the maximum amount of events seen in a single voxel during training.
Event network architecture. While the event spike tensor can be processed by a standard CNN, we follow [Gehrig_2019] and first learn a 1D filter in the time dimension and then apply a standard image convolution where the time dimension acts as different channels. In practice, we use the same backbone from [garon2018framework] for the RGB-D frame network and event network but change only the first two input layers to match the event spike tensor. Fig. 5 shows the two architectures. The RGB-D network is train with the same hyper-parameters from [garon2018framework]. The event network is optimized with ADAM [kingma2014adam]
at a learning rate of 0.001 and a batch size of 256. We train for 40 epoch and apply a learning rate scheduling by multiplying the latter by 0.3 every 8 epochs.
We now proceed to evaluate our RGB-D-E system for the high-speed tracking of 3D objects in 6-DOF. We first describe our real test dataset, then present quantitative and qualitative results.
5.1 Test dataset
In order to compare the RGB-D and the RGB-D-E trackers, we capture a series of real sequences of a rigid object freely moving at various speed with different environment perturbation and record the corresponding RGB-D frames and events using our capture setup. To provide a quantitative evaluation, we obtain ground truth pose of the object at each frame using the approach described below. We capture a total of 10 sequences with an average duration of 10 seconds, for a total of 2,472 frames and corresponding event data. The full dataset will be made public upon acceptance of this paper.
For each sequence, we first manually align the 3D model with the object on the first RGB frame. Then, we use ICP [Pomerleau12comp] to align the visible 3D model vertices with the depth from the RGB-D frame, back-projected in 3D. To avoid back-projecting the entire depth frame, only a bounding box of centered around the initial pose is kept. Vertex visibility is computed using raytracing and updated at each iteration of ICP. If the angular pose difference between two successive iterations of ICP is less than , it is deemed to have converged and that pose is kept. If that condition is not met after a maximum of 10 iterations, ICP diverges and the final pose is refined manually. For all subsequent frames in the sequence, ICP is initialized with the pose from the previous frame. In all, every frame in our test dataset is manually inspected to ensure a good quality pose is obtained, even when it has been determined automatically.
We quantitatively compare our RGB-D-E tracker with the RGB-D approach of Garon et al. [garon2018framework], which is the current state-of-the-art in 6-DOF object tracking. We represent a pose by a rotation matrix and a translation vector . The translation error between a pose estimate and its ground truth (denoted by ) is reported as the L2 norm between the two translation vectors
The rotation error between the two rotation matrices is computed using
where denotes the matrix trace.
Fig. 6 compares the translation and rotation errors obtained by both approaches. These plots report the error between two adjacent frames only: the trackers are initialized to their ground truth pose at the initial frame. Our method reports lower errors at translation speeds higher than 20 mm/frame, which corresponds to approximately 600 mm/s, and similar rotation errors overall. This is not surprising, given the fact that our method relies on the RGB-D network of Garon et al. [garon2018framework] to obtain its final pose estimate.
However, visualizing the per-frame error does not tell the whole story. Indeed, in a practical scenario the trackers estimate a succession of predictions instead of being reset to the ground truth pose at every frame. Errors, even small, may therefore accumulate over time and result in tracking failure. Following [garon2018framework], we consider a tracking failure when either or . Results of this analysis are presented in Tab. 1. While the RGB-D approach of Garon et al. results in 83 failures, our RGB-D-E decreases that number to just 28.
Fig. 7 shows representative qualitative results comparing both techniques with the ground truth. Those results show that the approach of Garon et al. [garon2018framework] is affected by the strong motion blur which arises under fast object motion. In contrast, our approach remains stable and can follow the object through very fast motion. Please see video results in the supplementary materials.
|Garon et al. [garon2018framework]||83|
Garon et al. [garon2018framework]
Garon et al. [garon2018framework]
We present a novel acquisition setup for simultaneous RGB-D-E capture which combines a Kinect Azure camera with a DAVIS346 sensor. With the new event modality, we show that a state-of-the-art RGB-D 6-DOF object tracker can be significantly improved in terms of tracking speed. We capture an evaluation dataset with ground truth 3D object poses that mimics difficult scenarios typically encountered in augmented reality applications : a user manipulating a small object with fast free motions. Using this dataset, we demonstrate that our approach achieves a threefold decrease in loss of tracking over the previous state-of-the-art, thereby bringing 6-DOF object tracking closer to applicability in real-life scenarios.
Limitations and future work. First, capturing an evaluation dataset is time-consuming and obtaining the 6-DOF ground truth pose of the object is difficult, especially when fast motions are involved. While our semi-automatic approach provided a way to acquire a small number of sequences easily, scaling up to larger RGB-D-E datasets will require more sophisticated apparatus such as a motion capture (mocap) setup as in [garon2018framework]. Indeed, mocap systems are ideal for this use-case as they can track the object robustly at high framerates. Second, while using a cascade scheme improves significantly the robustness to large motion of the tracker, it is still inherently limited in accuracy since it always relies on the frame network. The success of the cascade configuration motivates further exploration of better ways to fuse the Event modality with the previous frame-based modalities. Third, we notice that the trackers are still sensitive to dynamic backgrounds (see the last example in the supplementary video). We anticipate that this could be partially solved by generating training data with spurious structured events such as those that could be created by a dynamic background (or a moving camera). These represent exciting future research directions that we plan to investigate in order to achieve even more robust and accurate object tracking systems that can be used in real-world augmented reality applications.