1 Introduction
With the rapid development of capture and computation devices, such as depth sensors and GPUs, realtime 3D reconstruction has made big growth. In recent years, a lot of works have focused on indoor scene reconstruction. For example, InfiniTAM[1] only uses depth information to reconstruct 3D models, and estimate camera poses by an iterative closest point(ICP) algorithm[2]. However depth only is extremely brittle in situations such as geometryless scenes, bright windows and depth sensor noises, and can not eliminates accumulated errors. Driftfree camera tracking have been made breakthrough progress by monocular RGBbased methods, including direct methods[3] and feature point methods[4]. However these approaches can not reconstruct detailed and accurate 3D models. Further more, BundleFusion[5] and ElasticFusion[6] use both color and depth information to estimate camera motions and generate 3D models based on implicit truncated signed distance fields(TSDFs) and surfel representation, respectively.
Although these works exhibit reasonable results[7], they still require strong assumptions, like static scene without dynamic object, sufficient texture and geometrical information, slow camera motions and invariant illumination. However these assumptions can not be satisfied in many applications.
In this paper, we make a step further by handling fast camera motions. For both color and depth, fast camera motion leads to large interframe distance, which makes it difficult to perform image feature matching (As images may be blurred and feature appearances may vary a lot.) and ICP based depth aligning. Similar with Tristan et al.[8], we solve the issue by introducing IMU information, gathered by an accelerometer and a gyroscope. Further combining with color and depth information, robust camera pose estimation and geometry fusion of an indoor scene are jointly achieved. The main contributions of our work are as follows:
(1) A RGBDinertial 3D reconstruction system based on extended Kalman filter framework, which tightly combines the three kinds of information, and jointly achieves camera pose estimation and patch deformation in the kalman update step.
(2) A geometryaware feature tracking method for handling fast camera motion, which utilizes patch features to adapt blurry images and considers the deformation of patches in building feature matchings for images with very different perspectives.
(3) Through experiments on public datasets (including both synthetic and real data) and our data acquired by Intel Realsense ZR300, we see that our approach outperforms the stateoftheart reconstruction systems under fast camera motions.
2 Method
The pipline of our system is illustrated in Fig. 1. We introduce our method by following four parts, geometryaware feature tracking which explores the SE effect and executes patch deformation, filter framework which explain the kalman prediction and update step, model fusion and patch update.
2.1 Geometryaware Feature Tracking
Pointbased feature tracking methods[4] will extract insufficient number of features when images are blurred or with less texture. Thus, patchbased method, which considers larger image regions, is proposed[3] to track features under these situations. However, large patches may contain objects on different depth levels, which causes appearance changes in consecutive frames, especially when the camera motion is fast, leading to inaccurate feature tracking. To address this problem, we combine color and depth information to back project 2D patches into 3D and reproject them to the camera of the next frame by using an initial camera motion. The projection helps us to deform the original patches to model the appearance changes, and the patch tracking can be easily and accurately achieved by the deformed patches.
2.1.1 SE effect and Patch Deformation
When camera moves, a feature patch will be seen from different perspectives in different frames, and thus the shape and position of the feature patch in image coordinates vary from different frames. To account for the patch deformation, different from Bloesch et al.[9] which only considers the 2D planar information of a patch, we use the 3D geometry of a patch to determine the 2D shape deformation between images recorded with fast camera motion.
Depending on different geometries of patches and irregularities of camera motions, patches may produce different deformations in consecutive frames. Fig. 2 shows three representative cases of patch deformations:
Case 1. If there is no significant difference among pixel depths in a patch, then no matter how aggressive the camera motion is, the general shape of the patch will remain unchanged in two consecutive frames.
Case 2.
When camera moves slowly, the 2D shape of a patch will still remain unchanged even though there are large depth variances existing in the patch.
Case 3. Different from case 2, if camera moves aggressively, the intensity distribution and shape of the patch will changes. If camera moves from to the viewpoint , the yellow region will be occluded and the patch shrinks in the current frame. In addition, the black region which is occluded in is visible once the camera moves to , and thus the patch shape extends in the current frame. These phenomenons are called shrink effect and extend effect, so we call them together by SE effect.
We designed a unified deformation method to handle the SE effect. The details of patch deformation process are illustrated in Fig. 3 and formalized as follows:
Each pixel in a patch extracted from last frame are defined as , where denotes the image coordinates of pixel in the patch. denote the intensity, depth and 3D normal of pixel . and are obtained from the depth image, and since we have 3D information, we call our patches 3D patch features. We first back project each patch into 3D world coordinate system:
(1) 
Here, is to project a point from the 3D camera coordinate system to 2D pixel domain, and represent the inverse operation. is the camera pose of frame which transforms a 3D point in the camera coordinate system of frame to the world coordinate system. Thus is in the world coordinate systme but is indexed from a pixel in frame , so where the intensity information does not change in the projection and indicates the 3D coordinate. Then we project it to the pixel coordinate system of the current frame .
(2) 
Here means it is projected to this frame, not original from this frame. Note that is unknown and affects the 2D position of the projection.
After projection, if two projected pixels happened to be in the same pixel coordinate, then we consider the shrink effect occurred in this patch. This usually happens when a region close to the camera covers the distant region. Therefore we remove the pixels corresponding to the distant region. Then we evaluate whether the extend effect happens or not. We set the shape of the projected patch to be the bounding box of all projected pixels. Then the extend effect will be verified if the height (or width) of the projected patch is greater than the original one. No matter which effect happens, we use as the deformed patch in the following feature tracking steps.
2.1.2 Objective
In feature tracking, patches affected by SE effect are replaced with the corresponding projected patches. Then We track the projected patch features by both intensity and depth information.
Photometric error for each projected patch is computed as follows. We first extract the patch of the same size at the projected location of the current image as , and then calculate the intensity difference between the extracted patch and the projected patch. The photometric error can be formalized as:
(3) 
where denotes the number of pixels in the patch. indicates the corresponding intensity information of a patch. Then we compute pointtoplane geometry error as:
(4) 
Given and , the cost function for patch tracking is formulated as:
(5) 
where denotes the patch , and indicates the number of patches.
2.2 Filter Framework
Our EKF framework aims to tightly combine the color, depth and inertial measurements information. To be specific, we model the camera pose of each frame as the state of the EKF, and will solve it in the EKF. The observation of the EKF include both the color and depth images, and the relationship between the state and the observation is measured by the energy defined in Equation 5. If a state fits exactly to an observation, the energy is zero. On the other hand, the inertial information is used in the Kalman prediction step, which serves in building the motion prediction model.
We follow the traditional Kalman filter to define the variables. A nonlinear discrete time system with state x, observation term , process noise , and update noise in th frame can be written as
(6) 
(7) 
In our framework, the state of the filter is composed of the following elements , with a camera rotation matrix R and a
camera translation vector
t related to the world coordinate system which is assigned to be the camera coordinate system of the first frame.In the following, the superscript symbol ’’ denotes aposteriori estimate of a variable calculated from the Kalman update step and ’’ denotes aprior estimate from the Kalman prediction step.2.2.1 Kalman Prediction and State Propagation
Given an aposteriori estimate with covariance , the prediction step of the EKF yields apriori estimate at the next frame:
(8) 
(9) 
with the Jacobians:
(10) 
The key in the Kalman prediction step is to define the function . In our EKF framework, the inertial measurements are employed in the definition. Following [10], we get the actual sensor acceleration a and angular velocity w from inertial measurements. We assume that the IMU is synchronized with the camera and acquires measurements with time interval which is much smaller than that of the camera. Hence, we denote as the number of inertial measurements acquired in two consecutive camera frames, and then merge them together by preintegration method[11] to predict camera rotation and translation between two consecutive camera frames:
(11) 
(12) 
(13) 
In the above three equations, the subscript ’’ denotes the corresponding variable at the th IMU input in consecutive camera frames. Besides, is the accumulated IMU linear velocity from the last camera frame to the current camera frame, and is the extrinsic matrix from the IMU coordinate to the camera coordinate. g is the gravity acceleration, and denotes the exponential map from Liealgebra to Liegroup. Details about this predition step can be found in [10]. Finally, the states predicted in current frame can be formulated as:
(14) 
2.2.2 Kalman Update and Iteration
In traditional extended Kalman update step, the measurement residual is modeled as:
(15) 
Here, 0 means we directly use to calculate the residual without adding any Gausian noise. The updated state is formulated as:
(16) 
where is the Kalman gain. In our method, we defined the residual as the photometric and geometric error of patches (equation 5), thus the residual can be formulated as:
(17) 
Notice that the deformations of the SE patches, which are used in calculating by Equation 17, is heavily affected by the camera poses. So after we obtained an updated camera pose by Equation 16, we use the newly updated camera pose to iteratively calculate the deformations of the SE patches and refine the camera poses by Equation 16 again. In this manner, we can estimate a more accurate . To be more specific, we use to denote the iterations, and thus we have:
(18) 
and the Kalman gain respect to each iteration is:
(19) 
(20) 
As defined in the begin of section 2.2, is the covariance matrix of noise . And the Jacobians updated in every iteration are formulated as:
(21) 
Then the updated state of each iteration is calculated as follows:
(22) 
Notice that is set to be . Finally, the iteration is terminated when the absolute value of is below a certain threshold and the covariance matrix is only updated once the process has converged after iterations:
(23) 
2.3 Model Fusion and Patch Update
We use the volumetric truncated signed distance function (TSDF)[12] to incrementally fuse each consecutive depth frame into one 3D geometry model , with the associated camera pose from Kalman update . Details about depth fusion can be found in[13].
After the reconstruction, we should update patch features for subsequent tracking. We get rid of bad features based on average pixel intensity error, and reextract squared patch feature for those with nonsquare shapes affected by the SE effect. Then, we add new features with distinct intensity gradient and sufficient depth information evaluated by FAST corner detector[14] and the number of pixels with available depth information. Finally, patch intensity information is updated by current color and depth information is acquired from the 3D geometry model which has better quality than the current depth image.
3 Experiments
We first demonstrate the effectiveness of our geometryaware feature tracking method, which evaluates SE effect and deforms patches for accurate feature tracking in sequences with fast camera motion. Then, we evaluate the benefits of inertial information by comparing our system with and without IMU. Finally, our 3D reconstruction method is compared against stateoftheart systems in datasets with fast sensor motion.
Type  Dataset  AIE  

DM  Ours  
slow  TUM_freiburg1_desk  13.3756  9.53 
ICL_NUIM_lr_kt2  4.8981  4.0825  
Dorm_slow  8.1312  7.8934  
fast  ICL Fast Motion  17.219  7.8328 
Dorm_fast  13.9011  7.9325 
3.1 Evaluation
Feature tracking. We compare our feature tracking method against traditional direct method which does not take the SE effect under consideration. In order to achieve equitable comparison, we use a patchsize in both methods and extract no more than 100 patches in each frame. Fig. 4 shows the tracking results of a patch feature in two consecutive frames. The tracking result of traditional method is severely influenced by SE effect and got bad intensity error, while our method deforms the patch and eliminates the influence caused by SE effect to get lower intensity error.
We compare on several datasets, contains ICL datasets[15, 16], TUM datasets[17] and our datasets gathered by a handheld sensor. The average intensity error(AIE) of patches are listed in Table 1. All datasets are divided into slow and fast depending on the qualities of recorded images. To be more specific, as there is no explicit criteria for dividing camera speed, thus we empirically set, based on the unified characteristics of most public datasets, the motion without creating image motion blur as slow camera motion, and the motion which creates severe image blur as fast camera motion. From the table, we find that our method gets lower AIE in all datasets, especially in datasets with fast camera motion.
IMU evaluation. To verify whether the integration of IMU helps to reconstruct the scene geometry during fast camera motion, we compare the results with and without IMU on two datasets with slow and fast camera motions, respectively. As shown in Fig. 5, on the dataset with slow camera motion, the system without IMU works onpar with the complete system, while it fails to reconstruct the model for fast camera motions. Fig. 6 demonstrates the details of camera motions in the two datasets. From the figure, we find that in the fast dataset, there exist some subsequences with large linear and angular velocities of camera, which cause the system without IMU fails to track camera poses. Notice that other fast datasets used in our experiments also contain this kinds of subsequences.
3.2 Comparison
We compare our 3D reconstruction systems with InfiniTAM[1], a typical voxel based scene reconstruction method, Bundlefusion[5] which proposed an efficient global pose optimization algorithm, and a surfel based method ElasticFusion[6] which contains loop closure and executes model refinement through nonrigid surface deformations.
The results of sequence Dorm_fast which reconstruct the entire scene are exhibited in Fig. 7. As BundleFusion fails once the camera speeds up and subsequently restarts when the camera slows down, thus we only show its reconstruction precess in our supplementary video https://www.youtube.com/watch?v=Jy3SGqWuhp8. From the Fig. 7 we find that InfiniTAM can not maintain consistency of the reconstructed geometry, which is mainly caused by the inaccurate camera pose estimation and the large accumulated error. Meanwhile, the loop closure function of ElasticFusion, aiming to eliminate accumulated error, is always invalid in fast camera motion, and thus leads to the fail reconstruction of the parts shown with red and blue bounding boxes. In the opposite, our system reconstructs a good geometry of the scene even without loop closure.
We encourage the reader to watch our video for a better visualization of comparison results.
4 Conclusion and Future Work
We present a realtime system for indoor scene reconstruction by tightlycoupled RGBDInertial information with an extended Kalman filter. The key feature of our method is that it can estimate camera pose and reconstruct 3D scene model with fast camera motion. In addition, we explore the SE effect and propose a geometryaware patch deformation method to eliminate the influence during feature tracking. However, our system has not achieved loop closure with fast camera motion. The reason is that the degraded image information caused by fast camera motion, such as image motion blur, results in the difficulties in loop detection(or feature association) of loop closure method. In future work, we wish to address the problem of loop closure under fast camera motion.
References
 [1] Olaf Kähler, Victor Adrian Prisacariu, Carl Yuheng Ren, Xin Sun, Philip Torr, and David Murray, “Very high frame rate volumetric integration of depth images on mobile devices,” IEEE transactions on visualization and computer graphics, vol. 21, no. 11, pp. 1241–1250, 2015.
 [2] Szymon Rusinkiewicz and Marc Levoy, “Efficient variants of the icp algorithm,” in 3D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on. IEEE, 2001, pp. 145–152.
 [3] Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
 [4] Raul MurArtal, Jose Maria Martinez Montiel, and Juan D Tardos, “Orbslam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
 [5] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Christian Theobalt, “Bundlefusion: Realtime globally consistent 3d reconstruction using onthefly surface reintegration,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 76a, 2017.
 [6] Thomas Whelan, Renato F SalasMoreno, Ben Glocker, Andrew J Davison, and Stefan Leutenegger, “Elasticfusion: Realtime dense slam and light source estimation,” The International Journal of Robotics Research, vol. 35, no. 14, pp. 1697–1716, 2016.
 [7] Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai, Liang Li, Qionghai Dai, and Feng Wu, “A highly parallel framework for hevc coding unit partitioning tree decision on manycore processors,” IEEE Signal Processing Letters, vol. 21, no. 5, pp. 573–576, 2014.
 [8] Tristan Laidlow, Michael Bloesch, Wenbin Li, and Stefan Leutenegger, “Dense rgbdinertial slam with map deformations,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 6741–6748.
 [9] Michael Bloesch, Michael Burri, Sammy Omari, Marco Hutter, and Roland Siegwart, “Iterated extended kalman filter based visualinertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
 [10] Anastasios I Mourikis and Stergios I Roumeliotis, “A multistate constraint kalman filter for visionaided inertial navigation,” in Robotics and automation, 2007 IEEE international conference on. IEEE, 2007, pp. 3565–3572.
 [11] Christian Forster, Luca Carlone, Frank Dellaert, and Davide Scaramuzza, “Onmanifold preintegration for realtime visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2017.
 [12] Brian Curless and Marc Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 1996, pp. 303–312.
 [13] Hao Zhang and Feng Xu, “Mixedfusion: Realtime reconstruction of an indoor scene with dynamic objects,” IEEE Transactions on Visualization and Computer Graphics, 2017.

[14]
Edward Rosten and Tom Drummond,
“Machine learning for highspeed corner detection,”
inEuropean conference on computer vision
. Springer, 2006, pp. 430–443.  [15] Ankur Handa, Richard A Newcombe, Adrien Angeli, and Andrew J Davison, “Realtime camera tracking: When is high framerate best?,” in European Conference on Computer Vision. Springer, 2012, pp. 222–235.
 [16] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison, “A benchmark for RGBD visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014.
 [17] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgbd slam systems,” in Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.