Log In Sign Up

Real-time Indoor Scene Reconstruction with RGBD and Inertia Input

by   Zunjie Zhu, et al.

Camera motion estimation is a key technique for 3D scene reconstruction and Simultaneous localization and mapping (SLAM). To make it be feasibly achieved, previous works usually assume slow camera motions, which limits its usage in many real cases. We propose an end-to-end 3D reconstruction system which combines color, depth and inertial measurements to achieve robust reconstruction with fast sensor motions. Our framework extends Kalman filter to fuse the three kinds of information and involve an iterative method to jointly optimize feature correspondences, camera poses and scene geometry. We also propose a novel geometry-aware patch deformation technique to adapt the feature appearance in image domain, leading to a more accurate feature matching under fast camera motions. Experiments show that our patch deformation method improves the accuracy of feature tracking, and our 3D reconstruction outperforms the state-of-the-art solutions under fast camera motions.


page 4

page 5


ArticulatedFusion: Real-time Reconstruction of Motion, Geometry and Segmentation Using a Single Depth Camera

This paper proposes a real-time dynamic scene reconstruction method capa...

ENG: End-to-end Neural Geometry for Robust Depth and Pose Estimation using CNNs

Recovering structure and motion parameters given a image pair or a seque...

Joint axis estimation for fast and slow movements using weighted gyroscope and acceleration constraints

Sensor-to-segment calibration is a crucial step in inertial motion track...

A Pose-only Solution to Visual Reconstruction and Navigation

Visual navigation and three-dimensional (3D) scene reconstruction are es...

Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction

We present Mono-STAR, the first real-time 3D reconstruction system that ...

InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure

Volumetric models have become a popular representation for 3D scenes in ...

Robust Neural Routing Through Space Partitions for Camera Relocalization in Dynamic Indoor Environments

Localizing the camera in a known indoor environment is a key building bl...

1 Introduction

With the rapid development of capture and computation devices, such as depth sensors and GPUs, real-time 3D reconstruction has made big growth. In recent years, a lot of works have focused on indoor scene reconstruction. For example, InfiniTAM[1] only uses depth information to reconstruct 3D models, and estimate camera poses by an iterative closest point(ICP) algorithm[2]. However depth only is extremely brittle in situations such as geometry-less scenes, bright windows and depth sensor noises, and can not eliminates accumulated errors. Drift-free camera tracking have been made breakthrough progress by monocular RGB-based methods, including direct methods[3] and feature point methods[4]. However these approaches can not reconstruct detailed and accurate 3D models. Further more, BundleFusion[5] and ElasticFusion[6] use both color and depth information to estimate camera motions and generate 3D models based on implicit truncated signed distance fields(TSDFs) and surfel representation, respectively.

Although these works exhibit reasonable results[7], they still require strong assumptions, like static scene without dynamic object, sufficient texture and geometrical information, slow camera motions and invariant illumination. However these assumptions can not be satisfied in many applications.

In this paper, we make a step further by handling fast camera motions. For both color and depth, fast camera motion leads to large inter-frame distance, which makes it difficult to perform image feature matching (As images may be blurred and feature appearances may vary a lot.) and ICP based depth aligning. Similar with Tristan et al.[8], we solve the issue by introducing IMU information, gathered by an accelerometer and a gyroscope. Further combining with color and depth information, robust camera pose estimation and geometry fusion of an indoor scene are jointly achieved. The main contributions of our work are as follows:

(1) A RGB-D-inertial 3D reconstruction system based on extended Kalman filter framework, which tightly combines the three kinds of information, and jointly achieves camera pose estimation and patch deformation in the kalman update step.

(2) A geometry-aware feature tracking method for handling fast camera motion, which utilizes patch features to adapt blurry images and considers the deformation of patches in building feature matchings for images with very different perspectives.

(3) Through experiments on public datasets (including both synthetic and real data) and our data acquired by Intel Realsense ZR300, we see that our approach outperforms the state-of-the-art reconstruction systems under fast camera motions.

Figure 1: Overview of our pipeline. The red, green and blue arrows represent the input acquired from current frame, iterative operation, and the patches from last frame.

2 Method

The pipline of our system is illustrated in Fig. 1. We introduce our method by following four parts, geometry-aware feature tracking which explores the SE effect and executes patch deformation, filter framework which explain the kalman prediction and update step, model fusion and patch update.

2.1 Geometry-aware Feature Tracking

Point-based feature tracking methods[4] will extract insufficient number of features when images are blurred or with less texture. Thus, patch-based method, which considers larger image regions, is proposed[3] to track features under these situations. However, large patches may contain objects on different depth levels, which causes appearance changes in consecutive frames, especially when the camera motion is fast, leading to inaccurate feature tracking. To address this problem, we combine color and depth information to back project 2D patches into 3D and re-project them to the camera of the next frame by using an initial camera motion. The projection helps us to deform the original patches to model the appearance changes, and the patch tracking can be easily and accurately achieved by the deformed patches.

2.1.1 SE effect and Patch Deformation

Figure 2: This figure shows the patch SE effects caused by the camera motion and the geometrical shape of patches.

When camera moves, a feature patch will be seen from different perspectives in different frames, and thus the shape and position of the feature patch in image coordinates vary from different frames. To account for the patch deformation, different from Bloesch et al.[9] which only considers the 2D planar information of a patch, we use the 3D geometry of a patch to determine the 2D shape deformation between images recorded with fast camera motion.

Depending on different geometries of patches and irregularities of camera motions, patches may produce different deformations in consecutive frames. Fig. 2 shows three representative cases of patch deformations:

Case 1. If there is no significant difference among pixel depths in a patch, then no matter how aggressive the camera motion is, the general shape of the patch will remain unchanged in two consecutive frames.

Case 2.

When camera moves slowly, the 2D shape of a patch will still remain unchanged even though there are large depth variances existing in the patch.

Case 3. Different from case 2, if camera moves aggressively, the intensity distribution and shape of the patch will changes. If camera moves from to the viewpoint , the yellow region will be occluded and the patch shrinks in the current frame. In addition, the black region which is occluded in is visible once the camera moves to , and thus the patch shape extends in the current frame. These phenomenons are called shrink effect and extend effect, so we call them together by SE effect.

We designed a unified deformation method to handle the SE effect. The details of patch deformation process are illustrated in Fig. 3 and formalized as follows:

Each pixel in a patch extracted from last frame are defined as , where denotes the image coordinates of pixel in the patch. denote the intensity, depth and 3D normal of pixel . and are obtained from the depth image, and since we have 3D information, we call our patches 3D patch features. We first back project each patch into 3D world coordinate system:


Here, is to project a point from the 3D camera coordinate system to 2D pixel domain, and represent the inverse operation. is the camera pose of frame which transforms a 3D point in the camera coordinate system of frame to the world coordinate system. Thus is in the world coordinate systme but is indexed from a pixel in frame , so where the intensity information does not change in the projection and indicates the 3D coordinate. Then we project it to the pixel coordinate system of the current frame .


Here means it is projected to this frame, not original from this frame. Note that is unknown and affects the 2D position of the projection.

Figure 3: Deformation process of SE patches.

After projection, if two projected pixels happened to be in the same pixel coordinate, then we consider the shrink effect occurred in this patch. This usually happens when a region close to the camera covers the distant region. Therefore we remove the pixels corresponding to the distant region. Then we evaluate whether the extend effect happens or not. We set the shape of the projected patch to be the bounding box of all projected pixels. Then the extend effect will be verified if the height (or width) of the projected patch is greater than the original one. No matter which effect happens, we use as the deformed patch in the following feature tracking steps.

2.1.2 Objective

In feature tracking, patches affected by SE effect are replaced with the corresponding projected patches. Then We track the projected patch features by both intensity and depth information.

Photometric error for each projected patch is computed as follows. We first extract the patch of the same size at the projected location of the current image as , and then calculate the intensity difference between the extracted patch and the projected patch. The photometric error can be formalized as:


where denotes the number of pixels in the patch. indicates the corresponding intensity information of a patch. Then we compute point-to-plane geometry error as:


Given and , the cost function for patch tracking is formulated as:


where denotes the patch , and indicates the number of patches.

2.2 Filter Framework

Our EKF framework aims to tightly combine the color, depth and inertial measurements information. To be specific, we model the camera pose of each frame as the state of the EKF, and will solve it in the EKF. The observation of the EKF include both the color and depth images, and the relationship between the state and the observation is measured by the energy defined in Equation 5. If a state fits exactly to an observation, the energy is zero. On the other hand, the inertial information is used in the Kalman prediction step, which serves in building the motion prediction model.

We follow the traditional Kalman filter to define the variables. A nonlinear discrete time system with state x, observation term , process noise , and update noise in th frame can be written as


In our framework, the state of the filter is composed of the following elements , with a camera rotation matrix R and a

camera translation vector

t related to the world coordinate system which is assigned to be the camera coordinate system of the first frame.In the following, the superscript symbol ’’ denotes a-posteriori estimate of a variable calculated from the Kalman update step and ’’ denotes a-prior estimate from the Kalman prediction step.

2.2.1 Kalman Prediction and State Propagation

Given an a-posteriori estimate with covariance , the prediction step of the EKF yields a-priori estimate at the next frame:


with the Jacobians:


The key in the Kalman prediction step is to define the function . In our EKF framework, the inertial measurements are employed in the definition. Following [10], we get the actual sensor acceleration a and angular velocity w from inertial measurements. We assume that the IMU is synchronized with the camera and acquires measurements with time interval which is much smaller than that of the camera. Hence, we denote as the number of inertial measurements acquired in two consecutive camera frames, and then merge them together by pre-integration method[11] to predict camera rotation and translation between two consecutive camera frames:


In the above three equations, the subscript ’’ denotes the corresponding variable at the th IMU input in consecutive camera frames. Besides, is the accumulated IMU linear velocity from the last camera frame to the current camera frame, and is the extrinsic matrix from the IMU coordinate to the camera coordinate. g is the gravity acceleration, and denotes the exponential map from Lie-algebra to Lie-group. Details about this predition step can be found in [10]. Finally, the states predicted in current frame can be formulated as:


2.2.2 Kalman Update and Iteration

In traditional extended Kalman update step, the measurement residual is modeled as:


Here, 0 means we directly use to calculate the residual without adding any Gausian noise. The updated state is formulated as:


where is the Kalman gain. In our method, we defined the residual as the photometric and geometric error of patches (equation 5), thus the residual can be formulated as:


Notice that the deformations of the SE patches, which are used in calculating by Equation 17, is heavily affected by the camera poses. So after we obtained an updated camera pose by Equation 16, we use the newly updated camera pose to iteratively calculate the deformations of the SE patches and refine the camera poses by Equation 16 again. In this manner, we can estimate a more accurate . To be more specific, we use to denote the iterations, and thus we have:


and the Kalman gain respect to each iteration is:


As defined in the begin of section 2.2, is the covariance matrix of noise . And the Jacobians updated in every iteration are formulated as:


Then the updated state of each iteration is calculated as follows:


Notice that is set to be . Finally, the iteration is terminated when the absolute value of is below a certain threshold and the covariance matrix is only updated once the process has converged after iterations:


2.3 Model Fusion and Patch Update

We use the volumetric truncated signed distance function (TSDF)[12] to incrementally fuse each consecutive depth frame into one 3D geometry model , with the associated camera pose from Kalman update . Details about depth fusion can be found in[13].

After the reconstruction, we should update patch features for subsequent tracking. We get rid of bad features based on average pixel intensity error, and re-extract squared patch feature for those with non-square shapes affected by the SE effect. Then, we add new features with distinct intensity gradient and sufficient depth information evaluated by FAST corner detector[14] and the number of pixels with available depth information. Finally, patch intensity information is updated by current color and depth information is acquired from the 3D geometry model which has better quality than the current depth image.

3 Experiments

We first demonstrate the effectiveness of our geometry-aware feature tracking method, which evaluates SE effect and deforms patches for accurate feature tracking in sequences with fast camera motion. Then, we evaluate the benefits of inertial information by comparing our system with and without IMU. Finally, our 3D reconstruction method is compared against state-of-the-art systems in datasets with fast sensor motion.

Figure 4: This figure shows the feature tracking results of ours and direct method.
Type Dataset AIE
DM Ours
slow TUM_freiburg1_desk 13.3756 9.53
ICL_NUIM_lr_kt2 4.8981 4.0825
Dorm_slow 8.1312 7.8934
fast ICL Fast Motion 17.219 7.8328
Dorm_fast 13.9011 7.9325
Table 1: Comparison of Patch Feature Tracking

3.1 Evaluation

Feature tracking. We compare our feature tracking method against traditional direct method which does not take the SE effect under consideration. In order to achieve equitable comparison, we use a patch-size in both methods and extract no more than 100 patches in each frame. Fig. 4 shows the tracking results of a patch feature in two consecutive frames. The tracking result of traditional method is severely influenced by SE effect and got bad intensity error, while our method deforms the patch and eliminates the influence caused by SE effect to get lower intensity error.

We compare on several datasets, contains ICL datasets[15, 16], TUM datasets[17] and our datasets gathered by a handheld sensor. The average intensity error(AIE) of patches are listed in Table 1. All datasets are divided into slow and fast depending on the qualities of recorded images. To be more specific, as there is no explicit criteria for dividing camera speed, thus we empirically set, based on the unified characteristics of most public datasets, the motion without creating image motion blur as slow camera motion, and the motion which creates severe image blur as fast camera motion. From the table, we find that our method gets lower AIE in all datasets, especially in datasets with fast camera motion.

IMU evaluation. To verify whether the integration of IMU helps to reconstruct the scene geometry during fast camera motion, we compare the results with and without IMU on two datasets with slow and fast camera motions, respectively. As shown in Fig. 5, on the dataset with slow camera motion, the system without IMU works on-par with the complete system, while it fails to reconstruct the model for fast camera motions. Fig. 6 demonstrates the details of camera motions in the two datasets. From the figure, we find that in the fast dataset, there exist some subsequences with large linear and angular velocities of camera, which cause the system without IMU fails to track camera poses. Notice that other fast datasets used in our experiments also contain this kinds of subsequences.

Figure 5: The reconstruction results of a hotel under slow and fast camera motion.
Figure 6: Camera linear and angular velocity of the two sequences used in IMU evaluation.
Figure 7: Comparison of reconstruction with fast camera motion in (a)InfiniTAM (b)ElasticFusion and (c)Ours.

3.2 Comparison

We compare our 3D reconstruction systems with InfiniTAM[1], a typical voxel based scene reconstruction method, Bundlefusion[5] which proposed an efficient global pose optimization algorithm, and a surfel based method ElasticFusion[6] which contains loop closure and executes model refinement through non-rigid surface deformations.

The results of sequence Dorm_fast which reconstruct the entire scene are exhibited in Fig. 7. As BundleFusion fails once the camera speeds up and subsequently restarts when the camera slows down, thus we only show its reconstruction precess in our supplementary video From the Fig. 7 we find that InfiniTAM can not maintain consistency of the reconstructed geometry, which is mainly caused by the inaccurate camera pose estimation and the large accumulated error. Meanwhile, the loop closure function of ElasticFusion, aiming to eliminate accumulated error, is always invalid in fast camera motion, and thus leads to the fail reconstruction of the parts shown with red and blue bounding boxes. In the opposite, our system reconstructs a good geometry of the scene even without loop closure.

We encourage the reader to watch our video for a better visualization of comparison results.

4 Conclusion and Future Work

We present a real-time system for indoor scene reconstruction by tightly-coupled RGB-D-Inertial information with an extended Kalman filter. The key feature of our method is that it can estimate camera pose and reconstruct 3D scene model with fast camera motion. In addition, we explore the SE effect and propose a geometry-aware patch deformation method to eliminate the influence during feature tracking. However, our system has not achieved loop closure with fast camera motion. The reason is that the degraded image information caused by fast camera motion, such as image motion blur, results in the difficulties in loop detection(or feature association) of loop closure method. In future work, we wish to address the problem of loop closure under fast camera motion.


  • [1] Olaf Kähler, Victor Adrian Prisacariu, Carl Yuheng Ren, Xin Sun, Philip Torr, and David Murray, “Very high frame rate volumetric integration of depth images on mobile devices,” IEEE transactions on visualization and computer graphics, vol. 21, no. 11, pp. 1241–1250, 2015.
  • [2] Szymon Rusinkiewicz and Marc Levoy, “Efficient variants of the icp algorithm,” in 3-D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on. IEEE, 2001, pp. 145–152.
  • [3] Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
  • [4] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [5] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Christian Theobalt, “Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 76a, 2017.
  • [6] Thomas Whelan, Renato F Salas-Moreno, Ben Glocker, Andrew J Davison, and Stefan Leutenegger, “Elasticfusion: Real-time dense slam and light source estimation,” The International Journal of Robotics Research, vol. 35, no. 14, pp. 1697–1716, 2016.
  • [7] Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai, Liang Li, Qionghai Dai, and Feng Wu, “A highly parallel framework for hevc coding unit partitioning tree decision on many-core processors,” IEEE Signal Processing Letters, vol. 21, no. 5, pp. 573–576, 2014.
  • [8] Tristan Laidlow, Michael Bloesch, Wenbin Li, and Stefan Leutenegger, “Dense rgb-d-inertial slam with map deformations,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 6741–6748.
  • [9] Michael Bloesch, Michael Burri, Sammy Omari, Marco Hutter, and Roland Siegwart, “Iterated extended kalman filter based visual-inertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
  • [10] Anastasios I Mourikis and Stergios I Roumeliotis, “A multi-state constraint kalman filter for vision-aided inertial navigation,” in Robotics and automation, 2007 IEEE international conference on. IEEE, 2007, pp. 3565–3572.
  • [11] Christian Forster, Luca Carlone, Frank Dellaert, and Davide Scaramuzza, “On-manifold preintegration for real-time visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2017.
  • [12] Brian Curless and Marc Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 1996, pp. 303–312.
  • [13] Hao Zhang and Feng Xu, “Mixedfusion: Real-time reconstruction of an indoor scene with dynamic objects,” IEEE Transactions on Visualization and Computer Graphics, 2017.
  • [14] Edward Rosten and Tom Drummond,

    Machine learning for high-speed corner detection,”


    European conference on computer vision

    . Springer, 2006, pp. 430–443.
  • [15] Ankur Handa, Richard A Newcombe, Adrien Angeli, and Andrew J Davison, “Real-time camera tracking: When is high frame-rate best?,” in European Conference on Computer Vision. Springer, 2012, pp. 222–235.
  • [16] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison, “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014.
  • [17] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.