Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral
We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The experiments on ScanNet and 7-Scenes datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed. To the best of our knowledge, this is the first learning-based system that is able to reconstruct dense coherent 3D geometry in real-time.READ FULL TEXT VIEW PDF
Given the recent advances in depth prediction from Convolutional Neural
We present the first marker-less approach for temporally coherent 3D
The efficient fusion of depth maps is a key part of most state-of-the-ar...
Real-time 3D reconstruction from RGB-D sensor data plays an important ro...
In this paper, we present a learning based approach to depth fusion, i.e...
Real-time 3D reconstruction enables fast dense mapping of the environmen...
High screening coverage during colonoscopy is crucial to effectively pre...
Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral
3D scene reconstruction is one of the central tasks in 3D computer vision with many applications. In augmented reality (AR) for example, to enable realistic and immersive interactions between AR effects and the surrounding physical scene, 3D reconstruction needs to be accurate, coherent and performed in real-time. While camera motion can be tracked accurately with state-of-the-art visual-inertial SLAM systems[3, 35, 1], real-time image-based dense reconstruction remains to be a challenging problem due to low reconstruction quality and high computation demands.
Most image-based real-time 3D reconstruction pipelines [38, 52] adopt the depth map fusion approach, which resemble RGB-D reconstruction methods like KinectFusion . Single-view depth maps from each key frame are first estimated with real-time multi-view depth estimation methods like [48, 24, 13, 46]. The estimated depth maps are later filtered with criteria like multi-view consistency and temporal smoothness, and fused into a Truncated Signed Distance Function (TSDF) volume. The reconstructed mesh can be extracted from the fused TSDF volume with the Marching Cubes algorithm . This depth-based pipeline has two major drawbacks. First, since single-view depth maps are estimated individually on each key frame, each depth estimation is from scratch instead of conditioned on the previous estimations even the view-overlapping is substantial. As a result, the scale-factor may vary even with the correct camera ego-motion. Due to depth inconsistencies between different views, the reconstruction result is prone to be either layered or scattered. One example is shown in the red boxes in Fig. 1, where the depth-based method struggles to produce coherent depth estimations on the chairs and wall. Second, since key-frame depth maps need to be estimated separately in overlapped local windows, geometry of the same 3D surface is estimated multiple times in different key frames, causing redundant computation.
In this paper, we propose a novel framework for real-time monocular reconstruction named NeuralRecon that jointly reconstructs and fuses the 3D geometry directly in the volumetric TSDF representation. Given a sequence of monocular images and their corresponding camera poses estimated by a SLAM system, NeuralRecon incrementally reconstructs local geometry in a view-independent 3D volume instead of view-dependent depth maps. Specifically, it unprojects the image features to form a 3D feature volume and then uses sparse convolutions to process the feature volume to output a sparse TSDF volume. With a coarse-to-fine design, the predicted TSDF is gradually refined at each level. By directly reconstructing the implicit surface (TSDF), the network is able to learn the local smoothness and global shape prior of natural 3D surfaces. Different from depth-based methods that predict depth maps for each key frame separately, the surface geometry within a local fragment window is jointly predicted in NeuralRecon, and thus locally coherent geometry estimation can be produced. To make the current-fragment reconstruction to be globally consistent with the previously reconstructed fragments, a learning-based TSDF fusion module using the Gated Recurrent Unit (GRU) is proposed. The GRU fusion makes the current-fragment reconstruction conditioned on the previously reconstructed global volume, yielding a joint reconstruction and fusion approach. As a result, the reconstructed mesh is dense, accurate and globally coherent in scale. Furthermore, predicting the volumetric representation also removes the redundant computation in depth-based methods, which allows us to use a larger 3D CNN while maintaining the real-time performance.
We validate our system on the ScanNet and 7-Scenes datasets. The experimental results show that NeuralRecon outperforms multiple state-of-the-art multi-view depth estimation methods and the volume-based reconstruction method Atlas  by a large margin, while achieving a real-time performance at 33 key frames per second, faster compared to Atlas. As shown in the supplementary video, our method is able to reconstruct large-scale 3D scenes from a video stream on a laptop GPU in real-time. To the best of our knowledge, this is the first learning-based system that is able to reconstruct dense and coherent 3D scene geometry in real-time.
Multi-view Depth Estimation. The most related line of research is real-time methods
for multi-view depth estimation. Before the age of deep learning, many renowned works in monocular 3D reconstruction[47, 21, 38, 34] have achieved good performance with plane-sweeping stereo and depth filters under the assumption of photo-consistency. [46, 51] optimize this line of research towards low power consumption on mobile platforms. Learning-based methods on real-time multi-view depth estimation try to alleviate the photo-consistency assumption with a data-driven approach. Notably, MVDepthNet  and Neural RGB-D  use 2D CNNs to process the 2D depth cost volume constructed from multi-view image features. CNMNet  further leverages the planar structure in indoor scenes to constrain the surface normals calculated from the predicted depth maps to obtain smooth depth estimation. These learning-based methods use 2D CNNs to process the depth cost volume to maintain a low computational cost for near real-time performance.
When the input images are high-resolution and offline computation is allowed, multi-view depth estimation is also known as the Multiple View Stereo (MVS) problem. PatchMatch-based methods [56, 37] have achieved impressive accuracy and are still the most popular methods applicable to high-resolution images. Learning-based approaches in MVS have recently dominated several benchmarks [2, 20] in terms of accuracy, but are only limited to processing mid-resolution images due to the GPU memory constraint. Different from the real-time methods, 3D cost volumes are constructed and 3D CNNs are used to process the cost volume as proposed in MVSNet . Some recent works [12, 4] improve this pipeline with a coarse-to-fine approach. Similar design can also be found in many learning-based SLAM systems [45, 57, 42, 44].
All the above-mentioned works adopt single-view depth maps as intermediate representations. SurfaceNet [15, 16] takes a different approach and uses a unified volumetric representation to predict the volume occupancy. Recently, Atlas  also proposes a volumetric design and direct predicts TSDF and semantic labels with 3D CNN. As an offline method, Atlas aggregates the image features of the entire sequence and then predicts the global TSDF volume only once with a decoder module. We further elaborate the relationship between the proposed method and Atlas in the supplementary material. The proposed method is also related to [5, 18] in terms of using recurrent networks for multi-view feature fusion. However, their recurrent fusion is applied to only the global features and their focus is to reconstruct single objects.
3D Surface Reconstruction. After depth maps are estimated and converted to point clouds, the remaining task for 3D reconstruction is to estimate the 3D surface position and produce the reconstructed mesh. In an offline MVS pipeline , Poisson reconstruction  and Delaunay triagulation  are often used to fulfill this purpose. Proposed by the seminal work KinectFusion , incremental volumetric TSDF fusion  gets widely adopted in real-time reconstruction scenarios due to its simplicity and parallelization capability. [32, 10] improve KinectFusion by making it more scalable and robust. RoutedFusion [49, 50] changes the fusion operation from a simple linear addition into a data-dependent process.
Neural Implicit Representations. Recently, neural implicit representations [29, 33, 36, 17, 54, 25] have gained significant advances. Our work also learns a neural implicit representation by predicting SDF with the neural network from the encoded image features similar to PIFu . The key difference is that we are using sparse 3D convolution to predict a discrete TSDF volume, instead of querying the MLP network with image features and 3D coordinates.
Given a sequence of monocular images and camera pose trajectory provided by a SLAM system, the goal is to reconstruct dense 3D scene geometry accurately in real-time. We denote the global TSDF volume to reconstruct as , where represents the current time step. The system architecture is illustrated in Fig. 2.
To achieve real-time 3D reconstruction that is suitable for interactive applications, the reconstruction process needs to be incremental and the input images should be processed sequentially in local fragments . We seek to find a set of suitable key frames from the incoming image stream as input for the networks. To provide enough motion parallax while keeping multi-view co-visibility for reconstruction, the selected key frames should be neither too close nor far from each other. Following , a new incoming frame is selected as a key frame if its relative translation is greater than and the relative rotation angle is greater . A window with key frames is defined as a local fragment. After key frames are selected, a cubic-shaped fragment bounding volume (FBV) that encloses all the key frame view-frustums is computed with a fixed max depth range in each view. Only the region within the FBV is considered during the reconstruction of each fragment.
We propose to simultaneously reconstruct the TSDF volume of a local fragment and fuse it with global TSDF volume with a learning-based approach. The joint reconstruction and fusion is carried out in the local coordinate system. The definition of the local and global coordinate systems as well as the construction of FBV are illustrated in Fig. 1 of the supplementary material.
Image Feature Volume Construction. The images in the local fragment are first passed through the image backbone to extract the multi-level features. Similar to previous works on volumetric reconstruction [18, 15, 30], the extracted features are back-projected along each ray into the 3D feature volume. The image feature volume is obtained by averaging the features from different views according to the visibility weight of each voxel. The visibility weight is defined as the number of views from which a voxel can be observed in the local fragment. A visualization of this unprojection process can be found in Fig.3 i.
Coarse-to-fine TSDF Reconstruction. We adopt a coarse-to-fine approach to gradually refine the predicted TSDF volume at each level. We use 3D sparse convolution to efficiently process the feature volume . The sparse volumetric representation also naturally integrates with the coarse-to-fine design. Specifically, each voxel in the TSDF volume contains two values, the occupancy score and the SDF value . At each level, both and are predicted by the MLP. The occupancy score represents the confidence of a voxel being within the TSDF truncation distance . The voxel whose occupancy score is lower than the sparsification threshold is defined as void space and will be sparsified. This representation of sparse TSDF volume is visually illustrated in Fig.3 iii. After the sparsification, is upsampled by and concatenated with the as the input for the GRU Fusion module (introduced later) in the next level.
Instead of estimating single-view depth maps for each key frame, NeuralRecon jointly reconstructs the implicit surface within the bounding volume of the local fragment window. This design guides the network to learn the natural surface prior directly from the training data. As a result, the reconstructed surface is locally smooth and coherent in scale. Notably, this design also leads to less redundant computation compared to depth-based methods since each area on the 3D surface is estimated only once during the fragment reconstruction.
GRU Fusion. To make the reconstruction consistent between fragments, we propose to make the current-fragment reconstruction to be conditioned on the reconstructions in previous fragments. We use a 3D convolutional variant of Gated Recurrent Unit (GRU)  module for this purpose. As illustrated in Fig.3 ii, at each level the image feature volume is first passed through the 3D sparse convolution layers to extract 3D geometric features . The hidden state is extracted from the global hidden state within the fragment bounding volume. GRU fuses with hidden state and produces the updated hidden state , which will be passed through the MLP layers to predict the TSDF volume at this level. The hidden state will also be updated to global hidden state by directly replacing the corresponding voxels. Formally, denoting as the update gate, as the reset gate,
as the sigmoid function andas the weight for sparse convolution, GRU fuses with hidden state with the following operations:
Intuitively, in the context of joint reconstruction and fusion of TSDF, the update gate and forget gate in the GRU determine how much information from the previous reconstructions (i.e. hidden state ) is fused to the current-fragment geometric feature , as well as how much information from the current-fragment will be fused into the hidden state . As a data-driven approach, the GRU serves as a selective attention mechanism that replaces the linear running-average operation in conventional TSDF fusion . By predicting after the GRU, the MLP network can leverage the context information accumulated from history fragments to produce consistent surface geometry across local fragments. This is also conceptually analogous to the depth filter in a non-learning-based 3D reconstruction pipeline [38, 34], where the current observation and the temporally-fused depths are fused with the Bayesian filter. The effectiveness of joint reconstruction and fusion is validated in the ablation study.
Integration to the Global TSDF Volume. At the last coarse-to-fine level, is predicted and further sparsified to . Since the fusion between and has been done in GRU Fusion, is integrated into by directly replacing the corresponding voxels after being transformed into the global coordinate. At each time step , Marching Cubes is performed on to reconstruct the mesh.
Supervision. Following 
, two loss functions are used to supervise the network. The occupancy loss is defined as the binary cross-entropy (BCE) between the predicted occupancy values and the ground-truth occupancy values. The SDF loss is defined as thedistance between the predicted SDF values and the ground-truth SDF values. We log-transform the SDF values of predictions and ground-truth before applying the loss. The supervision is applied to all the coarse-to-fine levels.
and is initialized with the weights pretrained from ImageNet. Feature Pyramid Network is used in the backbone to extract more representative multi-level features. The entire network is trained end-to-end with randomly initialized weights except for the image backbone. The occupancy score is predicted with a Sigmoid layer. The voxel size of the last level is and the TSDF truncation distance is set to . is set to . and are set to 15°and respectively. is set to
. Nearest-neighbor interpolation is used in the upsampling between coarse-to-fine levels.
In this section, we conduct a series of experiments to evaluate the reconstruction quality and different design considerations of NeuralRecon.
Datasets. We perform the experiments on two indoor datasets, ScanNet (V2)  and 7-Scenes . The ScanNet dataset contains 1613 indoor scenes with ground-truth camera poses, surface reconstructions, and semantic segmentation labels. There are two training/validation splits commonly used in previous works (defined in  and ) for the ScanNet dataset. We use the same training and validation data with the corresponding baseline methods to make a fair comparison. The 7-Scenes dataset is another challenging RGB-D dataset captured in indoor scenes. Following the baseline method , we use the model trained on ScanNet to perform the validation on 7-Scenes.
Metrics. The 3D reconstruction quality is evaluated using 3D geometry metrics presented in , as well as standard 2D depth metrics defined in . The definitions of these metrics are detailed in the supplementary material. Among these 3D and 2D metrics, we consider F-score as the most suitable metrics to measure 3D reconstruction quality since both the accuracy and completeness of the reconstruction are considered.
Baselines. We compare our method with the following baseline methods in three categories: 1) Real-time methods for multi-view depth estimation [48, 13, 24, 26]. Due to the efficiency constraints, the estimated depth accuracy by these methods is rather limited. We compare with these methods to demonstrate the better reconstruction accuracy of NeuralRecon given the same efficiency. 2) Multiple View Stereo methods [37, 14, 53, 30, 28]. These offline methods have much higher accuracy compared to real-time methods. These baselines are used to demonstrate that NeuralRecon achieves a reconstruction quality on-par with offline methods but runs in real-time. 3) Learning-based SLAM methods [45, 42, 44]. These monocular SLAM methods estimate camera poses and perform reconstruction simultaneously, thus the scale factor of pose and depth is usually not accurately estimated. For a fair comparison, we use ground-truth camera poses for these methods and apply a scaling factor to the predicted depth map using ground-truth depth. Among all these baseline methods, GPMVS  and Atlas  are the most relevant real-time and offline methods, respectively.
Evaluation Protocols. Since our method does not estimate depth maps explicitly, we render the reconstructed mesh to the image plane and obtain depth map estimations . Key frames used for evaluation are sampled from the video sequence with an interval of 10 frames for both depth-based methods and Atlas. Following [30, 26], [53, 48, 14, 13] are fine-tuned on ScanNet. To evaluate depth-based methods [37, 48, 13, 14] in 3D, we use the point cloud fusion to obtain the 3D reconstruction following Atlas. For other depth-based methods, we use the standard TSDF fusion proposed in [31, 7]. For the reasons we detailed in the supplementary material, in order to make a fair comparison with Atlas, we also report the evaluation results using the double-layered mesh (same as Atlas). The evaluation of 3D geometry on 7-Scenes uses the single-layered mesh. We also evaluate the depth filtering operation with multi-view consistency check, which will be elaborated in the supplementary material.
|Consistent Depth ||single||0.091||0.344||0.461||0.266||0.331||2321|
|Method||Abs Rel||Abs Diff||Sq Rel||RMSE||Comp|
|Method||Abs Rel||Sq Rel||RMSE||RMSE log||Sc Inv||-|
|Consistent Depth ||0.073||0.037||0.217||0.105||0.103||-|
ScanNet. 2D depth metrics and 3D geometry metrics are used on the ScanNet dataset. The 3D geometry evaluation results are shown in Tab. 1. Our method produces much better performance than recent learning-based methods and achieves slightly better results than COLMAP. We believe that the improvements come from the joint reconstruction and fusion design achieved by the GRU Fusion module. Compared to depth-based methods, NeuralRecon can produce coherent reconstructions both locally and globally. Our method also surpasses the volumetric baseline method Atlas  on the accuracy, precision, and F-score. The improvements potentially come from the design of local fragment separation in our method, which can act as a view-selection mechanism that avoids irrelevant image features to be fused into the 3D volume. In terms of completeness and recall, the proposed method has an inferior performance compared to both depth-based methods and Atlas. Since depth-based methods predict pixel-wise depth maps on each view, the coverage of their predictions is high by nature, but with the cost of accuracy. Being an offline approach, Atlas has the advantage of having a global context from the entire sequence before predicting the geometry. As a result, Atlas sometimes achieves even better completeness compared to the ground-truth due to its TSDF completion capability. However, Atlas tends to predict over-smoothed geometries, and the completed regions may be inaccurate. As for 2D depth metrics, NeuralRecon also outperforms previous state-of-the-art methods for almost all 2D depth metrics, as shown in Tab. 2.
7-Scenes. 2D depth metrics and 3D geometry metrics are evaluated on the 7-Scenes dataset. As shown in Tab. 3, our method achieves comparable performance to the state-of-the-art method CNMNet  and outperforms all other methods. We believe that the accuracy of the proposed method can be further improved by leveraging the planar structure information as in CNMNet. Since the model used here is only trained on ScanNet, the results also demonstrate that NeuralRecon can generalize well beyond the domain of the training data.
Efficiency. We also report the average running time of the baselines and our method in Tab. 1. Only the inference time on key frames is computed. A detailed timing analysis for each module of NeuralRecon is presented in Table 4. For volumetric methods (Atlas and ours), the running time is obtained by dividing the time of reconstructing the TSDF volume of a local fragment by the number of key frames in the local fragment. Notice that the time for TSDF fusion is not included for depth-based methods. The running time for [44, 28, 24, 26, 45] and NeuralRecon is measured on an NVIDIA RTX 2080Ti GPU. We use running time reported in  and  for [48, 14, 37, 13, 30] and , respectively.
|Method||Abs Rel||Sq Rel||RMSE||Time|
As shown in Tab. 1, our time cost is 30ms per key frame, achieving real-time speed at 33 key frames per second and outperforming all previous methods. Specifically, our method runs 10 faster than Atlas, and 77 faster than Consistent Depth. Predicting the volumetric representation removes the redundant computation in depth-based methods, which contributes to the fast running speed of our method. Compared to Atlas, incrementally reconstructing geometry in local fragment avoids processing a huge 3D volume, leading to a faster speed than Atlas. The use of sparse convolution also contributes to the superior efficiency of NeuralRecon.
In this section, we conduct several ablation experiments on the ScanNet dataset to discuss the effectiveness of components in our method.
GRU Fusion. We validate the GRU Fusion design by comparing rows from (i) to (iv) in Tab. 5.
To validate the benefit of feature fusion, we compare row (i) and row (ii) in Tab. 5. Using feature fusion with the average operation obtains nearly 5% improvement for the precision metric than conventional linear TSDF fusion. Visualization in Fig. 5 shows that feature fusion with the average operation can reconstruct smoother geometry. These results demonstrate that feature fusion can be more effective than TSDF fusion using the same average operation.
Comparing row (ii) and row (iii) in Tab. 5 shows that replacing average operation with GRU gives 4% improvement in terms of recall. The mesh in Fig. 5 (iii) is also more complete than that in Fig. 5 (ii). These results demonstrate that the GRU is more effective to selectively integrate only the consistent information from the current-fragment to the hidden state.
The recalls in row (iii) and row (iv) in Tab. 5 show that fusion in the fragment bounding volume can produce much more complete results. Visualization results in Fig. 5 (iii) and (iv) show that, with fusion in the fragment bounding volume, our method produces fewer artifacts on the ground. Fusion in the fragment bounding volume can leverage the context information in boundaries and produce more consistent and complete surface estimation.
|Img. Enc.||Unproj.||Sparse Conv.||GRU||Total|
|#views||Fusion||3D Geometry Metrics|
Number of views. We set 5, 7, 9 and 11 views as the length of a fragment respectively. As shown in row (v) in Tab. 5, the F-score has over 2% improvement when 9 views are used as a fragment. As shown in visualization results in Fig. 5 (v), with more views in a fragment, the geometry can be reconstructed more accurately compared to Fig. 5 (iv).
Qualitative Results. We provide the qualitative results and the corresponding analysis in Fig. 4.
In this paper, we introduced a novel system NeuralRecon for real-time 3D reconstruction with monocular video. The key idea is to jointly reconstruct and fuse sparse TSDF volumes for each video fragment incrementally by 3D sparse convolutions and GRU. This design enables NeuralRecon to output accurate and coherent reconstruction in real-time. Experiments show that NeuralRecon outperforms state-of-the-art methods in both reconstruction quality and running speed. The sparse TSDF volume reconstructed by NeuralRecon can be directly used in downstream tasks like 3D object detection, 3D semantic segmentation and neural rendering. We believe that, by jointly training with the downstream tasks end-to-end, NeuralRecon enables new possibilities in learning-based multi-view perception and recognition systems.
Acknowledgement. The authors would like to acknowledge the support from the National Key Research and Development Program of China (No. 2020AAA0108901), NSFC (No. 61806176), and ZJU-SenseTime Joint Lab of 3D Vision.
ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM.ArXiv, 2020.
Empirical evaluation of gated recurrent neural networks on sequence modeling.In NeurIPS 2014 Workshop on Deep Learning, 2014.