Deep Depth Completion of a Single RGB-D Image
The goal of our work is to complete the depth channel of an RGB-D image. Commodity-grade depth cameras often fail to sense depth for shiny, bright, transparent, and distant surfaces. To address this problem, we train a deep network that takes an RGB image as input and predicts dense surface normals and occlusion boundaries. Those predictions are then combined with raw depth observations provided by the RGB-D camera to solve for depths for all pixels, including those missing in the original observation. This method was chosen over others (e.g., inpainting depths directly) as the result of extensive experiments with a new depth completion benchmark dataset, where holes are filled in training data through the rendering of surface reconstructions created from multiview RGB-D scans. Experiments with different network inputs, depth representations, loss functions, optimization methods, inpainting methods, and deep depth estimation networks show that our proposed approach provides better depth completions than these alternatives.READ FULL TEXT VIEW PDF
We study the problem of single-image depth estimation for images in the ...
Transparent objects are a common part of everyday life, yet they possess...
Depth estimation features are helpful for 3D recognition. Commodity-grad...
Depth completion involves estimating a dense depth image from sparse dep...
We present a deep reinforcement learning method of progressive view
We describe a method that predicts, from a single RGB image, a depth map...
A core procedure of pavement management systems is data collection. The
Deep Depth Completion of a Single RGB-D Image
Depth sensing has become pervasive in applications as diverse as autonomous driving, augmented reality, and scene reconstruction. Despite recent advances in depth sensing technology, commodity-level RGB-D cameras like Microsoft Kinect, Intel RealSense, and Google Tango still produce depth images with missing data when surfaces are too glossy, bright, thin, close, or far from the camera. These problems appear when rooms are large, surfaces are shiny, and strong lighting is abundant – e.g., in museums, hospitals, classrooms, stores, etc. Even in homes, depth images often are missing more than 50% of the pixels (Figure 1).
The goal of our work is to complete the depth channel of an RGB-D image captured with a commodity camera (i.e., fill all the holes). Though depth inpainting has received a lot of attention over the past two decades, it has generally been addressed with hand-tuned methods that fill holes by extrapolating boundary surfaces  or with Markovian image synthesis . Newer methods have been proposed to estimate depth de novo from color using deep networks . However, they have not been used for depth completion, which has its own unique challenges:
Training data: Large-scale training sets are not readily available for captured RGB-D images paired with ”completed” depth images (e.g., where ground-truth depth is provided for holes). As a result, most methods for depth estimation are trained and evaluated only for pixels that are captured by commodity RGB-D cameras . From this data, they can at-best learn to reproduce observed depths, but not complete depths that are unobserved, which have significantly different characteristics. To address this issue, we introduce a new dataset with 105,432 RGB-D images aligned with completed depth images computed from large-scale surface reconstructions in 72 real-world environments.
Depth representation: The obvious approach to address our problem is to use the new dataset as supervision to train a fully convolutional network to regress depth directly from RGB-D. However, that approach does not work very well, especially for large holes like the one shown in the bottom row of Figure 1. Estimating absolute depths from a monocular color image is difficult even for people . Rather, we train the network to predict only local differential properties of depth (surface normals and occlusion boundaries), which are much easier to estimate . We then solve for the absolute depths with a global optimization.
Deep network design:
There is no previous work on studying how best to design and train an end-to-end deep network for completing depth images from RGB-D inputs. At first glance, it seems straight-forward to extend previous networks trained for color-to-depth (e.g., by providing them an extra depth channel as input). However, we found it difficult to train the networks to fill large holes from depth inputs – they generally learn only to copy and interpolate the input depth. It is also challenging for the network to learn how to adapt for misalignments of color and depth. Our solution is to provide the network with only color images as input (Figure2). We train it to predict local surface normals and occlusion boundaries with supervision. We later combine those predictions with the input depths in a global optimization to solve back to the completed depth. In this way, the network predicts only local features from color, a task where it excels. The coarse-scale structure of the scene is reconstructed through global optimization with regularization from the input depth.
Overall, our main algorithmic insight is that it is best to decompose RGB-D depth completion into two stages: 1) prediction of surface normals and occlusion boundaries only from color, and 2) optimization of global surface structure from those predictions with soft constraints provided by observed depths. During experiments we find with this proposed approach has significantly smaller relative error than alternative approaches. It has the extra benefit that the trained network is independent of the observed depths and so does not need to be retrained for new depth sensors.
There has been a large amount of prior work on depth estimation, inpainting, and processing.
Depth estimation from a monocular color image is a long-standing problem in computer vision. Classic methods include shape-from-shading and shape-from-defocus . Other early methods were based on hand-tuned models and/or assumptions about surface orientations [31, 60, 61]
. Newer methods treat depth estimation as a machine learning problem, most recently using deep networks[19, 73]. For example, Eigen et al. first used a multiscale convolutional network to regress from color images to depths [19, 18]. Laina et al. used a fully convolutional network architecture based on ResNet . Liu et al. proposed a deep convolutional neural field model combining deep networks with Markov random fields . Roy et al. combined shallow convolutional networks with regression forests to reduce the need for large training sets . All of these methods are trained only to reproduce the raw depth acquired with commodity RGB-D cameras. In contrast, we focus on depth completion, where the explicit goal is to make novel predictions for pixels where the depth sensor has no return. Since these pixels are often missing in the raw depth, methods trained only on raw depth as supervision do not predict them well.
Depth inpainting. Many methods have been proposed for filling holes in depth channels of RGB-D images, including ones that employ smoothness priors , fast marching methods [25, 42], Navier-Stokes , anisotropic diffusion , background surface extrapolation [51, 54, 68], color-depth edge alignment [10, 77, 81], low-rank matrix completion 
, tensor voting, Mumford-Shah functional optimization , joint optimization with other properties of intrinsic images , and patch-based image synthesis [11, 16, 24]. Recently, methods have been proposed for inpainting color images with auto-encoders  and GAN architectures . However, prior work has not investigated how to use those methods for inpainting of depth images. This problem is more difficult due to the absence of strong features in depth images and the lack of large training datasets, an issue addressed in this paper.
Depth super-resolution. Several methods have been proposed to improve the spatial resolution of depth images using high-resolution color. They have exploited a variety of approaches, including Markov random fields [48, 15, 46, 56, 63], shape-from-shading [27, 76], segmentation , and dictionary methods [21, 34, 49, 69]
. Although some of these techniques may be used for depth completion, the challenges of super-resolution are quite different – there the focus is on improving spatial resolution, where low-resolution measurements are assumed to be complete and regularly sampled. In contrast, our focus is on filling holes, which can be quite large and complex and thus require synthesis of large-scale content.
Depth reconstruction from sparse samples. Other work has investigated depth reconstruction from color images augmented with sparse sets of depth measurements. Hawe et al. investigated using a Wavelet basis for reconstruction . Liu et al. combined wavelet and contourlet dictionaries . Ma et al. showed that providing 100 well-spaced depth samples improves depth estimation over color-only methods by two-fold for NYUv2 , yet still with relatively low-quality results. These methods share some ideas with our work. However, their motivation is to reduce the cost of sensing in specialized settings (e.g., to save power on a robot), not to complete data typically missed in readily available depth cameras.
In this paper, we investigate how to use a deep network to complete the depth channel of a single RGB-D image. Our investigation focuses on the following questions: “how can we get training data for depth completion?,” “what depth representation should we use?,” and “how should cues from color and depth be combined?.”
The first issue we address is to create a dataset of RGB-D images paired with completed depth images.
A straight-forward approach to this task would be to capture images with a low-cost RGB-D camera and align them to images captured simultaneously with a higher cost depth sensor. This approach is costly and time-consuming – the largest public datasets of this type cover a handful of indoor scenes (e.g., [57, 62, 75]).
Instead, to create our dataset, we utilize existing surface meshes reconstructed from multi-view RGB-D scans of large environments. There are several datasets of this type, including Matterport3D , ScanNet , SceneNN , and SUN3D [26, 72], to name a few. We use Matterport3D. For each scene, we extract a triangle mesh with 1-6 million triangles per room from a global surface reconstruction using screened Poisson surface reconstruction . Then, for a sampling of RGB-D images in the scene, we render the reconstructed mesh from the camera pose of the image viewpoint to acquire a completed depth image D*. This process provides us with a set of RGB-D D* image pairs without having to collect new data.
Figure 3 shows some examples of depth image completions from our dataset. Though the completions are not always perfect, they have several favorable properties for training a deep network for our problem . First, the completed depth images generally have fewer holes. That’s because it is not limited by the observation of one camera viewpoint (e.g., the red dot in Figure 3), but instead by the union of all observations of all cameras viewpoints contributing to the surface reconstruction (yellow dots in Figure 3). As a result, surfaces distant to one view, but within range of another, will be included in the completed depth image. Similarly, glossy surfaces that provide no depth data when viewed at a grazing angle usually can be filled in with data from other cameras viewing the surface more directly (note the completion of the shiny floor in rendered depth). On average, 64.6% of the pixels missing from the raw depth images are filled in by our reconstruction process.
Second, the completed depth images generally replicate the resolution of the originals for close-up surfaces, but provide far better resolution for distant surfaces. Since the surface reconstructions are constructed at a 3D grid size comparable to the resolution of a depth camera, there is usually no loss of resolution in completed depth images. However, that same 3D resolution provides an effectively higher pixel resolution for surfaces further from the camera when projected onto the view plane. As a result, completed depth images can leverage subpixel antialiasing when rendering high resolution meshes to get finer resolution than the originals (note the detail in the furniture in Figure 3).
Finally, the completed depth images generally have far less noise than the originals. Since the surface reconstruction algorithm combines noisy depth samples from many camera views by filtering and averaging, it essentially de-noises the surfaces. This is especially important for distant observations (e.g., 4 meters), where raw depth measurements are quantized and noisy.
In all, our dataset contains 117,516 RGB-D images with rendered completions, which we split into a training set with 105,432 images and a test set with 12,084 images.
A second interesting question is “what geometric representation is best for deep depth completion?”
A straight-forward approach is to design a network that regresses completed depth from raw depth and color. However, absolute depth can be difficult to predict from monocular images, as it may require knowledge of object sizes, scene categories, etc. Instead, we train the network to predict local properties of the visible surface at each pixel and then solve back for the depth from those predictions.
Previous work has considered a number of indirect representations of depth. For example, Chen et al. investigated relative depths . Charkrabarti et al. proposed depth derivatives . Li et al. used depth derivatives in conjunction with depths . We have experimented with methods based on predicted derivatives. However, we find that they do not perform the best in our experiments (see Section 4).
Instead, we focus on predicting surface normals and occlusion boundaries. Since normals are differential surface properties, they depend only on local neighborhoods of pixels. Moreover, they relate strongly to local lighting variations directly observable in a color image. For these reasons, previous works on dense prediction of surface normals from color images produce excellent results [3, 18, 38, 71, 80]. Similarly, occlusion boundaries produce local patterns in pixels (e.g., edges), and so they usually can be robustly detected with a deep network [17, 80].
A critical question, though, is how we can use predicted surface normals and occlusion boundaries to complete depth images. Several researchers have used predicted normals to refine details on observed 3D surfaces [28, 55, 74], and Galliani et al.  used surface normals to recover missing geometry in multi-view reconstruction for table-top objects. However, nobody has ever used surface normals before for depth estimation or completion from monocular RGB-D images in complex environments.
Unfortunately, it is theoretically not possible to solve for depths from only surface normals and occlusion boundaries. There can be pathological situations where the depth relationships between different parts of the image cannot be inferred only from normals. For example, in Figure 4(a), it is impossible to infer the depth of the wall seen through the window based on only the given surface normals. In this case, the visible region of the wall is enclosed completely by occlusion boundaries (contours) from the perspective of the camera, leaving its depth indeterminate with respect to the rest of the image.
In practice, however, for real-world scenes it is very unlikely that a region of an image will both be surrounded by occlusion boundaries AND contain no raw depth observations at all (Figure 4(b)). Therefore, we find it practical to complete even large holes in depth images using predicted surface normals with coherence weighted by predicted occlusion boundaries and regularization constrained by observed raw depths. During experiments, we find that solving depth from predicted surface normals and occlusion boundaries results in better depth completions than predicting absolute depths directory, or even solving from depth derivatives (see Section 4).
A third interesting question is “what is the best way to train a deep network to predict surface normals and occlusion boundaries for depth completion?”
For our study, we pick the deep network architecture proposed in Zhang et.al because it has shown competitive performance on both normal estimation and boundary detection 
. The model is a fully convolutional neural network built on the back-bone of VGG-16 with symmetry encoder and decoder. It is also equipped with short-cut connections and shared pooling masks for corresponding max pooling and unpooling layers, which are critical for learning local image features. We train the network with “ground truth” surface normals and silhouette boundaries computed from the reconstructed mesh.
After choosing this network, there are still several interesting questions regarding how to training it for depth completion. The following paragraphs consider these questions with a focus on normal estimation, but the issues and conclusions apply similarly for occlusion boundary detection.
What loss should be used to train the network? Unlike past work on surface normal estimation, our primary goal is to train a network to predict normals only for pixels inside holes of raw observed depth images. Since the color appearance characteristics of those pixels are likely different than the others (shiny, far from the camera, etc.), one might think that the network should be supervised to regress normals only for these pixels. Yet, there are fewer pixels in holes than not, and so training data of that type is limited. It was not obvious whether it is best to train only on holes vs. all pixels. So, we tested both and compared.
We define the observed pixels as the ones with depth data from both the raw sensor and the rendered mesh, and the unobserved pixels as the ones with depth from the rendered mesh but not the raw sensor. For any given set of pixels (observed, unobserved, or both), we train models with a loss for only those pixels by masking out the gradients on other pixels during the back-propagation.
Qualitative and quantitative results comparing the results for different trained models are shown in supplemental material. The results suggest that the models trained with all pixels perform better than the ones using only observed or only unobserved pixels, and ones trained with rendered normals perform better than with raw normals.
What image channels should be input to the network? One might think that the best way to train the network to predict surface normals from a raw RGB-D image is to provide all four channels (RGBD) and train it to regress the three normal channels. However, surprisingly, we find that our networks performed poorly at predicting normals for pixels without observed depth when trained that way. They are excellent at predicting normals for pixels with observed depth, but not for the ones in holes – i.e., the ones required for depth completion. This result holds regardless of what pixels are included in the loss.
We conjecture that the network trained with raw depth mainly learns to compute normals from depth directly – it fails to learn how to predict normals from color when depth is not present, which is the key skill for depth completion. In general, we find that the network learns to predict normals better from color than depth, even if the network is given an extra channel containing a binary mask indicating which pixels have observed depth . For example, in Figure 5, we see that the normals predicted in large holes from color alone are better than from depth, and just as good as from both color and depth. Quantitative experiments support this finding in Table 1.
This result is very interesting because it suggests that we can train a network to predict surface normals from color alone and use the observed depth only as regularization when solving back for depth from normals (next section). This strategy of separating “prediction without depth” from “optimization with depth” is compelling for two reasons. First, the prediction network does not have to be retrained for different depth sensors. Second, the optimization can be generalized to take a variety of depth observations as regularization, including perhaps sparse depth samples . This is investigated experimentally in Section 4.
|Depth Completion||Surface Normal Estimation|
After predicting the surface normal image and occlusion boundary image , we solve a system of equations to complete the depth image . The objective function is defined as the weighted sum of squared errors with four terms:
where measures the distance between the estimated depth and the observed raw depth at pixel , measures the consistency between the estimated depth and the predicted surface normal , encourages adjacent pixels to have the same depths.
down-weights the normal terms based on the predicted probability a pixel is on an occlusion boundary (B(p)).
In its simplest form, this objective function is non-linear, due to the normalization of the tangent vectorrequired for the dot product with the surface normal in . However, we can approximate this error term with a linear formation by foregoing the vector normalization, as suggested in . In other settings, this approximation would add sensitivity to scaling errors, since smaller depths result in shorter tangents and potentially smaller terms. However, in a depth completion setting, the data term forces the global solution to maintain the correct scale by enforcing consistency with the observed raw depth, and thus this is not a significant problem.
Since the matrix form of the system of equations is sparse and symmetric positive definite, we can solve it efficiently with a sparse Cholesky factorization (as implemented in cs_cholsol in CSparse ). The final solution is a global minimum to the approximated objective function.
This linearization approach is critical to the success of the proposed method. Surface normals and occlusion boundaries (and optionally depth derivatives) capture only local properties of the surface geometry, which makes them relatively easy to estimate. Only through global optimization can we combine them to complete the depths for all pixels in a consistent solution.
We ran a series of experiments to test the proposed methods. Unless otherwise specified, networks were pretrained on the SUNCG dataset [66, 80] and fine-tuned on the training split of the our new dataset using only color as input and a loss computed for all rendered pixels. Optimizations were performed with , , and . Evaluations were performed on the test split of our new dataset.
We find that predicting surface normals and occlusion boundaries from color at 320x256 takes 0.3 seconds on a NVIDIA TITAN X GPU. Solving the linear equations for depths takes 1.5 seconds on a Intel Xeon 2.4GHz CPU.
The first set of experiments investigates how different test inputs, training data, loss functions, depth representations, and optimization methods affect the depth prediction results (further results can be found in the supplemental material).
Since the focus of our work is predicting depth where it is unobserved by a depth sensor, our evaluations measure errors in depth predictions only for pixels of test images unobserved in the test depth image (but present in the rendered image). This is the opposite of most previous work on depth estimation, where error is measured only for pixels that are observed by a depth camera.
When evaluating depth predictions, we report the median error relative to the rendered depth (Rel), the root mean squared error in meters (RMSE), and percentages of pixels with predicted depths falling within an interval ([), where is , , , , or . These metrics are standard among previous work on depth prediction, except that we add thresholds of and to enable finer-grained evaluation.
When evaluating surface normal predictions, we report the mean and median errors (in degrees), plus the percentages of pixels with predicted normals less than thresholds of 11.25, 22.5, and 30 degrees.
What data should be input to the network? Table 1 shows results of an experiment to test what type of inputs are best for our normal prediction network: color only, raw depth only, or both. Intuitively, it would seem that inputting both would be best. However, we find that the network learns to predict surface normals better when given only color (median error = 17.28 for color vs. 23.07 for both), which results in depth estimates that are also slightly better (Rel = 0.089 vs. 0.090). This difference persists whether we train with depths for all pixels, only observed pixels, or only unobserved pixels (results in supplemental material). We expect the reason is that the network quickly learns to interpolate from observed depth if it is available, which hinders it from learning to synthesize new depth in large holes.
The impact of this result is quite significant, as it motivates our two-stage system design that separates normal/boundary prediction only from color and optimization with raw depth.
What depth representation is best? Table 2 shows results of an experiment to test which depth representations are best for our network to predict. We train networks separately to predict absolute depths (D), surface normals (N), and depth derivatives in 8 directions (DD), and then use different combinations to complete the depth by optimizing Equation 1. The results indicate that solving for depths from predicted normals (N) provides the best results (Rel = 0.089 for normals (N) as compared to 0.167 for depth (D), 0.100 for derivatives (DD), 0.092 for normals and derivatives (N+DD). We expect that this is because normals represent only the orientation of surfaces, which is relatively easy to predict . Moreover, normals do not scale with depth, unlike depths or depth derivatives, and thus are more consistent across a range of views.
Does prediction of occlusion boundaries help? The last six rows of Table 2 show results of an experiment to test whether down-weighting the effect of surface normals near predicted occlusion boundaries helps the optimizer solve for better depths. Rows 2-4 are without boundary prediction (“No” in the first column), and Rows 5-7 are with (“Yes”). The results indicate that boundary predictions improve the results by 19% (Rel = 0.089 vs. 0.110). This suggests that the network is on average correctly predicting pixels where surface normals are noisy or incorrect, as shown qualitatively in Figure 6.
How much observed depth is necessary? Figure 7 shows results of an experiment to test how much our depth completion method depends on the quantity of input depth. To investigate this question, we degraded the input depth images by randomly masking different numbers of pixels before giving them to the optimizer to solve for completed depths from predicted normals and boundaries. The two plots shows curves indicating depth accuracy solved for pixels that are observed (left) and unobserved (right) in the original raw depth images. From these results, we see that the optimizer is able to solve for depth almost as accurately when given only a small fraction of the pixels in the raw depth image. As expected, the performance is much worse on pixels unobserved by the raw depth (they are harder). However, the depth estimations are still quite good when only a small fraction of the raw pixels are provided (the rightmost point on the curve at 2000 pixels represents only 2.5% of all pixels). This results suggests that our method could be useful for other depth sensor designs with sparse measurements. In this setting, our deep network would not have to be retrained for each new dense sensor (since it depends only on color), a benefit of our two-stage approach.
The second set of experiments investigates how the proposed approach compares to baseline depth inpainting and depth estimation methods.
Comparison to Inpainting Methods Table 8 shows results of a study comparing our proposed method to typical non-data-driven alternatives for depth inpainting. The focus of this study is to establish how well-known methods perform to provide a baseline on how hard the problem is for this new dataset. As such, the methods we consider include: a) joint bilinear filtering  (Bilateral), b) fast bilateral solver  (Fast), and c) global edge-aware energy optimization  (TGV). The results in Table 8
show that our method significantly outperforms these methods (Rel=0.089 vs. 0.103-0.151 for the others). By training to predict surface normals with a deep network, our method learns to complete depth with data-driven priors, which are stronger than simple geometric heuristics. The difference to the best of the tested hand-tuned approaches (Bilateral) can be seen in Figure8.
Comparison to Depth Estimation Methods Table 4 shows results for a study comparing our proposed method to previous methods that estimate depth only from color. We consider comparisons to Chakrabarti et al. , whose approach is most similar to ours (it uses predicted derivatives), and to Laina et al. , who recently reported state-of-the-art results in experiments with NYUv2 . We finetune  on our dataset, but use pretrained model on NYUv2 for  as their training code is not provided.
Of course, these depth estimation methods solve a different problem than ours (no input depth), and alternative methods have different sensitivities to the scale of depth values, and so we make our best attempt to adapt both their and our methods to the same setting for fair comparison. To do that, we run all methods with only color images as input and then uniformly scale their depth image outputs to align perfectly with the true depth at one random pixel (selected the same for all methods). In our case, since Equation 1 is under-constrained without any depth data, we arbitrarily set the middle pixel to a depth of 3 meters during our optimization and then later apply the same scaling as the other methods. This method focuses the comparison on predicting the “shape” of the computed depth image rather than its global scale.
Results of the comparison are shown in Figure 9 and Table 4. From the qualitative results in Figure 9, we see that our method reproduces both the structure of the scene and the fine details best – even when given only one pixel of raw depth. According to the quantitative results shown in Table 4, our method is 23-40% better than the others, regardless of whether evaluation pixels have observed depth (Y) or not (N). These results suggest that predicting surface normals is a promising approach to depth estimation as well.
This paper describes a deep learning framework for completing the depth channel of an RGB-D image acquired with a commodity RGB-D camera. It provides two main research contributions. First, it proposes to complete depth with a two stage process where surface normals and occlusion boundaries are predicted from color, and then completed depths are solved from those predictions. Second, it learns to complete depth images by supervised training on data rendered from large-scale surface reconstructions. During tests with a new benchmark, we find the proposed approach outperforms previous baseline approaches for depth inpainting and estimation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5965–5974, 2016.
Direction matters: Depth estimation with a surface normal classifier.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 381–389, 2015.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1119–1127, 2015.
Sun rgb-d: A rgb-d scene understanding benchmark suite.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
Depth image inpainting: Improving low rank matrix completion with low gradient regularization.IEEE Transactions on Image Processing, 26(9):4311–4320, 2017.
Real-time user-guided image colorization with learned deep priors.ACM Transactions on Graphics (TOG), 9(4), 2017.
This section provides extra implementation details for our methods. All data and code will be released upon the acceptance to ensure reproducibility.
For every scene in the Matterport3D dataset, meshes were reconstructed and rendered to provide “completed depth images” using the following process. First, each house was manually partitioned into regions roughly corresponding to rooms using an interactive floorplan drawing interface. Second, a dense point cloud was extracted containing RGB-D points (pixels) within each region, excluding pixels whose depth is beyond 4 meters from the camera (to avoid noise in the reconstructed mesh). Third, a mesh was reconstructed from the points of each region using Screened Poisson Surface Reconstruction  with octree depth 11. The meshes for all regions were then merged to form the final reconstructed mesh for each scene. “Completed depth images” were then created for each of the original RGB-D camera views by rendering from that view using OpenGL and reading back the depth buffer.
Figure 10 shows images of a mesh produced with this process. The top row shows exterior views covering the entire house (vertex colors on the left, flat shading on the right). The bottom row shows a close-up image of the mesh from an interior view. Though the mesh is not perfect, it has 12.2M triangles reproducing most surface details. Please note that the mesh is complete where holes typically occur in RGB-D images (windows, shiny table tops, thin structures of chairs, glossy surfaces of cabinet, etc.). Please also note the high level of detail for surfaces distant to the camera (e.g., furniture in the next room visible through the doorway).
All the networks used for this project are derived from the surface normal estimation model proposed in Zhang et.al  with the following modifications.
Depending on what is the input, the network takes data with different channels at the first convolution layer.
Color. The color is a 3-channel tensor with R,G,B for each. The intensity values are normalized to [-0.5 0.5]. We use a bi-linear interpolation to resize color image if necessary.
Depth. The absolute values of depth in meter are used as input. The pixels with no depth signal from sensor are assigned a value of zero. To resolve the ambiguity between “missing” and “0 meter”, a binary mask indicating the pixels that have depth from sensor is added as an additional channel as suggested in Zhang et.al . Overall, the depth input contains 2 channels (absolute depth and binary valid mask) in total. To prevent inaccurate smoothing, we use the nearest neighbor search to resize depth image.
Color+Depth. The input in this case is the concatenation of the color and depth as introduced above. This results in a 5-channel tensor as the input.
The network for absolute depth, surface normal, and depth derivative outputs results with 1, 3, and 8 channels respectively. The occlusion boundary detection network generates 3 channel outputs representing the probability of each pixel belonging to “no edge”, “depth crease”, and “occlusion boundary”.
Depth, surface normal, and derivative are predicted as regression tasks.
The SmoothL1 loss111https://github.com/torch/nn/blob/master/doc/criterion.md#nn.Smoot-
hL1Criterion is used for training depth and derivative, and the cosine embedding loss222https://github.com/torch/nn/blob/master/doc/criterion.md#nn.Cosine-
EmbeddingCriterion is used for training surface normal. The occlusion boundary detection is formulated into a classification task, and cross entropy loss333https://github.com/torch/nn/blob/master/doc/criterion.md#nn.CrossE-
is used. The last two batch normalization layers are removed because this results in better performance in practice.
The neural network training and testing are implemented in Torch. For all the training tasks, RMSprop optimization algorithm is used. The momentum is set to 0.9, and the batch size is 1. The learning rate is set to 0.001 initially and reduce to half every 100K iterations. All the models converge within 300K iterations.
This section provides extra experimental results, including ablation studies, cross-dataset experiments, and comparisons to other depth completion methods.
Section 4.1 of the paper provides results of ablation studies aimed at investigating how different test inputs, training data, loss functions, depth representations, and optimization methods affect our depth prediction results. This section provides further results of that type.
More qualitative results about surface normal estimation model trained from different setting are shown in Figure 11. Comparatively, training the surface normal estimation model with our setting (i.e. using only color image as input, all available pixels with rendered depth as supervision, the 4-th column in the figure) achieves the best quality of prediction, and hence benefits the global optimization for depth completion.
|Comparison||Setting||Depth Completion||Surface Normal Estimation|
This test studies what normals should be used as supervision for the loss when training the surface prediction network. We experimented with normals computed from raw depth images and with normals computed from the rendered mesh. The result in the top two rows of Table 6 (Comparison:Target) shows that the model trained on rendered depth performs better the the one from raw depth. The improvement seems to come partly from having training pixels for unobserved regions and partly from more accurate depths (less noise).
This test studies which pixels should be included in the loss when training the surface prediction network. We experimented with using only the unobserved pixels, using only the observed pixels, and both as supervision. The three models were trained separately in the training split of our new dataset and then evaluated versus the rendered normals in the test set. The quantitative results in the last three rows of Table 6 (Comparison:Pixel) show that models trained with supervision from both observed and unobserved pixels (bottom row) works slightly better than the one trained with only the observed pixels or only the unobserved pixels. This shows that the unobserved pixels indeed provide additional information.
Several depth representations were considered in the paper (normals, derivatives, depths, etc.). This section provides further results regarding direct prediction of depth and disparity (i.e. one over depth) to augment/fix results in Table 2 of the paper.
Actually, the top row of Table 2 of the paper (where the Rep in column 2 is ‘D’) is mischaracterized as direct prediction of depth from color – it is actually direct prediction of complete depth from input depth. That was a mistake. Sorry for the confusion. The correct result is in the top line of Table 5 of this document (Input=C, Rep=D). The result is quite similar and does not change any conclusions: predicting surface normals and then solving for depth is better than predicting depth directly (Rel = 0.089 vs. 0.408).
We also consider prediction of disparity rather than depth, as suggested in Chakrabarti et.al and other papers . We train models to estimate disparity directly from color and raw depth respectively. The results can be seen in Table 5. We find that estimating disparity results in performance that is not better than estimating depth when given either color or depth as input for our depth completion application.
This test investigates whether it is possible to train our method on one dataset and then use it effectively for another.
We first conduct experiments between Matterport3D and ScanNet datasets. Both have 3D surface reconstructions for large sets of environments (1000 rooms each) and thus provide suitable training data for training and test our method with rendered meshes. We train a surface normal estimation model separately on each dataset, and then use it without fine tuning to perform depth completion for the test set of the other. The quantitative results are shown in Table 7. As expected, the models work best on the test dataset matching the source of the training data. Actually, the model trained from Matterport3D has a better generalization capability compared to the model trained from ScanNet, which is presumably because the Matterport3D dataset has a more diverse range of camera viewpoints. However, interestingly, both models work still reasonably well when run on the other dataset, even though they were not fine-tuned at all. We conjecture this is because our surface normal prediction model is trained only on color inputs, which are relatively similar between the two datasets. Alternative methods using depth as input would probably not generalize as well due to the significant differences between the depth images of the two datasets.
The depth map from Intel RealSense has better quality in short range but contains more missing area compared to Structure Sensor  and Kinect . The depth signal can be totally lost or extremely sparse for distant area and surface with special materials, e.g. shinny, dark. We train a surface normal estimation model from ScanNet dataset  and directly evaluate on the RGBD images captured by Intel RealSense from SUN-RGBD dataset  without any finetuning. The results are shown in Figure 12. From left to right, we show the input color image, input depth image, completed depth image using our method, the point cloud visualization of the input and completed depth map, and the surface normal converted from the completed depth. As can be seen, the depth from RealSense contains more missing area than Matterport3D and ScanNet, yet our model still generates decent results. This again shows that our method can effectively run on RGBD images captured from various of depth sensors with significantly different depth patterns.
Section 4.2 of the paper provides comparisons to alternative methods for depth inpainting. This section provides further results of that type in Table 8. In this additional study, we compare with the following methods:
DCT : fill in missing values by solving the penalized least squares of a linear system using discrete cosine transform using the code from Matlab Central 444https://www.mathworks.com/matlabcentral/fileexchange/27994-inpaint-over-missing-data-in-1-d–2-d–3-d–nd-arrays.
The results of DCT  are similar to other inpainting comparisons provided in the paper. They mostly interpolate holes.
The results of FCN and CE show that methods designed for inpainting color are not very effective at inpainting depth. As already described in the paper, methods that learn depth from depth using an FCN can be lazy and only learn to reproduce and interpolate provided depth. However, the problems are more subtle than that, as depth data has many characteristics different from color. For starters, the context encoder has a more shallow generator and lower resolution than our network, and thus generates blurrier depth images than ours. More significantly, the fact that ground-truth depth data can have missing values complicates the training of the discriminator network in the context encoder (CE) – in a naive implementation, the generator would be trained to predict missing values in order to fool the discriminator. We tried multiple approaches to circumvent this problem, including propagating gradients on only unobserved pixels, filling a mean depth value in the missing area. We find that none of them work as well as our method.
More results of our method and comparison to other inpainting methods can be found in Figure 14,15,16 in the end of this paper. Each two rows shows an example, where the 2nd row shows the completed depth of different methods, and 1st row shows their corresponding surface normal for purpose of highlighting details and 3D geometry. For each example, we show the input, ground truth, our result, followed by the results of FCN , joint bilateral filter , discrete cosine transform , optimization with only smoothness, and PDE . As can be seen, our method generates better large scale planar geometry and sharper object boundary.
|Garcia et.al ||0.115||0.144||36.78||47.13||61.48||74.89||81.67|
We also convert the completed depth maps into 3D point clouds for visualization and comparison, which are shown in Figure 13. The camera intrinsics provided in Matterport3D dataset is used to project each pixel on the depth map into a 3D point, and the color intensity are copied from the color image. Each row shows one example, with the color image and point clouds converted from ground truth, input depth (i.e. the raw depth from sensor that contains a lot of missing area), and results of our method, FCN , joint bilateral filter , and smooth inpainting. Compared to other methods, our method maintains better 3D geometry and less bleeding on the boundary.