People regularly reason about free space they cannot see. For example, you might reach to grasp a cup, and your fingers will fold around the back of the cup, confident that there is room. As another example, you might put a mug down on your desk behind the laptop, even though you cannot see there. While your model of this invisible space might not be precise, you have it and use it every day. When you do so, you are using “counterfactual depth” — the depth you would see if an object had been removed. This paper shows how to predict counterfactual depth from images.
This ability to “see behind” is reproduced in scene completion methods, which seek to complete voxel maps to account for the back of objects, and to infer invisible free space. But these methods produce limited resolution models of space, and require depth measurements to do so on another hand. Besides, stereo pairs provide less help to infer scene geometry behind objects, since the larger unknown depth region can’t be fully observed by small changes in camera position. While there are excellent methods for inferring depth from a single image, the resulting depth maps represent only the free space to the nearest object.
In this paper, we describe a system that can accept an image and an object mask, and produce a depth map for the scene where the masked object has been removed (Figure 1
): e.g. if you mask a cup in an image of a cup on a table, our system will show you the depth behind the cup. Our method works for the same reason that scene completion works. Indoor scenes are very highly structured, and it is quite easy to come up with very good estimates of depth in unknown regions. However, image details are important: we show that our method easily outperforms Poisson smoothing of the depth map. Furthermore, our method easily outperforms the natural baseline of inpainting the image and recovering depth from the result, because inpainting often produces unnatural pixel fields.
Our approach is closely related to scene completion [42, 13], and works for the same reason that scene completion works. Scene geometries have quite simple spatially consistent structure. However, our method differs in important ways. We do not require additional depth information, and predict on RGB image only. Our system learns from images and depth maps (which are easy to acquire at a large scale), rather than from polyhedral 3D models of scenes. Rather than actively reconstructing the entire scene at limited resolution (voxels), our method is passive: with no object mask, our method reports a depth map for the image; provided with a mask, it reconstructs the depth map of the image with that object removed. This deferred computation allows us to produce representations with smoothed output and much higher resolution than voxels can support. Our approach differs from the layered scene decomposition  and depth hole filling [1, 29] which all rely largely on the quality of input depth to perceive the hidden geometry.
: 1) We describe a system that learns, from data, to reconstruct the depth that would be observed if an object or multiple objects were removed from a scene. 2) For images where an object is removed, quantitative evaluations demonstrate that our method outperforms strong natural baselines (depth hole filling, image inpainting and then depth prediction). 3) We introduce a carefully designed test set taken from real scenes that allows experiments investigating what scene and object properties tend to result in accurate reconstructions.
2 Related Work
Single image depth estimation is now well established. Early approaches use biased models (e.g. boxes for rooms ) or aggressive smoothing (e.g. ). Markov random field (MRF)  and Conditional random field (CRF) 
can be applied to regress image depth against monocular images. More recent approaches use deep neural networks with multi-scale predictions[11, 12], large-scale datasets [26, 2] and user interactions 
. Stereo provides strong cues for unsupervised learning[14, 46]
or semi-supervised learning with LiDAR. Other approaches use sparse depth samples  or variational models . Laina  propose a fully convolution approach with an encoder-decoder structure, and utilize per-pixel reverse Huber loss for better predictions. Chen  propose to learn from pixel pairs of relative depth, which is further improved with supervisions of surface normal 
. Our approach regresses on both depth and surface normal predictions. Different from Chen , we preprocess the ground truth surface normal with weighted quantized vectorization to ensure a smooth prediction. Moreover, we show in experiment that, in our task, angular-based surface normal loss can help improve performance (while Chen found that this is less effective).
Depth completion helps predict the 3D layout of a scene and the objects in a novel view. The completion can be performed on point clouds , RGBD sensors [43, 39, 6, 45, 30], raw depth scans [35, 13, 42] or semantic segmentations . The predictions can be represented as dense depth maps [45, 30, 6], 3D meshes [35, 8], or voxels [13, 42]. Our “conterfactual depth prediction” task is challenging, because we only condition on a single RGB input and a 2D object mask only, and predict the dense depth map of the scene with the object removed – we predict the depth that can be seen and the depth that we cannot see.
We also investigate the natural baseline of removing objects from the scene – image inpainting. We can apply existing single image depth estimation approaches on the inpainted images, and obtain the predicted depth map with the objects removed. Image inpainting can be achieved by smoothing from unmasked neighbors [36, 7, 4], patch-based approaches [5, 15], planar structure guidance 18, 44, 28, 34]. We use the method by Iizuka , which is one of the state-of-the-art for high resolution predictions with source code available, as our image inpainting baseline.
Assume a single RGB image is given. Now, for any object mask that identifies an object in the scene, write for the set of pixels lying on the object. We would like to predict the depth for the scene with that object removed (Figure 2). We write for the depth field; for the depth predicted for pixels in (i.e. the depth behind the object in the mask); and for the depth predicted for pixels out of . For example, if the scene had a cup on a desk, and the mask lay on the cup, then would be the desk behind the cup, would be the rest of the desk, and should be predictable because of the spatial coherence of objects.
3.1 Network architecture
Figure 2 gives an overview of our network. We choose to modify the depth predictor by Laina , because it is fully convolutional, and can model the dense spatial relationship between and . The encoder-decoder strategy of that method allows coarse-to-fine corrections of . Our network’s input RGB image size is (height width dimension) and the output depth map is . The encoder is based on Resnet-50, with the fully-connected layers and the top pooling layers removed. The bottleneck feature space is . The decoder consists four up-projection blocks and a convolution layer afterwards. We use the object mask to guide the prediction by concatenating to each of the input feature layers of the up-projection block. is for pixels on the object to be removed and is otherwise for non-removed area. The bottleneck forces the decoder to capture long scale order in depth fields; the mask then informs the decoder where it should ignore image features and extrapolate depth. Extrapolation is helped by having an image feature encoded, because the features give some information about the likely depth behavior at the boundary of the mask, so the decoder can extrapolate into the masked region using both depth prior statistics and feature information to guide the extrapolation. This comes at the cost of training difficulty. The decoder has a strictly more difficult task than Laina ’s decoder, because it must be willing to extrapolate into any masked region supplied at run time. We also experienced with concatenating the object mask with the input RGB image as input, but observed performance degrades.
3.2 Network loss
Given a predicted image depth , and a ground truth depth , the overall network loss for each image is:
is the weighted summation of the surface normal loss , the average image depth difference and the pixel-wise reverse Huber (berHu) loss .
Surface normal loss with weighted smoothed ground truth. Much of the world is made of large polygons [8, 17], so that we can expect strong spatial correlations in surface normal. One can obtain small depth errors with large surface normal errors, which suggests controlling surface normal error directly. We use a loss that encourages normals derived from the predicted depth to be accurate:
penalizes the average pixel-wise negative log likelihood of the angular distance between the predicted surface normal and the ground truth. denotes a pixel in positioned at . denotes the total number of pixels in , and is the pixel-wise weight that we will explain later. denotes the surface normal computation which is the first-order derivatives of predicted depth.
However, computing ground truth normals requires care. For two adjacent pixels with only a few millimeters apart, a small error in measurement can still produce a steep change in normal direction. We apply a window-based gradient smoothing method, given known camera focal length and in and dimension respectively, computing gradients at pixel based on the neighboring pixels: , . We compute in the same way, set and normalize to unit 1.
We then smooth the normal spatially, using a procedure to retain sharp normal discontinuities. We quantize each ground truth normal into discrete bins. We divide the hemisphere of the normal space (assuming all pointing towards the viewpoint) into equally spanned bins of 16 latitudes and 4 azimuths. Then, we score the confidence of each bin belonging to the pixel’s normal based on the weighted average angular distance to the pixel’s neighbors: . denotes a pixel in neighborhood, denotes candidate bin ’s normal. We set to model a smooth decrease of the angle between two normal vectors going further apart. Finally, we assign the highest score to and its normal to . The advantage of the weighting strategy is that for a flat ground truth region, most of the processed ground truth normal will be in the same bin, so we will recover a constant plane. Similarly, at a normal discontinuity (e.g. a ridge), one normal will dominate on one side and the other will dominate on the other, so the ridge will not be smoothed (see Figure 3). We show in experiments (Sec. 5.2) that training with helps boost our performance. It’s worth noting that our approach is faster than plane fitting , and is more accurate than simple partial derivatives (please find more detailed comparison in Appendix A). This is crucial since we need to re-compute surface normal for each training sample as required by the data augmentation in Sec. 3.3.
Depth prediction loss. We penalize the average depth difference compared to the ground truth: . We use reverse Huber loss to penalize the per-pixel prediction error, which has shown superiority in single image depth estimation . We set the cut-off rate for each batch.
3.3 Implementation details
In inference, for each input image and the object mask , we first perform the largest center crop with the same aspect ratio as the network input size, then resize
to fit the network input size. The output depth map is then resized back to the same scale as the original cropped image by bilinear interpolation.
Initial experiments indicated that depth regressions against images tend to have quite localized support, likely because very high spatial correlations in real images mean that large-scale support is superfluous. But a network that predicts depth in locations where there are no known pixel values needs to have spatial support on very long scales (so that a location where pixel values aren’t known can draw from locations where the pixels are known). To achieve this, we randomly flip each pixel value in the object mask with a chance of 10%, meaning a mask dropout rate of 0.1. This forces the network to be able to use nearby pixels to predict depths. We mask out the flipped pixels when computing the loss to avoid error backpropagation. We show in experiments (Sec.5.2) that training with mask dropout helps stabilize our performance.
Data Augmentation. During training, we perform random cropping instead of center cropping to increase the training samples. The window size varies between the fraction of the size of the largest center crop. We perform the same cropping for the ground truth depth map . Note that a smaller cropping is equivalent to a closer view of the object, resulting in a smaller distance to the camera. We thus divide each pixel value with in order to preserve the depth scales across different crops of the same image. We also update each crop’s normal given the re-scaled depth, using the weighted quantized smoothed normal computation as described in Sec. 3.2. Moreover, we perform random rotation on the image plane ranges in degrees, random horizontal flipping and image color changes with each of the RGB channel being multiplied by the weight ranges in independently. Each augmentation parameter is uniformly and randomly sampled from the defined range.
To train our method, we need triples of ground truth: RGB image, object mask, depth with masked object being removed. Such datasets do not exist, and are difficult to make on a large scale. Instead, we make the ground truth tuples by rendering a synthetic dataset. However, a rendered dataset may not properly represent texture or illumination. We thus combine the data with the standard NYUd v2 
real dataset (where we have only empty object masks). Training samples are selected uniformly across each training set (synthetic or real), with a 50% probability of choosing one or another. We apply mask dropout on all object masks.
Synthetic: AI2-THOR 
is an indoor virtual environment that supports physical simulation of objects in the scene. We modified the default simulation setting to be able to remove every object in the scene, rather than pickupable objects only. AI2-THOR has 120 predefined scenes from four categories of rooms: kitchen, living room, bedroom and bathroom. In each scene, we place an agent at a random location for 100 times. The height of agent is sampled under the normal distribution with mean of 1.0m and a standard derivation (std) of 0.1m. The agent looks at the scene with a randomly sampled altitude, which is normally distributed with a mean of(looking at horizon) and a std of . At each view, we generate the ground truth depth map with one of the objects removed. For each type of room, we use 27 scenes for training and withhold three scenes for testing. This creates 47k image-depth pairs of synthetic samples. Each rendered depth map ranges up to 5 meters.
Real: NYUd v2  is one of the widely used RGBD dataset with real indoor scenes. We use the official train and test split in our experiment.
Synthetic. We use the test split of AI2-THOR to compare with other baselines. We obtain 1162 test samples with depth changes of least 0.25m per pixel after the object is removed. Slight changes in depth can hardly be examined the performance.
|shape complexity||simple (e.g. box), complex (e.g. chair)|
|shape rarity||common (e.g. box), rare (e.g. doll)|
|number of objects close by||0, 1, 2|
|object behind||wall, empty space, other objects|
|distance to the camera||1.5m, 2.0m|
Real. We have collected a small but carefully structured RGBD dataset for evaluation using Kinect v2, as shown in Figure 4. Our dataset contains both RGB images and the depth maps before and after the removal of objects. For each image, we carefully label a 2D tight object mask around the object to be removed. Our images are collected so as to investigate five factors that might affect the prediction error (Table 1): (1) the complexity of the object; (2) the rarity of the object in the training set; (3) number of other non-removed objects close by with similar depth; (4) the object location; (5) the distance between the object and the camera. The first two factors focus on the object itself and the latter three focus on the spatial relationship between the object and the scene. This results in testing cases. Please find more detailed dataset configurations in Appendix C.
We implement our network using MatConvNet and train it on a single NVIDIA Titan X GPU. We use the weights of pretrained ResNet-50 on ImageNet to initialize the the encoder, then train the whole network end-to-end. We use ADAM
to update network parameters with a batch size of 32 and an initial learning rate of 0.01. The learning rate is then halved after every 5 epochs and the whole training procedure takes around 20 epochs to converge. In our experiment, we set the term weights in Eq.1 as: .
Baselines. To demonstrate the effectiveness of our approach, we compare with three classes of natural baselines: (1) “Do nothing”. We simply ignore the mask and apply our approach to estimate image depth. In this case we’re predicting image depth with the object. (2) Depth inpainting. We use the object mask to remove the object from our predicted depth map, then fill in the hole using three different methods. For the first method, we apply Poisson editing  to interpolate the missing depth based on neighboring depth values. For the second method, we apply a vanilla auto-encoder. The auto-encoder gets as input the concatenation of the depth map and the object mask, and predicts the scene depth with the object removed. The encoder (decoder) consists five convolution layers with kernel size ofbottleneck feature size as ours. We train the auto-encoder with the same setting as our approach. For the third method, we compare to the state-of-the-art depth hole filling approach DepthComp by Atapour . DepthComp requires additional input of semantic segmentation maps. We use the outputs from SegNet  trained on SUNRGBD  to run the experiment. (3) Image inpainting. Given the object mask, we inpaint the RGB image using the method by Iizuka , then predict depth from the inpainted one using our approach.
For fair comparison, we use our network with no object mask to produce the initial depth map for all baselines. We evaluate the performance of our approach and all the baselines using the following standard single image depth estimation evaluation metrics:
rms: root mean squared error:
mae: mean absolute error:
rel: mean absolute relative error:
: percentage of pixels where the ratio (or its reciprocal) between the prediction and the label is within a threshold, 1.25, to the power : . We set .
Note that rms, mae, and rel are error metrics (the lower the better) and measures accuracy (the higher the better). For detailed analysis, we calculate the average pixel performance using the metrics on the entire image (all pixels), the region inside the mask (interior), and the region outside the mask (exterior). Performance on the entire image naturally shows the ability of predicting image depth with an object removed; performance on the interior region demonstrates the ability to predict the scene depth behind the object; and performance on the exterior region demonstrates the ability of predicting the depth of non-removed area.
|Ours w/o mask dropout||.542||.364||.162||75.0||93.5||97.8||.569||.407||.133||80.2||95.3||99.1||.540||.363||.162||75.1||93.7||97.8|
|Ours w/o norm||.629||.430||.187||70.1||89.5||96.1||.678||.490||.158||73.9||92.4||97.4||.627||.428||.186||70.2||89.6||96.1|
|Ours w/o mask dropout||.762||.612||.272||38.9||71.3||90.1||.517||.416||.203||51.3||87.5||99.3||.781||.630||.279||37.7||69.7||89.3|
|Ours w/o norm||.455||.364||.188||66.7||93.6||99.3||.393||.310||.160||68.8||96.0||99.8||.460||.369||.191||66.5||93.4||99.2|
5.1 Qualitative results
Depth with an object removed. We show in Figure 5 our qualitative performance compared with other baselines on NYUd v2 dataset. NYUd v2 does not have ground truth depth with the object removed, so we could only compare qualitatively. We use the ground truth 2D segmentation in NYUd v2 as the input object mask. Our approach is able to produce well-behaved depth behind the object and the depth of non-removed area, along with a good normal estimates for the hidden geometry. Note that depth predictions by the inpainting baseline are mangled by inpainting errors. Poisson smoothing produces somewhat better estimates, but fails in the obvious way when one side of the background is closer than the other (first column). We show in Figure 6 more qualitative results on our collected real dataset and the synthetic AI2-THOR dataset.
Depth with multiple objects removed. One important benefit of using object mask as input is that we can arbitrarily remove any number of objects from the scene and predict the depth without these objects. Figure 7 demonstrates the ability of our network to estimate scene depth with different combinations of objects removed from the same scene. Our approach is also able to produce consistent predictions for non-removed area (e.g. layouts, counter) in the same scene.
5.2 Quantitative results
We show in Table 2 our quantitative comparison on the test set of the synthetic AI2-THOR dataset. Table 3 reports the performance on our collected real dataset. Poisson and DepthComp do not perturb depth outside the object mask region, hence, their exterior region is equal to “Do nothing”. We report their error metrics in exterior as *. Our method outperforms all baselines on most metrics. Inpainting method does not work; Poisson and DepthComp have trouble removing an object. Auto-encoder and ours produce comparatively good interior (ours still slightly better) depth, but Auto-encoder produces worse depth estimates of exterior region. Note that for some measurements the depth prediction performance inside the object masked could be better than the prediction on the whole image scale. We believe that it’s uncommon that objects mask other clutter, so the masked scene tends to be walls, floors, etc., where depth has simpler statistics and is easier to predict.
Ablation study. We show in Table 2 and Table 3 the performance gains by training with our smoothed ground truth normal loss (ours v.s. ours w/o normal) and the mask dropout data augmentation (ours v.s. ours w/o mask).
Factors that affect error. We investigate how properties of test data affect the error of the method, by regressing error against the attributes of the test images (Sec. 4.2) and looking for significant predictors. We use both individual terms and pairwise interactions, and apply an ANOVA. Please find detailed analysis in Appendix E.
Single image depth with the object. For images where no object is removed, our approach is able to predict scene depth that is of comparable quality to that of state-of-the-art single image depth estimation methods. Please find detailed evaluations in Appendix B.
We have introduced a new task – estimating the hidden geometry behind the object. Our method takes as input a single RGB image and an object mask, and predicts a depth map that describes the scene when the object is removed. We show, both qualitatively and quantitatively, that our approach is able to predict depth behind objects better than other baselines, and is flexible in removing multiple objects. Our approach can be further utilized for applications like object insertion and manipulation in a single RGB image.
This research is supported in part by ONR MURI grant N00014-16-1-2007.
-  A. Atapour-Abarghouei and T. P. Breckon. Depthcomp: real-time depth image completion based on prior semantic scene segmentation. 2017.
-  A. Atapour-Abarghouei and T. P. Breckon. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In , pages 2800–2810, 2018.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
-  C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera. Filling-in by joint interpolation of vector fields and gray levels. IEEE transactions on image processing, 10(8):1200–1211, 2001.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG), 28(3):24, 2009.
-  J. T. Barron and J. Malik. Intrinsic scene properties from a single rgb-d image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 17–24, 2013.
-  M. Bertalmio, A. L. Bertozzi, and G. Sapiro. Navier-stokes, fluid dynamics, and image and video inpainting. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages I–I. IEEE, 2001.
-  A.-L. Chauve, P. Labatut, and J.-P. Pons. Robust piecewise-planar 3d reconstruction and completion from large-scale unstructured point data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1261–1268. IEEE, 2010.
-  W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In Advances in Neural Information Processing Systems, pages 730–738, 2016.
-  W. Chen, D. Xiang, and J. Deng. Surface normals in the wild. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, pages 22–29, 2017.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
-  M. Firman, O. Mac Aodha, S. Julier, and G. J. Brostow. Structured prediction of unobserved voxels from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5431–5440, 2016.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
-  J. Hays and A. A. Efros. Scene completion using millions of photographs. In ACM Transactions on Graphics (TOG), volume 26, page 4. ACM, 2007.
-  V. Hedau, D. Hoiem, and D. Forsyth. Thinking inside the box: Using appearance models and context based on room geometry. In European Conference on Computer Vision, pages 224–237. Springer, 2010.
-  J.-B. Huang, S. B. Kang, N. Ahuja, and J. Kopf. Image completion using planar structure guidance. ACM Transactions on graphics (TOG), 33(4):129, 2014.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017.
-  K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE transactions on pattern analysis and machine intelligence, 36(11):2144–2158, 2014.
A. Kendall and Y. Gal.
What uncertainties do we need in bayesian deep learning for computer vision?In Advances in neural information processing systems, pages 5574–5584, 2017.
-  Y. Kim, H. Jung, D. Min, and K. Sohn. Deep monocular depth estimation via integration of global and local predictions. IEEE Transactions on Image Processing, 27(8):4131–4144, 2018.
-  D. Kinga and J. B. Adam. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), volume 5, 2015.
-  E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
-  Y. Kuznietsov, J. Stückler, and B. Leibe. Semi-supervised deep learning for monocular depth map prediction. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6647–6655, 2017.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Fourth International Conference on 3D Vision (3DV), pages 239–248. IEEE, 2016.
-  Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
-  C. Liu, P. Kohli, and Y. Furukawa. Layered scene decomposition via the occlusion-crf. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 165–173, 2016.
-  G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723, 2018.
-  J. Liu, X. Gong, and J. Liu. Guided inpainting and filtering for kinect depth maps. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 2055–2058. IEEE, 2012.
-  M. Liu, X. He, and M. Salzmann. Building scene models by completing and hallucinating depth and semantics. In European Conference on Computer Vision, pages 258–274. Springer, 2016.
-  M. Liu, M. Salzmann, and X. He. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 716–723, 2014.
-  F. Mal and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
A. B. Owen.
A robust hybrid of lasso and ridge regression.Contemporary Mathematics, 443(7):59–72, 2007.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
-  M. Pauly, N. J. Mitra, J. Giesen, M. H. Gross, and L. J. Guibas. Example-based 3d scan completion. In Symposium on Geometry Processing, number CONF, 2005.
-  P. Pérez, M. Gangnet, and A. Blake. Poisson image editing. ACM Transactions on graphics (TOG), 22(3):313–318, 2003.
-  D. Ron, K. Duan, C. Ma, N. Xu, S. Wang, S. Hanumante, and D. Sagar. Monocular depth estimation via deep structured models with ordinal constraints. In 2018 International Conference on 3D Vision (3DV), pages 570–577. IEEE, 2018.
-  A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
-  J. Shen and S.-C. S. Cheung. Layer depth denoising and completion for structured-light rgb-d cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1187–1194, 2013.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
S. Song, S. P. Lichtenberg, and J. Xiao.
Sun rgb-d: A rgb-d scene understanding benchmark suite.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
-  S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 190–198. IEEE, 2017.
-  L. Wang, H. Jin, R. Yang, and M. Gong. Stereoscopic inpainting: Joint color and depth completion from stereo images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008.
-  C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
-  Y. Zhang and T. Funkhouser. Deep depth completion of a single rgb-d image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 175–185, 2018.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017.
Appendix A Evaluation of Surface Normal Computation
Due to the small magnitude of noise of the measurement error presented by the data collected from sensors in the real world (NYUd v2), it is difficult to directly calculate reliable surface normals from the depth map to train from. The error in surface normal ground truth will greatly affect the quality of the depth estimation. To incorporate in the training procedure, we thus propose our smoothed surface normal ground truth computation in Sec. 3.2 in the main paper.
We demonstrate the efficacy of our surface normal computation compared to other methods on the synthetic AI2-THOR test set. AI2-THOR has accurate ground truth depth and surface normal without sensor error. We obtain the ground truth surface normal by computing the first-order derivatives from the ground truth depth. Then we simulate a measurement error by adding some random noises. We model those noises as a combination of a white noise (0.001 m) and a circle patch of diameter of 5 pixels (0.01 m) added randomly to the scene with a probability of 0.01. We shown in Table4 the comparison between our surface normal computation, the simple gradient method and the plane fitting approach . “Accuracy” is defined as the average dot product between the computed surface normal and the ground truth (higher the better, ranges from -1 to 1). “Speed” reports the time (second) used per image. Note that our method and the gradient-based approach run on single gpu (NVIDIA Titan X) while plane-fitting runs on single cpu (1.7 GHz, 8 cores). We observed that the accuracy is highly dependent on a noise type: if we only add the random circle patch with a probability of 0.02, the resulted accuracy is 0.917, 0.831, and 0.881 respectively.
In all, those experiments demonstrate that our surface normal computation produces high enough quality and is fast enough to be incorporated in network training.
|Ours||Gradient-based||plane fitting |
|NYUd v2 |
|Eigen & Fergus ||0.641||0.158||76.9||95.0||98.8|
|Ma & Karaman ||0.514||0.143||81.0||95.9||98.9|
|Our Collected Evaluation dataset|
Appendix B Single Image Depth Estimation with the Object
While this is not our objective, we also evaluate our performance with no object removed – same as single image depth estimation. We directly test our trained network that predicts depth with the object removed on NYUd v2 dataset and set the input object mask as empty. Table 5 compares our method to a variety of the state-of-the-art on the NYUd v2 dataset and our collected evaluation dataset. Though our approach is trained for a different task, we show on-par performance. We consider the main comparison with Laina , since we have the similar encoder-decoder structure. Our method outperforms Laina on our collected dataset but shows performance degrades on NYUd v2. We realized that this is due to the different depth statistic between the synthetic and real dataset. In Figure 8, we show that depth maps in AI2-THOR range up to 5 meters, compared to the maximum depth of 10 meters in NYUd v2. This biases the depth predictor to be in favor of the shallower depth. As a result, our network trained on both NYUd v2 and AI2-THOR makes more significant (RMS) error on the depth prediction that is deeper than 5 meters on NYUd v2 test set in comparison to Laina . However, it is unavoidable for us to not to use the synthesis dataset, since it is the only source that we can easily manipulate the scene to have ground truth depth with the object removed.
In all, we conclude that our method, like others, performs very strongly on test sets where the distribution of depths compares to that in training, but degrades when it encounters novel depths.
Appendix C Our Collected Evaluation Dataset
We show in Table 6 the detailed configurations of our collected evaluation dataset. The configurations are based on the five factors we investigate that might affect prediction error. The top sub-table considers the object’s characteristics itself: common or rare, simple or complex. The bottom sub-table, which has three rows, considers the variables of the spatial relationship with the scene: numbers of objects close by, the non-removed objects behind and the distance to the camera. We show typical samples of each of the five factors in the table.
Appendix D More Qualitative Results
We show in Figure 9 and Figure 10 more qualitative results on the NYUd v2 dataet. Our method can remove objects very well. Note that our network is trained on NYUd v2 but is never trained to remove an object from this dataset (we learn to remove an object by training on AI2-THOR). Since there is no ground truth of scene depth with an object removed in NYUd v2, we are only able to show the qualitative results compared to other baselines
Appendix E Analysis of Variance (ANOVA)
We investigate how properties of test data affect the error of the method, by regressing error against the attributes of the test images from our collected dataset and looking for significant predictors. We use both individual terms and pairwise interactions, and apply an ANOVA. We consider the following five individual terms:
numbers of objects close by
background (objects) behind
object’s distance to the camera
The interaction terms are the 2-combination of the five individual terms, resulting in terms.
We analyze on our approach and the two baselines: image inpainting and poisson inpainting.
e.1 Our method
For our method, only 3 of 10 interaction terms achieve significance using the usual F-test (i.e. ). The adjusted is 0.882, meaning the regression is quite good at predicting errors, and so it is reasonable to infer hard cases from regression coefficients. The significant cases are:
objects far from the camera with cluttered backgrounds (mild increase in error rate);
simple objects that are far from the camera (mild increase in error rate);
simple objects that are rare (mild increase in error rate).
Of the individual terms, rarity, objects behind and distance to the camera have effects, with common objects, cluttered or empty space behind, and objects far
from the camera are each associated with an increase in error rate. It is odd thatcommon objects should be associated with increased error, and it is odd that objects far from the camera should be associated with increased error. The effect of objects behind is easily understood: the cases are against wall, on cluttered or on empty background, and it is relatively natural that predicting the depth to a wall an object is in contact with might be more accurate.
e.2 Image inpainting baseline
For the image inpainting baseline, again only 3 of 10 interaction terms achieve significance using the usual F-test (i.e. ). The adjusted is 0.633, meaning the regression is only moderate at predicting errors. This is likely because the conditions we investigate have only mild effect on whether inpainting is likely to be successful (more important is image appearance around the object). Significant effects are:
objects far from the camera with two other objects close by (mild decrease in error rate);
simple objects that are far from the camera (mild decrease in error rate);
rare objects that are far from the camera (mild decrease in error rate).
Of the individual terms, complexity, rarity, objects behind and size have effects, with simple objects, common objects, cluttered or empty backgrounds, and objects far from the camera are each associated with an increase in error rate.
e.3 Poisson editing baseline
For the Poisson baseline, again only 3 of 10 interaction terms achieve significance using the usual F-test (i.e. ). The adjusted is 0.632, meaning the regression is only moderate at predicting errors. This is likely because the conditions we investigate have only mild effect on whether smoothing is likely to be successful (more important is the pool of depths around the object). Significant effects are:
objects far from the camera with two other objects close by (mild decrease in error rate);
simple objects that are far from the camera (mild decrease in error rate);
rare objects that are far from the camera (mild decrease in error rate).
Note that the above effects are the same as for the inpainting baseline. Of the individual terms, complexity, rarity, objects behind and size have effects, with simple objects, common objects, cluttered or empty backgrounds, and objects far from the camera are each associated with an increase in error rate (again, the same as the inpainting baseline).