Estimating depth from a single RGB image is a very important field in research due to its wide range of applications in robotics and AR [almagro2019, tateno2017cvpr, Shih3DP20]. Nevertheless, predicting accurate depth from monocular input is also an inherently ill-posed problem. For the human perceptual system depth perception is a simpler task, as we heavily rely on prior knowledge from the environment. Similarly, deep learning has recently proven to be particularly suited for such problems, as the network is also capable of leveraging visual priors when making a prediction [saxena2005, hoiem2005].
With the rise of deep learning and increasing availability of appropriate and large datasets [silberman2012, koch2019, dai2017scannet, cityscapes2016, Ranftl2020], depth prediction from single images has recently made a huge leap forward in terms of robustness and accuracy [wang2020, Miangoleh2021Boosting, yin2019virtualnormal, jiao2018deeper, fu2018dorn]. Yet, despite those large improvements, they still often fall short of adequate quality for specific robotics applications, such as in path planning or robotic interventions where robots need to operate in hazardous environments with low-albedo surfaces and clutter [almagro2019, koch2019, ramamon2020]. One of the most limiting factor is the poor quality around object edges and surfaces, which directly effects the 3D perception towards failure, thereby resulting in a robot missing the objects. The predicted depth maps are typically blurry around object boundaries due to the nature of 2D convolutions and bilinear upsampling. Since the kernel aggregates features across object boundaries, the estimated depth map commonly ends up being an undesired interpolation between fore- and background. Similarly, associated 3D point clouds cannot reflect 3D structures (see Fig. 1). In this work, our motivation is to capture object-based depth values more sharply and completely, while preserving the global consistency with the rest of the scene.
To circumvent the smeared boundary problem, i.e. avoid undesired depth interpolation across different segments, we are interested in an operation that extracts features within a continuous object segment. To achieve this, we employ a novel convolution operation inspired by the sparse convolutions introduced by Uhrig et al. [uhrig2017]
. Sparse convolutions are characterized by a mask, which defines the region in which the convolution operates. While sparse convolutions typically rely on a single mask throughout the entire image, in this work, the mask depends on the pixel location. Given a filter window, we define the mask as the feature region that belongs to the same segment as the central pixel of that window. In other words, only the pixels that belong to a certain object contribute to its feature extraction. We name our convolution operatorInstance Convolution.
Using Instance Convolutions to learn the object depth values should make the depth values at the object edges sharper than regular convolution, i.e. prevent the interpolation problem at the occlusion boundaries. Despite this advantage in terms of boundary sharpness, Instance Convolutions come with an obvious drawback. An architecture based solely on such operation, would not be able to capture object extent and global context. This inherent scale-distance ambiguity would thus make it impossible to obtain metric depth. Therefore, we propose an architecture that first extracts global features so to utilize scene priors via a common backbone comprised of regular convolutions. We then append a block composed of Instance Convolutions to rectify the features within an object segment, resulting in sharper depth across occlusion boundaries. Notice that we chose an optimization-based approach [achanta2012slic], producing superpixels, to obtain corresponding segmentation as a deep learning-driven method would simply shift the problem of clear boundaries towards the segmenter. In addition, our method does not require any semantic information, but rather only needs to understand what pixels belong to the same discontinuity-free object part.
Our contributions can be summarized as follows:
We propose a novel end-to-end method for depth estimation from monocular data, which explicitly enforces clear object boundaries by means of superpixels.
To this end we propose a dynamic convolutional operator, Instance Convolution, which only aggregates features appertaining to the same segment as the center pixel, with respect to the current kernel location.
Further, as we are required to properly propagate the correct segment information throughout the whole network, we additionally introduce the ”center-pooling” operator, keeping track of center pixel’s segment id.
We validate the usefulness of Instance Convolutions for edge-aware depth estimation on two commonly employed benchmarks, namely NYUv2 [silberman2012] and iBims [koch2019]. Thereby, we show that Instance Convolutions can consistently improve object boundaries regardless of the chosen backbone depth estimator.
Ii Related Work
Supervised monocular depth prediction
The first attempts to tackle monocular depth estimation where proposed by [saxena2005, saxena2009] via hand-engineered features and Markov Random Fields (MRF). Later, the advancements in deep learning established a new era for depth estimation, starting with Eigen et al. [eigen2014]
. One of the main problems in learned depth regression occurs in the decoder part. Due to the successive layers of convolution channels in neural networks, fine details of the input images are lost. There are a number of works that approach this problem in different ways. Eigen & Fergus[eigen2015] introduced multi-scale networks to make depth predictions at multiple scales. Laina et al. [laina2016] build upon a ResNet architecture with improved up-sampling blocks to reduce information loss in the decoding phase. Xu et al. [xu2017]
proposed an approach that combines deep learning with conditional random fields (CRF), where CRFs are used to fuse the multi-scale deep features.
More recently, a line of works pursued multitask learning approaches that predict semantic or instance labels [jiao2018deeper], depth edges, and normals [ramamonjisoa2019sharpnet, zhang2019pap, lee2019big] to improve depth prediction. Kendall et al. [kendall2017multi]
investigated the effect of uncertainty estimation for estimating loss contributions in scene understanding tasks. Yinet al. [yin2019virtualnormal] estimated the 3D point cloud from the predicted depth map and used a local surface normal loss.
All of the above works aim to learn a globally consistent depth map, yet do not focus on fine local details, often resulting in blurred boundaries and deformed planar surfaces. Consistent with our work, Hu et al. [hu2018boundary] focuses on accurate object boundaries through gradient and normal based losses. Ramamonjisoa et al. [ramamonjisoa2019sharpnet] aims to improve predicted depth boundaries by estimating normals and edges along with depth and establishing consensus between them. Several works apply bilateral filters to increase occlusion gaps [shih2020] or learn energy-based image-driven refinement focusing on edges [niklaus2019, ramamon2020].
Sparsity in convolutions has been investigated in several works [liu2015sparseconv, wen2016nips] aimed at improving the efficiency of neural networks by reducing the number of parameters, i.e. increasing sparsity. Minkowski Engine was proposed as an efficient 3D Spatio-Temporal convolution built on sparse convolutions [choy2019minkowski]. In contrast to such works, Uhrig et al. used sparse meshes [uhrig2017] to improve structural understanding in the case of sparse inputs, e.g. depth map completion [zhao2021adaptive, Lee_2021_CVPR]. Some works used a similar convolution operator, called partial or gated convolution, for image editing and inpainting tasks [liu2018partialinpainting, yu2019gatedconv]
to discard content-free regions. In our work, we are also interested in computing convolution only on a subset of input pixels. Differently from these works, our masks do not define a random set or a normally distributed sparse set of pixels. Our masks change dynamically for each pixel position, to extract features within the sameobject segment.
The proposed instance convolution operation relies on the detection of meaningful object segments in a given scene. One alternative would be to identify all objects in a scene, either as annotated instance labels, or via learned segmentation models, e.g. Mask R-CNN [he2017maskrcnn]. The former requires a heavy amount of annotation work for large datasets. The latter requires the objects in the corresponding dataset to match the pre-trained models and can additionally lead to inaccurate edges. To detect objects without pre-trained models and labeled data, in this work, we leverage over-segmentation methods. Among the available methods for over-segmentation [achanta2012slic, uziel2019bayesian, levinshtein2009turbopixels], in our experiments we mainly focus on superpixel (SLIC) [achanta2012slic] and Bayesian adaptive superpixel segmentation (BASS) [uziel2019bayesian] (see Fig. 3).
In this section, the problem statement and the individual components of the proposed method for boundary-aware MDP are presented.
Iii-a Depth Estimation Using Deep Learning
Monocular depth estimation has recently received a lot of attention in literature and several different methods have been proposed [wang2020, Miangoleh2021Boosting, Ranftl2020]. Interestingly, even very early methods have noticed the performance degradation around occlusion boundaries, and various different measures, such as skip-connection [eigen2015, laina2016] or Conditional Random Fields [xu2017, liu2015], have been put in place to counteract the smearing effect. Nevertheless, despite those measures, the proposed methods can still not capture the high frequencies of object discontinuities due to the inherent nature of 2D convolutions.
The classical convolution kernel simultaneously operates on all inputs within the kernel region, performing a weighted summation. Consequently, features originating from different object parts are simply fused, which in turn causes a blurring of the object boundaries, i.e. the edges that separate the object from the background in 3D. A corresponding example is shown in Fig. 2. Notice that this effect is more visible when viewing the associated 3D point cloud.
Iii-B Instance Convolutions for Boundary-Aware Depth Estimation
To avoid aggregation of features appertaining to different image layers, we thus propose to leverage superpixels in an effort to guide the convolution operator. In particular, inspired by Sparsity Invariant Convolutions [uhrig2017], we propose Instance Convolutions, which applies the weighted summation only to pixels belong to the same segment as the central pixel. Formally, this can be written as follows:
where denotes the observed feature at pixel , is the indicator function which returns if the segment id equals the segment id of the central pixel . Learnable convolution weight is denoted by with a kernel size , and bias . Finally, denotes a small constant, added to avoid division by zero for non-masked pixels. Notice that if all pixels belong to the same segment, this operation turns into a regular convolution.
Since our architecture follows the standard encoder-decoder methodology (Section III-C), we have to adequately propagate the segment information through the network. However, as MaxPooling can lead to loss of spatial information, we introduce the center-pooling operator, which simply forwards the segment id of the central pixel with respect to each downsampling operation to preserve the object boundaries. Whereas at upsampling, the original semantic map (or downsampled from previous layers) is directly used as they are already readily available or computed. For a detailed explanation of Instance Convolutions, see Fig. 4 (b).
A deep learning-based approach would simply transfer the problem of clear boundaries towards the segmenter. Such methods can also never cope with all objects classes in the wild. Moreover, our method does not require any semantic information, but rather only needs to understand what pixels belong to the same discontinuity-free object part. Hence, in this work, we instead rely on optimization-based approaches, i.e. SLIC [achanta2012slic] and BASS [uziel2019bayesian], to obtain the needed superpixels. Thereby, these works can provide not only object boundaries but also self-occlusions within objects (see Fig. 3). Noteworthy, while SLIC requires to define the number of output segments in advance, BASS can find the optimal number of segments by itself, however at a higher computational cost.
Iii-C End-to-end Architecture
To summarize, we model our Instance Convolution such that the method is particularly suited for estimating depth at object boundaries. Nevertheless, as an output pixel has never observed any feature outside of the segment it resides on, it is impossible for the model to predict metric depth due to the scale-distance ambiguity (i.e. a large object far away can have the same projection onto the image plane as a small object close by). Therefore, we harness a state-of-the-art MDP backbone to extract global information about the scene. We then feed the extracted features together with the obtained superpixels to our Instance Convolution-driven network to estimate the final edge-aware depth. Since the backbone as well as our Instance Convolution block are fully differentiable, we can train the whole model end-to-end. Proposed method can be plugged together with different depth predictors. In this paper, we use SharpNet[ramamonjisoa2019sharpnet], BTS[lee2019big], and VNL[yin2019virtualnormal] for feature extraction, to show the generalizability of Instance Convolutions.
Iii-D Training Loss
Our training objective is composed of three terms, namely the gradient loss, the normal loss, and the loss. The depth gradient loss is given as
and calculates the horizontal () and vertical gradients () using the Sobel operator. Here denotes the predicted depth, while refers to the ground truth depth.
A surface normal of a pixel can be computed directly by vertical and horizontal gradients of the depth map as
We then define the normal loss as follows
which computes the angular distance between the per-pixel normals extracted from ground truth and the predicted depth maps.
The last loss is a -term between the predicted depth map and the ground truth depth map
The total training loss is the sum of the three losses as follows
In this section, the experimental setup along with the proof-of-concept experiment is introduced.
Iv-a Overfitting Experiment
To test the capacity and prove the effectiveness of our method, we first conduct overfitting experiments, comparing the classical convolution based MDP methods and Instance Convolution counterparts. In general, as classical convolution is limited by the kernel-window pixel averaging, it cannot learn the sharp occlusion boundaries of an input image and results in a structured noise effect on point clouds (see Fig. 1). Yet, the Instance Convolution method can prevent this issue through considering the pixels only relevant to the objects.
Iv-B Evaluation Criteria
Standard metrics. We follow the standard MDP metrics as introduced in [eigen2014] and report results with respect to mean absolute relative error (absrel), root mean squared error (rmse), and the accuracy under threshold ().
Occlusion boundaries. In addition, Koch et al. [koch2019] proposed another set of metrics focusing on occlusion boundaries (DBE, DDE) and planarity (PE). The former calculates the accuracy () and completeness () of occlusion boundaries by comparing predicted depth map edges with an annotated map of occlusion boundaries, calculating the Truncated Chamfer Distance (TCD) according to
where is the distance to the nearest pixel of the ground truth edge, . If this distance is greater than 10 pixels, is set to zero to neglect irrelevant pixels. Furthermore, while PE denotes the surface normal error on planar region maps (provided with iBims) computed as 3D distance and angular difference, DDE is the directed depth error assessing depth behind and in front of planar regions. In this work, we focus on occlusion boundary quality, therefore mainly consider DBE along with the standard metrics. We also state the results for PE and DDE for completeness.
NYU v2 Dataset. NYU v2 consists of images collected in a real indoor environments ([silberman2012]). Depth values are captured with Microsoft Kinect camera. The raw dataset of RGB depth pairs (approximately 120K images) has no semantic labels. The authors created a smaller split for semantic labels, instance labels, along with the refined depths and normals. In our experiments, we refer to this smaller split, which contains 1449 images in total, which are divided into 795 for training and 654 for testing.
NYU v2 - OB Dataset. Occlusion boundary annotations on the NYU v2 test data for evaluation purposes is released by Ramamonjisoa et al. [ramamon2020]. In this work, we use this dataset to further evaluate the occlusion boundaries of our depth predictions.
iBims Dataset. This dataset is presented as an evaluation split along with novel metrics on occlusion boundaries and planarity scores. They provide rich annotations of dense depth maps from different scenes, with occlusion boundaries and planar regions. iBims contains around 100 images for evaluation only [koch2019].
|with Instance Conv.||0.124||0.456||0.847||0.971||0.993||1.961||6.489|
|with Instance Conv.||0.117||0.425||0.863||0.970||0.991||1.780||6.059|
|with Instance Conv.||0.121||0.467||0.848||0.964||0.993||1.817||6.197|
Iv-D Comparison to State-of-The-Art
In Table I, we compare our results from NYU v2 with three state-of-the-art approaches, namely SharpNet [ramamonjisoa2019sharpnet], VNL [yin2019virtualnormal] and BTS [lee2019big]. Thereby, the proposed architecture (Fig. 4) uses these pre-trained models for latent depth feature extraction and applies Instance Convolution based blocks.
and decrease it by every 10 epochs. The loss terms in Eq. 6 have equal weights of 1. We use SLIC [achanta2012slic] to obtain superpixels with 64 segments and set sigma to 1. For SharpNet [ramamonjisoa2019sharpnet] we train each model with a batchsize of 4. For BTS [lee2019big] we set the batch size to 3. Our proposed model employs 3 layers of Instance Convolutions with gradually decreasing number of feature channels. The feature map resolution remains constant with a prediction kernel of size .
We can clearly outperform the original methods with respect to the occlusion boundary metrics and . In addition, for the classical metrics (absrel and rmse), we report comparable results as the baselines. The qualitative results in the Fig. 5 further supports these findings, where the proposed Instance Convolution based predictions of each model have sharper occlusion boundaries, and resulting depth maps.
In Table II, we further report our results with respect to the iBims evaluation dataset, in order to assess the generalizability of our method. Note that this dataset is used only for inference (i.e. no training), to measure whether the models are capable of detecting depth values. Here, our model again improved the counterpart depth models in terms of DBE. VNL backbone model obtains state-of-the-art results for DBE accuracy and completion metrics. As for SharpNet, our model have better results than the original model.
Notably on iBims, the model with the best absrel value does not have the best DBE score (Li et al. [li2017] 0.22 absrel, 3.90 DBE vs. Liu et al. [liu2015] 0.30 absrel, 2.42 DBE). This conceptually agrees with the fact that absrel averages out per-pixel distances, while DBE calculates 3D distances between the points lying on occlusion gaps. Further, BTS achieves an absrel error of 0.22, sharing the state-of-the-art on iBims with Li et al. [li2017], while outperforming them on DBE. The reason of DBE degradation on BTS could reside in the atrous convolutions used in the backbone, resulting in losing edge information and thus the generalization ability.
Iv-E Ablation Studies
Table III contains the results for different parameters and configurations. For all experiments, the SharpNet model is used as backbone, with Instance Convolutions (IC) or Standard Convolutions (SC).
Superpixels information. We trained a model with SC, but provided the superpixel segmentation map as an extra input to each convolutional layer. DBE scores improved (SC 64), but the results are worse than IC.
Number of segments in SLIC. We ablate different number of superpixel segments. As expected, increasing number of segments improve DBE accuracy, yet, induce a little loss in absrel. Notice that this is also the case for most sota works. Best performing works for edges are often worse on absrel, which could be caused by imperfect annotations.
Over segmentation with BASS. We also qualitatively evaluated our methods using BASS [uziel2019bayesian] to extract superpixels. As shown in the Fig. 3, BASS is able to retrieve more detailed segments from the image, however it also detects overly noisy edges due to redundant number of segments (400-500), which increases the model complexity, making the learning more difficult.
Instance masks. To compare the quality of instance mask prediction to unsupervised segmentation, we ablated our method with the state-of-the-art instance mask prediction method PointRend [kirillov2020pointrend]. As both the absrel and the DBE results were poorer than the baseline, this proved the effectiveness of over-segmentation method, most likely because of detecting self-occlusions within the images.
Runtime analysis Full inference times are given in the Tbl. III under Frames per Second (FPS). Each runtime is an average of 1000 inferences. It can be seen that IC does not excessively alter the FPS (compared to both original SharpNet and SC). As PointRend and BASS rely on external neural networks, we do not consider them in comparisons.
In this work, we introduce a novel depth estimation method, which is particularly tailored towards tackling the problem of depth smoothing at object-discontinuities. To this end, we propose a new convolutional operator, which avoids feature aggregation across discontinuities by means of superpixels. Our exhaustive evaluation on NYU v2 as well as iBims demonstrates that proposed method is indeed capable of enhancing depth prediction around edges, while almost completely maintaining the quality on the remaining regions. In the future, we want to explore how Instance Convolution can be incorporated into other domains such as semantic segmentation to similarly improve sharpness.