Depth measurement is a critical task in various applications, including robotics, augmented reality, and self-driving vehicles. It measures the distance from all or a part of the pixels in the imaging device to target objects using active/passive sensors. Equipping such devices requires high cost and continuous operation, which makes its use limited. Monocular depth estimation estimates the depth of pixels in a given 2D image without additional measurement. It facilitates the understanding of 3D scene geometry from a captured image, which closes the dimension gap between the physical world and an image.
Because of its importance and cost benefits, there have been lots of studies [stereo, sfm1, 3d, sfm2, 10.1145/2601097.2601165]
that have improved depth estimation accuracy, temporal consistency and depth ranges. Owing to the success of the convolutional neural network, it has also been adapted to monocular depth estimation and has produced great improvements.
Many existing monocular depth estimation methods train their networks with supervised depth labels computed via synthetic data or estimated from depth sensor [Liu01, Eigen01, Li01, Laina01]. Although such methods have provided significant improvements in depth estimation, they still have multiple concerns related to the high cost of labeling and obtaining the depth labels on pixels, the limited available ground-truth depth data, the restricted depth range of sampled data, and the noticeable noise in the depth values. To avoid these shortcomings, self-supervised training methods have recently been proposed.
Notably, the SfM-Learner [Zhou01] method utilizes the ensembles of consecutive frames in video sequences for joint training depth and pose networks. It demonstrates comparable performance to extant supervised methods; however, recent works [dfnet, CC, Godard02, Bian01] based on SfM-Learner mostly rely on photometric loss [ssim] and smoothness constraints; hence, they suffer from limited supervision of weak texture regions. Furthermore, moving objects and uncertainty in the pose network destabilize training, leading to incorrect depth values, especially on object boundaries (see Fig. 1).
Several recent methods have attempted to overcome this weakness by employing cross-domain knowledge learning, including leveraging scene semantics to improve monocular depth predictions [Klingner2020SelfSupervisedMD, lee2021learning, wild, casser2018depth]. They remove dynamic objects or explicitly model the object motion from the semantic instances to incorporate them into the scene geometry. In addition, a regularization of the depth smoothness within corresponding semantic objects enforces consistency between depth and semantic predictions [Chen01, Ramirez01, zhu2020edge].
In this study, we aim to improve self-supervised monocular depth estimation via the implicit use of semantic segmentation. We do not explicitly identify moving objects or regularize depth values in accordance with the semantic labels. Instead, we focus on representation enhancement, optimizing the depth network in the representation spaces, to produce semantically consistent intermediate depth representations.
Inspired by the recent use of deep metric learning [Wang_2017_ICCV, song2017deep, MLarticle], we suggest a novel semantics-guided triplet loss to refine depth representations according to implicit semantic guidance. Here, our goal is to take advantage of local geometric information from the scene semantics. For example, the adjacent pixels within each object have similar depth values, whereas those that cross semantic boundaries may have large differences. Combined with a simple but effective patch-based sampling strategy, our metric-learning approach exploits the semantics-guided local geometry information to optimize pixel representations near the object boundary, thereby yielding improved depth predictions.
We also design a cross-task attention module for refining depth features more semantically consistent. It computes the similarity between the reference and target features through multiple representation subspaces and effectively utilizes the cross-modal interactions among the heterogeneous representations. As a result, we quantify the semantic awareness of depth features as a form of attention and exploit it to produce better depth predictions.
Our contributions are summarized as follows. First, we present a novel training method that extracts semantics-guided local geometry with patch-based sampling and utilizes it to refine depth features in a metric-learning formulation. Second, we propose a new cross-task feature fusion architecture that fully utilizes the implicit representations of semantics for learning depth features. Finally, we comprehensively evaluate the performances of these two methods using the KITTI Eigen split and demonstrate that our method outperforms recent state-of-the-art self-supervised monocular depth prediction works in every metric.
2 Related Work
2.1 Depth Estimation with Neural Network
The recent success of neural networks has stimulated significant improvements to monocular depth estimation as a supervised regression method[Liu01, Eigen01, Laina01]. Recently, unsupervised training methods have been actively investigated. [Godard01] used predicted disparity to synthesize a virtual image and minimized its photometric loss for training. [Zhou01] trained the depth network jointly with an additional pose network, requiring only monocular sequences. Based on these approaches, they have widely been tackled [Godard02, Bian01, Xu2019RegionDN]. Many researchers have made further improvements along multiple lines, such as regularizing consistency with optical flow [Tosi2020DistilledSF, dfnet, CC, Zhao2020TowardsBG] or functional geometric constraints between feature maps [Spencer2020DeFeatNetGM, Shu2020FeaturemetricLF].
Several recent works have proposed self-supervised depth prediction with semantics. They have enforced cross-task consistency and smoothness [Chen01, Ramirez01, zhu2020edge] and removed dynamic objects [Klingner2020SelfSupervisedMD] or explicitly modeled object motions [lee2021learning, wild, casser2018depth]. [Guizilini2020SemanticallyGuidedRL] targeted semantics-aware representations for depth predictions, enabling it via knowledge transfer from a fixed teacher segmentation network with pixel-adaptive convolution [Su2019PixelAdaptiveCN]. In contrast, we design a multi-task network with cross-task multi-embedding attention and semantics-guided triplet loss to successfully produce semantics-aware representation.
2.2 Neural Attention Network
[attention, bert] designed a self-attention scheme that captures long-range dependencies to resolve the locality of recurrent operations. They proposed multi-head attention for utilizing information from different representation subspaces. Recently, cross-attention schemes have been utilized to extract features across heterogeneous representations, such as image, speech, and text [hou2019cross, Wei_2020_CVPR, ZhaoNLJCM20b]. Additionally, for self-supervised depth estimation, [Johnston2020SelfSupervisedMT, Zhou2019UnsupervisedHD] applied self-attention to capture the global context for estimating depth and combining multi-scale features from dual networks. In this study, inspired by the use of multi-head attention and cross attention, we propose a novel method of judiciously utilizing cross-task features across depth and segmentation.
2.3 Multi-task Architecture
The combination of features from multiple tasks has been widely used in recent multi-task architectures. [Xu2018PADNetMG, mtinet, ECCV18] applied a convolutional layer to extract local information from the reference task feature for multimodal distillation. [Zhou2020PatternStructureDF, zhang2019patternaffinitive, Jiao2019GeometryAwareDF, choi2020safenet] adopted affinity-guided message passing to propagate the relationship of spatially different features within the reference task to the target one. Instead, we propose a cross-task attention to fully utilize cross-modal interactions between geometry and semantics.
2.4 Deep Metric Learning
Deep metric learning [Wang_2017_ICCV, song2017deep, MLarticle]
has been widely applied in various fields, such as face recognition[Schroff_2015, Hu_2014_CVPR] and image ranking [wang2014learning, ng2020solar, netvlad]. Inspired by recent successes, we propose a semantics-guided triplet loss to refine feature representations for improving depth predictions by exploiting implicit geometry from semantic supervision.
Here, we review our baseline approach, Monodepth2 [Godard02], and present our current methodology in the following subsections.
3.1 Depth Estimation and Semantic Segmentation
3.1.1 Self-supervised Monocular Depth Estimation
Given consecutive RGB images, and , one can predict , the depth of every pixel on
, and compute a six degree-of-freedom relative pose,, using a pose network. With known camera intrinsics, , we can derive the projected pixel coordinates and use them from as:
where is the homogeneous coordinates of the pixel in , and is the transformed coordinates of by . is a sub-differentiable bilinear sampler [Jaderberg2015SpatialTN] that obtains nearby pixels at in
and assigns the linearly interpolated pixel atin . Ideally, and should be aligned if both depth and pose networks are optimally trained. These two networks are jointly optimized to minimize the discrepancy between and . We utilize the structural similarity index measure (SSIM) [ssim] combined with L1 loss as a photometric loss, [Godard01]:
We compute for the two frame pairs, , and , to deal effectively with occlusions. We apply the minimum reprojection [Godard02], which selects the pixel having a smaller loss between the two reference frames , and we apply an auto-mask [Godard02]. The following edge-aware smoothness loss [Godard01], , is also added.
The loss function of our baseline is obtained as follows:
where controls the relative strength of the smoothness factor.
3.1.2 Supervised Semantic Segmentation
A typical network model for semantic segmentation has an encoder-decoder architecture [ronneberger2015unet] for extracting features and upsampling them for dense predictions. This structure is similar to our baseline depth network [Godard02], wherein basic features are extracted first prior to being fed into the decoder. Therefore, we adopt a shared-encoder architecture to reduce computations and benefit from both tasks.
In our proposed method, we train semantic segmentation with a pseudo-label generated by an off-the-shelf segmentation model [Zhu01]. We do not require per-image ground-truth of segmentation in the training dataset; thus, it is more practically applicable. We used the cross-entropy loss, , for training, and the training loss includes with the baseline loss (Eqn. 4), where is a control parameter.
3.2 Semantics-guided Triplet Loss
Based on the local geometric relation from scene semantics, adjacent pixels within each object instance have similar depth values, whereas those across semantic boundaries may have large depth differences. Thus, we apply this intuition through a representation learning problem inspired by the recent usage of deep metric learning [Wang_2017_ICCV, MLarticle]. We first separate pixels of the local patch on the semantic label into triplets (i.e., anchor, positive, and negative), and we then divide features from the layer of the depth decoder () in accordance with the corresponding location of those triplets. We aim to optimize the distance among these triplets, following the intuition described above. However, we do not directly optimize the depth value itself. Our key idea is that the distance should be defined and optimized in the representation space. Hence, the depth decoder can produce more discriminative features on the boundary regions so that the output depth map becomes more aligned with the semantic boundaries.
3.2.1 Patch-based Candidate Sampling
We first divide the semantic label into the
size of image patches with a stride of one. For each patch, we selected center of each patch as the anchor pixel and those that have the same class as that of the anchor as positive pixels. The negative pixels have different classes from those of the anchor pixels. Subsequently, we defineand , the sets of positive and negative pixels in the local patch , of which the spatial location of the anchor is . We use and to determine whether intersects the semantic borders. For example, means that is located inside a specific object and does not cross the borders. On the other hand, if and are both larger than zero, it indicates that intersects the boundaries across objects.
Additionally, the semantic labels may not be accurate or consistent because they are predictions of pre-trained segmentation networks. To reduce misclassification caused by these imperfect labels, we set a threshold, , and determine intersects with the boundaries when and are both larger than .
3.2.2 Triplet Margin Loss
We grouped the features in each patch of the depth feature map into three classes (i.e., anchor, positive, and negative) following the corresponding pixel locations in the semantic image patch. We define positive distance and negative distance as the mean of the Euclidean distance of the L2 normalized depth feature pairs.
We aim to reduce the distance between the anchor and positive features, and increase the distance between the anchor and negative features. However, naively maximizing as far as possible does not lead to our desired outcome because the semantic border does not always guarantee that the depth of two separate objects differs by a large amount. Instead, we adopt the triplet loss [triplet, wang2014learning] with a margin so that the distance is no longer optimized when the negative distance exceeds a positive distance more than a specific margin , as a hyper-parameter.
The semantics-guided triplet loss is the average of , only containing satisfying the condition described in Sec. 3.2.1.
We sum over the of depth features across multiple layers and include into the total loss the sum multiplied by control parameter .
3.3 Cross-task Multi-embedding Attention (CMA) Module
We propose a CMA module to produce semantics-aware depth features through the representation subspaces and utilize them to refine depth predictions. As illustrated in Fig. 2, the CMA modules are located in the middle of each decoder layer and utilize the information from the other decoder. A single CMA module has uni-directional data flow, e.g., a CMA module refines the target feature with the reference feature. We use two CMA modules simultaneously to enable bidirectional feature enhancement, where depth (segmentation) becomes the target (reference) in one CMA module while their roles change in the other. In the following paragraphs, we only describe a single case where the depth feature is the target for ease of explanation.
In our model, each decoder comprises five blocks () and the spatial resolution of the feature map is doubled for each. The depth (segmentation) decoder generates a feature map, (), which has a spatial resolution of , where and are the height and width of the input image, respectively. The CMA modules can be attached to any of the five candidates.
The CMA module performs a pixel-wise operation on the two feature maps, and , through several operations. It first computes the semantic awareness of the depth features as a pixel-wise attention score through cross-task similarity (Sec. 3.3.1). We then extend this computation with multiple linear projections so that the similarity can be computed from different representation subspaces (Sec. 3.3.2). This enables selective extraction of depth features from multiple embeddings upon the corresponding semantic awareness, maximizing the utilization of cross-modality. Subsequently, the fusion function combines the input feature map, , with the refined one, (Sec. 3.3.3). We explain the details in the following sections.
3.3.1 Cross-task Similarity
We define cross-task similarity as , where is the spatial index of each feature map, and is a -dimensional
feature vector. This indicates quantitative amounts of semantic representation that each depth feature implicitly refers to. However, direct computation with raw feature vectors is infeasible, owing to the different nature of the tasks. We apply a linear projection,, that transforms the input feature from the original dimension, , to . This indirectly computes the cross-task similarity through the representation subspace. The refined feature is computed as follows:
Here, is a normalization factor scaling the input. We apply three separate linear embeddings, and each acts as query (), key (), and value () functions. The target feature map, , becomes the input for the key and value embeddings, and the reference feature map, , becomes the input for the query embedding.
For depth prediction, this imposes large attention scores () on the specific depth features which are consistent with semantics, so that it can implicitly utilize semantic region information. As mentioned above, this module is bidirectional, and the semantic feature
acts as the target simultaneously. At this time, the depth feature is used to learn the features for semantic prediction so that backpropagation from the segmentation loss () optimizes depth layers while offering more semantics-aware representations.
Compared with the affinity matrix for cross-task feature fusion[Zhou2020PatternStructureDF, zhang2019patternaffinitive, Jiao2019GeometryAwareDF, choi2020safenet], which is computed solely based on features from a single task, the CMA module computes the attention score based on features from both tasks. Hence, it can effectively handle cross-modal interactions for multi-task predictions.
3.3.2 Multi-embedding Attention
Inspired by multi-head attention [attention, bert], we adopt multiple linear projections to compute the similarity between feature vectors through different representation subspaces. This refines depth features with implicit semantic representations more effectively, as verified in Sec. 4. We use distinct projection functions, ; hence, the queries, keys, and values are mapped to independent subspaces. The cross-task similarity in Eqs. 9-10 can be directly extended to a multi-embedding scheme as follows:
The refined feature, , is the summation of the feature maps refined from each embedding function:
In the above equations, represents the index of multiple linear embeddings. We adopted as a normalization function, , to compute the importance of each embedding. Thus, we can selectively exploit the outputs from multiple attentions. This process is illustrated in Fig. 4.
In contrast to the original multi-head attention where the results from each embedding head are concatenated and equally handled, we compute the attention score among multiple heads and measure the significance of results from each embedding on the corresponding attention scores.
3.3.3 Fusion Layer
Finally, the refined feature map, , is projected to the original dimension ( in Fig. 4) and fused with the initial feature map, , to produce final output, . We apply two convolution layers to concatenated feature maps, , to produce . becomes the input of layer of the depth decoder.
4.1 KITTI Dataset
The KITTI dataset [Geiger01] has been widely adopted for depth prediction benchmarks. We used the Eigen split [Eigen01] for this purpose, and preprocessing was performed to remove static frames, as in [Godard02, Zhou01]; thus, 39,910 and 4,424 images were used for training and validation, respectively, and 697 images were used for evaluation.
For training semantic segmentation, we generated pseudo-labels using an off-the-shelf network [Zhu01]. To evaluate the segmentation performance, we used 200 images and labels provided in the training set of the KITTI semantic segmentation benchmark corresponding to KITTI 2015 [Menze2015CVPR].
To evaluate the capability of depth prediction, we conducted per-image median-scaling with ground-truth following the evaluation protocol in [Godard02]. The maximum depth is 80 m, as in recent studies [Godard02, Patil2020DontFT, Klingner2020SelfSupervisedMD]. We evaluated semantic segmentation in the mean intersection over union (mIoU).
|Ours w/o CMA||K||56.1|
|Ours w/o CMA (HR)||K||59.1|
Semantic segmentation results on KITTI 2015 training set. CS denotes Cityscapes, and K represents KITTI. HR refers to training using high-resolution image.
4.2 Implementation Details
4.2.1 Network Architecture
The depth and segmentation network has a standard encoder-decoder architecture [ronneberger2015unet] with skip connections, as in Monodepth2 [Godard02]. The shared encoder and the pose network encoder are ResNet-18 [resnet]
, pre-trained on ImageNet[imagenet_cvpr09]. For the CMA module, we adopt four ( = 4) embeddings of the multi-embedding scheme. The dimension ratio between the original feature and the embedded feature is two, such that = . Thus, the projected vectors have twice the dimensions of the the corresponding input features. The normalization factor, , is the identity function when (without multi-embedding) and when (w/ multi-embedding). We apply CMA module to three of decoder layers, .
4.2.2 Training Details
For training, we resized the original image into a resolution of
and used a batch size of 12. The Adam optimizer was used with an initial learning rate of 1.5e-4, and we trained for 20 epochs while the learning rate was decayed by 0.1 twice, after 10 and 15 epochs of training. We used SSIM[ssim] with loss for , with = 0.85 following the previous work [Godard02]. We set the loss parameters as follows: , , and . The local patch size, , is set to five, and the margin, , is set to 0.3 for . This loss is applied to features from three layers, and . The threshold is set to .
4.3 Quantitative Results and Ablation Study
Table 0(a) compares our proposed method with recent works. Ours achieves state-of-the-art results on the KITTI Eigen test split and outperforms previous works in every metric. Our network adopts ResNet-18 as a backbone, but we also use ResNet-50 and compare it with others adopting ResNet-50. Ours (ResNet-50) also achieved the best results. Note that the PackNet versions of [Guizilini2020SemanticallyGuidedRL] adopted a significantly large backbone ( larger than ResNet-18). Therefore, we compare the ResNet-18 and ResNet-50 versions of [Guizilini2020SemanticallyGuidedRL] and show that our method outperforms it by a large margin. Additionally, in our multi-task network, semantic information is required only for training. In contrast, [Guizilini2020SemanticallyGuidedRL] and [li2021learning] require the semantics for both training and testing. [Guizilini2020SemanticallyGuidedRL] requires a teacher segmentation network for feature distillation during inference and [li2021learning] requires semantic label or pre-computed segmentation results as the network input. Finally, our network is highly compatible with more advanced networks [lyu2020hrdepth, Guizilini20203DPF] which have architectural differences from our baseline, Monodepth2. This indicates the potential for further improvement.
In Table 0(b), we also evaluate the effectiveness of each proposed method. The addition of semantic segmentation to depth () via shared encoder shows an improvement. Applying the semantics-guided triplet loss and the CMA module further improves the baseline. This verifies that more semantics-aware representation improvements of depth predictions are produced. Finally, the combination of both methods significantly improves the performance. Both techniques are designed to refine the depth representation via semantic knowledge, and they offer highly synergistic improvements.
In Table 2, we also evaluate the semantic segmentation performance on KITTI 2015 [Menze2015CVPR]. Though the proposed method outperforms others, it is not fair to compare with the works that trained semantic segmentation with Cityscapes [Cordts2016Cityscapes] ground-truth (they trained depth on KITTI.). Hence, we focus more on how segmentation benefits from depth estimation via CMA rather than the final performance. As shown in the last two rows, the proposed CMA module also improves the segmentation performance. Thanks to its bi-directional flow, the CMA module also refines semantic features as a target with reference to depth representations. Additionally, it is more effective when the resolution is high.
4.4 Qualitative Evaluation
We qualitatively compare our method with recent methods, SGDepth [Klingner2020SelfSupervisedMD] and PackNet-Sfm [Guizilini20203DPF], as shown in Figs. 1 and 5. In Fig. 1, we compare the depth predictions and error distributions111Owing to the sparsity of depth ground-truth, we computed the mismatch with top-performing supervised depth network [lee2019big]., fixing AbsRel between 0 and 1. Similar to ours, [Klingner2020SelfSupervisedMD] also adopted multi-task training with semantic segmentation via a shared encoder. However, only enhancing the encoder in a multi-task setting cannot fully exploit the semantic information. As shown in both figures, our method captures fine-grained detail, leading to more accurate depth predictions compared with others, especially at object borders. This verifies the effectiveness of the proposed fine-grained semantics-aware enhancement of representation.
4.5 Further Analysis
Table 3a shows the results of varying layers () to which semantics-guided triplet loss is applied. We selected layers and because it showed the best results. Applying to degrades the performance as it has a significantly low channel dimension (16); hence, the distance cannot be properly computed. In Table 3b, we compare the effect of the patch size, . Because separating a local patch into triplets relies on semantic labels from the off-the-shelf network, there must be noisy labels. When is small (i.e., ), the number of samples decreases, and each noisy label contributes more to the mean distance, . When is large (i.e., ), each local patch contains more non-boundary pixels and the negative distance can easily exceed the margin. In other words, the loss is computed from more easy samples, and the improvements are limited. In our experiments, was the balanced point, which was the best option. We further compare the results of different margins, , in the supplement.
Table 4 lists the results of the CMA module with varying parameters. As shown in Table 4a, our bidirectional CMA provides better results than the unidirectional CMA. This confirms that, to benefit from cross-modal representation, it is more beneficial to simultaneously improve both depth and semantics features than to improve just one.
Table 4b shows the effectiveness of the proposed multi-embedding scheme. It can fully utilize cross-modality as the number of embeddings grows. As shown in Fig. 6, the more embeddings used, the more precise the object boundary of the depth network, and the depth prediction becomes more aligned to semantics. This demonstrates that the depth network can have more semantics-aware representations, owing to our proposed multi-embedding scheme.
This paper proposed novel methods for accurate monocular depth prediction (i.e., semantics-guided triplet loss and cross-task multi-embedding attention) to make the best use of semantics-geometry cross-modality. Semantics-guided triplet loss offered a new and effective supervisory signal for optimizing depth representations. The CMA module allowed us to utilize rich and spatially fine-grained representations for multi-task training of depth prediction and semantic segmentation. The enhanced representation from these two methods exhibited a highly synergistic performance boost. Our extensive evaluation on the KITTI dataset demonstrates that the proposed methods outperformed extant state-of-the-art methods, including those that use semantic segmentation.
Acknowledgement. We would like to especially thank Soohyun Bae at Bobidi for his invaluable comments. This work was supported by the SNU-SK Hynix Solution Research Center (S3RC).