DeepAI
Log In Sign Up

Self-Supervised Monocular Depth Estimation: Solving the Edge-Fattening Problem

10/02/2022
by   Xingyu Chen, et al.
aiit.org.cn
Peking University
0

Self-supervised monocular depth estimation (MDE) models universally suffer from the notorious edge-fattening issue. Triplet loss, popular for metric learning, has made a great success in many computer vision tasks. In this paper, we redesign the patch-based triplet loss in MDE to alleviate the ubiquitous edge-fattening issue. We show two drawbacks of the raw triplet loss in MDE and demonstrate our problem-driven redesigns. First, we present a min. operator based strategy applied to all negative samples, to prevent well-performing negatives sheltering the error of edge-fattening negatives. Second, we split the anchor-positive distance and anchor-negative distance from within the original triplet, which directly optimizes the positives without any mutual effect with the negatives. Extensive experiments show the combination of these two small redesigns can achieve unprecedented results: Our powerful and versatile triplet loss not only makes our model outperform all previous SoTA by a large margin, but also provides substantial performance boosts to a large number of existing models, while introducing no extra inference computation at all.

READ FULL TEXT VIEW PDF

page 3

page 5

page 8

10/11/2022

Frequency-Aware Self-Supervised Monocular Depth Estimation

We present two versatile methods to generally enhance self-supervised mo...
11/26/2019

Edge-Guided Occlusion Fading Reduction for a Light-Weighted Self-Supervised Monocular Depth Estimation

Self-supervised monocular depth estimation methods generally suffer the ...
05/05/2022

FisheyeDistill: Self-Supervised Monocular Depth Estimation with Ordinal Distillation for Fisheye Cameras

In this paper, we deal with the problem of monocular depth estimation fo...
09/26/2021

Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

Self-supervised methods play an increasingly important role in monocular...
09/07/2022

BiFuse++: Self-supervised and Efficient Bi-projection Fusion for 360 Depth Estimation

Due to the rise of spherical cameras, monocular 360 depth estimation bec...
05/20/2022

Self-Supervised Depth Estimation with Isometric-Self-Sample-Based Learning

Managing the dynamic regions in the photometric loss formulation has bee...
09/24/2019

Improving Collaborative Metric Learning with Efficient Negative Sampling

Distance metric learning based on triplet loss has been applied with suc...

Code Repositories

tri-depth

[WACV 2023] Self-Supervised Monocular Depth Estimation: Solving the Edge-Fattening Problem


view repo

1 Introduction

Estimating how far each pixel in the image is away from the camera is a fundamental problem in 3D computer vision. This technique is desirable in various fields, such as autonomous driving [43, 49], AR [29] and robotics [13]. The majority of these applications adopt the appealing off-the-shelf hardware, e.g. Lidar sensor or RGB-D cameras, to enable agents. On the contrary, monocular videos or stereo pairs are much easier to obtain. Hence, quantities of researches [51, 10, 45] have been conducted to estimate promising dense depth map from only a single RGB image. Most of them and their follow-ups formulated the problem as image reconstruction [10, 51, 45, 46, 24]. In particular, given a target image, the network infers its pixel-aligned depth map. Next, with a known or estimated camera ego-motion, every pixel in the target image can be reprojected to the reference image(s) which is taken from different viewpoint(s) of the same scene. The reconstructed image can be generated by sampling from the source image, and the training loss is based on the photometric distance between the reconstructed and target image. In this way, the network is trained under self-supervision.

Nevertheless, these approaches suffered severally from the notorious ‘edge-fattening’ problem, where the objects’ depth predictions are always ‘fatter’ than themselves. We visualize the problem and analyse its cause in Fig. 1. Disappointingly, there is no corresponding one-size-fits-all solution, let alone a lightweight one.

Deep metric learning seeks to learn an embedding space where semantically similar samples are embedded to nearby locations, and semantically dissimilar samples are embedded to distant locations. Pioneeringly, [24] first introduced the patch-based semantics-guided triplet loss into MDE. Its key idea is to encourage pixels within each object instance to have similar depths, while those across semantic boundaries to have depth differences, as shown in Fig. 2.

However, we find that the straightforward application of the triplet loss only produces poor results. In this paper, we dig into the weakness of the patch-based triplet loss, and improve it through a problem-driven manner, reaching unprecedented performances.

First, in some boundary regions the edge-fattening areas could be thin, their contributions are small compared to the non-fattening area, in which case the thin but defective regions’ error could be covered up. Therefore, we change the optimizing strategy to only focus on the fattening area. The problematic case illustrated in Fig. 3 motivates our strategy, where we leave the normal area alone, and concentrate the optimization on the poor-performing negatives.

Second, we point out that the training objective of the original triplet loss [37] is to distinguish / discriminate, the network only has to make sure the correct answer’s score (the anchor-positive distance ) wins the other choices’ scores (the anchor-negative distances ) by a predefined margin, i.e, , while the absolute value of is not that important, see an example in Fig. 4. However, depth estimation is a regression problem since every pixel has its unique depth solution. Here, we have no idea of the exact depth differences between the intersecting objects, thus, it is also unknown how much should exceed . But one thing for sure is that, in depth estimation, the smaller the , the better, since depths within the same object are generally the same. We therefore split and from within the original triplet and optimize them in isolation, where either error of the positives or negatives will be penalized individually and more directly. The problematic case illustrated in Fig. 5 motivates this strategy, where the negatives are so good enough to cover up the badness of the positives. In other words, even though is large and needs to be optimized, already exceeds more than , which hinders the optimization for .

To sum up, this paper’s contributions are:

  • We show two weaknesses of the raw patch-based triplet optimizing strategy in MDE, that it (i) could miss some thin but poor fattening areas; (ii) suffered from mutual effects between positives and negatives.

  • To overcome these two limitations, (i) we present a min. operator based strategy applied on all negative samples, to prevent the good negatives sheltering the error of poor (edge-fattening) negatives; (ii) We split the anchor-positive distance and anchor-negative distance from within the original triplet, to prevent the good negatives sheltering the error of poor positives.

  • Our redesigned triplet loss is powerful, generalizable and lightweight: Experiments show that it not only makes our model outperform all previous methods by a large margin, but also provides substantial boosts to a large number of existing models, while introducing no extra inference computation at all.

2 Related Work

2.1 Self-Supervised Monocular Depth Estimation

Garg et al. [7] firstly introduced the novel concept of estimating depth without depth labels. Then, SfMLearner presented by Zhou et al. [51] required only monocular videos to predict depths, because they employed an additional pose network to learn the camera ego-motion. Godard et al. [10] presented Monodepth2, with surprisingly simple methods handling occlusions and dynamic objects in a non-learning manner, both of which add no network parameters. Multiple works leveraged additional supervisions, e.g. estimating depth with traditional stereo matching method [45, 38] and semantics [22]. HR-Depth  [30] proved that higher-resolution input images can reduce photometric loss with the same prediction. Manydepth [46] proposed to make use of multiple frames available at test time and leverage the geometric constraint by building a cost volume, achieving superior performance.  [35] integrated wavelet decomposition into the depth decoder, reducing its computational complexity. Some other recent works estimated depths in more severe environments, e.g. indoor scenes [21] or in nighttime [40, 27]

. Innovative loss functions were also developed,

e.g. constraining 3D point clouds consistency [20]. To deal with the notorious edge-fattening issue as illustrated in Fig. 1, most existing methods utilized an occlusion mask [11, 52] in the way of removing the incorrect supervision under photometric loss. We argue that although this exclusion strategy works to some extent, the masking technique could prevent these occluded regions from learning, because no supervision exists for them any longer. In contrast, our triplet loss closes this gap by providing additional supervision signals directly to these occluded areas.

2.2 Deep Metric Learning

The idea of comparing training samples in the high-level feature space [3, 1] is a powerful concept, since there could be more task-specific semantic information in the feature space than in the low-level image space. The contrastive loss function (a.k.a. discriminative loss) [16] is formulated by whether a pair of input samples belong to the same class. It learns an embedding space where samples within the same class are close in distance, whereas unassociated ones are farther away from each other. The triplet loss [37] is an extension of contrastive loss, with three samples as input each time, i.e. the anchor, the positive(s) and the negative(s). The triplet loss encourages the anchor-positive distance to be smaller than the anchor-negative loss by a margin

. Generally, the triplet loss function was introduced to face recognition and re-identification 

[37, 54], image ranking [41], to name a few. Jung et al. [24] first plugged triplet loss into self-supervised MDE, guided by another pretrained semantic segmentation network. In our experiments, we show that without other contributions in [24], the raw semantic-guided triplet loss only yielded a very limited improvement. We tackle various difficulties when plugging triplet loss into MDE, allowing our redesigned triplet loss outperforms existing ones by a large margin, with unparalleled accuracy.

3 Analysis of the Edge-fattening Problem

Figure 1: Analysis of the edge-fattening issue. (a) Example of the edge-fattening issue. The depth predictions of foreground objects (e.g. the tree-trunk and poles) are ‘fatter’ than the objects themselves. (b) In the left view, pixel and are located in the background with a disparity of 5 pixels , and will be occluded if any further to the right. is on the tree with a disparity of 10 pixels. (c) and are OK - their gt disparity is the global optimum of the photometric error (loss). suffers from edge-fattening issue. Since is occluded by the tree in the right view, the photometric error of its gt disparity 5 is large. The photometric loss function therefore struggles to find another location that has a small loss, i.e., shifting another 5 pixels to reach the nearest background pixel . However, is not the true correspondence of . As a result, disparity of the background equals to that of the foreground , leading to the edge-fattening issue. Details in Sec. 3.

Before delving into our powerful triplet loss, it is necessary to make the problem clear. We first show the behaviour of the so-called edge-fattening issue in self-supervised MDE, then analyse its cause in Fig. 1. This motivates us to introduce our resigned triplet loss.

The ubiquitous edge-fattening problem, limiting performances of vast majorities of self-supervised MDE models [10, 46, 45, 24, 51, 30], manifests itself as yielding inaccurate object depths that partially leak into the background at object boundaries, as shown in Fig. 1a.

We first lay out the final conclusion: The networks misjudge the background near the foreground as the foreground, so that the foreground looks fatter than itself.

The cause could be traced back to occlusions of background pixels as illustrated in Fig. 1b&c. The background pixels visible in the target (left) image but invisible in the source (right) image suffer from incorrect supervision under the photometric loss, since no exact correspondences in the source image exist for them at all.

The crux of the matter is that, for the occluded pixel (e.g. pixel in Fig. 1b), the photometric loss still needs to seek a pixel with a similar appearance (in the right view) to be its fake correspondence. Generally, for a background pixel, only another background pixel could show a small photometric loss, so the problem turns into finding the nearest background pixel for the occluded pixel () in the source (right) view. Since the foreground has a smaller depth than the background, it has a larger disparity owing to the geometry projection constrain: , where is the fixed camera baseline. Consequently, the photometric loss has to shift more to the left to find the nearst background pixel (the fake solution, e.g. pixel in Fig. 1b&c). In this way, the occluded background pixels share the same disparities as the foreground, forming the edge-fattening issue.

4 Methodology

4.1 Self-supervised Depth Estimation

Following [51, 10], given a monocular and/or stereo video, we first train a depth network consuming a single target image as input, and outputs its pixel-aligned depth map . Then, we train a pose network taking temporally adjacent frames as input, and outputs the relative camera pose .

Suppose we have access to the camera intrinsics , along with and , we warp into to generate the reconstructed image :

(1)

where is the differentiable bilinear sampling operator according to [10].

To make use of multiple input frames, we build a cost volume [46] using discrete depth values from a predefined range . Moreover, and are dynamically adjusted during training to find the best scale [46].

In order to evaluate the reconstructed images , we adopt the edge-aware smoothness loss [18] and the photometric reprojection loss measured by  [51, 10]:

(2)

where by default and  [44] computes pixel similarity over a window. See [46] for more details of our network architecture.

4.2 Redesigned Triplet Loss

We first introduce our baseline depth-patch-based triplet loss, and then give detailed analyses of our two redesigns.

4.2.1 Patch-based Triplet Loss

Figure 2: The baseline semantic-aware triplet loss. For each pixel in the semantic boundary region, we group the local patch of its corresponding depth feature () into a triplet according to the semantic patch. Next, the triplet loss minimizes anchor-positive distance () and maximizes the anchor-negative distance ().
Figure 3: How the error of the low-proportion poor negatives is sheltered by the large-proportion well-performing negatives. (a) An instance of the edge-fattening issue (detailed analysis in Fig. 1), where the depth prediction is ‘fatter’ than the object itself. In other words, negatives in the ‘fatter’ area perform poorly - their depth has to be the same as the background but not the foreground object. (b) The Euclidean distance of depth feature between all pixels in the local patch and the anchor. The yellow plane indicates (mean anc-pos distances); the grey plane indicates (mean anc-neg distances); the green plane specifies the decision boundary of whether this triplet participates in training (see the hinge function in Eq. 8). is large such that the green plane lies beneath the grey. Thus, no learning happens. Disappointingly, the poor negative pixels in the ‘fatter’ area (e.g. the poor negatives marked in (a)) get no optimization. This is because of the average operator - the low-proportion poor negatives’ contributions are weakened by the large-proportion good negatives.

There lies a fact that pixels within a certain object have similar depths, while pixels across object boundaries may have large depth differences. However, in many cases, the depth boundary does not align to the semantic boundary as shown in Fig. 1. Therefore, similar to [24], we overcome this problem through deep metric learning [25, 48]. Specifically, we group the pixels in a local patch into triplets: the central pixel as the anchor; those share the same semantic class with the anchor as positives; while those have a different semantic class from the anchor as negatives. Then, we refer to the sets of positive and negative pixels of the anchor as and , respectively. For example, implies is located inside an object.

The anchor-positive distance and anchor-negative distance are defined as the mean of Euclidean distance of normalized depth features [24]:

(3)
(4)

where .

Intuitively, should be minimized, whereas should be maximized. However, naively maximizing as large as possible is unfavourable, since the depth differences between two spatially adjacent objects are not always large. Hence, a margin is employed to prevent from exceeding immoderately [37]. In this way, we want:

(5)

where is the threshold controlling the least distance between and .

In addition, since we also adopt an off-the-shelf pretrained semantic segmentation model [53] whose predictions could also be inaccurate, we only optimize the boundary anchors with and both larger than  [24]:

(6)
(7)

where is the set holding all semantic boundary pixels that fulfill the aforementioned constraint.

Consequently, the baseline triplet loss is defined as:

(8)

where is the hinge function.

In this way, we do not require any ground truth annotations, neither in depth nor segmentation, which enables a fully self-supervised setting and can therefore better generalize to unseen scenarios.

4.2.2 Motivation 1: The Poor Negatives’ Error may be Sheltered

Nevertheless, naively encouraging the mean in the local patch to increase leads to poor results. We trace its cause to the average operator in computing the anchor-negative distance. An example is demonstrated in Fig. 3, where the majorities of the negatives perform well, while only a few pixels inside the gap between the depth and semantic boundary perform poorly, triplet loss with the strategy of averaging is likely to mask the error of these poor negatives, as their contributions are small compared to large numbers of other well-performed negatives. As a result, the average anchor-negative distance is still not small enough to be optimized by the triplet loss with margin , which further hinders the optimization for these thin but poor negatives.

4.2.3 Solution 1: Focus only on the Fattening Area

We point out that it is not a wise idea to simply increase the margin to let cases like Fig. 3 participate in training, because it is not only these cases matter, we should find out the right suitable for as many cases as possible. Instead, inspired by PointNetVLAD [39], a point cloud retrieval network, we introduce selecting the hardest negative to participate in training, while leaving other negatives alone. In this way, the anchor-negative distance becomes:

(9)

This comes from the fact that if the worst negative is good enough, other negatives from the local patch would no longer need optimizations. Compared to other computer vision tasks like image classification [50], point cloud retrieval [39], person re-identification [19], where hard-negative mining is expensive because the whole set of training data has to be revisited to compute the one with the largest , our strategy is much more efficient since our mining process happens only in the local patch, and it is quite easy to proceed all triplets in batches.

4.2.4 Motivation 2: The Positives’ Error may be Sheltered

Figure 4: In a classification task, both network 1&2 perform well (both s.t. Eq. 5), they undoubtedly judge the anchor as a cat because there is a clear distance gap () between the two choices.

In a classification task, it is OK (tolerable) that the anchor-positive distance () is a bit large (e.g. Network 2 in Fig. 4), because it only has to be smaller enough than any of the samples belonging to a different class (e.g. the dog). In other words, for classification, it is the comparison relationship that counts, and neither the individual nor counts. A classification network only has to make sure that the score of the correct answer wins all the others, while the absolute score of the correct answer is not that important. In fact, it makes no difference for to be any value in Fig. 4 when inferring.

Figure 5: Example of how the error of poor positives is sheltered by the good negatives. (a) From top to bottom: An RGB image; its predicted depth; semantic segmentation. (b) Image patch in (a), where the depth prediction of the car window (positives) is wrong - it has to be the same as the car body, while depth predictions of the background (negatives) are extremely accurate. (c) The average of anchor-negative distances () is so large that the grey plane lies above the green plane. That is, the triplet loss stops working as Eq. 5 is satisfied. However, all poor positive car window pixels (the large ) get no optimization.

However, when it comes to depth estimation, the situation changes. We emphasize that depth estimation is a regression problem, which aims to predict accurate depth values for every pixel. In this case, and matter individually. For example, as depicted in Fig. 5b, the depth prediction of the car window (positives) should have been close to that of the car body (anchor), but in fact it doesn’t. When using the original triplet loss, the depth prediction of the background negatives is so good ( is so large enough) that they can shelter / cover up the error of the positives. In consequence, Eq. 5 is satisfied and no learning happens - all poor positive pixels (car window) are not going to be optimized.

4.2.5 Solution 2: Optimize Pos and Neg in Isolation

To this end, we no longer compare the value of with . Instead, we easily split and from within the original triplet loss (Eq. 8). We compare with a new margin , and optimize directly. Concretely, our redesigned the triplet loss becomes:

(10)

where either error of the positives or negatives appears individually without any mutual effects.

 

Method PP W H Data Abs Rel Sq Rel RMSE
RMSE
log
Ranjan et al. [36] S 0.148 1.149 5.464 0.226 0.815 0.935 0.973
EPC++ [28] S 0.141 1.029 5.350 0.216 0.816 0.941 0.976
Structure2depth [2] M 0.141 1.026 5.291 0.215 0.816 0.945 0.979
Videos in the wild [12] M 0.128 0.959 5.230 0.212 0.845 0.947 0.976
Guizilini et al. [15] M 0.102 0.698 4.381 0.178 0.896 0.964 0.984
Johnston et al. [23] M 0.106 0.861 4.699 0.185 0.889 0.962 0.982
Packnet-SFM [14] M 0.111 0.785 4.601 0.189 0.878 0.960 0.982
Li et al. [26] M 0.130 0.950 5.138 0.209 0.843 0.948 0.978
Patil et al. [32] M 0.111 0.821 4.650 0.187 0.883 0.961 0.982
Wang et al. [42] M 0.106 0.799 4.662 0.187 0.889 0.961 0.982
Monodepth2 MS [10] MS 0.106 0.818 4.750 0.196 0.874 0.957 0.979
Zhou et al. [51] M 0.183 1.595 6.709 0.270 0.734 0.902 0.959
WaveletMonodepth [35] S 0.109 0.845 4.800 0.196 0.870 0.956 0.980
HR-Depth [30] MS 0.107 0.785 4.612 0.185 0.887 0.962 0.982
FSRE-Depth [24] M 0.105 0.722 4.547 0.182 0.886 0.964 0.984
Depth-Hints [45] S 0.109 0.845 4.800 0.196 0.870 0.956 0.980
CADepth [47] S 0.106 0.849 4.885 0.204 0.869 0.951 0.976
SuperDepth [33] S 0.112 0.875 4.958 0.207 0.852 0.947 0.977
ManyDepth [46] M 0.098 0.770 4.459 0.176 0.900 0.965 0.983
Refine&Distill [34] S 0.098 0.831 4.656 0.202 0.882 0.948 0.973
TriDepth (Ours) M 0.093 0.665 4.272 0.172 0.907 0.967 0.984

 

Table 1: Comparison to previous SoTA on KITTI Eigen split [6]. Metrics are error metrics ↓ and accuracy metrics ↑. The Data column specifies the training data type: S - stereo images, M - monocular video and MS - stereo video. All models are trained with images and Resnet18 [17] as backbone. All results are not Post-Processed [9]. Best results are in bold; second best are underlined. We outperform all previous methods by a large margin on exactly all metrics.

E.g., in Fig. 5, the poor positives (wrong car window depth predictions) will be penalized directly, since is no longer restricted to the comparison with . This is a bit like contrastive loss [16], but we point out that we are still a triplet loss, since we still follow the idea of searching and optimizing both positives and negatives of the given anchor simultaneously. Another reason why we do not optimize the value of is that, we have no prior knowledge of how much the depth difference between the two-side objects should be. In fact, it varies with different objects. But one thing is certain: the smaller the , the better, since we always want the depths of different parts of the same object to be the same.

5 Experiments

5.1 Implementation Details

We implement our methods in PyTorch 

[31]

. We set the total training epoch as 20, the learning rate as

. Our triplet loss is applied throughout every layer of the depth decoder. We set the triplets’ patch size to be , and  [24] to filter out anchors located in potentially inaccurate semantic predictions. Because we use the minimum of all negative samples, we increase the original margin to . The rest of our model’s setting is the same as [46]. Note that our triplet loss is not restricted to a fixed baseline, when plugging it into a new model, just leave all the original settings exactly unchanged.

 

Method Pub. PP W H Data
Extra
time
Abs Rel Sq Rel RMSE
RMSE
log
Monodepth2 M [10] ICCV 2019 M 0.115 0.903 4.863 0.193 0.877 0.959 0.981
+ Ours M + 0ms 0.108 0.744 4.537 0.184 0.883 0.9630 0.983
Zhou et al. [51] CVPR 2017 M 0.183 1.595 6.709 0.270 0.734 0.902 0.959
+ Ours M + 0ms 0.148 1.098 5.150 0.212 0.819 0.949 0.980
Monodepth2 S [10] ICCV 2019 S 0.109 0.873 4.960 0.209 0.864 0.948 0.975
+ Ours S + 0ms 0.107 0.826 4.822 0.201 0.866 0.953 0.978
FSRE-Depth only SGT [24] ICCV 2021 M 0.113 0.836 4.711 0.187 0.878 0.960 0.982
+ Ours M + 0ms 0.108 0.746 4.507 0.182 0.884 0.964 0.983
Monodepth2 MS [10] ICCV 2019 MS 0.106 0.818 4.750 0.196 0.874 0.957 0.979
+ Ours MS + 0ms 0.105 0.753 4.563 0.182 0.887 0.963 0.983
Depth-Hints [45] ICCV 2019 S 0.109 0.845 4.800 0.196 0.870 0.956 0.980
+ Ours S + 0ms 0.106 0.843 4.774 0.194 0.875 0.957 0.980
HR-Depth [30] AAAI 2020 M 0.109 0.792 4.632 0.185 0.884 0.962 0.983
+ Ours M + 0ms 0.107 0.760 4.522 0.182 0.886 0.964 0.984
HR-Depth MS [30] AAAI 2020 MS 0.107 0.785 4.612 0.185 0.887 0.962 0.982
+ Ours MS + 0ms 0.105 0.751 4.512 0.181 0.890 0.963 0.983
ManyDepth [46] CVPR 2021 M 0.098 0.770 4.459 0.176 0.900 0.965 0.983
+ Ours M + 0ms 0.093 0.665 4.272 0.171 0.907 0.967 0.984
CADepth [47] 3DV 2021 M 0.110 0.812 4.686 0.187 0.882 0.962 0.983
+ Ours M + 0ms 0.105 0.745 4.530 0.181 0.888 0.965 0.984

 

Table 2: Comparisons of existing models with and without our method on KITTI Eigen split [6]. All models are trained with images and Resnet18 [17] as backbone. All results are not Post-Processed [9]. Models augmented with our powerful triplet loss are highlighted, achieving better results than their original counterparts on exactly all metrics, while no extra inference computation is needed.

 

Naive
Triplet Loss
Hard
Negative
Isolated
Triplet
Abs Rel Sq Rel RMSE
RMSE
log
0.115 0.903 4.863 0.193 0.877 0.959 0.981
0.113 0.836 4.711 0.187 0.878 0.960 0.982
0.110 0.782 4.622 0.185 0.881 0.963 0.983
0.111 0.823 4.645 0.186 0.881 0.962 0.982
0.108 0.746 4.507 0.182 0.884 0.964 0.983

 

Table 3: Ablations of our different strategies on KITTI Eigen split [6]. Here, we use Monodepth2 [10] as the baseline. All models are trained with monocular videos and Resnet18 [17] as backbone. All results are not post-processed [9].

5.2 Comparison to State-of-the-art

We run experiments on the KITTI dataset [8] which consists of calibrated stereo video registered to LiDAR measurements of a city, captured from a moving car. We use the KITTI Eigen split [6] to evaluate the depth predictions, which is commonly used for comparisons of MDE models. The depth evaluation is done on the Lidar point cloud, and we report all seven standard metrics. See [6] for more evaluation details. In order to fairly compare our model with recent state-of-the-art models, we evaluate with Garg’s crop [7], using the standard cap of 80m [9]. We report the results in Tab. 1, showing that we produce an unprecedented performance. Specifically, even though we make no use of any stereo pairs during training, our model still outperforms all SoTA by a large margin on every metric. To enable real-time applications, we abandon the post-processing technique [10] which doubles the inference computation, even so, we still show comparable results to previous post-processed SoTA results [45, 30, 24, 47]. Visualization of solving the edge-fattening issue is shown in Fig. 6.

5.3 Generalizability

As mentioned before, our triplet loss is not only powerful, but also versatile and lightweight. After showing our unparalleled state-of-the-art performance, we demonstrate the high generalizability of the proposed triplet loss.

We integrate our triplet loss into a large number of existing open-source methods (some of them are SoTA before ours), and show that our triplet loss can help them to achieve better scores, as reported in Tab. 

2. Remarkably, in all integrations, we bring a substantial performance boost to the original model, in exactly all seven metrics. Furthermore, we do not lead in any extra inference computation at all, which shows our lightness. For the benefit of the whole MDE community, we publish our full implementation, since we believe this powerful triplet loss could enhance more subsequent depth estimators.

5.4 Ablation Study

In Tab. 3, we validate the effectiveness of our two design decisions in turn. The original triplet loss, migrated from deep metric learning by Jung et al. [24], yields only a slight improvement to the baseline model [10]. Through either of our two contributions, the performance is substantially improved, and not surprisingly, our full model performs the best. It is also worth noting that none of our contributions introduces any computation overhead when inferring.

5.4.1 Consideration on Margin

We further compare different values of the margin in our redesigned triplet loss, as shown in Tab. 4. We select because it shows the best result. Either increasing or decreasing leads to performance degradation. We point out that selecting the margin of triplet loss in depth estimation is a difficult task, since we have no prior knowledge of whether the semantic boundaries guarantee a large depth difference or not. When is too large, the triplet loss would be too strict, i.e. leading to some false positives. E.g., let us consider an encouraging the depth difference between the intersecting objects to be m, then, the boundaries with a depth difference of m will be wrongly penalized. When is too small, the triplet loss will be too tolerant, i.e. leading to some false negatives. E.g., let us consider an encouraging the depth difference between the intersecting objects just to be m, then, boundaries with a depth difference of m but the prediction only has a depth difference of m, are going to be neglected by the optimization strategy.

 

Abs Rel Sq Rel RMSE R log
0.50 0.110 0.818 4.646 0.186 0.882 0.962 0.982
0.60 0.109 0.806 4.667 0.184 0.883 0.961 0.983
0.65 0.108 0.746 4.507 0.182 0.884 0.964 0.983
0.70 0.109 0.763 4.613 0.184 0.882 0.963 0.983
0.80 0.110 0.787 4.601 0.185 0.880 0.962 0.983

 

Table 4: Ablation on margin . All models are trained with monocular videos and Resnet18 [17] as backbone. All results are not post-processed [9].
Figure 6: Visual comparisons of an existing model with and w/o our triplet loss. The edge-fattening problem is significantly alleviated. The depths of thin structures are better aligned to their RGB counterparts, e.g. road signs and poles.

5.5 Can Current MDE Go into Production?

Apart from solving the ubiquitous edge-fattening problem, we also reveal new problems for our whole MDE community as the future works. We hope our community could focus not only on the performance numbers, but also on the barriers that prevent our practical application.

The scenario shown in Fig. 7

is not from any open datasets, but is captured by one of our authors. As can be seen, quantities of self-supervised MDE models fail completely in predicting the depth of the crossbar. We speculate that there are two probable reasons:

  • Existing methods rely too heavily on the images’ shape features in the dataset, and by chance, there are no horizontal bars in KITTI [8] dataset.

  • When training with stereo pairs, for the horizontal bar, the photometric loss for all disparities is small, thus no clear global optimum exists. The network therefore tends to mix the crossbar and the background road together.

We expect future MDE models to better generalize to real-life scenes via better feature representation techniques, e.g. deformable convolution [4] or data augmentation in feature space [5].

Figure 7: A fatal failure case for autonomous driving. The depth predictions of the crossbar in the middle of the input image are totally wrong, which may lead to a fatal error in autonomous driving. When vertically flipping the input image, the depth of the crossbar is correctly estimated, while other objects, e.g., people, are wrong again.

6 Conclusion

In this paper, we solve the notorious and ubiquitous edge-fattening issue in self-supervised MDE by introducing a well-designed triplet loss. We tackle two drawbacks of the original patch-based triplet loss in MDE. First, we propose to apply min. operator on computing anchor-negative distance, to prevent the error of the edge-fattening negatives from being masked. Second, we split the anchor-positive distance and anchor-negative distance from within the original triplet loss. This strategy provides more direct optimizations to the positives, without the mutual effect with the negatives. Our triplet loss is highly powerful, versatile and lightweight. Experiments show that it not only brings our TriDepth model an unprecedented SoTA performance, but also provides substantial performance boosts to a large number of existing models, without introducing any extra inference computation at all.

Acknowledgement. This work was supported by National Natural Science Foundation of China (No. 62172021).

References

  • [1] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1993)

    Signature verification using a” siamese” time delay neural network

    .
    Advances in neural information processing systems 6. Cited by: §2.2.
  • [2] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova (2019)

    Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos

    .
    In

    Proceedings of the AAAI conference on artificial intelligence

    ,
    Vol. 33, pp. 8001–8008. Cited by: Table 1.
  • [3] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    ,
    Vol. 1, pp. 539–546. Cited by: §2.2.
  • [4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017-10) Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §5.5.
  • [5] T. DeVries and G. W. Taylor (2017) Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538. Cited by: §5.5.
  • [6] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283. Cited by: Table 1, §5.2, Table 2, Table 3.
  • [7] R. Garg, V. K. Bg, G. Carneiro, and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European conference on computer vision, pp. 740–756. Cited by: §2.1, §5.2.
  • [8] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361. Cited by: 1st item, §5.2.
  • [9] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279. Cited by: Table 1, §5.2, Table 2, Table 3, Table 4.
  • [10] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838. Cited by: §1, §2.1, §3, §4.1, §4.1, §4.1, Table 1, §5.2, §5.4, Table 2, Table 3.
  • [11] J. L. GonzalezBello and M. Kim (2020) Forget about the lidar: self-supervised depth estimators with med probability volumes. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 12626–12637. External Links: Link Cited by: §2.1.
  • [12] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova (2019) Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8977–8986. Cited by: Table 1.
  • [13] B. Griffin, V. Florence, and J. Corso (2020) Video object segmentation-based visual servo control and object depth estimation on a mobile robot. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1647–1657. Cited by: §1.
  • [14] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon (2020) 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2485–2494. Cited by: Table 1.
  • [15] V. Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon (2020) Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319. Cited by: Table 1.
  • [16] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §2.2, §4.2.5.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1, Table 2, Table 3, Table 4.
  • [18] P. Heise, S. Klose, B. Jensen, and A. Knoll (2013) Pm-huber: patchmatch with huber regularization for stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2360–2367. Cited by: §4.1.
  • [19] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §4.2.3.
  • [20] N. Hirose, S. Koide, K. Kawano, and R. Kondo (2021) Plg-in: pluggable geometric consistency loss with wasserstein distance in monocular depth estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 12868–12874. Cited by: §2.1.
  • [21] P. Ji, R. Li, B. Bhanu, and Y. Xu (2021-10) MonoIndoor: towards good practice of self-supervised monocular depth estimation for indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12787–12796. Cited by: §2.1.
  • [22] J. Jiao, Y. Cao, Y. Song, and R. Lau (2018) Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European conference on computer vision (ECCV), pp. 53–69. Cited by: §2.1.
  • [23] A. Johnston and G. Carneiro (2020) Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 4756–4765. Cited by: Table 1.
  • [24] H. Jung, E. Park, and S. Yoo (2021) Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12642–12652. Cited by: §1, §1, §2.2, §3, §4.2.1, §4.2.1, §4.2.1, Table 1, §5.1, §5.2, §5.4, Table 2.
  • [25] S. Kim, D. Kim, M. Cho, and S. Kwak (2020) Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3238–3247. Cited by: §4.2.1.
  • [26] H. Li, A. Gordon, H. Zhao, V. Casser, and A. Angelova (2020) Unsupervised monocular depth learning in dynamic scenes. arXiv preprint arXiv:2010.16404. Cited by: Table 1.
  • [27] Y. Lu and G. Lu (2021-01) An alternative of lidar in nighttime: unsupervised depth estimation based on single thermal image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3833–3843. Cited by: §2.1.
  • [28] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille (2019) Every pixel counts++: joint learning of geometry and motion with 3d holistic understanding. IEEE transactions on pattern analysis and machine intelligence 42 (10), pp. 2624–2641. Cited by: Table 1.
  • [29] X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf (2020) Consistent video depth estimation. ACM Transactions on Graphics (ToG) 39 (4), pp. 71–1. Cited by: §1.
  • [30] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan (2020) HR-Depth: high resolution self-supervised monocular depth estimation. arXiv preprint arXiv:2012.07356 6. Cited by: §2.1, §3, Table 1, §5.2, Table 2.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §5.1.
  • [32] V. Patil, W. Van Gansbeke, D. Dai, and L. Van Gool (2020) Don’t forget the past: recurrent depth estimation from monocular video. IEEE Robotics and Automation Letters 5 (4), pp. 6813–6820. Cited by: Table 1.
  • [33] S. Pillai, R. Ambruş, and A. Gaidon (2019) Superdepth: self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9250–9256. Cited by: Table 1.
  • [34] A. Pilzer, S. Lathuiliere, N. Sebe, and E. Ricci (2019) Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9768–9777. Cited by: Table 1.
  • [35] M. Ramamonjisoa, M. Firman, J. Watson, V. Lepetit, and D. Turmukhambetov (2021) Single image depth prediction with wavelet decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11089–11098. Cited by: §2.1, Table 1.
  • [36] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black (2019) Competitive Collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12240–12249. Cited by: Table 1.
  • [37] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2.2, §4.2.1.
  • [38] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia (2019-06) Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [39] M. A. Uy and G. H. Lee (2018) PointNetVLAD: deep point cloud based retrieval for large-scale place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4470–4479. Cited by: §4.2.3.
  • [40] M. Vankadari, S. Garg, A. Majumder, S. Kumar, and A. Behera (2020) Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In European Conference on Computer Vision, pp. 443–459. Cited by: §2.1.
  • [41] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014)

    Learning fine-grained image similarity with deep ranking

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1386–1393. Cited by: §2.2.
  • [42] J. Wang, G. Zhang, Z. Wu, X. Li, and L. Liu (2020) Self-supervised joint learning framework of depth estimation via implicit cues. arXiv preprint arXiv:2006.09876. Cited by: Table 1.
  • [43] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8445–8453. Cited by: §1.
  • [44] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.1.
  • [45] J. Watson, M. Firman, G. J. Brostow, and D. Turmukhambetov (2019) Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2162–2171. Cited by: §1, §2.1, §3, Table 1, §5.2, Table 2.
  • [46] J. Watson, O. Mac Aodha, V. Prisacariu, G. Brostow, and M. Firman (2021) The temporal opportunist: self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1164–1174. Cited by: §1, §2.1, §3, §4.1, §4.1, Table 1, §5.1, Table 2.
  • [47] J. Yan, H. Zhao, P. Bu, and Y. Jin (2021) Channel-wise attention-based network for self-supervised monocular depth estimation. In 2021 International Conference on 3D Vision (3DV), pp. 464–473. Cited by: Table 1, §5.2, Table 2.
  • [48] D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Deep metric learning for person re-identification. In 2014 22nd international conference on pattern recognition, pp. 34–39. Cited by: §4.2.1.
  • [49] Y. You, Y. Wang, W. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar++: accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310. Cited by: §1.
  • [50] B. Yu, T. Liu, M. Gong, C. Ding, and D. Tao (2018) Correcting the triplet selection bias for triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 71–87. Cited by: §4.2.3.
  • [51] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858. Cited by: §1, §2.1, §3, §4.1, §4.1, Table 1, Table 2.
  • [52] S. Zhu, G. Brazil, and X. Liu (2020) The edge of depth: explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13116–13125. Cited by: §2.1.
  • [53] Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro (2019) Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8856–8865. Cited by: §4.2.1.
  • [54] B. Zhuang, G. Lin, C. Shen, and I. Reid (2016) Fast training of triplet-based deep binary embedding networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5955–5964. Cited by: §2.2.