On the Sins of Image Synthesis Loss for Self-supervised Depth Estimation

09/13/2021 ∙ by Zhaoshuo Li, et al. ∙ 22

Scene depth estimation from stereo and monocular imagery is critical for extracting 3D information for downstream tasks such as scene understanding. Recently, learning-based methods for depth estimation have received much attention due to their high performance and flexibility in hardware choice. However, collecting ground truth data for supervised training of these algorithms is costly or outright impossible. This circumstance suggests a need for alternative learning approaches that do not require corresponding depth measurements. Indeed, self-supervised learning of depth estimation provides an increasingly popular alternative. It is based on the idea that observed frames can be synthesized from neighboring frames if accurate depth of the scene is known - or in this case, estimated. We show empirically that - contrary to common belief - improvements in image synthesis do not necessitate improvement in depth estimation. Rather, optimizing for image synthesis can result in diverging performance with respect to the main prediction objective - depth. We attribute this diverging phenomenon to aleatoric uncertainties, which originate from data. Based on our experiments on four datasets (spanning street, indoor, and medical) and five architectures (monocular and stereo), we conclude that this diverging phenomenon is independent of the dataset domain and not mitigated by commonly used regularization techniques. To underscore the importance of this finding, we include a survey of methods which use image synthesis, totaling 127 papers over the last six years. This observed divergence has not been previously reported or studied in depth, suggesting room for future improvement of self-supervised approaches which might be impacted the finding.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Depth estimation from images has long been recognized as an important research area and has gained substantial interest in recent years given its utility for downstream tasks such as scene understanding, registration, navigation, and control. Learning-based models often work well under strong supervision, but acquiring the necessary ground truth depth data can be cost prohibitive or impossible and often requires additional planning and extra hardware (godard2017unsupervised).

Self-supervised learning methods, which construct auxiliary objectives for training models, are often used to overcome the lack of ground truth depth information. Most work in self-supervised learning for depth estimation synthesizes images from different, but neighboring, viewpoints into one common frame, given estimated depth and ego motion, and maximizes the similarity between observed and synthesized frames in place of ground truth supervision. The learning task is shown in Figure 1. Given a disparity prediction (which is proportional to inverse depth, we use the terms interchangeably) from a primary coordinate frame and relative camera pose, the image from a neighboring viewpoint is warped to the primary frame, followed by the computation of a synthesis loss based on the appearance difference between the observed and synthesized images. The ideas behind this self-supervised approach are valid given that neighboring frames only differ in viewpoint. This paradigm creates new opportunities to train depth estimation networks in previously data-constrained domains and has led to many top-performing self-supervised depth estimation networks (xie2016deep3d; godard2019digging; tonioni2019real).

Figure 1: Overall process for self-supervised depth estimation using image synthesis. is the image of the primary frame. is the neighboring frame to generate the synthesized image given the disparity estimation and relative camera pose . In monocular depth networks, only is used to predict disparity. In stereo depth networks, both images are used to predict disparity.

However, during internal experimentation using the image synthesis loss for refining depth estimation networks after transfer to different data domains, we observed divergence (formal definition in Section 2.5) between performance on the surrogate (image synthesis) and true target objective (depth estimation). We observed that even if in some cases the networks final performance improved, optimization failed to stabilize for the most optimal operating point found; in other cases, the depth estimation performance of networks directly worsens as image synthesis improves. This circumstance is worrisome because in the faithful self-supervised learning setting, where true depth measurements are unknown, a well-intended attempt at improving depth estimation performance of a transferred model through fine-tuning using image synthesis may, in fact, result in the exact opposite: worsening of depth estimation performance. Therefore, in this paper, we raise the following questions:

  1. Is image synthesis effective as an auxiliary task for training deep depth estimation networks? Yes, but only to a certain extent. We conduct extensive experiments using various contemporary networks on several datasets. We show in Section 4.1 that the gradient from image synthesis will actually disagree with that of the theoretically unobserved depth loss after a certain point. Thus, updating parameters in the negative gradient direction determined from the image synthesis loss will instead worsen depth prediction. The divergence is observed both at image-level and pixel-level (Section 4.2).

  2. Can perfect image synthesis be achieved from ground truth depth information? In short, no. In Section 4.3, we demonstrate that even using the ground truth depth, image synthesis is far from perfect. This suggests that optimizing for image synthesis may indeed deviate from the true objective of depth estimation.

  3. What is behind the diverging phenomenon between image synthesis and depth estimation? In Section 4.4, we analyze the loss manifold of image synthesis at places where we observe divergence. We show the disagreement between image synthesis loss and depth error, and the non-convexity of image synthesis loss at these places. We attribute the source of uncertainties to the data itself, also known as aleatoric uncertainty.

  4. What is the larger significance of this finding? In Appendix I, we summarize a literature review following PRISMA guidelines (moher2009preferred) that resulted in 127 papers using image synthesis loss, or small variations thereof. None of these papers reported a diverging behaviour, which underscores the impact of the image synthesis paradigm on self-supervised depth estimation. We hope that identifying this issue will enable further research on improving on the clearly impactful image synthesis paradigm for self-supervised training.

2 Background and Related Work

2.1 Learning-based Depth Estimation

Recently, deep learning has shown promising performance for depth estimation, where depth can be estimated from monocular image or stereo image pairs. In both cases, the depth estimation

is calculated from the disparity estimation , with . Monocular depth networks rely on data priors to regress up-to-scale depth from a single image (dijk2019neural), i.e. the exact proportionality between depth and disparity are unknown; while stereo depth networks rely on pixel matching between image pairs to produce metric depth estimation (szeliski2010computer), i.e. depth is found exactly with , where is the stereo baseline and is the focal length. Both monocular and stereo networks can be trained in a self-supervised way using image synthesis (Figure 1).

To ensure that our findings regarding the image synthesis loss are not limited to a specific design of the network, we have selected a diverse set of architectures, including both monocular networks and stereo networks. Due to the high sensitivity to camera setup and scene variation for monocular depth networks (dijk2019neural), we only included the SOTA MonoDepth2 (godard2019digging). As stereo networks are better constrained in terms of depth estimation, we included several recent networks of different network design, including convolution networks MADNet (tonioni2019real), HSM (yang2019hierarchical), GwcNet (guo2019group)

, and a transformer network STTR 

(li2020revisiting). We refer interested readers to Appendix A and the papers for more details.

2.2 Image Synthesis

As illustrated in Figure 1, a synthesized image can be generated given the disparity, relative camera pose, and the image from a neighboring viewpoint . This can be achieved by either using images from concurrent observations from different cameras (spatial stereo), or a series of observations from the same camera (temporal stereo). In both cases, image synthesis is only valid for co-observed and static objects (godard2019digging). In our experiment, we only use the valid pixels and the ground truth camera poses to constrain the varying factor of image synthesis to be only the estimated disparity. To enable gradients to flow backwards through the disparity prediction to the network parameters, differentiable warping (jaderberg2015spatial) is used for synthesizing images.

2.3 Self-supervised Training Using Image Synthesis for Depth Estimation

The most commonly used image synthesis loss was first proposed for image denoising and demosaicing tasks by Zhao et al. (zhao2015loss) and later adapted for depth estimation in Godard et al (godard2017unsupervised). The loss itself is typically formulated as a weighted sum of structural similarity (SSIM) loss (wang2004image) and L1 loss between the observed and synthesized images, which can be written as

(1)
(2)
(3)

are the -th pixels in the total pixels. In Equation 2, and

are the mean and standard deviation within a local patch,

and are small constants to avoid division by 0. The synthesis loss is designed to measure both shape and appearance similarities. SSIM is mean balanced; therefore, it better preserves high frequency signals and is robust to chromatic differences. Conversely, L1 preserves color and luminance regardless of local structure. is a weighting factor measuring the trade-off between the two loss terms.

2.4 Additional Regularization and Auxiliaries

It has previously been observed that suffers from a gradient locality problem, where both depth loss and  stagnate, since it relies on the comparison of localized pixel intensities (sharma2019unsupervised; yin2018geonet). Multi-scale regularization, i.e., estimating disparities and computing at different spatial scales (), has been reported to mitigate the gradient locality problem (sharma2019unsupervised; godard2017unsupervised; godard2019digging). Additionally, an edge-aware smoothness regularizing term on the disparity map has also been included (li2018occlusion; tonioni2019real) to encourage smoothness of the estimation,

(4)

Here is the estimated disparity map, and denotes first order gradient. Due to the popularity of these two regularization approaches, we conduct additional experiments and show in Appendix B neither of these regularization approaches prevents the divergent phenomenon observed (often only mitigating it to a negligible extent).

Other auxiliary techniques to have also been proposed. For example, (wang2020faster)

uses deep reinforcement learning to decide if the gradient computed from

for a specific stereo pair of images should back-propagate to the network. In (zhong2018open)

, a recurrent neural network is used to backpropagate the gradient of

conditioned on a sequence of past frames. We do not consider these works in our experiments because they add additional sophistication and challenges associated that would be conflated with the divergence issue under examination.

2.5 Divergence and Spearman’s Rank Coefficients

In our experiment, we report the accuracy of disparity estimation via end-point-error (EPE), i.e., the absolute disparity error. Since the EPE and  are of different units, we quantitatively assess the divergence between  and EPE as their monotonic relationship, which is quantified using the Spearman’s Rank Coefficients (ramsey1989critical)

due to its robustness against outliers and noise

(schober2018correlation). Given two sets of variables of size that are co-observed, ,

(5)

where is the rank and is the mean rank. In Equation 5, the numerator is the cross-covariance between while the denominator is the product of the covariances of . The coefficient ranges from -1 to +1, with -1 indicating exact negative monotonic trend, +1 indicating exact positive monotonic trend, and 0 indicating no correlation as shown in Figure 2. Significance of

is reported via critical probability (p) values. We introduce the following definition of

divergence:

Definition 1.

Given two sets of variables of size , , a divergence is observed between if both conditions are met

  1. : where a weak or negative correlation between the two variables is observed,

  2. : the result is statistically significant.

The above thresholds are adopted from the commonly established values (guilford1950fundamental; rowntree1981statistics; de2014basic). We show in Appendix C that other thresholds used in different specialities do not affect the result significantly. Intuitively, divergence occurs when  decreases while EPE 1) stagnates shown in Figure 2 (b), 2) decreases then increases (or vice versa) shown in Figure 2 (b), or 3) increases shown in Figure 2 (d).

(a)
(b)
(c)
(d)
Figure 2: Examples of Spearman’s rank coefficients between two variables . (a) Positive, (b-c) weak and (d) negative monotonic relationship.

2.6 Aleatoric Uncertainty

Uncertainty is an active topic in machine/deep learning. In general, uncertainty can be modeled as epistemic and aleatoric uncertainties (der2009aleatory). Epistemic uncertainty refers to the ignorance of the model, for example, the limited capacity of a specific network architecture design. On the other hand, aleatoric uncertainty refers to the randomness of the problem or data itself. An example of aleatoric uncertainty is the coin flipping problem (hullermeier2021aleatoric), where even if the best model is given, we still cannot obtain a definite head/tail prediction of next coin flip. We refer interested readers to (hullermeier2021aleatoric) for in-depth review of relevant literature. In this work, we try our best to reduce epistemic uncertainty by using various contemporary networks and ensure enough training of each network onto the target dataset, which leaves the observed divergence between image synthesis and depth estimation to aleatoric uncertainties originated from the training data.

In depth estimation literature, prior work such as (godard2019digging; luo2019every; vijayanarasimhan2017sfm) model aleatoric uncertainties with a pixel-wise mask to exclude certain pixels from the training process for better performance. The excluded pixels include occluded pixels that are not commonly observed by two frames and pixels of dynamic objects that break the static scene assumption. However, none of the prior work reported the diverging phenomenon even after excluding these pixels.

3 Experimental Setup

Instead of only reporting the final validation performance like prior work, we monitor how

 and EPE change respectively at each epoch to check if there is a divergence between

 and EPE.

3.1 Dataset

We conduct experiments on four datasets to demonstrate that the disagreement between and depth prediction error is not limited to a single domain. The datasets of interest provide varied image statistics and scene complexity. Furthermore, the ground truth data are captured with different depth modalities, which constitutes a stable backdrop for observing the effects of self-supervision. The datasets selected have corresponding ground truth depth, which can be used to monitor divergence. Since real datasets may have chromatic differences between left/right images, we align the left and right images appearance using a per-channel shift (jin2001real). We normalize the images to 0-1 when computing  to make SSIM valid (wang2004image). More details about dataset used can be found in Appendix A.

3.1.1 Spatial Stereo Dataset

KITTI2015 (menze2015object) contains 200 pairs of images of street scenes of resolution with sparse ground truth disparity maps computed from LiDAR. We use the first 150 images for training and the rest 50 images for validation. SERV-CT (eddie2020serv) contains 16 pairs of ex vivo endoscopic images of resolution with dense ground truth disparity maps computed from an aligned CT scan. We use the first 8 images for training and the remaining 8 images for validation. Middlebury2014 Q (scharstein2014high) contains 15 images of indoor scenes of various image resoltuion with dense disparity maps acquired via structured light. We use the first 10 images for training and the remaining 5 images for validation.

3.1.2 Temporal Stereo Dataset

Since dynamic objects violate the image synthesis assumption, we use the synthetic New Tsukuba Stereo Dataset fluorescent (martull2012realistic) which is static indoor scenes with associated ground truth camera pose and dense disparity maps. The resolution is . We use the first 130 frames for training and the remaining 50 images for validation. Using this dataset allows us to investigate the temporal effect of self-supervised depth estimation.

3.2 Hyperparameters

We use the pre-trained weights provided by the authors (see Appendix A for more details). We scale the originally used learning rate by 0.1 as the pre-trained weights are well-initialized. We use the AdamW optimizer with a weight decay of 1e-4 (loshchilov2017decoupled). We demonstrate the choice of optimizer does not fundamentally change the result (see Appendix G).

We follow prior work (godard2017unsupervised; tonioni2019real) and set in Equation 1, and set e-4, e-4, the kernel size used to compute to in Equation 2. We also show that these hyper-parameters ( and kernel size) marginally affect the result but do not resolve the divergence issue (see Appendix E and F). Following (godard2019digging), we upsample disparities from each scale to full resolution and set weightings for  from each scale to 1, and weighting for  to 1e-2. We train for 200 epochs on all dataset. We run all experiments on one Nvidia Titan RTX GPU.

4 Results and Discussion

4.1 Divergence in validation

The quantitative result is summarized in Table 1. Comparing the change of EPE from final to initial values, some networks are able to improve on certain dataset (highlighted as blue), which agrees with the conclusion of prior work that image synthesis improves the depth estimation. However, many networks have worse performance after optimizing for image-synthesis (highlighted as red). Furthermore, we note that the is not always a positive number and often falls below the weak correlation threshold (highlighted as red), which indicates that EPE does not strictly follow the  monotonically. The only strongly correlated case without any divergence is HSM trained on Middlebury2014. Fraction of divergence is also reported, where on average 0.54 of the data across different networks and dataset exhibit divergence.

We plot examples of divergence from two datasets in Figure 3. It is common to many networks that there exist a positive correlation between EPE and  when  is large, and after reaching the minima, the network worsens w.r.t depth afterwards. In other cases, the best operating point found is the starting point. This observation suggests a practical upper bound on depth estimation performance when optimizing for image synthesis: as a network improves on the target domain, further optimization of image synthesis can be harmful where networks fail to stabilize at the best optimal operating point found. Lastly, divergence is found in both spatial and temporal stereo dataset. We also observe divergence of  and EPE when evaluated on the training data. Details can be found in Appendix B.

Figure 3: Validation result as training proceeds of (a) KITTI2015 dataset and (b) SERV-CT dataset.
EPE
Dataset Network Initial Final Divergence
MonoDepth2 6.24 3.49 4.07 2.72 -2.18 (-35%) +0.41 0.38 0.20
MADNet 8.28 5.83 1.92 1.70 -6.35 (-73%) +0.35 0.45 0.36
HSM 1.31 0.42 1.27 0.40 -0.04 (-3%) +0.04 0.60 0.48
GwcNet 1.27 0.44 1.37 0.47 +0.10 (+9%) -0.41 0.58 0.82
KITTI2015 STTR 1.50 0.81 2.53 2.79 +1.03 (+66%) -0.01 0.21 0.40
MonoDepth2 28.18 5.31 11.85 5.39 -16.33 (-59%) +0.65 0.51 0.25
MADNet 57.71 23.67 35.10 6.23 -22.61 (-18%) +0.34 0.60 0.50
HSM 2.38 0.72 2.79 1.00 +0.42 (+18%) -0.29 0.67 0.62
GwcNet 3.53 1.89 2.73 0.82 -0.80 (-7%) -0.58 0.56 0.88
SERV-CT STTR 3.70 2.06 4.56 3.05 +0.86 (+20%) -0.73 0.38 1.00
MonoDepth2 17.07 8.52 12.83 8.58 -4.24 (-28%) -0.28 0.83 0.60
MADNet 13.06 4.50 14.58 3.88 +1.52 (+18%) -0.29 0.51 0.60
HSM 2.60 1.57 2.10 1.40 -0.49 (-21%) +0.89 0.17 0.00
GwcNet 3.07 4.00 3.21 4.17 +0.14 (+4%) +0.42 0.60 0.20
Middlebury2014 STTR 1.69 2.66 1.33 1.24 -0.35 (-5%) -0.58 0.65 0.87
MonoDepth2 10.74 4.09 8.45 3.21 -2.29 (-20%) +0.26 0.38 0.38
MADNet 26.13 11.99 16.57 6.20 -9.56 (-30%) 0.00 0.42 0.56
HSM 1.66 0.47 1.62 0.39 -0.04 (-1%) +0.15 0.47 0.48
GwcNet 1.69 1.14 1.79 1.06 +0.09 (+11%) -0.68 0.21 0.56
Tsukuba STTR 2.01 2.67 21.95 7.98 +19.94 (+1458%) -0.89 0.16 1.00
Average 0.54
Table 1: Validation result on four datasets. : changes between final and initial EPE ( blue indicates EPE decreases; red indicates EPE increases). : Spearman’s Rank Coefficient ( blue indicates average above 0.39; red indicates average below 0.39). Divergence: fraction of diverging training instances observed based on Definition 1.

4.2 Change of EPE with decreasing

Instead of comparing the aggregated  and EPE for the whole image, we instead compute the aggregated change of EPE only on pixels where  is decreasing. This further isolates the inconsistent behavior. The quantitative result is summarized in Table 2. We note that even though in most cases the average EPE is negative (i.e. a smaller error after training), the standard deviation is much larger, indicating many pixels have increasing error. We further highlight in red five cases where on average these pixels have increasing EPE. Therefore, minimization of  is indeed not equivalent to minimization of EPE despite the theoretical justification of the image synthesis paradigm. Qualitative results can be found in Appendix D.

EPE
Network KITTI2015 SERV-CT Middlebury2014 New Tsukuba
MonoDepth2 -1.54 3.98 -24.20 40.26 -4.04 11.77 -0.54 5.36
MADNet -1.95 13.58 -19.03 27.36 -0.60 4.86 +0.53 3.35
HSM +1.30 1.67 +1.28 4.97 -0.66 1.18 -0.70 3.88
GwcNet +0.25 1.49 -5.42 19.26 -0.12 1.66 -0.13 1.75
STTR -1.06 3.51 -6.18 20.14 -0.13 1.07 +1.23 4.11
Table 2: Changes of EPE of pixels where  decreases.

4.3 Ground Truth Disparity and Non-zero

In fact, we show that even with ground truth disparity, will not be 0 (Table 3). Therefore, if a network hypothetically predicts perfect disparity, non-zero gradients computed for  w.r.t disparity will pass to the network. In this case, increasing  may cause a larger perturbation of the model weights, leading to potentially divergent behavior. Though this non-zero gradient may be averaged out during the training process, we showed in Section 4.1 that divergence occurs after all. We visualize examples in Figure 4. Despite visual similarities between the actual and synthesized images, there are non-zero errors across the whole image. Appendix E provides additional experiment showing optimizing either and can lead to divergent learning.

KITTI2015 SERV-CT Middlebury2014 New Tsukuba
0.120 0.023 0.053 0.035
0.132 0.023 0.057 0.039
0.051 0.022 0.035 0.011
Table 3: ,  and  when ground truth disparity is used.
Figure 4: Visualization of (a) the input image, (b) the synthesized image, and (c) (occluded regions or regions without valid ground truth depth data are shown in black). Left to right: KITTI2015, SERV-CT, Middlebury2014 and New Tsukuba dataset.

4.4 Aleatoric Uncertainties of Image Synthesis

After observing the divergence, we analyze the loss landscapes (cheng2021explore) at places where divergence occurs in Figure 5. As shown, the optimization for image synthesis on a pixel level is non-convex w.r.t depth and deviates from the ground truth location (EPE). We further elaborate below:

Figure 5: Aleatoric uncertainties and their associated loss landscapes. Top row: input images. Second row: zoom-in view of the primary image (left) and the neighboring image (right). Red cross indicates the corresponding pixels using ground truth depth and blue cross indicates the location of minimum found within the local window. Bottom row: loss landscape of  w.r.t EPE.

Specular reflection and shadow - Specular reflection of non-Lambertian surfaces and shadow will occlude the true surface appearance and complicate the image synthesis process when it is observed at different positions. One example is illustrated in Figure 5 (a) where differences between the reflection in the two images will conflate with the true disparity estimation value.

Object/occlusion edge - When a large disparity change occurs at object/occlusion edges, pixels will have different local context in primary and neighboring images, which will lead to false correspondences as shown in Figure 5 (b).

Lack of texture or repeated texture - At a textureless (over/under exposure or uniform texture) or repeated texture area, there are many possible values that can minimize , which makes image synthesis prone to error. Even if smoothness regularization is used, it does not guarantee the correctness of the disparity estimation (see video supplementary material).

5 Significance of the Finding

Our primary finding demonstrates that self-supervised optimization of is limited in its ability to improve depth prediction error, especially when depth estimation has improved to a sufficient extent during self-supervised learning. While there are cases where the final performance of depth estimation on a target domain may improve (which has been the focus of most published work to date), a network may, contrary to common belief, have divergent training and fail to stabilize at the best operating point found (Figure 3

). This finding may point to inherent limitations of contemporary self-supervised methods for depth estimation, and further, cast doubt on the use of SSIM to compare the synthesized images against the primary image as an evaluation metric for assessing depth estimation performance without ground truth

(gur2019single).

6 Conclusion and Limitations

In this work, we have empirically demonstrated a disagreement between the image synthesis loss and the final depth estimation accuracy during self-supervised learning of depth estimation. This phenomenon provides evidence for practical limits of image synthesis for decreasing depth prediction error due to aleatoric uncertainties. We found 127 papers use this paradigm, yet none of them reported such cases. While we cannot contribute a solution to this issue at the moment, we believe that this issue warrants discussion due to the growing prevalence of image synthesis-based self-supervision. Our results currently are limited to small datasets without large scale training due to the limited availability of ground truth data. Neither do we not consider the noise and artifacts associated with the ground truth data, which may influence the result. Future work may improve upon our findings by explicitly identifying the regions which exhibit aleatoric uncertainties. Exploring additional constraints beyond image synthesis (such as geometric information

(wei2020deepsfm)) for training depth estimation networks and causal analysis (pearl2009causality) of the aleatoric uncertainties and divergence are of interest for future work.

References

Appendix

Appendix A Networks and Dataset Details

MonoDepth2 [godard2019digging] is a SOTA monocular depth estimation network. The network predicts disparity through a sigmoid layer. The pre-trained weight used is named stereo_, which is trained in a self-supervised setup using stereo images on KITTI RAW dataset [geiger2013vision] (which is different from KITTI 2015 dataset). The code and pre-trained weights are released under the Monodepth v2 Liense.

MADNet [tonioni2019real] is a stereo depth estimation network. The network has a correlation layer for predicting disparity. The pre-trained weight used is named Sythetic, which is trained with ground truth supervision using Scene Flow [mayer2016large]. The code and pre-trained weights are released under the Apache-2.0 License.

HSM [martull2012realistic] is a stereo depth estimation network. The network uses 3D convolutions for predicting disparities. The pre-trained weight used is named Middlebury model, which is trained on customized datasets (see paper for more details). The code and pre-trained weights are released under the MIT License.

GwcNet [guo2019group] is a stereo depth estimation network. The network uses both correlation and 3D convolutions to predict disparities. The pre-trained weight used is provided in [li2020revisiting] which is trained using color-augmentation on Scene Flow [mayer2016large] with ground truth supervision. The code and pre-trained weights are released under the MIT License.

STTR [li2020revisiting] is a stereo depth estimation network. The network is a transformer based network which uses attention to predict disparities. The pre-trained weight used is named STTR-light, which is trained using color-augmentation on Scene Flow [mayer2016large]. The code and pre-trained weights are released under the Apache-2.0 License.

KITTI2015[menze2015object] uses the CC BY-NC-SA 3.0 License. SERV-CT [eddie2020serv] uses the CC BY 4.0 License. No license is provided for Middlebury2014 [scharstein2014high] and New Tsukuba [martull2012realistic] dataset. KITTI2015, SERV-CT and Middlebury2014 have distinct scenes for each data provided, while New Tsukuba provides a sequential observation of the same scene at 30 Hz. Due to the small camera motion for New Tsukuba, we downsample the frequency to 3 Hz.

Appendix B Divergence in Training

b.1 Divergence in Training

We also monitor  and EPE for training data, where no generalization is required for the network. The quantitative result is shown in Table 4. The disagreement of  and EPE is also evident on training data as the rarely surpasses 0.39 and many networks have worse performance after self-supervised training. In fact, the average divergence cases of 0.61 is higher than 0.54 in the validation result (Table 1 in the manuscript), suggesting that over-optimizing for  may harm the depth performance without any knowledge of the actual depth performance.

EPE
Dataset Network Initial Final Divergence
MonoDepth2 7.20 3.08 1.46 0.88 -5.75 (-78%) +0.54 0.21 0.23
MADNet 7.03 12.18 1.74 1.64 -5.29 (-65%) +0.38 0.51 0.41
HSM 1.21 0.49 1.14 0.66 -0.07 (-7%) +0.52 0.58 0.29
GwcNet 1.31 2.08 1.32 0.64 +0.01 (+4%) -0.67 0.40 0.95
KITTI2015 STTR 1.46 2.78 2.02 3.29 +0.56 (+37%) -0.05 0.24 0.99
MonoDepth2 35.62 6.80 11.93 5.45 -23.69 (-67%) +0.64 0.49 0.25
MADNet 36.03 20.20 19.48 10.62 -16.55 (-32%) -0.14 0.59 0.88
HSM 2.02 0.72 3.03 1.03 +1.01 (+54%) -1.00 0.00 1.00
GwcNet 2.22 0.85 2.76 0.80 +0.54 (+32%) -0.96 0.05 1.00
SERV-CT STTR 2.04 0.76 4.21 1.96 +2.16 (+141%) -0.18 0.10 1.00
MonoDepth2 14.54 4.11 10.95 5.35 -3.58 (-25%) +0.48 0.72 0.30
MADNet 27.49 13.09 17.54 6.23 -9.95 (-25%) +0.16 0.52 0.60
HSM 2.05 0.90 1.46 0.66 -0.59 (-26%) +0.68 0.55 0.20
GwcNet 1.00 0.37 0.83 0.26 -0.17 (-14%) +0.19 0.59 0.60
Middlebury2014 STTR 1.66 2.60 1.34 1.24 -0.33 (-5%) -0.57 0.65 0.87
MonoDepth2 12.08 6.65 3.84 4.44 -8.24 (-73%) +0.85 0.21 0.04
MADNet 26.98 11.62 17.37 5.49 -9.62 (-30%) +0.36 0.40 0.44
HSM 1.74 0.79 1.50 0.51 -0.24 (-10%) +0.26 0.59 0.48
GwcNet 1.30 0.46 1.40 0.45 +0.10 (+10%) -0.72 0.28 0.63
New Tsukuba STTR 1.39 0.48 19.93 10.42 +18.54 (+1346%) -0.94 0.08 1.00
Average 0.61
Table 4: Training result on four datasets. : changes between final and initial EPE ( blue indicates EPE decreases; red indicates EPE increases). : Spearman’s Rank Coefficient ( blue indicates average above 0.39; red indicates average below 0.39). Divergence: fraction of diverging training instances observed based on Definition 1.

b.2 Effect of Regularization

Since we observe diverging phenomenon during refinement, we further ablate the training with different regularization setting to ensure that it is not caused by a specific regularization. We conduct four experiments for each network: baseline without any regularization, with smoothness regularization only (denoted as SMTH), with multi-scale regularization only (denoted as MS), and with both regularization. We have selected SERV-CT and Middlebury2014 dataset for this experiment since it has dense ground truth (allowing us to inspect every pixel) and distinct scene characteristics (medical and indoor). The quantitative result is summarized in Table 5.

We find that the commonly used regularization techniques do not always improve the final EPE performance. For example, for HSM/GwcNet/STTR trained on SERV-CT, the final EPE became worse that initial EPE after training ( column is red), regardless of the regularization techniques used. While it is commonly believed that using both regularizations gives the best result, we also find sometimes removing smoothness regularization leads to better result (MonoDepth2/MADNet/ trained on SERV-CT and MonoDepth2/GwcNet trained on Middlebury2014) and sometimes removing muti-scale regularization leads to better result (HSM trained on SERV-CT). These findings suggest that regularization techniques do not fundamentally solve the divergence issue and can lead to inconsistent result depending on the network/dataset.

Regularization EPE
Dataset Network MS SMTH initial Final Divergence
16.86 8.07 -19.05 (-52%) +0.59 0.42 0.25
17.02 7.93 -18.90 (-52%) +0.54 0.43 0.31
16.81 8.14 -19.11 (-52%) +0.53 0.41 0.31
MonoDepth2 35.92 9.93 17.00 8.08 -18.92 (-52%) +0.47 0.44 0.38
8.07 5.48 -15.67 (-68%) +0.65 0.32 0.12
8.06 5.67 -15.68 (-68%) +0.69 0.28 0.12
8.05 5.58 -15.69 (-68%) +0.66 0.32 0.12
MADNet 23.74 9.68 7.97 5.38 -15.77 (-68%) +0.60 0.32 0.12
4.01 2.29 +1.88 (+93%) -0.73 0.48 0.88
3.52 1.99 +1.39 (+68%) -0.55 0.61 0.88
4.02 2.53 +1.89 (+92%) -0.65 0.45 0.94
HSM 2.13 0.73 3.58 2.17 +1.44 (+70%) -0.49 0.58 0.75
5.33 1.58 +2.98 (+147%) -0.99 0.01 1.00
3.10 1.01 +0.83 (+42%) -0.96 0.09 1.00
5.41 1.78 +2.90 (+139%) -1.00 0.00 1.00
GwcNet 2.22 0.85 2.76 0.80 +0.54 (+32%) -0.96 0.05 1.00
5.42 3.50 +1.98 (+57%) -0.60 0.60 0.81
4.38 2.81 +0.93 (+29%) -0.54 0.58 0.88
4.46 2.66 +1.02 (+32%) -0.70 0.36 0.94
SERV-CT STTR 3.44 2.02 3.48 1.83 +0.03 (+5%) -0.39 0.53 0.81
10.95 5.41 -3.75 (-26%) +0.39 0.87 0.30
11.07 5.16 -3.58 (-25%) +0.42 0.83 0.30
10.88 5.37 -3.70 (-26%) +0.45 0.76 0.30
MonoDepth2 14.54 4.11 10.95 5.35 -3.58 (-25%) +0.48 0.72 0.30
17.01 6.07 -11.01 (-29%) +0.20 0.56 0.40
17.38 6.09 -10.44 (-26%) +0.29 0.51 0.30
18.17 6.20 -9.14 (-25%) +0.15 0.52 0.30
MADNet 27.49 13.09 17.54 6.23 -9.95 (-25%) +0.16 0.52 0.40
1.24 0.52 -0.71 (-35%) +0.20 0.56 0.40
1.30 0.60 -0.63 (-32%) +0.29 0.51 0.30
1.24 0.51 -0.71 (-35%) +0.15 0.52 0.30
HSM 2.05 0.90 1.27 0.58 -0.68 (-34%) +0.16 0.52 0.40
1.04 0.94 +0.09 (+6%) +0.24 0.65 0.40
0.78 0.21 -0.17 (-14%) -0.10 0.74 0.50
0.77 0.24 -0.23 (-19%) -0.01 0.67 0.50
GwcNet 1.00 0.37 0.83 0.26 -0.17 (-14%) +0.19 0.59 0.60
2.34 4.57 +0.69 (+20%) -0.74 0.53 0.87
2.48 5.36 +0.77 (+14%) -0.64 0.64 0.87
2.05 3.58 +0.39 (+18%) -0.63 0.57 0.80
Middlebury2014 STTR 1.66 2.60 1.34 1.23 -0.31 (-5%) -0.57 0.65 0.87
Table 5: Training result on two datasets with different types of regularization. MS: multi-scale regularization. SMTH: smoothness regularization. : changes between final and initial EPE ( blue indicates EPE decreases; red indicates EPE increases). : Spearman’s Rank Coefficient ( blue indicates average above 0.39; red indicates average below 0.39). Divergence: fraction of diverging training instances observed based on Definition 1.

Appendix C Sensitivity Analysis of Thresholds for Definition 1

In Figure 6, we generate a set of simulation result where the original variables are perfectly correlated with , with at 0.02 interval. We add increasing corruption that is randomly sampled from the range [0.0,  ] to to qualitatively illustrate the how Spearman’s Rank coefficient changes as data change. As increased from 0.0 to 2.5, Spearman’s Rank Coefficient decreases from 1.0 to 0.29 and the correlation between are becoming weaker.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 6: Simulation result of with increasing noise and decreasing Spearman’s Rank Coefficient .

One other well-established threshold of Spearman’s Rank coefficient for weak correlation in the political and medical researches is 0.29 [akoglu2018user]. We present result in Table 1 again but recomputed with the new threshold. The presented results only differ in two cases (MADNet trained on KITTI2015 and Middlebury2014) and average divergence case changes from 0.54 to 0.50, thus indicating our definition of divergence in Definition 1 is not highly sensitive to the selected threshold.

Threshold=0.39 Threshold=0.29
Dataset Network Divergence Divergence
MonoDepth2 +0.41 0.38 0.20 +0.41 0.38 0.16
MADNet +0.35 0.45 0.36 +0.35 0.45 0.30
HSM +0.04 0.60 0.48 +0.04 0.60 0.44
GwcNet -0.41 0.58 0.82 -0.41 0.58 0.76
KITTI2015 STTR -0.01 0.21 0.40 -0.01 0.21 0.38
MonoDepth2 +0.65 0.51 0.25 +0.65 0.51 0.25
MADNet +0.34 0.60 0.50 +0.34 0.60 0.25
HSM -0.29 0.67 0.62 -0.29 0.67 0.62
GwcNet -0.58 0.56 0.88 -0.58 0.56 0.88
SERV-CT STTR -0.73 0.38 1.00 -0.73 0.38 1.00
MonoDepth2 -0.28 0.83 0.60 -0.28 0.83 0.60
MADNet -0.29 0.51 0.60 -0.29 0.51 0.60
HSM +0.89 0.17 0.00 +0.89 0.17 0.00
GwcNet +0.42 0.60 0.20 +0.42 0.60 0.20
Middlebury2014 STTR -0.58 0.65 0.87 -0.58 0.65 0.87
MonoDepth2 +0.26 0.38 0.38 +0.26 0.38 0.22
MADNet 0.00 0.42 0.56 0.00 0.42 0.48
HSM +0.15 0.47 0.48 +0.15 0.47 0.40
GwcNet -0.68 0.21 0.56 -0.68 0.21 0.52
Tsukuba STTR -0.89 0.16 1.00 -0.89 0.16 1.00
Average 0.54 Average 0.50
Table 6: Comparison of results with Spearman’s Rank coefficient threshold being 0.39 and 0.29.

Appendix D Change of EPE with decreasing

We provide qualitative results of the change of EPE with decreasing . We mask out the pixels that are occluded or with increasing  as black color. As shown in Figure 7, while some of the pixels have decreasing EPE (cooler color), many have increased EPE (warmer color). The increasing EPE are particularly evident at occlusion/object edges, textureless areas and specular regions as discussed in Section 4.4.

(a) SERV-CT
(b) New Tsukuba
Figure 7: Visualization of synthesized images and change of EPE at places where  decreases (black are either occluded regions or where  increases). Top left figure is the initial synthesized image, top right figure is the final synthesized image, and the bottom figure is the change of EPE.

Appendix E Effect of Weighting Between and

As discussed in Section 4.3, the gradient w.r.t disparity estimation will be non-zero due to the residual error (where gradient is loss times learning rate), this means if there is a network with perfect depth prediction, it will diverge from this point. We examine if weightings between  and  affect training behavior. We conduct three experiments, using the default weighting in Equation 1, using alone, and using alone. Experiments are conducted with multi-scale and smoothness regularization. Quantitative result is summarized in Table 7.

We find that there are cases where using either loss or a combination of them alone will lead to worse EPE, such as HSM/GwcNet trained on SERV-CT ( column is red). We conclude that the divergent problem is not due to the weighting of  and , therefore adjusting the weight will not solve the divergence issue. We also find that using  alone is consistently worse than using  alone with the exception of MADNet trained on Middlebury2014, which justifies the larger weighting proposed for  in [zhao2015loss, godard2017unsupervised]. Moreover, we find that sometimes using  alone is better, especially in dataset such as SERV-CT where many specular reflective regions are present.

Weighting EPE
Dataset Network initial Final Divergence
1 0 20.26 9.47 -15.65 (-44%) +0.57 0.43 0.25
0 1 21.02 7.51 -14.90 (-39%) -0.07 0.38 0.56
MonoDepth2 0.85 0.15 35.92 9.93 17.00 8.08 -18.92 (-52%) +0.47 0.44 0.38
1 0 7.97 5.02 -15.76 (-68%) +0.67 0.19 0.06
0 1 13.71 6.11 -10.03 (-40%) +0.10 0.34 0.56
MADNet 0.85 0.15 23.74 9.68 7.97 5.38 -15.77 (-68%) +0.60 0.32 0.12
1 0 2.96 1.77 +0.83 (+41%) -0.22 0.81 0.56
0 1 10.27 5.46 +8.14 (+420%) -0.97 0.05 1.00
HSM 0.85 0.15 2.13 0.73 3.58 2.17 +1.44 (+70%) -0.49 0.58 0.75
1 0 2.26 0.73 +0.03 (+6%) -0.86 0.17 1.00
0 1 7.24 2.57 +4.97 (+251%) -0.90 0.18 1.00
GwcNet 0.85 0.15 2.22 0.85 2.76 0.80 +0.54 (+32%) -0.96 0.05 1.00
1 0 3.20 1.58 -0.24 (-2%) -0.26 0.55 0.69
0 1 9.49 5.36 +6.05 (+197%) -0.96 0.07 1.00
SERV-CT STTR 0.85 0.15 3.44 2.02 3.48 1.83 +0.03 (+5%) -0.39 0.53 0.81
1 0 11.24 5.48 -3.39 (-24%) +0.41 0.87 0.30
0 1 7.68 2.83 -6.15 (-43%) +0.79 0.53 0.10
MonoDepth2 0.85 0.15 14.54 4.11 10.95 5.35 -3.58 (-25%) +0.48 0.72 0.30
1 0 17.26 5.56 -10.34 (-27%) +0.16 0.50 0.50
0 1 22.12 10.10 -5.55 (-13%) -0.03 0.52 0.40
MADNet 0.85 0.15 27.49 13.09 17.54 6.23 -9.95 (-25%) +0.16 0.52 0.40
1 0 1.28 0.58 -0.67 (-33%) +0.83 0.50 0.10
0 1 1.41 0.60 -0.54 (-27%) +0.43 0.61 0.30
HSM 0.85 0.15 2.05 0.90 1.27 0.58 -0.68 (-34%) +0.16 0.52 0.10
1 0 0.84 0.29 -0.16 (-16%) +0.57 0.61 0.10
0 1 1.05 0.38 +0.05 (+5%) -0.55 0.52 0.80
GwcNet 0.85 0.15 1.00 0.37 0.83 0.26 -0.17 (-17%) +0.19 0.59 0.60
1 0 1.69 1.65 +0.03 (+3%) -0.61 0.54 0.80
0 1 1.93 1.33 +0.27 (+37%) -0.36 0.72 0.90
Middlebury2014 STTR 0.85 0.15 1.66 2.60 1.34 1.23 -0.31 (-19%) -0.57 0.65 0.87
Table 7: Training result on two datasets with different weightings between  and . : changes between final and initial EPE ( blue indicates EPE decreases; red indicates EPE increases). : Spearman’s Rank Coefficient ( blue indicates average above 0.39; red indicates average below 0.39). Divergence: fraction of diverging training instances observed based on Definition 1.

Appendix F Effect of Kernel Size and Type in SSIM

As mentioned in Section 3.2, there can be variations of the . One parameter is the size of the local patch for computing . The other parameter is the type of kernel used for computing , one can use either linear kernels (where each pixels in the local patch are weighted equally) or Gaussian kernels. We conduct experiments to examine the effects of kernel size and kernel type for . We demonstrate that these parameters do not inherently change the training behavior as shown in Figure 8.

(a) Linear Kernel (pixels are equally weighted)
(b) Gaussian Kernel
Figure 8: Comparison of kernel size and kernel type for  computation.

Appendix G Effect of Optimizer

To eliminate the possibility that the specific choice of optimizer causes the disagreement, we conduct an ablation study for different optimizers using STTR, as STTR exhibits the worst divergence, including AdamW [loshchilov2017decoupled], Adam [kingma2014adam], Adagrad [duchi2011adaptive]

, RMSprop 

[graves2013generating] and SGD with momentum [sutskever2013importance]. All experiments are performed twice, with and without weight decay. The momentum is set to 0.9 where applicable. All experiments are performed with multi-scale and smoothness regularization. As shown in Figure 9, SGD suffers from gradient locality of , and does not make sufficient updates to the model. Adam, AdamW and RMSprop all lead to divergence between  and depth prediction error after a certain point, especially RMSprop which minimizes  the most. The only exception amongst the adaptive optimizers is Adagrad, which stops updating after a finite number of iterations due to gradient accumulation. This prevents  from decreasing and thus avoids the inconsistency between  and EPE, which can be a temporary alternative. However, because there is no good control over when the gradient magnitude will diminish with Adagrad, we feel that it is inadequate in practice.

Figure 9: Ablation of different optimizers. Asterisk indicates presence of weight decay.

Appendix H Synthesis Loss Beyond Intensities

An alternative to using image intensities for

would be to use features extracted by the network. However, if a network already has the ability to distinguish each pixel and match them properly, the network should already predict the perfect disparity. Furthermore, we hypothesize that this learning paradigm introduces instability as the network can freely alter the feature representation compared to computing

 on image intensities. Regardless, we conduct a single experiment using features extracted by the network to compute to verify the hypothesis. The experiment is conducted on STTR with both multi-scale and smoothness regularization. As shown in Figure 10, due to a lack of constraints, the EPE increases much more faster compared to computing  using image intensities.

Figure 10: Ablation of  on features and image intensities.

Appendix I Survey of Papers Using Image synthesis Loss

Following PRISMA guidelines [moher2009preferred], we conducted a literature review on self-supervised depth estimation using . Since the primary objective of this review is to estimate the scope of image synthesis-based self-supervision, we surveyed the arXiv database with the following search keywords:

  • For monocular depth estimation - (self-supervised unsupervised) (depth disparity) (estimation prediction);

  • For stereo depth estimation - (self-supervised unsupervised adapt) (depth disparity) (stereo).

This search retrieved a total of 447 papers that were then screened further based on the following inclusion criteria:

  • Must be monocular or stereo depth estimation from images;

  • Must use  loss or other combination of  and ;

  • Must be original primary investigation (i.e. not a review paper);

  • Language must be English;

  • Published 2015 and up to June 1st, 2021.

After screening, 190 papers were included for full text review. A total of 63 papers were subsequently rejected after full text review, leaving 127 papers that fit the inclusion criteria. Among the selected papers, 114 papers use smoothness regularization, 67 papers use multi-scale regularization and 69 papers handle occlusion explicitly.

Paper ID

Title

Author Name

Year

M/S

SMTH

MS

Occlusion

1 Improved Point Transformation Methods For Self-Supervised Depth Prediction Ziwen 2021 S x x
2 Learning Depth via Leveraging Semantics: Self-supervised Monocular DepthEstimation with Both Implicit and Explicit Semantic Guidance Li 2021 M x
3 Learning Monocular Depth in Dynamic Scenes viaInstance-Aware Projection Consistency Lee 2021 M x
4 Self-supervised monocular depth estimation from oblique UAV videos Madhuanand 2020 M x
5 Semantic-Guided Representation Enhancement for Self-supervised Monocular Trained Depth Estimation Li 2020 M x
6 HR-Depth : High Resolution Self-Supervised Monocular Depth Estimation Lyu 2020 M
7 Variational Monocular Depth Estimation for Reliability Prediction Hirose 2020 M x x
8 Attentional Separation-and-Aggregation Network for Self-supervised Depth-Pose Learning in Dynamic Scenes Gao 2020 M x
9 Unsupervised Monocular Depth Learning with Integrated Intrinsics and Spatio-Temporal Constraints Chen 2020 M x x
10 Unsupervised Deep Persistent Monocular Visual Odometry and Depth Estimation in Extreme Environments Almalioglu 2020 M x
11 Unsupervised Monocular Depth Learning in Dynamic Scenes Li 2020 M x
12 Geometry-based Occlusion-Aware Unsupervised Stereo Matching for Autonomous Driving Peng 2020 S x
13 Unsupervised Learning of Depth and Ego-Motion from Cylindrical Panoramic Video with Applications for Virtual Reality Sharma 2020 M x
14 SAFENet: Self-Supervised Monocular Depth Estimation with Semantic-Aware Feature Extraction Choi 2020 M x x
15 Calibrating Self-supervised Monocular Depth Estimation McCraith 2020 M x x
16 Cascade Network for Self-Supervised Monocular Depth Estimation Chai 2020 M x
17 Self-Supervised Learning for Monocular Depth Estimation from Aerial Imagery Hermann 2020 M
18 Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation Aleotti 2020 S
19 Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion Vasiljevic 2020 M x x
20 SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation Synergized with Semantic Segmentation for Autonomous Driving Kumar 2020 M x
21 S3Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data Cheng 2020 M x x
22 P2Net: Patch-match and Plane-regularization for Unsupervised Indoor Depth Estimation Yu 2020 M x x
23 P2D: a self-supervised method for depth estimation from polarimetry Blanchon 2020 M x
24 Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance Klingner 2020 M x
25 UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models Kumar 2020 M
26 Continual Adaptation for Deep Stereo Poggii 2020 S x
27 Self-supervised Depth Estimation to Regularise Semantic Segmentation in Knee Arthroscopy Liu 2020 M x x
28 EndoSLAM Dataset and An Unsupervised Monocular Visual Odometry and Depth Estimation Approach for Endoscopic Videos: Endo-SfMLearner Ozyoruk 2020 M x x
29

MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation

Liu 2020
30 Increased-Range Unsupervised Monocular Depth Estimation Imran 2020 M
31 Consistency Guided Scene Flow Estimation Chen 2020 S x
32 Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues Wang 2020 M, S
33 Semantics-Driven Unsupervised Learning for Monocular Depth and Ego-Motion Estimation Wei 2020 M x
34 Unsupervised Depth Learning in Challenging Indoor Video: Weak Rectification to Rescue Biang 2020 M x
35 Self-Attention Dense Depth Estimation Network for Unrectified Video Sequences Mathew 2020 M x x
36 Deep feature fusion for self-supervised monocular depth prediction Kaushik 2020 M
37 Self-Supervised Human Depth Estimation from Monocular Videos Tan 2020 M x
38 Self-Supervised Attention Learning for Depth and Ego-motion Estimation Sadek 2020 M x x
39 Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction Tiwari 2020 M x x
40 RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes Ocal 2020 M
41 Masked GANs for Unsupervised Depth and Pose Prediction with Scale Consistency Zhao 2020 M x
42 Self-Supervised Monocular Scene Flow Estimation Hur 2020 M x
43 Distilled Semantics for Comprehensive Scene Understanding from Videos Tosi 2020 M x
44 Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume Johnston 2020 M
45 DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning Spencer 2020 M x
46 DiPE: Deeper into Photometric Errors for Unsupervised Learning of Depth and Ego-motion from Monocular Videos Jiang 2020 M
47 Unsupervised Learning of Depth, Optical Flow and Pose with Occlusion from 3D Geometry Wang 2020 M x
48 Semantically-Guided Representation Learning for Self-Supervised Monocular Depth Guiziliini 2020 M
49 Single Image Depth Estimation Trained via Depth from Defocus Cues Gur 2020 M x x
50 Don’t Forget The Past: Recurrent Depth Estimation from Monocular Video Patil 2020 M
51 Self-supervised Object Motion and Depth Estimation from Video Dai 2020 M x
52 Edge-Guided Occlusion Fading Reduction for a Light-Weighted Self-Supervised Monocular Depth Estimation Peng 2019 M x
53 Unsupervised Monocular Depth Prediction for Indoor Continuous Video Streams Feng 2019 M
54 Unsupervised High-Resolution Depth Learning From Videos With Dual Networks Zhou 2019 M
55 FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving Kumar 2020 M x
56 Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment Shi 2019 M x
57 Spherical View Synthesis for Self-Supervised 360o Depth Estimation Zioulis 2019 M x x
58 Progressive Fusion for Unsupervised Binocular Depth Estimation using Cycled Networks Pilzer 2019 S x x
59 Learning Residual Flow as Dynamic Motion from Stereo Videos Lee 2019 S
60 Unsupervised Domain Adaptation for Depth Prediction from Images Tonioni 2019 M x
61 MVS2: Deep Unsupervised Multi-view Stereo with Multi-View Symmetry Dai 2019 S x
62 Improving Self-Supervised Single View Depth Estimation by Masking Occlusion Schellevis 2019 M
63 Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video Bian 2019 M x
64 Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera Chen 2019 M x
65 Non-destructive three-dimensional measurement of hand vein based on self-supervised network Chen 2019 S x x
66 Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence Lai 2019 S
67 Unsupervised Depth Completion from Visual Inertial Odometry Wong 2020 M x x
68

Semi-Supervised Monocular Depth Estimation with Left-Right Consistency Using Deep Neural Network

Amiri 2019 M x
69 Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency Khot 2019 S x
70 3D Packing for Self-Supervised Monocular Depth Estimation Guizilini 2020 M
71 Learn Stereo, Infer Mono: Siamese Networks for Self-Supervised, Monocular, Depth Estimation Goldman 2019 M, S
72 Recurrent Neural Network for (Un-)supervised Learning of Monocular Video Visual Odometry and Depth Wang 2019 M
73 Learning monocular depth estimation infusing traditional stereo knowledge Tosi 2019 M
74 Learning to Adapt for Stereo Tonioni 2019 S x x
75 Geometry-Aware Symmetric Domain Adaptation for Monocular Depth Estimation Zhao 2019 M x x
76 A Novel Monocular Disparity Estimation Network with Domain Transformation and Ambiguity Learning Bello 2019 M
77 Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation Pilzer 2019 M x x
78 Unsupervised Cross-spectral Stereo Matching by Learning to Synthesize Liang 2019 S x
79 Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos Xu 2019 M
80 Unsupervised monocular stereo matching Zhang 2018 M x
81

Unsupervised Learning of Monocular Depth Estimation with Bundle Adjustment, Super-Resolution and Clip Loss

Zhou 2018 M
82 Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos Wang 2018 S x
83 SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation Pillai 2018 M
84 DispSegNet: Leveraging Semantics for End-to-End Learning of Disparity Estimation from Stereo Imagery Zhang 2019 S x
85 Learning structure-from-motion from motion Pinard 2018 M
86 DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency Zhou 2018 M
87 A Deeper Insight into the UnDEMoN: Unsupervised Deep Network for Depth and Ego-Motion Estimation Babu V 2018 M x
88 Learning monocular depth by distilling cross-domain stereo networks Guo 2018 M
89 Learning monocular depth estimation with unsupervised trinocular assumptions Poggii 2018 M
90 Towards real-time unsupervised monocular depth estimation on CPU Poggii 2018 M x
91 Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding Yang 2018 M
92 Digging Into Self-Supervised Monocular Depth Estimation Godard 2019 M
93 Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation Ranjan 2019 M
94 Position Estimation of Camera Based on Unsupervised Learning Wu 2018 M x x x
95 Dual CNN Models for Unsupervised Monocular Depth Estimation Repala 2019 M x
96 On the importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach Smolyanskiy 2020 S x x
97 Fusion of stereo and still monocular depth estimates in a self-supervised learning context Martins 2018 M, S x x x
98 LEGO: Learning Edge with Geometry all at Once by Watching Videos Yang 2018 M x
99 Self-Supervised Monocular Image Depth Learning and Confidence Estimation Chen 2018 M x
100 Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction Zhan 2018 S x x
101 Geonet: Unsupervised learning of dense depth, optical flow and camera pose Yin 2018 M
102 AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation Kundu 2018 M x x x
103 Unsupervised Odometry and Depth Learning for Endoscopic Capsule Robots Turan 2018 M
104 Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints Mahjourian 2018 M x
105 Learning Depth from Monocular Videos using Direct Methods Wang 2017 M x
106 Unsupervised Learning of Geometry with Edge-aware Depth-Normal Consistency Yang 2017 M x x
107 UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning Li 2018 M x x x
108 Multi-task Self-Supervised Visual Learning Doersch 2017 M x x x
109 Self-Supervised Siamese Learning on Stereo Image Pairs for Depth Estimation in Robotic Surgery Ye 2017 S x x x
110 Unsupervised Learning of Depth and Ego-Motion from Video Zhou 2017 M x
111 SfM-Net: Learning of Structure and Motion from Video Vijayanarasimhan 2017 M x x
112 Unsupervised monocular depth estimation with left-right consistency Godard 2017 M
113 Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue Garg 2016 S x
114 EffiScene: Efficient Per-Pixel Rigidity Inference for Unsupervised Joint Learning of Optical Flow, Depth, Camera Pose and Motion Segmentation Jiao 2020 S x
115 Parallax Attention for Unsupervised Stereo Correspondence Learning Wang 2020 S
116 Self-Supervised Scale Recovery for Monocular Depth and Egomotion Estimation Wagstaff 2020 M x x
117 Monocular Depth Estimation with Self-supervised Instance Adaptation McCraith 2020 M x
118 Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications Xue 2020 M x
119 AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching Song 2020 S x
120 Enhancing self-supervised monocular depth estimation with traditional visual odometry Andraghetti 2019 S
121 LiStereo: Generate Dense Depth Maps from LIDAR and Stereo Imagery Zhang 2020 S x x
122 Online Adaptation through Meta-Learning for Stereo Depth Estimation Zhang 2019 S x x x
123 Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction Wong 2019 M x
124 Real-time self-adaptive deep stereo Tonioni 2019 S x
125 Geometry meets semantics for semi-supervised monocular depth estimation Ramirez 2018 M
126 Open-World Stereo Video Matching with Deep RNN Zhong 2018 S x x
127 Self-Supervised Learning for Stereo Matching with Self-Improving Ability Zhong 2017 S x x
Table 8: Summary of literature review on self-supervised depth estimation using image synthesis. M/S: Depth from monocular or stereo images. SMTH: Smoothness regularization. MS: Multi-scale regularization. Occlusion: Occlusion handling.