Repo for our CVIU work on the Benefit of Adversarial Training on Monocular Depth Estimation
In this paper we address the benefit of adding adversarial training to the task of monocular depth estimation. A model can be trained in a self-supervised setting on stereo pairs of images, where depth (disparities) are an intermediate result in a right-to-left image reconstruction pipeline. For the quality of the image reconstruction and disparity prediction, a combination of different losses is used, including L1 image reconstruction losses and left-right disparity smoothness. These are local pixel-wise losses, while depth prediction requires global consistency. Therefore, we extend the self-supervised network to become a Generative Adversarial Network (GAN), by including a discriminator which should tell apart reconstructed (fake) images from real images. We evaluate Vanilla GANs, LSGANs and Wasserstein GANs in combination with different pixel-wise reconstruction losses. Based on extensive experimental evaluation, we conclude that adversarial training is beneficial if and only if the reconstruction loss is not too constrained. Even though adversarial training seems promising because it promotes global consistency, non-adversarial training outperforms (or is on par with) any method trained with a GAN when a constrained reconstruction loss is used in combination with batch normalisation. Based on the insights of our experimental evaluation we obtain state-of-the art monocular depth estimation results by using batch normalisation and different output scales.READ FULL TEXT VIEW PDF
Learning based methods have shown very promising results for the task of...
Self-supervised deep learning methods have leveraged stereo images for
Dense depth estimation and 3D reconstruction of a surgical scene are cru...
Monocular depth estimation is an extensively studied computer vision pro...
As processing power has become more available, more human-like artificia...
Nowadays, the majority of state of the art monocular depth estimation
We address the problem of depth and ego-motion estimation from image
Repo for our CVIU work on the Benefit of Adversarial Training on Monocular Depth Estimation
We are interested in estimating depth from single images. This is fundamentally an ill-posed problem, since a single 2D view of a scene can be explained by many 3D scenes, among others due to scale ambiguity. Therefore the corresponding 3D scene should be estimated – implicitly – by looking at the global scene context. Using the global context, a model prior can be estimated to reliably retrieve depth from a single image.
To take into account global scene context for single image depth estimation, elaborate image recognition models have been developed (eigen2014depth; godard2017unsupervised; saxena2006learning). Currently used deep convolutional networks (ConvNets) have enough capacity to understand the global relations between pixels in the image as well as to encode the prior information. However, they are only trained with combinations of per-pixelwise losses.
Generative Adversarial Networks (GANs) (goodfellow2014generative; isola2017image) force generated images to be realistic and globally coherent. They do so by introducing an adaptable loss, in the form of a neural discriminator network that penalises generated output that looks different from real data. GANs have seen much research attention in recent years, with ever increasing quality of the generated data (e.g. karras2019style). Moreover, they have shown success in many tasks, including image reconstruction (isola2017image), image segmentation (ghafoorian2018gan), novel viewpoint estimation (galama18iclrw) and binocular depth estimation (pilzer2018unsupervised). In this paper, we add an adversarial discriminator network to an existing monocular depth estimation model to include a loss based on global context.
Training deep ConvNets requires large datasets with corresponding depth data, preferably dense depth data where ground truth depth is available per pixel. These are not always easy to obtain, either due to a complicated acquisition setup or due to sparse depth data of LiDaR sweeps. To circumvent this, depth prediction has recently been formulated as a novel viewpoint generation task from stereo imagery, where depth is never directly estimated, but just the valuable intermediate in an image reconstruction pipeline. godard2017unsupervised start from a stereo pair of images, and use the left image to estimate the disparities from the right-to-the-left image, which is combined with the right image to reconstruct the left image (see Fig. 1). Due to the constrained image reconstruction setting, disparities are learned from a single view. We use this model as our baseline and extend it with a discriminator network, which we train in an adversarial setup.
Previous literature (godard2017unsupervised; yang2018deep; luo2016efficient; mayer2016large)
shows impressive performance of depth estimation using mostly engineered photometric and geometric losses. However, most of these loss functions are defined as a sum over per-pixel loss functions. The global consistency of the scene context is not taken into account in the loss formulation. We address the question:“To what extent can monocular depth estimation benefit from adversarial training?” We do so by studying the influence of adversarial learning on several combinations of photometric and geometric loss functions.
Estimating depth from single images is fundamentally different from estimating depth from stereo pairs, because it is no longer possible to triangulate points. In a monocular setting, contextual information is required, e.g. texture variation, occlusions, and scene context. These cannot reliably be detected from local image patches. For example, a patch of blue pixels could either represent distant sky, or a nearby blue coloured object. The global picture thus has to be considered. Global context information can for example be modeled by using manually engineered features (saxena2009make3d), or by using CNNs (eigen2014depth).
Depth ground truth is expensive and time-consuming to obtain, and the readings might be inaccurate, e.g. due to infrared interference or due to surface specularities. An alternative is to use self-supervised depth estimation (garg2016unsupervised; godard2017unsupervised; godard2018digging), where training data consists of pairs of left and right images. A disparity prediction model can be trained, to warp the left image into the right image, using photometric reconstruction losses. Depth can be recovered from the disparities, by using the camera intrinsics, making depth ground truth data unnecessary at train time. While stereo pairs are necessary during training, during test time depth can be predicted from a single image. In this work we use the work of godard2017unsupervised as a baseline. Their follow-up work is also concerned with depth estimation, yet based on temporal sequences of (monocular) frames (godard2018digging). The scope of our paper is on stereoscopic learning of depth.
Depth estimation from single images can be formulated as an image warping or image generation task. Often image generation is done by means of encoder-decoder networks that output newly generated images. Encoder-decoder networks trained using L1 or L2 produce blurry results (pathak2016context; zhao2017loss), since output pixels are penalised conditioned on their respective input pixels, but never on the joint configuration of the output pixels. GANs (goodfellow2014generative; mirza2014conditional; isola2017image) counteract this by introducing a structured high-level adaptable loss. GANs are used in tasks where generating globally coherent images is important. Since monocular depth estimation is largely dependent on how well global contextual information is extracted from the input view, there is reason to believe GANs can be of benefit.
Pairing the high-level adversarial loss with a low-level reconstruction loss such as L1 may boost performance even more (isola2017image). This may be due to the fact that the adversarial loss punishes high-level detail, but only slowly updates low-level detail. Combining adversarial losses and pixel-based local losses has been shown to work well for a number of tasks, including novel viewpoint estimation (huang2017beyond; wu2016learning; galama18iclrw), predicting future frames in a video (yin2018novel; mathieu2015deep)
, and image inpainting(pathak2016context).
Adversarial losses have already been explored for depth estimation. chen18arxiv shows that adversarial training can be beneficial when directly regressing on depth from single images using ground-truth depth data. The authors use a CNN and a CNN-CRF architecture using either L1 or L2 norm as similarity metrics for the predicted depths. Since they only ever use single images during training, they do not exploit scene geometry for more involved geometric losses. kumar18cvprw predict depth maps from monocular video sequences and successfully use an adversarial network to promote photo-realism between frames. Their generator is composed of two separate networks: A depth network and a pose network. Together these networks enable the authors to generate frames over time. Compared to the current work the problem is less constrained, because the static scene assumption is violated. That is, objects in the scene may themselves move between frames. pilzer2018unsupervised suggest to use a cycled architecture for estimating depth, in which two generators and two discriminators jointly learn to estimate depth. Their half-cycle architecture is close to our approach, since it uses a single discriminator. However, the generator requires the input of both left and right images to predict a disparity map and even then does not explicitly enforce consistency between the two images. Concurrently with our research is the work of aleotti2018generative, where the method of godard2017unsupervised is extended with a vanilla GAN. They address the weighting of loss components and find that a subtle adversarial loss can possibly yield improved performance, albeit marginally. In our experiments we find the opposite, none of the used GAN variants improve performance, and we show that small variations of performance could also be explained by initialisation or attributed to the use of batch normalisation. Unlike the methods above the purpose of the current work is to evaluate adversarial approaches when constrained reconstruction losses are used.
We evaluate different GAN objectives for depth estimation, in part based on the results of lucic2018gans, who conclude that no variant is (necessarily) better than others, given a sufficiently large computational budget and extensive hyper-parameter search. We compare the following GAN variants:
Vanilla GAN, with a PatchGAN (isola2017image) discriminator;
Gradient-Penalty Wasserstein GANs (WGAN-GP) (arjovsky2017wasserstein; gulrajani2017improved), which minimises the Wasserstein distance, to overcome the saturating loss of the original GAN formulation;
Least Square GANs (LSGAN) (mao2017least), where the sigmoid real or fake prediction is replaced by an L2-loss.
While GANs provide a powerful method to output realistic data with an adaptable loss, they are notoriously difficult to train stably (salimans2016improved). Many strategies to improve training stability of GANs have been proposed, including using feature matching (ghafoorian2018gan), using historical averaging (shrivastava2017learning), or adding batch normalisation (ioffe2015batch). We include the latter in our GAN variants; initial results have shown that batch normalisation is beneficial for vanilla GANs and LSGANs to counteract internal covariate shift.
The baseline of our single depth estimation model is the reconstruction-based architecture for depth estimation from godard2017unsupervised, which we describe first. Then, we extend this baseline with an adversarial discriminator network.
The problem of estimating depth from single images could be formulated as an image reconstruction task, like in garg2016unsupervised, where a generator network takes in a single left view image and outputs the left-to-right disparity. The right image is reconstructed from the left input image and the predicted disparity. As a consequence this network could be trained from rectified left and right image pairs, without requiring ground-truth depth-maps. At test time depth estimation is based on the disparity predicted from the (single) left image, see Fig. 1.
godard2017unsupervised improve on the work of garg2016unsupervised by using both right-to-left and left-to-right disparities and by using a bilinear sampler (jaderberg2015spatial) to generate images, which makes the objective fully differentiable.
We first describe in detail the method of godard2017unsupervised, our baseline: The generator G uses the left image of the pair to reconstruct both the left and right images. Consider a left image and a right image . Image will be used as the sole input to the generator G. The generator G, however, outputs two disparities and . Using left-to-right disparity the right image can be reconstructed, using warping method :
And similarly, the reconstructed left image , is obtained by:
A good generator G should predict , such that the reconstructed images ( and ) are close to the original image pair ( and ). To measure this several image reconstruction losses are used.
The quality of the reconstruction is based on multiple loss components, each with different properties for the total optimisation process. We combine:
L1 loss to minimise the absolute per-pixel distance:
Note that L1 has been reported to outperform L2 (zhao2017loss).
Structural similarity (SSIM) reconstruction loss to measure the perceived quality (wang2004image):
where illumination and signal contrast are computed around centre pixels and , and overcome divisions by (almost) zeros, these values are set in line with godard2017unsupervised; yang2018deep.
Left-Right Consistency Loss (LR), which enforces the consistency between the predicted left-to-right and right-to-left disparity maps:
Disparity Smoothness Loss, which forces smooth disparities, i.e. small disparity gradients, unless there is an edge, therefore using an edge-aware L1 smoothness loss:
since the generator outputs disparities at different scales, this loss is normalised by at scale s to normalise the output ranges.
In yang2018deep an occlusion loss component was suggested in addition to the other loss terms. Initial experimentation shows no clear benefit of using it for our set-up. Results with the occlusion loss can be found in the supplementary material.
The generator network outputs scaled disparities at intermediate layers of the decoder when it is upsampling from the bottleneck layer. For each subsequent scale, height and width of the output image is halved. At each scale the reconstruction loss is computed, and the final reconstruction loss is a combination of the losses at the different scales s:
where weighs the influence of each loss component. For each loss component we use the reconstruction of the left and right image, which are defined symmetrically.
We extend the baseline model by a single discriminator network. The discriminator network is tasked with discerning between fake and real images on the right side. The schematic is shown in Fig. 1, by the green discriminator.
For adversarial training, we combine the reconstruction loss with the loss functions belonging to specific GAN variants. Note that unlike the original formulation of GANs actual data, in the form of the left image , is fed to the generator, not noise z (mirza2014conditional). The generator G produces two disparities and . However the discriminator D is only presented with the right image constructed from . Thus the discriminator D examines only (reconstructed) right images and , to tell apart. This leads to the following losses:
where denotes the gradient penalty with = 10 from WGAN-GP (gulrajani2017improved); and where the generator follows the NS-GAN loss (goodfellow2014generative).
The final loss used for training the generator is:
which combines the reconstruction loss with the generator part of the GAN loss , where weighs its influence.
At test time only the generator is used to predict the right-to-left disparity d at the finest scale, which has the same resolution is the input image. The predicted disparity is transformed into a depth map by using the known camera intrinsic and extrinsic parameters.
In this section we experimentally evaluate the proposed GAN models on two public datasets KITTI and CityScapes.
|lower is better||higher is better|
|lower is better||higher is better|
|Loss Components||BN||GAN||ARD||SRD||RMSE||RMSE log|
|L1||LR||Disp||SSIM||lower is better||higher is better|
|5||Training set mean||0.361||4.826||8.102||0.377||0.638||0.804||0.894|
For the main set of experiments we use the KITTI (Geiger2013IJRR) dataset, which contains image pairs of a car driving in various environments, from highways to city centres to rural roads. We follow the Eigen data split (eigen2014depth) to compare fairly against other methods, which uses 22.6K training images, 888 validation images and 697 test images, which are resized to 256 512. During training no-depth ground truth is used, only the available stereo imagery. For evaluation the provided velodyne laser data is used.
To test if the results generalise to another dataset, we use the CityScapes dataset (cordts2016cityscapes). This dataset consists of almost 25 thousand images, with 22.9K training images, 500 validation images and 1525 test images. Visual inspection of the CityScapes dataset reveals two things that have consequences for data pre-processing. First, some images contain artifacts at the top and bottom of the images. Second, both left and right cameras capture part of the car on which they are mounted. To compensate the top 50 and bottom 224 rows of pixels are cropped. Cropping is also performed at the sides of the images to retain width and height ratios.
The generator-only model from godard2017unsupervised is used as a baseline111Available at https://github.com/rickgroen/depthgan. For all experiments we use an adapted VGG30 generator network architecture, with M parameters, for fair comparison with other methods (godard2017unsupervised; pilzer2018unsupervised).
All models are trained for 50 epochs, in mini-batches of 8, with the Adam optimizer(kingma2014adam) The initial learning rate is set to , and updated using a plateau scheduler (radford2015unsupervised). Data augmentation is done in online fashion, including gamma, brightness, and color shifts and horizontal flipping. In case of the latter left and right images are swapped in order to preserve their relative position. Based on recommendations from previous works and some initial experiments, we set , , , , .
The discriminators for Vanilla GAN and LSGAN are convolutional network with five layers and for WGAN-GP a 3-layer fully-connected network was used.
At test times disparities are warped into depth maps, and the predicted depth is bounded between 0 and 80 metres, which is close to the maximum in the ground truths. Similar to other methods we vertically centre-crop images, see e.g. garg2016unsupervised. Ground truth depth data of the Eigen split is sparse. For quantitative evaluation we use a set of common metrics (eigen2014depth; godard2017unsupervised; pilzer2018unsupervised; yang2018deep): Absolute Relative Distance (ARD), Squared Relative Distance (SRD), Root Mean Squared Error (RMSE), log Root Mean Squared Error (log RMSE), and accuracy within threshold t (, with ).
Note that the disparity of pixels on the side of the image is ill-defined, the so-called disparity ramps. To compensate we use the test image and its flipped version to obtain disparity maps and , the latter is flipped again to align with . Both have a disparity ramp, yet on opposite sides of the disparity map. Therefore we use , flipped () for the 5% outer right (left) columns, and average predictions everywhere else.
In this set of initial experiments we study the robustness of our models to initialisation and the influence of the backbone architecture.
In this first experiment we study the robustness of our models with respect to the initialisation of the weights and the randomness of the training. Therefore, we have trained two of our models 10 times with the same hyper-parameters, namely our baseline model, with a VGG backbone, which combines the 4 reconstruction loss components without batch norm and the LSGAN model with batch norm. The results are shown in Tab. 1
, where we show for each performance measure the minimum, the maximum and the average value, and include the standard deviation. We observe some small differences in performance,e.g. ARD in range and in the range . We conclude that minor differences in performance between models, might in fact be the result of initialisation rather than model design and training choices.
In the second experiment in this section, we study the influence of the backbone network used. We compare the VGG30 network, used in all other experiments, to variants of a ResNet backbone, using the full reconstruction loss definition (i.e. L1, LR, disparity & SSIM), without any adversarial components. The results are shown in Tab. 2. We conclude that the ResNet50 architecture yields best performance on the test set and while the shallow ResNet18 is outperformed by the VGG architecture, it is only by relatively little. This is in the line of expectations, since residual learning has been shown to be effective for deep convolutional architectures (he2016deep). For the other experiments, however, we use the VGG30 backbone for fair comparison to related work.
To address our main research question we have performed an extensive set of experiments using different combinations of loss components paired with three different GAN variants. We compare the performance of the model when using different configurations of loss components: L1 Loss, adding Left-Right consistency, adding disparity smoothness, and adding the SSIM loss. For each of the loss component configurations, we also compare the performance of the model without using a GAN, against using a Vanilla GAN, LSGAN or WGAN-GP. The results are shown in Tab. 3, we refer to experiment indices in parentheses. From the results we observe the following:
Using only an adversarial loss, without an image reconstruction loss yields imprecise results (1), even worse than using the training set mean (5). The poor performance of the adversarial loss could be explained by the fact that many different disparity maps may reconstruct a correct image. The global coherency loss of the GANs seems to have difficulty converging to locally geometrically viable disparities, we conclude that both global and local consistency should be taken into account.
All models (with or without GAN) improve when more constraining image reconstruction losses are combined (2, 3, 4); However, where GANs do improve over using just L1 as image reconstruction losses (2.b vs 2.d/2.e), they do not improve when multiple reconstruction losses are used (4.b vs 4.c/d/e).
Where WGAN is the best (adversarial) model for the L1 loss, it is the worst (adversarial) model when more constrained losses are used. This could be partly explained by the difficulty of training GAN models, which are highly sensitive to parameters settings and network architectures.
When considering models with a constrained loss (4), adversarial learning boosts performance beyond the baseline (4.a vs 4.c/4.d). However, upon closer inspection, the difference in performance is likely due to the use of batch normalisation (cf. 4.b vs 4.c/4.d). Batch normalisation is often used for GANs because it facilitates stable training (salimans2016improved)
and it is the default in many open-source GAN implementations.
For the baseline model, we obtain 0.142 ARD (4.a) with our implementation, which is slightly better compared to 0.148 reported by godard2017unsupervised (see also Tab. 8). Training with batch normalisation yields an increase of performance to 0.132 (4.b).
In this experiment we evaluate the influence of normalisation strategies on the performance. We compare the L1 and the full reconstruction loss, trained without GAN or with Vanilla / LSGAN. The results are shown in Tab. 4. From the results, we observe that for any configuration of loss components, performance improves when batch normalisation is used. While instance normalisation does not yield better performance for a full loss model, it is beneficial for models trained with a L1 loss. We conclude that batch normalisation is important for training for any model. Qualitative investigations in Fig. 5 show that batch normalisation takes away some granularity in the predicted disparities.
In the previous section we quantitatively show how performance of models is affected making use of adversarial losses. GANs optimise for photo-realism in reconstructed images, which could lead to well-reconstructed images while the predicted disparities poorly model accurate depth. This is because many disparity maps may reconstruct an image well even though they do not capture depth correctly. We evaluate the performance of the non-adversarial model versus GANs at the level of individual images to see if there are cases for which GANs are better suited. The results are presented in Fig. 2. When using only the L1 reconstruction loss (left plane
), the variance in the performance difference between models trained with and without GANs is large (i.e.
many outliers away from the diagonal). However, when using the full reconstruction loss, the scatter is aligned around the diagonal, indicating that both models perform similarly on most images. We have verified the few outliers visually, but do not find a noticeable patterns which indicates better photo-realism for GANs in these cases.
In Fig. 3 the reconstructed images and their corresponding disparities are shown for L1, WGAN L1, Full and Full WGAN models. The model trained using only a L1 loss can reconstruct images that have a low pixel-wise loss, but impose no structure on the disparities. This means that there are holes and strong transitions in the predicted disparities. For the WGAN that was trained alongside a L1 loss, it seems the WGANs prefers to predict low disparities. This way images that are very close to the input image are generated. As such, even though the disparities are poor, the reconstructed images are photo-realistic. Adding more geometric loss components that constrain the disparities alleviates these problems.
We also visually compare the performance of different GAN variants in Fig. 4 for models that were trained alongside a L1 loss and a full loss. Again, for models trained with a L1 loss, it can be seen that the WGAN predicts low disparities compared to the other GAN variants. This is beneficial for depth estimation, since it smooths the disparities and prevents under-estimating depth. In a way, the WGAN trained alongside the L1 loss imitates the disparity smoothness loss component. This effect is much less pronounced for models trained with a full loss.
In a next experiment we compare the image quality (rather than the depth prediction quality) of the reconstructed images, measured by L1, L2, and SSIM scores on the test set. The results are shown in Tab. 5. The first row shows the identity mapping, indicating the scores comparing left and right ground truth images directly. In the rest of the table we can see that that are subtle differences between methods in image space. WGAN reconstruction is marginally worse than other methods. The results are interesting, because it seems image reconstruction score has relatively low correlation with depth estimation scores, the image quality of the L1 based models is higher than the full loss, yet the depth prediction is worse. This implies that there is a need for geometrically founded loss functions, like left-right consistency and disparity smoothness losses to obtain better depth predictions.
|lower is better||higher is better|
|L1 loss - base||0.0499||0.0102||0.7110|
|L1 loss - Vanilla GAN||0.0495||0.0101||0.7130|
|L1 loss - LSGAN||0.0521||0.0112||0.7003|
|L1 loss - WGAN||0.0525||0.0111||0.6945|
|Full loss - base||0.0514||0.0110||0.7183|
|Full loss - Vanilla GAN||0.0521||0.0112||0.7115|
|Full loss - LSGAN||0.0518||0.0111||0.7125|
|Full loss - WGAN||0.0550||0.0126||0.6936|
In this set of experiments, we vary the number of output disparity predictions from 4 (as used in godard2017unsupervised) to 1 (as used in pilzer2018unsupervised). We do so for two settings of the reconstruction loss, using only L1 + LR loss, similar to pilzer2018unsupervised and using the full reconstruction loss, similar to godard2017unsupervised. The results are shown in Tab. 6, using the ARD performance measure. From the results we observe that reducing disparity output scales contributes positively to depth estimation quality, this holds especially in the case of using just the L1 + LR loss. The intuition is that forcing the network to output coherent disparities early on can over-regularise disparities: With 4 output scales the smallest resolution of disparities is just 32 64, which acts as a (too) strict regulariser.
|Loss Components||S||BN||GAN||ARD||SRD||RMSE||RMSE log|
|L1||LR||Disp||SSIM||lower is better||higher is better|
|Method||Trained on||ARD||SRD||RMSE||RMSE log|
|lower is better||higher is better|
|Supervised using LiDAR depth|
|Supervised using Left-Right Correspondence|
|godard2018digging||R50, video based, v1||K||0.133||1.142||5.533||0.230||0.830||0.936||0.970|
|Optimised settings||VGG, BN + S2||K||0.128||1.026||5.313||0.222||0.830||0.939||0.973|
|ResNet - Baseline||R50, S4||K||0.123||0.936||5.145||0.216||0.843||0.943||0.975|
|ResNet - Optimised settings||R50, BN + S2||K||0.122||0.928||5.119||0.215||0.847||0.945||0.975|
In this set of experiments we evaluate our models on the CityScapes dataset, after training on the KITTI dataset. The goal is to see if the insights and results generalise to this novel domain. Results are depicted in Tab. 7. The results indicate similar behaviour on the CityScapes dataset as on the KITTI dataset: when the reconstruction losses are sufficiently constrained, adversarial training does not improve the performance. We obtain the best generalising results by using a combination of all four image reconstruction losses, trained at a single scale, using batch normalisation.
In the final set of experiments, we compare the performance of our current work to a few state-of-the-art methods on KITTI, see Tab. 8. For comparison, we report performance on the Eigen test set, using centre cropping (garg2016unsupervised). For reference we include a few seminal works on monocular depth estimation, when using the LiDAR data during training (saxena2006learning; eigen2014depth) and one of the newest methods (yang2018deep). Then we compare our performance to the work of godard2017unsupervised, godard2018digging, and pilzer2018unsupervised, since these serve as baselines and inspiration for the current work. While adding adversarial losses to take into account scene context does not improve over a combination of reconstruction based lasses, using batch normalisation and just 2 output scales significantly boost performance.
This work has sought to investigate whether using adversarial losses benefits the estimation of depth maps in a monocular setting. For many tasks where global consistency is important, adversarial training improves image reconstruction tasks. However, after extensive experimental evaluation, we conclude that adversarial training is beneficial in monocular depth estimation if and only if the reconstruction loss does not impose too many constraints on reconstructed images. When more involved geometrically or structurally inspired losses are introduced, adversarial training contributes hardly to the quality of the predicted depth maps and may even be harmful.
Based on our extensive experiments we also conclude that:
Batch normalisation improves depth prediction quality significantly; and
evaluating reconstruction losses at many output scales over-regularises the disparity at the final scale, this effect is stronger when the loss function itself is more constrained by SSIM and left-right consistency components;
Using those two insights, we have been able to set a new state-of-the-art monocular depth prediction based on reconstruction losses, improving from 0.148 (godard2017unsupervised) to 0.128 (using batch norm and 2 output scales).
Future research could investigate the influence of specific architectures of the discriminator network and the use of conditional GANs for guiding monocular depth estimation.
This research was supported in part by the Dutch Organisation for Scientific Research via the VENI grant What & Where awarded to Dr. Mensink.