Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

08/17/2021 ∙ by Lina Liu, et al. ∙ Zhejiang University Baidu, Inc. 9

Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the-art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-supervised depth estimation has been applied in a wide range of fields such as augmented reality [3][5], 3D reconstruction [17], SLAM [22][30][31]

and scene understanding 

[1][6]

since it does not need large and accurate ground-truth depth labels as supervision. The depth information can be estimated by the implicit supervision provided by the spatial and temporal consistency present in image sequences. Benefiting from the well developed deep learning technology, impressive results have been achieved by Deep Convolution Neural Network(DCNN) based approaches 

[14][21][24][25]

, which outperform traditional methods that rely on handcrafted features and exploit camera geometry and/or camera motion for depth and pose estimation.

Figure 1: Comparison with other approaches on Oxford RobotCar dataset [28]. From left to right: (a) Night Images, (b) Monodepth2 [12], (c) HR-Depth [27], (d) Monodepth2+CycleGAN [44], (e) Ours.

However, most of current DCNN based self-supervised depth estimation approaches [18][36][37][42] mainly solve the problem of depth estimation on day-time images, which are evaluated by day-time benchmarks, such as KITTI [9]

and Cityscapes 

[7]. They fail to generalize well on all-day images due to the large domain shift between day and night images. The night-time images are unstable due to the low visibility and non-uniform illumination arising from multiple and moving lights. Methods [16][19]

are proposed by applying a commonly used depth estimation strategy for images captured in low-light conditions. However, the performance is limited due to the unstable visibility. Meanwhile, generative adversarial networks(GAN), such as CycleGAN 

[44], are also used to solve the problem of depth estimation on night-time images by translating information of night-time to day-time in both image levels and feature levels. Unfortunately, due to the inherent domain shift between day and night-time images, it is difficult to obtain natural day-time images or features with GAN using night-time images as input, thus the performance is also limited. Fig. 1 (b) and (c) demonstrate the results of Monodepth2 [12] and HR-Depth [27] of night-time images. Monodepth2[12] is an effective self-supervised depth estimation approach, and HR-Depth[27] make a series of improvements based on Monodepth2[12]. Fig. 1 (d) demonstrate the result Monodepth2 [44] with CycleGAN translated image as input. We can see that the depth details are failed to be estimated due to the non-uniform illumination of night-time images.

For a scene in real-world, the depth is constant if the viewpoint is fixed, while the disturbing terms, such as illumination, varies as time goes, which will disturb the performance of self-supervised depth estimation, especially for night-time images. [8] also proves that texture information plays more important roles on depth estimation than exact color information. To cater to the above issues, we propose a domain-separated network for self-supervised depth estimation of all-day RGB images. The information of day and night image pairs are separated into two complementary sub-spaces: private and invariant domains. Both domains use DCNN to extract features. The private domain contains the unique information (illumination, etc.) of day and night-time images, which will disturb the performance of depth estimation. In contrast, the invariant domain contains invariant information (texture, etc.), which can be used for common depth estimation. Thus the disturbed information can be removed and better depth maps will be obtained.

Meanwhile, unpaired day and night images always contain inconsistent information, which interferes with the separation of private and invariant features. Therefore, the domain-separated network takes a paired of the day-time image and corresponding night-time image (generated by GAN) as input, the private and invariant feature extractors are first utilized to extract private (illumination, etc.) and invariant (texture, etc.) features using orthogonality and similarity losses, which can obtain more effective features for depth estimation of both day and night-time images. Besides, constraints in feature and gram matrices levels are leveraged in orthogonality losses to alleviate the domain gap, thus more effective features and fine-grain depth maps can be obtained. Then, depth maps and corresponding RGB images are reconstructed by decoder modules with reconstruction and photometric losses. Note that real-world day-time and night-time images can be tested directly. As shown in Fig. 1 (e), our approach can effectively relieve the problems of low-visibility and non-uniform illumination, and achieves more appealing results for night-time images.

The main contributions can be summarized as:

  • We propose a domain-separated framework for self-supervised depth estimation of all-day images. It can relieve the influence of disturbing terms in depth estimation by separating the all-day information into two complementary sub-spaces: private (illumination, etc.) and invariant (texture, etc.) domains, thus better depth maps can be expected;

  • Private and invariant feature extractors with orthogonality and similarity losses are utilized to extract effective and complementary features to estimate depth information. Meanwhile, the reconstruction loss is employed to refine the obtained complementary information (private and invariant information);

  • Experimental results on the Oxford RobotCar dataset demonstrate that our framework achieves state-of-the-art depth estimation performance for all-day images, which confirms the superiority of our approach.

2 Related work

2.1 Day-time Depth Estimation

Self-supervised depth estimation has been extensively studied in recent years. [43][11] are the first self-supervised monocular depth estimation approaches which train the depth network along with a separate pose network. Meanwhile, [12][13][15][20] make a series of improvements for outdoor scenes, which are sufficiently evaluated on KITTI dataset [9] and Citycapes dataset [7] subsequently. [23][32] [38] outperform better results in indoor scenes.

KITTI [9] and Citycapes [7] datasets only contain day-time images, and all of the above methods are excellently improved for these scenes. However, the self-supervised depth estimation approaches for all-day images have not been well addressed before, and the performance of current approaches on night-time images is limited due to the low-visibility and non-uniform illuminations.

2.2 Night-time Depth Estimation

Figure 2: Overview of the network architecture. The network architecture includes three parts: Shared weights depth network (orange area), Day private branch (yellow structure) and night private branch (green structure). Day-time and night-time images are the input of the shared weights depth network, which extracts the invariant features first, and then estimates corresponding depth maps. Meanwhile, the day private feature extractor and night private feature extractor (blue area) extract the private features of day and night, respectively, which are constrained by orthogonality loss to get complementary features. And the private and invariant features are added to reconstruct the original input images with the reconstruction loss. In inference, only operations of , and shared weights depth network are used to estimate depth.

Approaches have also been proposed for self-supervised depth estimation of night-time images.

[19][26] propose to use additional sensors to estimate depth of night-time images. To estimate all day time depth, [19] utilizes a thermal imaging camera sensor to reduce the influence of low-visibility in the night-time, while [26] adds LiDAR to provide additional information in estimating depth maps at night-time. Meanwhile, using generate adversarial network, [33] and [34] propose effective strategies for depth estimation of night-time images. [33] utilizes a translation network with light effects and uninformative regions that can render realistic night stereo images from day stereo images, and vice versa. During inference, a separate network during the day and night is used to estimate the stereo depth of night-time images. [34] proposes an adversarial domain feature adaptation method to reduce the domain shift between day and night images at the feature level. Finally, independent encoders and a shared decoder are used for the day-time and night-time images during inference.

Though remarkable progress has been achieved, due to the large domain shift between day-time and night-time images, it is difficult to obtain natural day-time images or features with night-time images as input, thus the performances of these approaches are limited.

2.3 Domain Adaptation

Most depth estimation and stereo matching domain adaptation methods mainly focus on the migration between the synthetic domain and the real domain or between different datasets. Most methods usually translate images from one domain to another. To reduce the requirements of real-world images in depth estimation, [2][40][41] explore image translation techniques to generate synthetic labeled data. [29] tackles synthetic to real depth estimation issue by using domain invariant defocus blur as direct supervision. [39] proposes a domain normalization approach of stereo matching that regularizes the distribution of learned representations to allow them to be invariant to domain differences.

Compared with previous approaches, we propose an effective domain separation framework for all-day self-supervised depth estimation, which can effectively handle the problem of domain shift between day and night-time images.

3 Approach

We propose a domain-separated framework to relieve the influence of disturbing terms, which takes day-time images and corresponding night-time images generated by GAN as input, and Fig. 2 demonstrates the pipeline of our proposed domain separated framework.

3.1 Domain Separated Framework

For day-time and night-time images of a scene, the depth information should be consistent, though the illumination of these image pairs is quite different. This means that the essential information of corresponding day-time images and night-time images of a scene should be similar. Here, we separate the information of day and night-time images into two parts:

(1)

where and mean the invariant information of day and night images, which should be similar of the same scene, and and mean the different private information (illumination, etc.) of day and night images, respectively.

Inspired by [8], the illumination of a scene is different as time goes, while the depth of the scene is constant, thus the illumination components ( and ) of a scene play fewer roles in self-supervised depth estimation. As shown in Fig. 2, the proposed domain-separated framework separates the images into two complementary sub-spaces in feature levels (elucidated in Fig.5), and the invariant components are utilized for depth estimation.

Moreover, it is quite difficult to guarantee that the real-world day-time and night-time images of a scene contain the same information except for the private information (illumination, etc.), since there are always moving objects in outdoor scenes. This will mislead the network to obtain private and invariant components of images. Therefore, CycleGAN[44] is used to translate day-time images to night-time images, where the day-time and corresponding generated night-time images are regarded as input image pairs. It ensures that the invariant information is consistent, and all objects are in the same position, reducing the loss of essential information during the process of separating private information. Note that other GANs can also be used here.

Inspired by [4], our domain-separated framework uses two network branches to extract the private and invariant information of an image in feature levels, respectively. Given the input day image sequences and the corresponding generated night images sequences , where represents the t-th frame image arranged in chronological order, the day private feature extractor and night private feature extractor are used to extract private features of day-time images and night-time images and , respectively. The invariant feature extractor and are utilized to extract invariant features of day-time and night-time images and , respectively. Since the input day-time and night-time images contains same essential information, and are weight-shared, which is defined as

. Then the feature extraction process can be formulated as:

(2)

where t in the subscript denotes the t-th frame of day and night-time.

Then, decoders are used to reconstruct the corresponding depth maps of day and night-time images. As shown in Fig. 2, the red decoder represents the depth recovery module of shared weights depth network, and the yellow decoder and green decoder denote the reconstructed feature restoration branch. The process of the depth map and image reconstruction can be formulated as:

(3)

where and are the reconstructed images of t-th frame by and , and and are the corresponding depth maps estimated by .

3.2 Loss function

To obtain private and invariant features and well estimate depth information of all-day images in a self-supervised manner, different losses are leveraged here, including reconstruction loss, similarity loss, orthogonality loss and photometric loss.

3.2.1 Reconstruction Loss

The private and invariant features are complementary information which can be used to reconstruct the original RGB images. Hence, we use reconstruction loss to refine the domain separated framework, which is defined as:

(4)

where , is the pixel number of and .

3.2.2 Similarity Loss

The proposed domain separated framework takes day-time images and corresponding generated night-time images (CycleGAN [44]) as input, and the estimated depth maps of day-time and night-time images should be consistent. Due to the inherent advantages of the day-time image in depth estimation, the estimated depth of the night image is expected to be as close as possible to the day-time, that is, the depth of the day-time image is used as a pseudo-label to constrain the depth of the night-time image. So the similarity loss is defined as:

(5)

where , is the pixel number of and , is the x-th pixel. means that the gradient of is cut off during back propagation.

3.2.3 Orthogonality Loss

As discussed above, the private and invariant features of an image are complementary and completely different. Therefore, two types of orthogonality losses are utilized here to guarantee the private and invariant features are completely different.

Direct feature orthogonality loss: the private and invariant feature extractors obtains 3-D private and invariant features ( and ) which have large sizes. To reduce the complexity, we first use a convolution layer with a kernel size of to reduce the size of obtained private and invariant features ( and

), then we straighten the reduced features into 1-D feature vectors. Finally, we calculate the inner product (orthogonality loss) between the private and invariant feature vectors, which is defined as

.

Gram matrices orthogonality loss: inspired by style transfer [10], Gram matrix is widely used in style transfer to represent the style of the features. The private and invariant features can be considered to have different styles. Hence, we first calculate the Gram matrices ( and ) of private and invariant features, then straighten them to 1-D feature vectors, thus the orthogonality loss between these vectors can be calculated, which is defined as .

The process of and can be defined as:

(6)

where , , and are the convolution operation for invariant and private features of day-time and night-time images, respectively.

(7)

where is the operation which convert multi-dimensional features to 1-D features.

3.2.4 Photometric Loss

Following Monodepth2[12], photometric loss are utilized in the self-supervised depth estimation process. Photometric loss can be formulated as the same as [12]:

(8)

where is the pose estimation process, and following [12], here we use method [35] for pose estimation, is the reprojection image, and which is set empirically (same as [12]).

3.2.5 Total Loss

The total training loss of the network is

(9)

where are the weight parameters. In this paper, we set , , empirically.

3.3 Inference Process

For day-time image, the output is ; for night-time is . Except for the first convolution layer, the remaining parameters of the depth estimation during the day and night are all shared. In inference, only operations of , and shared weights depth network are used to estimate depth.

4 Experiments

Method (test at night) Max depth Abs Rel Sq Rel RMSE RMSE log
Monodepth2 [12](day) 40m 0.477 5.389 9.163 0.466 0.351 0.635 0.826
Monodepth2 [12](night) 40m 0.661 25.213 12.187 0.553 0.551 0.849 0.914
Monodepth2+CycleGAN [44] 40m 0.246 2.870 7.377 0.289 0.672 0.890 0.950
HR-Depth [27] 40m 0.512 5.800 8.726 0.484 0.388 0.666 0.827
ADFA111Note that the test set of ADFA [34] is not available, our test set is not exactly the same as the test set of ADFA [34]. [34] 40m 0.201 2.575 7.172 0.278 0.735 0.883 0.942
Ours 40m 0.233 2.344 6.859 0.270 0.631 0.908 0.962

Monodepth2 [12](day)
60m 0.432 5.366 11.267 0.463 0.361 0.653 0.839
Monodepth2 [12](night) 60m 0.580 21.446 12.771 0.521 0.552 0.840 0.920
Monodepth2+CycleGAN [44] 60m 0.244 3.202 9.427 0.306 0.644 0.872 0.946
HR-Depth [27] 60m 0.462 5.660 11.009 0.477 0.374 0.670 0.842
ADFA111Note that the test set of ADFA [34] is not available, our test set is not exactly the same as the test set of ADFA [34]. [34] 60m 0.233 3.783 10.089 0.319 0.668 0.844 0.924
Ours 60m 0.231 2.674 8.800 0.286 0.620 0.892 0.956





Method (test at day)
Max depth Abs Rel Sq Rel RMSE RMSE log
Monodepth2 [12](day) 40m 0.117 0.673 3.747 0.161 0.867 0.973 0.991
Monodepth2 [12](night) 40m 0.306 2.313 5.468 0.325 0.545 0.842 0.937
HR-Depth [27] 40m 0.121 0.732 3.947 0.166 0.848 0.970 0.991
Ours 40m 0.109 0.584 3.578 0.153 0.880 0.976 0.992

Monodepth2 [12](day)
60m 0.124 0.931 5.208 0.178 0.844 0.963 0.989
Monodepth2 [12](night) 60m 0.294 2.533 7.278 0.338 0.541 0.831 0.934
HR-Depth [27] 60m 0.129 1.013 5.468 0.184 0.825 0.958 0.989
Ours 60m 0.115 0.794 4.855 0.168 0.863 0.967 0.989




Table 1: Quantitative comparison with state-of-the-art methods. Higher value is better for the last three columns, lower value is better for others. Monodepth2[12](day) and HR-Depth[27] mean training with day-time data of the Oxford dataset and testing on the night and day test set. Monodepth2[12](night) means training with night-time data of the Oxford dataset and testing on the night and day test set. CycleGAN[44] means translating night-time Oxford images into day-time images and then using a day-time trained Monodepth2 model to estimate depth from these translated images the same as [34]. The best results are highlighted.

In this section, following [34], we compare the performance of our method with state-of-the-art approaches on Oxford RobotCar dataset[28], which is split to adapt all day time monocular image depth estimation.

4.1 Oxford RobotCar Dataset

The KITTI [9] and Cityscapes [7] datasets are widely used in depth estimation task. However, only day-time images are contained in these datasets, which cannot meet the requirements for all-day depth estimation. Therefore, we choose the Oxford RobotCar dataset [28] as the training and testing dataset. It is a large outdoor-driving dataset that contains images at various times captured in one year, including day and night times. Following [34], we use the left images with the resolution of (collected by the front stereo-camera (Bumblebee XB3)) for self-supervised depth estimation. Sequence ”2014-12-09-13-21-02” and ”2014-12-16-18-44-24” are used for day-time and night-time training, respectively. Both training data are selected from the first 5 splits. The testing data are collected from the other splits of the Oxford RobotCar dataset, which contains 451 day-times images and 411 night-times images. We use the depth data captured by the front LMS-151 depth sensors as the ground truth in the testing phase. The images are first center-cropped to , and then resized to as the inputs of the network.

4.2 Implementation Details

First, we use CycleGAN[44] to translate day-time images to night-time images, then generated images and the day-time images are regarded as image pairs which are the input of the proposed domain-separated network.

The network is trained 20 epochs in end-to-end manner with Adam optimizer (

=0.9, =0.999). We set batch size as 6, the initial learning rate as 1e-4 for the first epochs, and the learning rate is set as 1e-5 for the remaining epochs. In inference, except for the first convolution layer, the day and night branches share weights in the depth estimation part.

Figure 3: Qualitative comparison with other state-of-the-art methods at night. From left to right: (a) Night Images, (b) Fake Day Images translated by CycleGAN[44], (c) Monodepth2[12], (d) Monodepth2+CycleGAN[44], (e) HR-Depth[27], (f) ADFA[34], (g) Ours.
Figure 4: Qualitative comparison with other state-of-the-art methods at day. From left to right: (a) Day Images, (b) Monodepth2 [12](day), (c) HR-Depth [27], (d) Ours.
Figure 5: Visualization of Convolution Features. From left to right: (a) Day-time Private Features, (b)Night-time Private Features, (c)Day-time Invariant Features, (d)Night-time Invariant Features. The first column shows the corresponding input images, and the remaining columns from left to right are top feature maps that contain more information.

4.3 Quantitative Results

Table. 1 demonstrates the quantitative comparison results between our approach and state-of-the-art approaches. Following [34], we evaluate the performance with two depth ranges: within 40m and 60m. In Table. 1, Monodepth2 [12](day) means the results trained with day-time images of the Oxford dataset, while Monodepth2 [12](night) means the results trained with night-time images of the Oxford dataset. Monodepth2+CycleGAN [44] means the results of translating night-time Oxford images into day-time images and then use a day-time trained Monodepth2 [12] model to estimate depth from these translated images the same as [34].

As shown in Table. 1, Monodepth2 [12] is an effective self-supervised depth estimation approach, which works well for day-time images. However, the performances are limited on night-time images for all models trained by day-time and night-time images. Due to the non-uniform illumination of night-time images, areas that are too bright and too dark will cause varying degrees loss of information, which leads to the fact that training directly on images at night cannot get absolutely good results. Meanwhile, although Monodepth2+CycleGAN[44] and HR-Depth [27] can improve the depth estimation results of the night images to a certain extent, the performances are also limited due to the non-uniform illumination of night-time images. ADFA [34] reduces the domain shift between day and night images at the feature level, but the performance is limited by day-time results. As shown in Table. 1, our domain-separated framework can effectively relieve the influence of disturbing terms, which can improve the depth estimation performance for all-day images. Almost all performance metrics in the depth ranges of 40m and 60m for day and night images can be largely improved by our approach, which prove the superiority of our method.

4.4 Qualitative Results

The qualitative comparison results of night-time images are shown in Fig. 3, where (b) shows the translated day-times images of CycleGAN [44] from night-time images, (c) shows the results of Monodepth2 [12] trained with day-time images and tested with night-time images, (d) shows the results of Monodpeth2 [12] trained with day-time images and tested with generated day-time images (CycleGAN [44]). Compared with (c), it is obvious that (d) obtained better visual results, which prove that CycleGAN works positively for night-time depth estimation. (f) shows the result of ADFA [34] which uses generative adversarial strategy in feature level of night-time images, which cannot fully restore the contour of the objects. Comparing with (c) to (f), more visual apparelling results can be obtained by our approach, which proves the effectiveness of our method.

Fig.4 demonstrates the qualitative comparison results of day-time images. We can see that more depth details can be recovered by our approach, which clarifies the ability of our method for day-time images.


Method (night)
Paired Abs Rel Sq Rel RMSE RMSE log

0.573 18.577 11.189 0.524 0.569 0.807 0.897
0.429 15.183 11.401 0.422 0.589 0.862 0.942

0.357 10.699 10.385 0.377 0.611 0.884 0.946
0.251 2.993 8.173 0.299 0.606 0.884 0.949
0.231 2.453 7.327 0.282 0.662 0.900 0.956


0.233 2.344 6.859 0.270 0.631 0.908 0.962

Method (day)
Paired Abs Rel Sq Rel RMSE RMSE log

0.280 4.509 6.359 0.297 0.661 0.873 0.949
0.117 0.758 3.737 0.170 0.874 0.967 0.988

0.131 1.355 3.937 0.178 0.848 0.967 0.990
0.108 0.569 3.535 0.152 0.879 0.977 0.992
0.109 0.580 3.518 0.152 0.891 0.976 0.991

0.109 0.584 3.578 0.153 0.880 0.976 0.992






Table 2: The table shows the quantitative results of the proposed losses and input data. All experiments are tested within the depth range of 40m.

4.5 Ablation Study

4.5.1 Private and Invariant Features

Fig. 5 demonstrates the private and invariant features obtained by the private and invariant feature extractors of day and night-time images, where the first column shows the corresponding input images, and the remaining columns from left to right are top feature maps that contain more information. Row (a) and (b) are private features of the day and night-time images, while row (c) and (d) are corresponding invariant features. It is obvious to see that feature maps in Fig. 5 (a) and (b) contain non-regular and smooth information with fewer structures, which is similar to the illumination information of images. Feature maps in Fig. 5 (c) and (d) contain regular and texture information with obvious structures, which can represent the invariant of scenes, proving that our approach can separate the private (illumination, etc.) and invariant (texture, etc.) information effectively.

4.5.2 Analysis of Input Data and Losses

Unpaired data vs. paired data: The results of and in Table. 2 show the quantitative results of our method with unpaired and paired images as input. In specific, we use the day-time and night-time images captured in the same roads from Oxford RobotCar dataset as unpaired data (), and we use day-time images and the corresponding generated night-time images (generated by GAN) as paired data (). We can see that results obtained by paired data outperform results obtained by unpaired data, which is mainly because that inconsistent information exists in unpaired images since they are captured at different times, though on the same roads. Hence, we use paired images in this paper.

Reconstruction loss: The private and invariant features are complementary, which should contain all the information of the original images. Therefore, we use reconstruction loss. Table.2 shows the quantitative results of our approach with reconstruction loss. Comparing with , produces great improvement to the depth estimation of night-time images. However, the result of the day-time images is slightly worse, because the invariant feature extractor is shared during the day and night, and no other constraints are used.

Orthogonality loss: Orthogonality loss is used to guarantee that private and invariant features can be separated by the private and invariant extractors orthogonally, which is composed of and . Table.2. and show the quantitative results of our approach with and , respectively. Compared with , (added loss) can greatly improve the performance of depth estimation night-time images, which also works positively for day-time images. Meanwhile, (added loss) can help the network achieves better performance on night-time images while maintaining the performance of day-time images, thereby further improving the performance of depth estimation for all-day images.

Similarity loss: The estimated depth maps should be similar for the input paired images, because consistent information is contained. Hence, similarity loss is employed in our approach. Since the depth estimation process of the day-time usually achieves better results than night, we use the day-time depth as a pseudo label so that the depth of the paired night image should be close to the day-time. Table.2. demonstrates the quantitative results of our approach with similarity loss, which shows that the similarity constraint can further improve the depth estimation result of nigh-time images while maintaining the performance of day-time images, thus proving the effectiveness of the similarity loss.

5 Conclusion

In this paper, to relieve the problem of low-visibility and non-uniform illumination in self-supervised depth estimation of all-day images, we propose an effective domain separated framework, which separates the images into two complementary sub-spaces in feature levels, including private (illumination, etc.) and invariant (texture, etc.) domains. The invariant (texture, etc.) features are employed for depth estimation, thus the influence of disturbing terms, such as low-visibility and non-uniform illumination in images, can be relieved, and effective depth information can be obtained. To alleviate the inconsistent information of day and night images, the domain-separated network takes the day-time images and the corresponding generated night-time images (GAN) as input. Meanwhile, orthogonality, similarity and reconstruction losses are utilized to separate and constrain the private and invariant features effectively, thus better depth estimation results can be expected. Note that the proposed approach is fully self-supervised and can be trained end-to-end, which can adapt to both the day and night images. Experiments on the challenging Oxford RobotCar dataset demonstrate that our framework achieves state-of-the-art results for all-day images.

Acknowledgements This work is supported in part by Robotics and Autonomous Driving Lab of Baidu Research. This work is also supported in part by the National Key R&D Program of China under Grant (No.2018YFB1305900) and the National Natural Science Foundation of China under Grant (No.61836015).

References

  • [1] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese (2017) Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105. Cited by: §1.
  • [2] A. Atapour-Abarghouei and T. P. Breckon (2018) Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2800–2810. Cited by: §2.3.
  • [3] R. T. Azuma (1997) A survey of augmented reality. Presence: Teleoperators & Virtual Environments 6 (4), pp. 355–385. Cited by: §1.
  • [4] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. arXiv preprint arXiv:1608.06019. Cited by: §3.1.
  • [5] J. Carmigniani, B. Furht, M. Anisetti, P. Ceravolo, E. Damiani, and M. Ivkovic (2011) Augmented reality technologies, systems and applications. Multimedia tools and applications 51 (1), pp. 341–377. Cited by: §1.
  • [6] P. Chen, A. H. Liu, Y. Liu, and Y. F. Wang (2019) Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2624–2632. Cited by: §1.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §2.1, §2.1, §4.1.
  • [8] T. v. Dijk and G. d. Croon (2019-10)

    How do neural networks see depth in single images?

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §3.1.
  • [9] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §1, §2.1, §2.1, §4.1.
  • [10] G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens (2017) Exploring the structure of a real-time, arbitrary neural artistic stylization network. arXiv preprint arXiv:1705.06830. Cited by: §3.2.3.
  • [11] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §2.1.
  • [12] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838. Cited by: Figure 1, §1, §2.1, §3.2.4, §3.2.4, Figure 3, Figure 4, §4.3, §4.3, §4.4, Table 1.
  • [13] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova (2019-10) Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • [14] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon (2020-06) 3D packing for self-supervised monocular depth estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [15] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon (2020) 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2485–2494. Cited by: §2.1.
  • [16] S. Im, H. Jeon, and I. S. Kweon (2018) Robust depth estimation from auto bracketed images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2954. Cited by: §1.
  • [17] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. (2011) KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 559–568. Cited by: §1.
  • [18] A. Johnston and G. Carneiro (2020-06) Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [19] N. Kim, Y. Choi, S. Hwang, and I. S. Kweon (2018) Multispectral transfer network: unsupervised depth estimation for all-day vision. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: §1, §2.2.
  • [20] M. Klingner, J. Termöhlen, J. Mikolajczyk, and T. Fingscheidt (2020) Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. In European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • [21] L. Liu, Y. Liao, Y. Wang, A. Geiger, and Y. Liu (2021) Learning steering kernels for guided depth completion. IEEE Transactions on Image Processing 30, pp. 2850–2861. Cited by: §1.
  • [22] L. Liu, X. Song, X. Lyu, J. Diao, M. Wang, Y. Liu, and L. Zhang (2021) FCFR-net: feature fusion based coarse-to-fine residual learning for depth completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 2136–2144. Cited by: §1.
  • [23] X. Long, C. Lin, L. Liu, W. Li, C. Theobalt, R. Yang, and W. Wang (2021) Adaptive surface normal constraint for depth estimation. arXiv preprint arXiv:2103.15483. Cited by: §2.1.
  • [24] X. Long, L. Liu, W. Li, C. Theobalt, and W. Wang (2020) Multi-view depth estimation using epipolar spatio-temporal network. arXiv preprint arXiv:2011.13118. Cited by: §1.
  • [25] X. Long, L. Liu, C. Theobalt, and W. Wang (2020) Occlusion-aware depth estimation with adaptive normal constraints. In European Conference on Computer Vision, pp. 640–657. Cited by: §1.
  • [26] Y. Lu and G. Lu (2021) An alternative of lidar in nighttime: unsupervised depth estimation based on single thermal image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3833–3843. Cited by: §2.2.
  • [27] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan (2020) HR-depth: high resolution self-supervised monocular depth estimation. arXiv preprint arXiv:2012.07356. Cited by: Figure 1, §1, Figure 3, Figure 4, §4.3, Table 1.
  • [28] W. Maddern, G. Pascoe, C. Linegar, and P. Newman (2017) 1 year, 1000 km: the oxford robotcar dataset. The International Journal of Robotics Research 36 (1), pp. 3–15. Cited by: Figure 1, §4.1, §4.
  • [29] M. Maximov, K. Galim, and L. Leal-Taixé (2020) Focus on defocus: bridging the synthetic to real domain gap for depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1071–1080. Cited by: §2.3.
  • [30] R. Mur-Artal and J. D. Tardós (2017)

    Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras

    .
    IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §1.
  • [31] A. A. B. Pritsker (1984) Introduction to simulation and slam ii. Halsted Press. Cited by: §1.
  • [32] M. Ramamonjisoa, Y. Du, and V. Lepetit (2020-06) Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [33] A. Sharma, L. Cheong, L. Heng, and R. T. Tan (2020) Nighttime stereo depth estimation using joint translation-stereo learning: light effects and uninformative regions. In 2020 International Conference on 3D Vision (3DV), pp. 23–31. Cited by: §2.2.
  • [34] M. Vankadari, S. Garg, A. Majumder, S. Kumar, and A. Behera (2020) Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In European Conference on Computer Vision, pp. 443–459. Cited by: §2.2, Figure 3, §4.1, §4.3, §4.3, §4.4, Table 1, §4, footnote 1, footnote 1.
  • [35] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey (2018) Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2022–2030. Cited by: §3.2.4.
  • [36] L. Wang, J. Zhang, O. Wang, Z. Lin, and H. Lu (2020-06) SDC-depth: semantic divide-and-conquer network for monocular depth estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [37] J. Watson, M. Firman, G. J. Brostow, and D. Turmukhambetov (2019-10) Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
  • [38] Z. Yu, L. Jin, and S. Gao (2020) Pnet: patch-match and plane-regularization for unsupervised indoor depth estimation. In ECCV, Cited by: §2.1.
  • [39] F. Zhang, X. Qi, R. Yang, V. Prisacariu, B. Wah, and P. Torr (2020) Domain-invariant stereo matching networks. In European Conference on Computer Vision, pp. 420–439. Cited by: §2.3.
  • [40] S. Zhao, H. Fu, M. Gong, and D. Tao (2019) Geometry-aware symmetric domain adaptation for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9788–9798. Cited by: §2.3.
  • [41] Y. Zhao, S. Kong, D. Shin, and C. Fowlkes (2020) Domain decluttering: simplifying images to mitigate synthetic-real domain shift and improve depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3330–3340. Cited by: §2.3.
  • [42] J. Zhou, Y. Wang, K. Qin, and W. Zeng (2019-10) Unsupervised high-resolution depth learning from videos with dual networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
  • [43] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858. Cited by: §2.1.
  • [44] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    .
    In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: Figure 1, §1, §3.1, §3.2.2, Figure 3, §4.2, §4.3, §4.3, §4.4, Table 1.