Self-supervised Depth Estimation Leveraging Global Perception and Geometric Smoothness Using On-board Videos

06/07/2021 ∙ by Shaocheng Jia, et al. ∙ 0

Self-supervised depth estimation has drawn much attention in recent years as it does not require labeled data but image sequences. Moreover, it can be conveniently used in various applications, such as autonomous driving, robotics, realistic navigation, and smart cities. However, extracting global contextual information from images and predicting a geometrically natural depth map remain challenging. In this paper, we present DLNet for pixel-wise depth estimation, which simultaneously extracts global and local features with the aid of our depth Linformer block. This block consists of the Linformer and innovative soft split multi-layer perceptron blocks. Moreover, a three-dimensional geometry smoothness loss is proposed to predict a geometrically natural depth map by imposing the second-order smoothness constraint on the predicted three-dimensional point clouds, thereby realizing improved performance as a byproduct. Finally, we explore the multi-scale prediction strategy and propose the maximum margin dual-scale prediction strategy for further performance improvement. In experiments on the KITTI and Make3D benchmarks, the proposed DLNet achieves performance competitive to those of the state-of-the-art methods, reducing time and space complexities by more than 62% and 56%, respectively. Extensive testing on various real-world situations further demonstrates the strong practicality and generalization capability of the proposed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 9

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

[scale=0.091]performance.jpg Our(LL)Our(CC)Our(CL)Monodepth2 [godard2019digging]Monodepth2 [godard2019digging]SC-SfMLearner [bian2019unsupervised]Yang et al. [yang2017unsupervised]SfMLearner [zhou2017unsupervised]GeoNet-Resnet [yin2018geonet]

Fig. 1: Performance of self-supervised depth estimation and other state-of-the-art methods on the KITTI dataset [geiger2013vision]

. CC, CL, and LL represent different network architectures, in which C and L represent convolutional neural networks (CNNs) and the proposed encoder/decoder. For example, CL indicates an encoder and decoder adopting CNNs and the proposed decoder, respectively.

Recovering a scene’s depth information plays a significant role in three-dimensional (3D) reconstruction, robot navigation, and scene understanding. The depth information of a scene can be obtained using two types of sensors: active detection (e.g., light detection and ranging (LiDAR) sensors) and passive receiving (e.g., camera sensors). Using LiDAR, 3D point cloud data can be directly obtained by scanning a scene; this method is accurate but is expensive for routine use. Alternatively, image data from a camera sensor can be used to recover the 3D information.

Specifically, a stereo vision system can follow the epipolar geometry restrictions to recover the depth information in a straightforward manner, but this approach necessitates a binocular camera. In most cases, however, monocular camera data is preferred considering the energy consumption and cost constraints. Therefore, monocular depth estimation, as a convenient and economical method of recovering depth information, has attracted the attention of many scholars from diverse research fields. Unfortunately, extracting 3D information from a monocular vision system is challenging because of its inherent ill-conditioning.

Current monocular depth estimation methods can be divided into supervised and self-supervised methods depending on whether the ground truth is used during training. In supervised depth estimation, the ground truth depth map is used to train a deep neural network (DNN), which directly fits the relationship between the RGB image and the depth map and imposes some priors by designing different loss functions as well as devising several network variants to better extract the features. In self-supervised depth estimation, first proposed by Zhou et al.

[zhou2017unsupervised], warping-based view synthesis is used as supervision to train the depth and pose networks. This approach does not require any labeled data and can simultaneously recover the depth and movement information. Although the depth map is relative, the absolute depth can be easily obtained with the aid of other information, such as real velocity from the global positioning system and the flat road assumption [absolutedepthXue2020].

In the past few years, both supervised and self-supervised depth estimation approaches have taken advantage of the powerful feature extracting ability of convolutional neural networks (CNNs). However, capturing global contextual information using pure CNNs is difficult because of the limited kernel size. To overcome this drawback, numerous studies have applied conditional random fields (CRFs) and Markov random fields (MRFs)

[cao2017estimating, eigen2015predicting, li2015depth, liu2015deep, mousavian2016joint, xu2017multi, xu2018structured, karschdepth, saxena20083]. Nevertheless, CRFs and MRFs are difficult to optimize, as is applying them to build an end-to-end model.

Predicting a geometric smoothness depth map facilitates both quantitative and qualitative evaluations. However, the smoothness loss measures used in previous works solely apply constraints to the two-dimensional (2D) depth map, and the 3D geometry properties of the scene are not considered. In addition, the multi-scale prediction strategy is often applied to overcome the gradient locality issue. Moreover, previous works have only used four-scale prediction frameworks, which raises the learning difficulty of the networks and thus negatively affects the performance.

To tackle the aforementioned issues, applying Linformer we propose a depth Linformer network (DLNet), a full-Linformer-based model, to concurrently capture global and local features, thereby improving the performance of self-supervised depth estimation [Linformer]. Although many trials of applying the Transformer model [Transformer]

to computer vision tasks have been reported, to the best of our knowledge, few have applied pure Transformer or Linformer networks to perform pixel-wise tasks. Instead, researchers have attempted to either extract features (encoding) or predict results (decoding) using CNNs. To the best of our knowledge, the present study is the first to perform pixel-wise depth estimation with a full-Linformer-based model. Moreover, to further improve the quantitative and qualitative results, we explore geometry properties and multi-scale prediction.

Our contributions can be summarized as follows:

  • To effectively extract global and local features, we propose a soft-split multi-layer perceptron (SSMLP) block and a depth Linformer block (DLBlock) to build the DLNet, the depth decoder, and the pose decoder.

  • We propose a 3D geometry smoothness (3DGS) loss to obtain a natural and geometry-preserving depth map by applying second-order smoothness constraints on the 3D point clouds rather than on the 2D depth map.

  • We present a maximum margin dual-scale prediction (MMDSP) strategy to overcome the gradient locality issue while concurrently saving computational resources and boosting performance.

  • Compared with state-of-the-art methods, the proposed model achieves competitive performance on the KITTI [geiger2013vision] and Make3D [saxena2008make3d]

    benchmarks but with a lightweight configuration and without pre-training. Furthermore, the promising qualitative results on the Cityscapes dataset

    [cordts2016cityscapes] and real-world scenarios demonstrate the proposed model’s strong generalization capability and practicality.

The remainder of this paper is organized as follows. Section II introduces related works. Section III mathematically defines the problem and presents notational conventions. Section IV presents the model design and loss functions. Section V reports on detailed experiments, and Section VI discusses the limitations of the proposed model. Section VII draws the conclusions.

Ii Related Work

In this section, we review the literature related to supervised depth estimation, self-supervised depth estimation, and transformer networks for computer vision.

Ii-a Supervised depth estimation

Prior to advances in deep learning algorithms, monocular depth estimation was largely obtained by devising efficient handcrafted features to capture the 3D information

[karschdepth, saxena20083]. For example, Saxena et al. [saxena20083] extracted absolute and relative depth features from the textures and statistical histograms of images, respectively, and integrated the extracted features and MRFs to predict the final depth map. Research on depth estimation has since proliferated, mainly focusing on exploring monocular cues in images [baig2016coupled, choi2015depth, furukawa2017depth, zoran2015learning].

However, obtaining abstract and deep features through such manual design is challenging. Fortunately, CNNs can aid in extracting abstract and complicated features from images. To the best of our knowledge, Eigen et. al.

[eigen2015predicting] were the first to apply CNNs for monocular depth estimation, and numerous variants, focusing on network structure design, have been proposed since [chen2016single, eigen2014depth, eigen2015predicting, laina2016deeper, li2017two]

. In addition, to overcome the spatial locality of the convolution operator, CRFs and recurrent neural networks (RNNs) have been introduced to capture the global information of an image

[cao2017estimating, eigen2015predicting, li2015depth, liu2015deep, mousavian2016joint, xu2017multi, xu2018structured, almalioglu2019ganvo, cs2018depthnet, grigorev2017depth, mancini2017toward, tananaev2018temporally, wang2019recurrent, mypaper2020].

Typically, depth estimation is regarded as a pixel-wise regression problem, but it can also be cast as a classification problem by discretizing the continuous depth into many intervals so as to predict a specific label for each pixel [cao2017estimating, fu2018deep].

Fig. 2: Self-supervised monocular depth estimation system. Pose

is the predicted ego-motion vector between the target image and reference image

().

Ii-B Self-supervised depth estimation

Differing from supervised depth estimation, self-supervised depth estimation uses warping-based view synthesis to reconstruct the target image and then trains the model by computing the difference between the reconstructed and target images [zhou2017unsupervised]. Self-supervised depth estimation relies on monocular data and does not require any labeled data, an advantage that has attracted many researchers [chen2019towards, garg2016unsupervised, ranjan2019competitive, yin2018geonet, zhan2018unsupervised, zhou2019unsupervised, zhou2017unsupervised, godard2017unsupervised, kuznietsov2017semi, almalioglu2019ganvo, feng2019sganvo].

However, recovering a scene’s structure from motion (SfM) is inherently problematic for some special cases, such as moving objects and occlusions. To bridge these application gaps, a series of outstanding studies have been conducted. Representative studies include the following: Godard et al. [godard2019digging] proposed a minimal reprojection loss function to effectively improve the occlusion/disocclusion problem; Casser et al. [casser2019depth] used the advanced semantic segmentation model to mask potential moving objects out, thus excluding their influence; Zhou et al. [zhou2017unsupervised] proposed multi-scale training for solving the gradient locality issue caused by low textures; Bian et al. [bian2019unsupervised] presented a geometry consistency loss function for achieving scale-consistent depth and ego-motion estimation within a continuous sequence; and Jia et al. [mypaper2021] modeled the prediction uncertainty and relationships between depths to realize a reliable and practical depth estimation system.

Moreover, Park et al. [park2019high] and Yang et al. [yang2019fast] have integrated data from multiple sensors, such as LiDAR sensors, visual odometers, and cameras, for improving the inference efficiency and accuracy.

Ii-C Transformer for computer vision

Transformer [Transformer]

, an attention-based model initially proposed for natural language processing (NLP), is efficient at capturing long-range dependencies between items. Numerous recent studies have applied Transformer to computer vision tasks by reshaping square images to sequence-like data, either using Transformer for CNN feature processing

[T4C1, T4C2, T4C3, T4C4] or for feature extraction [C4T1, C4T2, C4T3, C4T4]. Specifically, in the former application, Transformer is used as a decoder for predictions, whereas in the latter, Transformer substitutes CNNs and is used as an encoder for feature extraction.

However, applying the classic Transformer to pixel-wise computer vision tasks, such as semantic segmentation and depth estimation, is difficult because it requires large storage and computational resources for processing long sequence data. Hence, for efficiency in pixel-wise tasks, CNNs are generally used at the beginning or end of the network [C4T3, T4C4].

In addition, when rigidly splitting an image into many patches and using them as the input of the network, it is difficult to capture delicate features, such as edges [C4T4]. Consequently, Yuan et al. [C4T4] proposed a tokens-to-token strategy to aggregate neighboring features.

In summary, the literature reveals that effectively and efficiently extracting global and local information from images remains challenging, especially when using the emerging Transformer model. Moreover, no studies have examined the second-order geometric smoothness of the predicted point clouds. These research gaps have inspired the present work.

Iii Problem Setup

A self-supervised monocular depth estimation system comprises two parts, namely the depth estimation network and the pose estimation network, denoted as

and , respectively. Given a continuous image sequence , the depth estimation network solely takes the target image as the input to predict its depth map; this can be mathematically defined as: , where is the predicted depth map of the input image . Differing from the depth estimation network, the pose estimation network takes the whole sequence as the input and predicts the ego-motion for each image pair ; this can be mathematically presented as: , where is a pose matrix of describing the movement between the target and reference images.

Theoretically, given the depth map of the target image and the ego-motion between the target and reference images, the target image can be reconstructed from the reference images by warping-based view synthesis [zhou2017unsupervised], which can be mathematically defined as Eq. 1; here, , , , , , and represent the coordinate in the reference image, the camera intrinsic matrix (), the transform matrix between the target image and reference images, the depth corresponding to , the inverse matrix of , and the coordinate in the target image, respectively.

(1)

During the training phase, the reconstructed loss is computed with respect to the difference of the target and reconstructed images to train the system. Notably, the depth and pose estimation networks are trained cooperatively, but they can work separately during the testing phase.

Iv Method

In this section, we first illustrate the entire monocular depth estimation system. Subsequently, the proposed DLNet is introduced. Thereafter, the DLNet-based depth and pose estimation networks are presented. Finally, the loss functions used in this paper are presented.

Fig. 3: Architectures of the Linformer layer and its components. Left to right: scaled dot-product linear attention (SDPLA), multi-head linear attention (MHLA), and Linformer block.

Iv-a Model overview

The self-supervised monocular depth estimation system concurrently performs depth and pose estimation during training, as shown in Fig. 2. Following Zhou et al. [zhou2017unsupervised], the length of the image sequence used for training is set to , and the middle frame of the sequence is regarded as the target image that requires depth estimation using depth net. In contrast, the whole sequence of the three frames is used for pose estimation.

After obtaining the depth map and pose vectors, warping-based view synthesis can be performed for reconstructing the target image from the reference images. Then, the reconstruction loss, namely the difference between the target and reconstructed images, is computed to train the system. In the following subsections, the proposed DLNet, depth net, pose net, and loss function are introduced.

Iv-B Depth Linformer Network (DLNet)

Iv-B1 Linformer

Transformer uses the scaled dot-product attention (SDPA) mechanism to perform feature aggregation, which can be intuitively described as mapping a query and a set of key–value pairs to an output [Transformer]. In particular, the query, keys, values and output are all vectors with dimensions , , , and , respectively. However, in practice, we pack a set of queries together to perform the attention computation simultaneously.

For clarity, we denote the queries of a sequence of length as the matrix . Accordingly, the keys and values are denoted as the matrices and . Thus, the SDPA can be defined as Eq. 2:

(2)

The attention matrix is obtained by multiplying two matrices, which requires time and space complexities with respect to the length of the sequence. In many cases, the sequence length requires a prohibitively large amount of storage and computational resources when using the Transformer model, especially for pixel-wise computer vision tasks.

To overcome this problem, Wang et al. [Linformer] proposed a linear complexity () Transformer based on the low-rank property of the attention matrix , called Linformer, significantly reducing the time and space complexities. Specifically, two learnable matrices are used to project the original -dimensional matrices and into -dimensional () space; accordingly, SDPA can be rewritten as scaled dot-product linear attention (SDPLA) (Eq. 3). For simplicity, we do not differentiate between and in the following text.

(3)

Accordingly, the multi-head linear attention (MHLA) can be described as follows (Eq. 4):

(4)

where and are learnable parameters, and is the number of heads. Finally, the Linformer block is designed by integrating MHLA and multi-layer perceptron (MLP); the detailed structures are shown in Fig. 3.

Iv-B2 Depth Linformer block (DLBlock)

Most studies have rigidly divided the image into many patches, flattening the patches to vectors and using them as the input of the Transformer model. However, in this approach, obtaining fine features of the image, such as edges, is challenging because of the lack of communication between the patches. Furthermore, the original Transformer [Transformer] and Linformer [Linformer] models cannot dynamically change the feature map resolution, resulting in high computational and storage costs, especially for image processing.

To overcome this, we introduce the soft-split multi-layer perceptron (SSMLP) block to promote communication between the patches, thereby simultaneously adjusting the feature map size and reducing the computational and storage costs. Subsequently, the features obtained from the SSMLP are delivered to the Linformer block for extracting the global features. Figure 4 illustrates the detailed structure of the proposed DLBlock.

Fig. 4: Architecture of the proposed Depth Linformer block (DLBlock).
Fig. 5: Architectures of the proposed depth Linformer network (DLNet), depth decoder, and pose decoder. EBlock and DBlock represent encoder block and decoder block , respectively. +F indicates taking in the F feature via the corresponding shortcut connection. In DBlocks, the concatenation operation is used, if F is available. Similarly, multi-scale outputs are available for the different configurations.

Let us denote the input feature as , where , , and represent the height, width and dimension number of the input feature, respectively. For aggregating the local feature, a moving window of size

, stride

, and padding

, is used to reshape the input feature to the sequence data in a soft-split manner, wherein and are derived from Eq. 5:

(5)

where denotes rounding down.

When the stride and moving window size satisfy , adequate overlapping exists (soft split) for capturing the fine details of the image. However, in this case, the dimension of the feature is multiplied by , significantly raising the computational and storage costs. Accordingly, MLP is used for dimension reduction (Eq. 6):

(6)

where represents the transformed low-dimensional features; and are learnable parameters; and , , and

represents the target dimension of dimension reduction, layer normalization, and activation function, respectively. The broadcasting mechanism is automatically performed for the foregoing additive operation. Furthermore,

is given by Eq. 7:

(7)

where

is a hyperparameter and

is the output dimension.

The aforementioned transformations and computations, called SSMLP, effectively conduct local feature extracting, with the aid of the soft split and the MLP layer.

Subsequently, the feature goes through the Linformer block to capture the global information, thereby obtaining the feature . Then, another MLP layer performs the dimension increase to change the feature to ; this is followed by the reshaping of to , which can be formulated as Eq. 8:

(8)

Finally, we bring in the initial feature

through a residual connection, which is stated in Eq.

9:

(9)

where is the output feature.

In summary, our DLBlock mainly comprises the following three components:

  • SSMLP. SSMLP is critical for extracting the local feature and improving the efficiency;

  • Linformer block. The Linformer block is crucial for capturing global features, automatically shifting the focus to the more important features through the inner self-attention mechanism;

  • Residual connection. Residual connection, a proven technique, can improve the gradient explosion and network degradation, concurrently avoiding to the extent possible the information loss caused by changes in the feature map size and dimensions.

In what follows, DLNet, a DLBlock-based network, is introduced.

Iv-B3 Depth Linformer Network (DLNet)

Inspired by the success of the CNNs, a pyramid-like structure of gradually reducing the feature map size is adopted when devising the DLNet. As shown in Fig. 5 (a), an SSMLP layer is first used to embed the input image and simultaneously decrease the feature map’s resolution. After a MaxPooling layer, the four-stage feature transformations are performed via the encoder blocks (Fig. 5 (b)), which includes the proposed DLBlock.

In the next subsection, the depth and pose networks are devised based on the proposed DLNet.

Iv-C Depth and pose estimation networks

The proposed DLNet is considered the encoder in both the depth and pose estimation networks. The integral depth and pose estimation networks are presented in this subsection by further devising the decoders using the proposed components.

Figure 5 (c) illustrates the structure of the proposed depth decoder, which consists of a few decoder blocks and output heads. For each decoder block illustrated in Fig. 5 (d), a DLBlock is primarily used to perform the feature transformation, following which an upsampling layer is used to increase the resolution. Subsequently, the upsampled feature is stacked with the feature from the skip connection with respect to the channel, when the skip connection is available. Finally, a lightweight SSMLP layer is used for feature compression. Because of the availability of the skip connections, the residual connection in DLBlock is discarded to save computational and storage resources. For each output head, an SSMLP is used to predict the disparity map.

Figure 5 (e) illustrates the structure of the proposed pose decoder, which simply consists of a DLBlock without the residual connection and an SSMLP layer.

Then, the depth and pose estimation networks can be obtained by simply integrating the proposed DLNet and the corresponding decoder.

Iv-D Losses

In this subsection, a set of loss functions used for training the networks are presented. Specifically, basic losses that have been successfully applied in previous works are introduced. Subsequently, a novel 3D geometry smoothness (3DGS) loss function and the maximum margin dual-scale prediction (MMDSP) are presented. Thereafter, the final loss is shown.

Iv-D1 Basic losses

A strong assumption, the Lambertian reflection [basri2003lambertian], is imposed on all surfaces of the image, which makes the photometric constancy loss between the target image and the reconstructed image possible. Taking the robustness into account, we, therefore, choose the L1 norm to compute the photometric loss, which can be stated as Eq. 10:

(10)

where represents the reconstructed image from the reference image ; represents the reprojection function described in Eq. 1; and is the bilinear sampling operator, which is locally sub-differentiable. For simplicity, these notations are directly used in the rest of this paper.

However, the photometric loss is sensitive to illumination changes, particularly in complicated real-world scenarios. Consequently, following Godard et al. [godard2019digging], the structure similarity (SSIM) loss (Eq. 11) is used to improve this issue:

(11)

To address the problem of visual inconsistencies in the target and reference images, such as occlusion and disocclusion, we follow Godard et al. [godard2019digging] in adopting the minimum reprojection loss (Eq. 12):

(12)

where is set as 0.15 following [godard2019digging].

Furthermore, we apply a simple binary mask proposed by Godard et al. [godard2019digging] to avoid the influence of the static pixels caused by the static camera, an object moving at equivalent relative translation to the camera, and the low-texture regions, as follows (Eq. 13):

(13)

where is a binary mask and is the difference between the target image and the unwarped reference image . Therefore, the reconstruction loss can be written as Eq. 14:

(14)

Finally, the scalar reconstruction loss can be computed by averaging over each pixel and batch, as follows (Eq. 15):

(15)

where and are the batch size and the number of the pixels, and and represent the traversing of each sample and pixel.

Iv-D2 3D geometry smoothness (3DGS) loss

Generally, a smoothness loss is applied to obtain a smooth depth map. However, the smoothness loss used in previous works [zhou2017unsupervised, godard2019digging] simply constrains the distance between the neighboring depths and does not take the geometric properties into account. Mathematically, the distance of the target depth from its neighbors can be directly minimized (Eq. 16) to encourage a smoothness depth map, which solely promotes continuity on depth values.

(16)

However, in this case, there are two major drawbacks as follows:

  • Non-differentiable artifacts. The naive smoothness loss function does not consider the differentiability of the depth map, resulting in unnatural and non-differentiable artifacts, especially in the edge regions of the objects;

  • Violation of the geometry structure. The values from the close to the far regions in the depth map/disparity map increase/decrease monotonically with various granularities. Nevertheless, the naive smoothness loss applies identical weights over different positions, which breaks up the overall geometry structure of the scene.

Therefore, we propose the 3DGS, aimed at predicting a smooth, geometry-preserved, and natural depth map by imposing the gradual change constraint on the surface normals of the reconstructed 3D point clouds.

Primarily, we need to estimate the pixel-wise surface normal from the predicted depth map. Thus, the depth map is first reprojected to 3D space using Eq. 17:

(17)

where , , , and represent the image coordinates, camera intrinsic matrix, depth map, and point clouds, respectively.

Fig. 6: Illustration of the estimation of the surface normal.

Then, the target point and its eight neighborhood points can be used to determine eight vectors , where . Any two arbitrary two neighboring vectors can determine a surface. For each surface, we can obtain the normal by computing the cross-product of the neighboring two vectors. Finally, the target surface normal is estimated by averaging over all reference normals, as shown in Eq. 18:

(18)

where , , and represent the cross-product operation, target normal, and reference normal, respectively.

Following the pixel-wise normal estimation, we apply the proposed 3DGS loss function, constraining and slowly changing the surface normals of the scene smoothly. First, consider a continuous space. Given a surface , which is defined on a two-dimensional space without any sharp points, the surface should be continuous if Eq. 19 holds:

(19)

In this case, the surface has smoothness. If the surface normal is everywhere available, we can infer that the surface is first-order differentiable (please note that we assume that there are no sharp points in the surface , such as the point in the curve ), which indicates that the surface has smoothness. Finally, the gradual changes of the surface normal, namely the smooth surface normals, require the surface to be second-order differentiable, making the surface have smoothness.

Therefore, we first define the distance between the two surface normals as the sine distance (Eq. 20) to achieve the surface normal smoothness:

(20)

where is the sine distance operator. Thus, the proposed 3DGS loss can be described as Eq. 21:

(21)

where , , and represent the gradient operator, estimated surface normal matrix, predicted disparity map, and color image, respectively. The exponential items slack the constraints on the edges for performing edge-aware prediction.

By requiring the 3DGS, the proposed model can predict a smooth and natural depth map, significantly improving the qualitative and quantitative performance, particularly in the edge regions.

Iv-D3 Maximum margin dual-scale prediction (MMDSP)

To overcome the gradient locality issue raised by the low-texture regions in the image, most previous works have used the multi-scale prediction strategy [zhou2017unsupervised, godard2019digging], as it is relatively easy to capture contextual information at a lower resolution so as to accurately predict the depth map for low-texture regions.

Accordingly, we adopt multi-scale training in our network. However, is the four-scale prediction used in previous works [godard2019digging] necessary?

Fig. 7: Illustrations of the conventional multi-scale prediction and the proposed maximum margin dual-scale prediction (MMDSP).

[width=]results_compare.jpg

Fig. 8: Qualitative results on the KITTI dataset [geiger2013vision]. Some reflective, far, and smooth regions, where prediction is difficult, are marked with red circles. Our model yields significantly improved predictions (green circles) for those regions. CC, CL, and LL represent the different network architectures, in which C and L represent convolutional neural networks (CNNs) and the proposed encoder/decoder. For instance, CL indicates that the encoder and decoder adopt CNNs and the proposed decoder, respectively.

Intuitively, the network would extract the low-level vision features in the first several layers, whereas the deep features would contain more semantic information. For this case, consider the semantic level of the output and its previous feature to be and , respectively. Then, the features , , , and , shown in Fig. 7 (a), would have the same semantic level of . Therefore, four-scale prediction would require three times the feature transformations between features having the identical semantic level, with each transformation solely depending on a single decoder block; this approach increases the network’s learning difficulty.

Therefore, we propose the maximum margin dual-scale prediction (MMDSP) strategy to overcome the gradient locality issue, as shown in Fig. 7 (b). The proposed MMDSP only performs one transformation between features having the identical semantic level, with three decoder blocks.

The proposed MMDSP not only overcomes the gradient locality issue, thus improving performance, but also reduces the computational complexity. Ablation studies that experimentally demonstrate the effectiveness of the proposed approach are discussed in Section V-D.

Iv-D4 Final loss

We integrate the reconstruction loss, the proposed 3DGS loss, and the MMDSP to form the final loss (Eq. 22) for training our networks:

(22)

where is set to 0.001 following Godard’s work [godard2019digging].

Errors Errors
Methods GT? PT? MS? AbsRel SqRel RMS RMSlog
Eigen et al., coarse [eigen2014depth] - - 0.214 1.605 6.563 0.292 0.673 0.884 0.957
Eigen et al., fine [eigen2014depth] - - 0.203 1.548 6.307 0.282 0.702 0.890 0.958
Liu et al. [liu2015learning] - - 0.202 1.614 6.523 0.275 0.678 0.895 0.965
Kuznietsov et al. (B) [kuznietsov2017semi] - 0.113 0.741 4.621 0.189 0.862 0.960 0.986
DORN [fu2018deep] 51.0M 0.072 0.307 2.727 0.120 0.932 0.984 0.994
GeoNet-VGG (J) [yin2018geonet] - - 0.164 1.303 6.090 0.247 0.765 0.919 0.968
GeoNet-Resnet (J) [yin2018geonet] - 229.3M 0.155 1.296 5.857 0.233 0.793 0.931 0.973
DDVO [wang2018learning] - 0.151 1.257 5.583 0.228 0.810 0.936 0.974
SC-SfMLearner [bian2019unsupervised] 59.4M 0.149 1.137 5.771 0.230 0.799 0.932 0.973
Struct2depth [casser2019depth] - - 0.141 1.026 5.291 0.215 0.816 0.945 0.979
Jia et al. [mypaper2021] 57.6M 0.144 0.966 5.078 0.208 0.815 0.945 0.981
Monodepth2 [godard2019digging] 59.4M 0.128 1.087 5.171 0.204 0.855 0.953 0.978
Our (CC) 59.4M 0.128 0.990 5.064 0.202 0.851 0.955 0.980
Our (CL) 63.1M 0.128 0.979 5.033 0.202 0.851 0.954 0.980
SfMLearner [zhou2017unsupervised] 126.0M 0.208 1.768 6.856 0.283 0.678 0.885 0.957
Yang et al. (J) [yang2017unsupervised] 126.0M 0.182 1.481 6.501 0.267 0.725 0.906 0.963
Monodepth2 [godard2019digging] 59.4M 0.144 1.059 5.289 0.217 0.824 0.945 0.976
Our (LL) 25.8M 0.141 1.060 5.247 0.215 0.830 0.944 0.977
TABLE I: Quantitative comparisons of depth estimation on the KITTI dataset [geiger2013vision]. (B) indicates binocular/stereo input pairs, and (J) denotes joint learning of multiple tasks. GT, PT, and MS represent ground truth, pretraining, and model size, respectively. CC, CL, and LL represent the different network architectures, in which C and L represent convolutional neural networks (CNNs) and the proposed encoder/decoder. For instance, CL indicates that the encoder and decoder adopt CNNs and the proposed decoder, respectively. - represents that the situation is unclear. Notably, all self-supervised methods are trained at a resolution of for a fair comparison. The best performances and our models are marked bold. The second best performances in the second and third cells are underlined.

V Experiments

In this section, we first introduce the experiment implementation details and then compare the results of the proposed approach with those of other state-of-the-art methods. Thereafter, the parameter analysis, ablation studies, and model complexity are discussed. All experiments are implemented with PyTorch

library on a single GTX Ti GPU card.

Errors
Methods GT? PT? AbsRel SqRel RMS RMSlog
Karsch et al. [karsch2014depth] 0.428 5.079 8.389 0.149
Liu et al. [liu2014discrete] 0.475 6.562 10.05 0.165
Laina et al. [laina2016deeper] 0.204 1.840 5.683 0.084
DDVO [wang2018learning] 0.387 4.720 8.090 0.204
Monodepth [godard2017unsupervised] 0.544 10.94 11.760 0.193
Monodepth2 [godard2019digging] 0.322 3.589 7.417 0.163
Jia et al. [mypaper2021] 0.301 3.143 6.972 0.351
Our (CL) 0.269 2.201 6.452 0.325
Our (CC) 0.267 2.188 6.406 0.322
SfMLearner [zhou2017unsupervised] 0.383 5.321 10.470 0.478
Our (LL) 0.289 2.423 6.701 0.348
TABLE II: Quantitative comparisons of depth estimation on the Make3D dataset [saxena2008make3d]. CC, CL, and LL represent the different network architectures, in which C and L represent convolutional neural networks (CNNs) and the proposed encoder/decoder. For instance, CL indicates that the encoder and decoder adopt CNNs and the proposed decoder, respectively. The best performances and our models are marked bold. The second best performances in the second box are underlined.
Errors Errors
AbsRel SqRel RMS RMSlog
32 0.142 1.121 5.330 0.216 0.829 0.944 0.976
64 0.142 1.094 5.286 0.216 0.827 0.944 0.977
128 0.143 1.117 5.340 0.217 0.829 0.942 0.975
TABLE III: Experiments on different values. The best performances are marked bold.
Errors Errors
AbsRel SqRel RMS RMSlog
False 0.144 1.106 5.339 0.217 0.825 0.943 0.976
True 0.142 1.094 5.286 0.216 0.827 0.944 0.977
TABLE IV: Experiments on different . The best performances are marked bold.

V-a Implementation details

Datasets. KITTI [geiger2013vision], a large-scale and publicly available dataset widely used in various computer vision tasks, serves as the set of benchmarks for evaluation and comparison in this study. Following Zhou et al. [zhou2017unsupervised], we take and -frame sequences from the KITTI dataset as the training and validation data, respectively. Following Eigen et al.’s [eigen2014depth] split, we take images from the KITTI dataset for testing.

We apply our model trained on the KITTI dataset to test images of the Make3D dataset [saxena2008make3d] unseen during training for demonstrating the generalization capability of the proposed model. Following Godard et al. [godard2019digging], we evaluate Make3D’s images on a center crop at a ratio. Additionally, qualitative results on the Cityscapes dataset [cordts2016cityscapes] generated by the proposed model trained on the KITTI dataset are presented. In all experiments, the resolution of the input image is .

Data augmentation. Following Godard et al. [godard2019digging], the input images are augmented with random cropping, scaling, and horizontal flips. In addition, a set of color augmentations, random brightness, contrast, saturation, and hue jitter with respective ranges of , , , and are adopted with a percent chance. These color augmentations are solely applied to the images fed to the networks, not to those applied to compute loss.

Hyperparameters. Our networks are trained using the Adam optimizer [kingma2014adam]. The learning rate is initially set to and decreased by a factor of every epochs. The other parameters of the optimizer are set to the default values. The epoch, batch size, and length of the sequence are set to , , and , respectively.

During training, we initialize the weights and biases of a linear layer using a Gaussian distribution with standard deviation

and constant . For the LayerNorm layer, the weights and biases are initialized with constants and

, respectively. We do not pretrain our model on the ImageNet dataset because of computational resource constraints. The specific network configurations are presented in Section V-C.

V-B Results

Table I shows the quantitative results of the proposed model on the KITTI dataset [geiger2013vision], divided into three categories according to whether the ground truth and pretraining are used during the training phase. Of note is that the depth estimation results in this paper benefit from the fusion scaling strategy proposed in Jia’s work [mypaper2021].

To benefit from pretraining and for a fair comparison, we integrate the proposed loss and decoders with existing pre-trained CNNs. The first network architecture CC, in which the networks are identical to those in Monodepth2 [godard2019digging] but apply the proposed loss function, outperforms the original Monodepth2, demonstrating the effectiveness of the proposed loss function. Then, we replace the CNN decoders with the proposed decoders, forming the second network architecture CL, which further improves the performance. Finally, the LL network, consisting of the proposed DLNet (encoder) and decoders, achieves performance competitive to those of the state-of-the-art methods while also reducing the number of parameters by more than .

Figure 1 presents a comprehensive comparison of the performance and number of parameters. Without pretraining, the proposed model (LL) outperforms the other methods, with performance comparable even to those of pretrained models. Moreover, the proposed model significantly reduces the number of parameters, and simply integrating the proposed loss function and decoders with the pre-trained encoder used in previous works leads to better performance.

Figure 8 illustrates the promising qualitative results of the proposed model on the KITTI dataset. The regions marked with red circles are extremely challenging to accurately predict because of the issues of reflection, smoothness, and farness. However, effectively capturing global features using CNNs is difficult and may lead to unexpected failure cases, especially for a model without pretraining. Thereagainst, the proposed loss and networks implicitly impose geometry constraints on the results and effectively extract the global information, respectively, significantly improving the performance and contributing to the prediction of a more accurate depth map.

Errors Errors
AbsRel SqRel RMS RMSlog
Flase 0.144 1.122 5.285 0.217 0.827 0.944 0.976
True 0.142 1.094 5.286 0.216 0.827 0.944 0.977
TABLE V: Experiments on different . The best performances are marked bold.
Errors Errors
AbsRel SqRel RMS RMSlog
4 0.144 1.130 5.317 0.217 0.827 0.943 0.976
8 0.142 1.094 5.286 0.216 0.827 0.944 0.977
16 0.143 1.114 5.282 0.216 0.828 0.944 0.977
TABLE VI: Experiments on different . The best performances are marked bold.
activation functions Errors Errors
AbsRel SqRel RMS RMSlog
ReLu [relu] 0.145 1.103 5.291 0.218 0.822 0.943 0.976
ELU [elu] 0.146 1.141 5.353 0.219 0.823 0.942 0.975
GELU [gelu] 0.142 1.094 5.286 0.216 0.827 0.944 0.977
TABLE VII: Experiments on different activation functions. The best performances are marked bold.
MS Errors Errors
AbsRel SqRel RMS RMSlog
0.5 58.0M 0.140 1.079 5.348 0.215 0.831 0.944 0.976
1 25.8M 0.141 1.060 5.247 0.215 0.830 0.944 0.977
2 15.1M 0.146 1.135 5.374 0.218 0.821 0.941 0.976
TABLE VIII: Experiments on different . MS represents model size. The best performances are marked bold.
Linformer depth MS Errors Errors
AbsRel SqRel RMS RMSlog
1 25.8M 0.141 1.060 5.247 0.215 0.830 0.944 0.977
2 30.7M 0.144 1.111 5.299 0.217 0.826 0.943 0.976
TABLE IX: Experiments on different Linformer [Linformer] depths. MS represents model size. The best performances are marked bold.
Number of DLBlock MS Errors Errors
AbsRel SqRel RMS RMSlog
1 25.8M 0.141 1.060 5.247 0.215 0.830 0.944 0.977
2 34.7M 0.144 1.171 5.323 0.218 0.831 0.944 0.975
TABLE X: Experiments on different number of DLBlock. MS represents model size. The best performances are marked bold.
scales Errors Errors
AbsRel SqRel RMS RMSlog
0 0.143 1.103 5.347 0.217 0.828 0.944 0.976
0,3 (MMDSP) 0.142 1.118 5.330 0.216 0.829 0.944 0.976
0,2,3 0.145 1.184 5.448 0.220 0.822 0.942 0.976
0,1,2,3 0.144 1.115 5.347 0.217 0.823 0.943 0.976
TABLE XI: Experiments on different multi-scale prediction strategies. The best performances are marked bold. The scale numbers are consistent with the numbers in Fig. 7, in which 0 and 3 are our maximum margin dual-scale prediction (MMDSP).
Errors Errors
Methods AbsRel SqRel RMS RMSlog
B 0.218 4.062 6.788 0.286 0.772 0.915 0.959
B+MRp 0.194 2.756 6.213 0.262 0.786 0.925 0.966
B+MRp+SSIM 0.148 1.238 5.496 0.221 0.818 0.941 0.974
B+MRp+SSIM+AM 0.144 1.120 5.308 0.217 0.826 0.944 0.976
B+MRp+SSIM+AM+MMDSP 0.142 1.118 5.330 0.216 0.829 0.944 0.976
B+MRp+SSIM+AM+MMDSP+3DGS 0.141 1.060 5.247 0.215 0.830 0.944 0.977
TABLE XII: Ablation studies on loss function. B, MRp, SSIM, AM, MMDSP, and 3DGS denote the basic photometric loss, minimal reprojection, SSIM loss, automasking, proposed maximum margin dual-scale prediction, and 3D geometry smoothness loss, respectively. The best performances are marked bold.

[scale=0.5]make3d.jpg

Fig. 9: Qualitative results on the Make3D dataset [saxena2008make3d]. CC, CL, and LL represent the different network architectures, in which C and L represent convolutional neural networks (CNNs) and the proposed encoder/decoder. For instance, CL indicates that the encoder and decoder adopt CNNs and the proposed decoder, respectively. Note that we directly apply the model trained on the KITTI dataset [geiger2013vision] to the Make3D dataset [saxena2008make3d], without any refinements.

To evaluate the generalization capability of the proposed model, we directly apply our model trained on the KITTI dataset to the Make3D dataset without any refinements and training. Table II shows that the quantitative results output by the proposed model are competitive, regardless of pretraining status.

Figure 9 illustrates some qualitative results generated by the proposed model on the Make3D dataset, which indicate that the proposed model has powerful generalization capability and can predict a reasonable depth map with adequate details for unseen data. The scenarios and perspectives in Make3D differ substantially from those in the KITTI data used for training.

[scale=0.35]cityscapes.jpg

Fig. 10: Qualitative results on the Cityscapes dataset [cordts2016cityscapes]. CC, CL, and LL represent the different network architectures, in which C and L represent convolutional neural networks (CNNs) and the proposed encoder/decoder. For instance, CL indicates that the encoder and decoder adopt CNNs and the proposed decoder, respectively. Note that we directly apply the model trained on the KITTI dataset [geiger2013vision] to the Cityscapes dataset [cordts2016cityscapes], without any refinements. PT indicates pretraining.

In Fig. 10, we present some qualitative results predicted by the proposed model on the Cityscapes dataset, thus demonstrating the model’s practicality. The scenes, color distributions, and perspectives in Cityscapes differ from those in the KITTI data used for training.

Summarily, we can conclude that the proposed model achieves competitive performance on the KITTI, Make3D, and Cityscapes datasets, presenting excellent qualitative results, especially in some challenging regions.

V-C Parameter analysis

In this part, we report on exhaustive experiments on the different parameters of the proposed model, conducted to determine the optimal configurations.

Primarily, we conduct a series of experiments to search for the best set of parameters for the Linformer block embedded in our networks. As shown in Tables III, IV, V, and VI, we carefully perform the experiments for each parameter in the Linformer [Linformer] block and choose the parameters that result in a better performance as the final configurations. Specifically, we set the parameter , , , and to , , , and , respectively.

We further experiment with various global network parameters. As shown in Table VII, we evaluate the different activation functions and find that GELU [gelu] outperforms other activation functions by a clear margin. Table VIII summarizes the performance of different hidden dimensions defined in Eq. 7. Considering both performance and model efficiency, is set to .

Finally, the results for experiments on networks with various Linformer depths and DLBlock amounts are shown in Table IX and X. No performance improvements could be achieved by deepening the network. Therefore, the final networks adopt a single DLBlock in which the Linformer depth is set as for each encoder block. All results reported in Section V-B are derived from these configurations.

V-D Ablation studies

In this section, we discuss experiments conducted to validate the proposed MMDSP strategy and loss items. First, Table XI shows that the proposed MMDSP is more effective than the other multi-scale prediction strategies. In fact, the proposed Linformer-based networks are capable of concurrently capturing the global and local features, which contribute to overcoming the gradient locality issue. Therefore, the performance of our network with a single prediction scale is still on par with that of the network equipped with the proposed MMDSP.

Table XII demonstrates the effectiveness of each item of the proposed loss function, which indicates that the proposed MMDSP strategy and especially the 3DGS loss indeed improve the performance.

V-E Model complexity

Table XIII summarizes the model’s time and space complexities. Clearly, the proposed model outperforms the other state-of-the-art methods by large margins in both time and space complexities.

For reference, our depth and pose networks can achieve a speed of more than and frames per second at a resolution of on a single GTX Ti card (averaging over iterations), respectively, which can satisfy the requirements of most applications.

Methods TC (GFLOPs) SC (M)
Jia et al. [mypaper2021] 3.485 57.6
Monodepth2 [godard2019digging] 3.480 59.4
Our 1.311 25.8
TABLE XIII: Model complexity. The time complexity evaluations are performed at an image resolution of on a single GTX Ti card. TC and SC represent time and space complexities, respectively. The best performances are marked bold.

Vi Limitations

In this section, some failure cases of the proposed model are discussed. As shown in Fig. 11, we find that it is challenging to 1) capture extremely far objects, which requires more accurate pose estimation; 2) accurately capture moving objects because such objects violate the static scene assumption; and 3) predict slender objects because such objects are easily confused as being part of the background. We will work on these issues in future studies.

[scale=0.080]failures.jpg

Fig. 11: Failure cases of the proposed model. Left to right (columns): far objects, moving objects, and slender objects. CC, CL, and LL represent the different network architectures, in which C and L represent convolutional neural networks (CNNs) and the proposed encoder/decoder. For instance, CL indicates that the encoder and decoder adopt CNNs and the proposed decoder, respectively. PT indicates pretraining.

Vii Conclusion

This paper focuses on how to simultaneously extract global and local information from images and predict a geometrically smooth depth map. For extraction, an innovative network, DLNet, is proposed along with special depth and pose decoders. For prediction, a 3DGS is presented. Moreover, we explore the multi-scale prediction strategy used for overcoming the gradient locality issue and propose a maximum margin dual-scale prediction (MMDSP) strategy for efficient and effective predictions. Detailed experiments demonstrate that the proposed model achieves competitive performance against state-of-the-art methods and reduces time and space complexities by more than and , respectively. We hope that our work contributes to the relevant academic and industrial communities.

References