I Introduction
Recovering a scene’s depth information plays a significant role in threedimensional (3D) reconstruction, robot navigation, and scene understanding. The depth information of a scene can be obtained using two types of sensors: active detection (e.g., light detection and ranging (LiDAR) sensors) and passive receiving (e.g., camera sensors). Using LiDAR, 3D point cloud data can be directly obtained by scanning a scene; this method is accurate but is expensive for routine use. Alternatively, image data from a camera sensor can be used to recover the 3D information.
Specifically, a stereo vision system can follow the epipolar geometry restrictions to recover the depth information in a straightforward manner, but this approach necessitates a binocular camera. In most cases, however, monocular camera data is preferred considering the energy consumption and cost constraints. Therefore, monocular depth estimation, as a convenient and economical method of recovering depth information, has attracted the attention of many scholars from diverse research fields. Unfortunately, extracting 3D information from a monocular vision system is challenging because of its inherent illconditioning.
Current monocular depth estimation methods can be divided into supervised and selfsupervised methods depending on whether the ground truth is used during training. In supervised depth estimation, the ground truth depth map is used to train a deep neural network (DNN), which directly fits the relationship between the RGB image and the depth map and imposes some priors by designing different loss functions as well as devising several network variants to better extract the features. In selfsupervised depth estimation, first proposed by Zhou et al.
[zhou2017unsupervised], warpingbased view synthesis is used as supervision to train the depth and pose networks. This approach does not require any labeled data and can simultaneously recover the depth and movement information. Although the depth map is relative, the absolute depth can be easily obtained with the aid of other information, such as real velocity from the global positioning system and the flat road assumption [absolutedepthXue2020].In the past few years, both supervised and selfsupervised depth estimation approaches have taken advantage of the powerful feature extracting ability of convolutional neural networks (CNNs). However, capturing global contextual information using pure CNNs is difficult because of the limited kernel size. To overcome this drawback, numerous studies have applied conditional random fields (CRFs) and Markov random fields (MRFs)
[cao2017estimating, eigen2015predicting, li2015depth, liu2015deep, mousavian2016joint, xu2017multi, xu2018structured, karschdepth, saxena20083]. Nevertheless, CRFs and MRFs are difficult to optimize, as is applying them to build an endtoend model.Predicting a geometric smoothness depth map facilitates both quantitative and qualitative evaluations. However, the smoothness loss measures used in previous works solely apply constraints to the twodimensional (2D) depth map, and the 3D geometry properties of the scene are not considered. In addition, the multiscale prediction strategy is often applied to overcome the gradient locality issue. Moreover, previous works have only used fourscale prediction frameworks, which raises the learning difficulty of the networks and thus negatively affects the performance.
To tackle the aforementioned issues, applying Linformer we propose a depth Linformer network (DLNet), a fullLinformerbased model, to concurrently capture global and local features, thereby improving the performance of selfsupervised depth estimation [Linformer]. Although many trials of applying the Transformer model [Transformer]
to computer vision tasks have been reported, to the best of our knowledge, few have applied pure Transformer or Linformer networks to perform pixelwise tasks. Instead, researchers have attempted to either extract features (encoding) or predict results (decoding) using CNNs. To the best of our knowledge, the present study is the first to perform pixelwise depth estimation with a fullLinformerbased model. Moreover, to further improve the quantitative and qualitative results, we explore geometry properties and multiscale prediction.
Our contributions can be summarized as follows:

To effectively extract global and local features, we propose a softsplit multilayer perceptron (SSMLP) block and a depth Linformer block (DLBlock) to build the DLNet, the depth decoder, and the pose decoder.

We propose a 3D geometry smoothness (3DGS) loss to obtain a natural and geometrypreserving depth map by applying secondorder smoothness constraints on the 3D point clouds rather than on the 2D depth map.

We present a maximum margin dualscale prediction (MMDSP) strategy to overcome the gradient locality issue while concurrently saving computational resources and boosting performance.

Compared with stateoftheart methods, the proposed model achieves competitive performance on the KITTI [geiger2013vision] and Make3D [saxena2008make3d]
benchmarks but with a lightweight configuration and without pretraining. Furthermore, the promising qualitative results on the Cityscapes dataset
[cordts2016cityscapes] and realworld scenarios demonstrate the proposed model’s strong generalization capability and practicality.
The remainder of this paper is organized as follows. Section II introduces related works. Section III mathematically defines the problem and presents notational conventions. Section IV presents the model design and loss functions. Section V reports on detailed experiments, and Section VI discusses the limitations of the proposed model. Section VII draws the conclusions.
Ii Related Work
In this section, we review the literature related to supervised depth estimation, selfsupervised depth estimation, and transformer networks for computer vision.
Iia Supervised depth estimation
Prior to advances in deep learning algorithms, monocular depth estimation was largely obtained by devising efficient handcrafted features to capture the 3D information
[karschdepth, saxena20083]. For example, Saxena et al. [saxena20083] extracted absolute and relative depth features from the textures and statistical histograms of images, respectively, and integrated the extracted features and MRFs to predict the final depth map. Research on depth estimation has since proliferated, mainly focusing on exploring monocular cues in images [baig2016coupled, choi2015depth, furukawa2017depth, zoran2015learning].However, obtaining abstract and deep features through such manual design is challenging. Fortunately, CNNs can aid in extracting abstract and complicated features from images. To the best of our knowledge, Eigen et. al.
[eigen2015predicting] were the first to apply CNNs for monocular depth estimation, and numerous variants, focusing on network structure design, have been proposed since [chen2016single, eigen2014depth, eigen2015predicting, laina2016deeper, li2017two]. In addition, to overcome the spatial locality of the convolution operator, CRFs and recurrent neural networks (RNNs) have been introduced to capture the global information of an image
[cao2017estimating, eigen2015predicting, li2015depth, liu2015deep, mousavian2016joint, xu2017multi, xu2018structured, almalioglu2019ganvo, cs2018depthnet, grigorev2017depth, mancini2017toward, tananaev2018temporally, wang2019recurrent, mypaper2020].Typically, depth estimation is regarded as a pixelwise regression problem, but it can also be cast as a classification problem by discretizing the continuous depth into many intervals so as to predict a specific label for each pixel [cao2017estimating, fu2018deep].
IiB Selfsupervised depth estimation
Differing from supervised depth estimation, selfsupervised depth estimation uses warpingbased view synthesis to reconstruct the target image and then trains the model by computing the difference between the reconstructed and target images [zhou2017unsupervised]. Selfsupervised depth estimation relies on monocular data and does not require any labeled data, an advantage that has attracted many researchers [chen2019towards, garg2016unsupervised, ranjan2019competitive, yin2018geonet, zhan2018unsupervised, zhou2019unsupervised, zhou2017unsupervised, godard2017unsupervised, kuznietsov2017semi, almalioglu2019ganvo, feng2019sganvo].
However, recovering a scene’s structure from motion (SfM) is inherently problematic for some special cases, such as moving objects and occlusions. To bridge these application gaps, a series of outstanding studies have been conducted. Representative studies include the following: Godard et al. [godard2019digging] proposed a minimal reprojection loss function to effectively improve the occlusion/disocclusion problem; Casser et al. [casser2019depth] used the advanced semantic segmentation model to mask potential moving objects out, thus excluding their influence; Zhou et al. [zhou2017unsupervised] proposed multiscale training for solving the gradient locality issue caused by low textures; Bian et al. [bian2019unsupervised] presented a geometry consistency loss function for achieving scaleconsistent depth and egomotion estimation within a continuous sequence; and Jia et al. [mypaper2021] modeled the prediction uncertainty and relationships between depths to realize a reliable and practical depth estimation system.
Moreover, Park et al. [park2019high] and Yang et al. [yang2019fast] have integrated data from multiple sensors, such as LiDAR sensors, visual odometers, and cameras, for improving the inference efficiency and accuracy.
IiC Transformer for computer vision
Transformer [Transformer]
, an attentionbased model initially proposed for natural language processing (NLP), is efficient at capturing longrange dependencies between items. Numerous recent studies have applied Transformer to computer vision tasks by reshaping square images to sequencelike data, either using Transformer for CNN feature processing
[T4C1, T4C2, T4C3, T4C4] or for feature extraction [C4T1, C4T2, C4T3, C4T4]. Specifically, in the former application, Transformer is used as a decoder for predictions, whereas in the latter, Transformer substitutes CNNs and is used as an encoder for feature extraction.However, applying the classic Transformer to pixelwise computer vision tasks, such as semantic segmentation and depth estimation, is difficult because it requires large storage and computational resources for processing long sequence data. Hence, for efficiency in pixelwise tasks, CNNs are generally used at the beginning or end of the network [C4T3, T4C4].
In addition, when rigidly splitting an image into many patches and using them as the input of the network, it is difficult to capture delicate features, such as edges [C4T4]. Consequently, Yuan et al. [C4T4] proposed a tokenstotoken strategy to aggregate neighboring features.
In summary, the literature reveals that effectively and efficiently extracting global and local information from images remains challenging, especially when using the emerging Transformer model. Moreover, no studies have examined the secondorder geometric smoothness of the predicted point clouds. These research gaps have inspired the present work.
Iii Problem Setup
A selfsupervised monocular depth estimation system comprises two parts, namely the depth estimation network and the pose estimation network, denoted as
and , respectively. Given a continuous image sequence , the depth estimation network solely takes the target image as the input to predict its depth map; this can be mathematically defined as: , where is the predicted depth map of the input image . Differing from the depth estimation network, the pose estimation network takes the whole sequence as the input and predicts the egomotion for each image pair ; this can be mathematically presented as: , where is a pose matrix of describing the movement between the target and reference images.Theoretically, given the depth map of the target image and the egomotion between the target and reference images, the target image can be reconstructed from the reference images by warpingbased view synthesis [zhou2017unsupervised], which can be mathematically defined as Eq. 1; here, , , , , , and represent the coordinate in the reference image, the camera intrinsic matrix (), the transform matrix between the target image and reference images, the depth corresponding to , the inverse matrix of , and the coordinate in the target image, respectively.
(1) 
During the training phase, the reconstructed loss is computed with respect to the difference of the target and reconstructed images to train the system. Notably, the depth and pose estimation networks are trained cooperatively, but they can work separately during the testing phase.
Iv Method
In this section, we first illustrate the entire monocular depth estimation system. Subsequently, the proposed DLNet is introduced. Thereafter, the DLNetbased depth and pose estimation networks are presented. Finally, the loss functions used in this paper are presented.
Iva Model overview
The selfsupervised monocular depth estimation system concurrently performs depth and pose estimation during training, as shown in Fig. 2. Following Zhou et al. [zhou2017unsupervised], the length of the image sequence used for training is set to , and the middle frame of the sequence is regarded as the target image that requires depth estimation using depth net. In contrast, the whole sequence of the three frames is used for pose estimation.
After obtaining the depth map and pose vectors, warpingbased view synthesis can be performed for reconstructing the target image from the reference images. Then, the reconstruction loss, namely the difference between the target and reconstructed images, is computed to train the system. In the following subsections, the proposed DLNet, depth net, pose net, and loss function are introduced.
IvB Depth Linformer Network (DLNet)
IvB1 Linformer
Transformer uses the scaled dotproduct attention (SDPA) mechanism to perform feature aggregation, which can be intuitively described as mapping a query and a set of key–value pairs to an output [Transformer]. In particular, the query, keys, values and output are all vectors with dimensions , , , and , respectively. However, in practice, we pack a set of queries together to perform the attention computation simultaneously.
For clarity, we denote the queries of a sequence of length as the matrix . Accordingly, the keys and values are denoted as the matrices and . Thus, the SDPA can be defined as Eq. 2:
(2) 
The attention matrix is obtained by multiplying two matrices, which requires time and space complexities with respect to the length of the sequence. In many cases, the sequence length requires a prohibitively large amount of storage and computational resources when using the Transformer model, especially for pixelwise computer vision tasks.
To overcome this problem, Wang et al. [Linformer] proposed a linear complexity () Transformer based on the lowrank property of the attention matrix , called Linformer, significantly reducing the time and space complexities. Specifically, two learnable matrices are used to project the original dimensional matrices and into dimensional () space; accordingly, SDPA can be rewritten as scaled dotproduct linear attention (SDPLA) (Eq. 3). For simplicity, we do not differentiate between and in the following text.
(3) 
IvB2 Depth Linformer block (DLBlock)
Most studies have rigidly divided the image into many patches, flattening the patches to vectors and using them as the input of the Transformer model. However, in this approach, obtaining fine features of the image, such as edges, is challenging because of the lack of communication between the patches. Furthermore, the original Transformer [Transformer] and Linformer [Linformer] models cannot dynamically change the feature map resolution, resulting in high computational and storage costs, especially for image processing.
To overcome this, we introduce the softsplit multilayer perceptron (SSMLP) block to promote communication between the patches, thereby simultaneously adjusting the feature map size and reducing the computational and storage costs. Subsequently, the features obtained from the SSMLP are delivered to the Linformer block for extracting the global features. Figure 4 illustrates the detailed structure of the proposed DLBlock.
Let us denote the input feature as , where , , and represent the height, width and dimension number of the input feature, respectively. For aggregating the local feature, a moving window of size
, stride
, and padding
, is used to reshape the input feature to the sequence data in a softsplit manner, wherein and are derived from Eq. 5:(5) 
where denotes rounding down.
When the stride and moving window size satisfy , adequate overlapping exists (soft split) for capturing the fine details of the image. However, in this case, the dimension of the feature is multiplied by , significantly raising the computational and storage costs. Accordingly, MLP is used for dimension reduction (Eq. 6):
(6) 
where represents the transformed lowdimensional features; and are learnable parameters; and , , and
represents the target dimension of dimension reduction, layer normalization, and activation function, respectively. The broadcasting mechanism is automatically performed for the foregoing additive operation. Furthermore,
is given by Eq. 7:(7) 
where
is a hyperparameter and
is the output dimension.The aforementioned transformations and computations, called SSMLP, effectively conduct local feature extracting, with the aid of the soft split and the MLP layer.
Subsequently, the feature goes through the Linformer block to capture the global information, thereby obtaining the feature . Then, another MLP layer performs the dimension increase to change the feature to ; this is followed by the reshaping of to , which can be formulated as Eq. 8:
(8) 
Finally, we bring in the initial feature
through a residual connection, which is stated in Eq.
9:(9) 
where is the output feature.
In summary, our DLBlock mainly comprises the following three components:

SSMLP. SSMLP is critical for extracting the local feature and improving the efficiency;

Linformer block. The Linformer block is crucial for capturing global features, automatically shifting the focus to the more important features through the inner selfattention mechanism;

Residual connection. Residual connection, a proven technique, can improve the gradient explosion and network degradation, concurrently avoiding to the extent possible the information loss caused by changes in the feature map size and dimensions.
In what follows, DLNet, a DLBlockbased network, is introduced.
IvB3 Depth Linformer Network (DLNet)
Inspired by the success of the CNNs, a pyramidlike structure of gradually reducing the feature map size is adopted when devising the DLNet. As shown in Fig. 5 (a), an SSMLP layer is first used to embed the input image and simultaneously decrease the feature map’s resolution. After a MaxPooling layer, the fourstage feature transformations are performed via the encoder blocks (Fig. 5 (b)), which includes the proposed DLBlock.
In the next subsection, the depth and pose networks are devised based on the proposed DLNet.
IvC Depth and pose estimation networks
The proposed DLNet is considered the encoder in both the depth and pose estimation networks. The integral depth and pose estimation networks are presented in this subsection by further devising the decoders using the proposed components.
Figure 5 (c) illustrates the structure of the proposed depth decoder, which consists of a few decoder blocks and output heads. For each decoder block illustrated in Fig. 5 (d), a DLBlock is primarily used to perform the feature transformation, following which an upsampling layer is used to increase the resolution. Subsequently, the upsampled feature is stacked with the feature from the skip connection with respect to the channel, when the skip connection is available. Finally, a lightweight SSMLP layer is used for feature compression. Because of the availability of the skip connections, the residual connection in DLBlock is discarded to save computational and storage resources. For each output head, an SSMLP is used to predict the disparity map.
Figure 5 (e) illustrates the structure of the proposed pose decoder, which simply consists of a DLBlock without the residual connection and an SSMLP layer.
Then, the depth and pose estimation networks can be obtained by simply integrating the proposed DLNet and the corresponding decoder.
IvD Losses
In this subsection, a set of loss functions used for training the networks are presented. Specifically, basic losses that have been successfully applied in previous works are introduced. Subsequently, a novel 3D geometry smoothness (3DGS) loss function and the maximum margin dualscale prediction (MMDSP) are presented. Thereafter, the final loss is shown.
IvD1 Basic losses
A strong assumption, the Lambertian reflection [basri2003lambertian], is imposed on all surfaces of the image, which makes the photometric constancy loss between the target image and the reconstructed image possible. Taking the robustness into account, we, therefore, choose the L1 norm to compute the photometric loss, which can be stated as Eq. 10:
(10)  
where represents the reconstructed image from the reference image ; represents the reprojection function described in Eq. 1; and is the bilinear sampling operator, which is locally subdifferentiable. For simplicity, these notations are directly used in the rest of this paper.
However, the photometric loss is sensitive to illumination changes, particularly in complicated realworld scenarios. Consequently, following Godard et al. [godard2019digging], the structure similarity (SSIM) loss (Eq. 11) is used to improve this issue:
(11) 
To address the problem of visual inconsistencies in the target and reference images, such as occlusion and disocclusion, we follow Godard et al. [godard2019digging] in adopting the minimum reprojection loss (Eq. 12):
(12)  
where is set as 0.15 following [godard2019digging].
Furthermore, we apply a simple binary mask proposed by Godard et al. [godard2019digging] to avoid the influence of the static pixels caused by the static camera, an object moving at equivalent relative translation to the camera, and the lowtexture regions, as follows (Eq. 13):
(13) 
where is a binary mask and is the difference between the target image and the unwarped reference image . Therefore, the reconstruction loss can be written as Eq. 14:
(14) 
Finally, the scalar reconstruction loss can be computed by averaging over each pixel and batch, as follows (Eq. 15):
(15) 
where and are the batch size and the number of the pixels, and and represent the traversing of each sample and pixel.
IvD2 3D geometry smoothness (3DGS) loss
Generally, a smoothness loss is applied to obtain a smooth depth map. However, the smoothness loss used in previous works [zhou2017unsupervised, godard2019digging] simply constrains the distance between the neighboring depths and does not take the geometric properties into account. Mathematically, the distance of the target depth from its neighbors can be directly minimized (Eq. 16) to encourage a smoothness depth map, which solely promotes continuity on depth values.
(16) 
However, in this case, there are two major drawbacks as follows:

Nondifferentiable artifacts. The naive smoothness loss function does not consider the differentiability of the depth map, resulting in unnatural and nondifferentiable artifacts, especially in the edge regions of the objects;

Violation of the geometry structure. The values from the close to the far regions in the depth map/disparity map increase/decrease monotonically with various granularities. Nevertheless, the naive smoothness loss applies identical weights over different positions, which breaks up the overall geometry structure of the scene.
Therefore, we propose the 3DGS, aimed at predicting a smooth, geometrypreserved, and natural depth map by imposing the gradual change constraint on the surface normals of the reconstructed 3D point clouds.
Primarily, we need to estimate the pixelwise surface normal from the predicted depth map. Thus, the depth map is first reprojected to 3D space using Eq. 17:
(17) 
where , , , and represent the image coordinates, camera intrinsic matrix, depth map, and point clouds, respectively.
Then, the target point and its eight neighborhood points can be used to determine eight vectors , where . Any two arbitrary two neighboring vectors can determine a surface. For each surface, we can obtain the normal by computing the crossproduct of the neighboring two vectors. Finally, the target surface normal is estimated by averaging over all reference normals, as shown in Eq. 18:
(18)  
where , , and represent the crossproduct operation, target normal, and reference normal, respectively.
Following the pixelwise normal estimation, we apply the proposed 3DGS loss function, constraining and slowly changing the surface normals of the scene smoothly. First, consider a continuous space. Given a surface , which is defined on a twodimensional space without any sharp points, the surface should be continuous if Eq. 19 holds:
(19) 
In this case, the surface has smoothness. If the surface normal is everywhere available, we can infer that the surface is firstorder differentiable (please note that we assume that there are no sharp points in the surface , such as the point in the curve ), which indicates that the surface has smoothness. Finally, the gradual changes of the surface normal, namely the smooth surface normals, require the surface to be secondorder differentiable, making the surface have smoothness.
Therefore, we first define the distance between the two surface normals as the sine distance (Eq. 20) to achieve the surface normal smoothness:
(20) 
where is the sine distance operator. Thus, the proposed 3DGS loss can be described as Eq. 21:
(21)  
where , , and represent the gradient operator, estimated surface normal matrix, predicted disparity map, and color image, respectively. The exponential items slack the constraints on the edges for performing edgeaware prediction.
By requiring the 3DGS, the proposed model can predict a smooth and natural depth map, significantly improving the qualitative and quantitative performance, particularly in the edge regions.
IvD3 Maximum margin dualscale prediction (MMDSP)
To overcome the gradient locality issue raised by the lowtexture regions in the image, most previous works have used the multiscale prediction strategy [zhou2017unsupervised, godard2019digging], as it is relatively easy to capture contextual information at a lower resolution so as to accurately predict the depth map for lowtexture regions.
Accordingly, we adopt multiscale training in our network. However, is the fourscale prediction used in previous works [godard2019digging] necessary?
Intuitively, the network would extract the lowlevel vision features in the first several layers, whereas the deep features would contain more semantic information. For this case, consider the semantic level of the output and its previous feature to be and , respectively. Then, the features , , , and , shown in Fig. 7 (a), would have the same semantic level of . Therefore, fourscale prediction would require three times the feature transformations between features having the identical semantic level, with each transformation solely depending on a single decoder block; this approach increases the network’s learning difficulty.
Therefore, we propose the maximum margin dualscale prediction (MMDSP) strategy to overcome the gradient locality issue, as shown in Fig. 7 (b). The proposed MMDSP only performs one transformation between features having the identical semantic level, with three decoder blocks.
The proposed MMDSP not only overcomes the gradient locality issue, thus improving performance, but also reduces the computational complexity. Ablation studies that experimentally demonstrate the effectiveness of the proposed approach are discussed in Section VD.
IvD4 Final loss
We integrate the reconstruction loss, the proposed 3DGS loss, and the MMDSP to form the final loss (Eq. 22) for training our networks:
(22) 
where is set to 0.001 following Godard’s work [godard2019digging].
Errors  Errors  
Methods  GT?  PT?  MS?  AbsRel  SqRel  RMS  RMSlog  
Eigen et al., coarse [eigen2014depth]      0.214  1.605  6.563  0.292  0.673  0.884  0.957  
Eigen et al., fine [eigen2014depth]      0.203  1.548  6.307  0.282  0.702  0.890  0.958  
Liu et al. [liu2015learning]      0.202  1.614  6.523  0.275  0.678  0.895  0.965  
Kuznietsov et al. (B) [kuznietsov2017semi]    0.113  0.741  4.621  0.189  0.862  0.960  0.986  
DORN [fu2018deep]  51.0M  0.072  0.307  2.727  0.120  0.932  0.984  0.994  
GeoNetVGG (J) [yin2018geonet]      0.164  1.303  6.090  0.247  0.765  0.919  0.968  
GeoNetResnet (J) [yin2018geonet]    229.3M  0.155  1.296  5.857  0.233  0.793  0.931  0.973  
DDVO [wang2018learning]    0.151  1.257  5.583  0.228  0.810  0.936  0.974  
SCSfMLearner [bian2019unsupervised]  59.4M  0.149  1.137  5.771  0.230  0.799  0.932  0.973  
Struct2depth [casser2019depth]      0.141  1.026  5.291  0.215  0.816  0.945  0.979  
Jia et al. [mypaper2021]  57.6M  0.144  0.966  5.078  0.208  0.815  0.945  0.981  
Monodepth2 [godard2019digging]  59.4M  0.128  1.087  5.171  0.204  0.855  0.953  0.978  
Our (CC)  59.4M  0.128  0.990  5.064  0.202  0.851  0.955  0.980  
Our (CL)  63.1M  0.128  0.979  5.033  0.202  0.851  0.954  0.980  
SfMLearner [zhou2017unsupervised]  126.0M  0.208  1.768  6.856  0.283  0.678  0.885  0.957  
Yang et al. (J) [yang2017unsupervised]  126.0M  0.182  1.481  6.501  0.267  0.725  0.906  0.963  
Monodepth2 [godard2019digging]  59.4M  0.144  1.059  5.289  0.217  0.824  0.945  0.976  
Our (LL)  25.8M  0.141  1.060  5.247  0.215  0.830  0.944  0.977 
V Experiments
In this section, we first introduce the experiment implementation details and then compare the results of the proposed approach with those of other stateoftheart methods. Thereafter, the parameter analysis, ablation studies, and model complexity are discussed. All experiments are implemented with PyTorch
library on a single GTX Ti GPU card.Errors  
Methods  GT?  PT?  AbsRel  SqRel  RMS  RMSlog 
Karsch et al. [karsch2014depth]  0.428  5.079  8.389  0.149  
Liu et al. [liu2014discrete]  0.475  6.562  10.05  0.165  
Laina et al. [laina2016deeper]  0.204  1.840  5.683  0.084  
DDVO [wang2018learning]  0.387  4.720  8.090  0.204  
Monodepth [godard2017unsupervised]  0.544  10.94  11.760  0.193  
Monodepth2 [godard2019digging]  0.322  3.589  7.417  0.163  
Jia et al. [mypaper2021]  0.301  3.143  6.972  0.351  
Our (CL)  0.269  2.201  6.452  0.325  
Our (CC)  0.267  2.188  6.406  0.322  
SfMLearner [zhou2017unsupervised]  0.383  5.321  10.470  0.478  
Our (LL)  0.289  2.423  6.701  0.348 
Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
32  0.142  1.121  5.330  0.216  0.829  0.944  0.976 
64  0.142  1.094  5.286  0.216  0.827  0.944  0.977 
128  0.143  1.117  5.340  0.217  0.829  0.942  0.975 
Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
False  0.144  1.106  5.339  0.217  0.825  0.943  0.976 
True  0.142  1.094  5.286  0.216  0.827  0.944  0.977 
Va Implementation details
Datasets. KITTI [geiger2013vision], a largescale and publicly available dataset widely used in various computer vision tasks, serves as the set of benchmarks for evaluation and comparison in this study. Following Zhou et al. [zhou2017unsupervised], we take and frame sequences from the KITTI dataset as the training and validation data, respectively. Following Eigen et al.’s [eigen2014depth] split, we take images from the KITTI dataset for testing.
We apply our model trained on the KITTI dataset to test images of the Make3D dataset [saxena2008make3d] unseen during training for demonstrating the generalization capability of the proposed model. Following Godard et al. [godard2019digging], we evaluate Make3D’s images on a center crop at a ratio. Additionally, qualitative results on the Cityscapes dataset [cordts2016cityscapes] generated by the proposed model trained on the KITTI dataset are presented. In all experiments, the resolution of the input image is .
Data augmentation. Following Godard et al. [godard2019digging], the input images are augmented with random cropping, scaling, and horizontal flips. In addition, a set of color augmentations, random brightness, contrast, saturation, and hue jitter with respective ranges of , , , and are adopted with a percent chance. These color augmentations are solely applied to the images fed to the networks, not to those applied to compute loss.
Hyperparameters. Our networks are trained using the Adam optimizer [kingma2014adam]. The learning rate is initially set to and decreased by a factor of every epochs. The other parameters of the optimizer are set to the default values. The epoch, batch size, and length of the sequence are set to , , and , respectively.
During training, we initialize the weights and biases of a linear layer using a Gaussian distribution with standard deviation
and constant . For the LayerNorm layer, the weights and biases are initialized with constants and, respectively. We do not pretrain our model on the ImageNet dataset because of computational resource constraints. The specific network configurations are presented in Section VC.
VB Results
Table I shows the quantitative results of the proposed model on the KITTI dataset [geiger2013vision], divided into three categories according to whether the ground truth and pretraining are used during the training phase. Of note is that the depth estimation results in this paper benefit from the fusion scaling strategy proposed in Jia’s work [mypaper2021].
To benefit from pretraining and for a fair comparison, we integrate the proposed loss and decoders with existing pretrained CNNs. The first network architecture CC, in which the networks are identical to those in Monodepth2 [godard2019digging] but apply the proposed loss function, outperforms the original Monodepth2, demonstrating the effectiveness of the proposed loss function. Then, we replace the CNN decoders with the proposed decoders, forming the second network architecture CL, which further improves the performance. Finally, the LL network, consisting of the proposed DLNet (encoder) and decoders, achieves performance competitive to those of the stateoftheart methods while also reducing the number of parameters by more than .
Figure 1 presents a comprehensive comparison of the performance and number of parameters. Without pretraining, the proposed model (LL) outperforms the other methods, with performance comparable even to those of pretrained models. Moreover, the proposed model significantly reduces the number of parameters, and simply integrating the proposed loss function and decoders with the pretrained encoder used in previous works leads to better performance.
Figure 8 illustrates the promising qualitative results of the proposed model on the KITTI dataset. The regions marked with red circles are extremely challenging to accurately predict because of the issues of reflection, smoothness, and farness. However, effectively capturing global features using CNNs is difficult and may lead to unexpected failure cases, especially for a model without pretraining. Thereagainst, the proposed loss and networks implicitly impose geometry constraints on the results and effectively extract the global information, respectively, significantly improving the performance and contributing to the prediction of a more accurate depth map.
Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
Flase  0.144  1.122  5.285  0.217  0.827  0.944  0.976 
True  0.142  1.094  5.286  0.216  0.827  0.944  0.977 
Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
4  0.144  1.130  5.317  0.217  0.827  0.943  0.976 
8  0.142  1.094  5.286  0.216  0.827  0.944  0.977 
16  0.143  1.114  5.282  0.216  0.828  0.944  0.977 
activation functions  Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
ReLu [relu]  0.145  1.103  5.291  0.218  0.822  0.943  0.976 
ELU [elu]  0.146  1.141  5.353  0.219  0.823  0.942  0.975 
GELU [gelu]  0.142  1.094  5.286  0.216  0.827  0.944  0.977 
MS  Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
0.5  58.0M  0.140  1.079  5.348  0.215  0.831  0.944  0.976 
1  25.8M  0.141  1.060  5.247  0.215  0.830  0.944  0.977 
2  15.1M  0.146  1.135  5.374  0.218  0.821  0.941  0.976 
Linformer depth  MS  Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
1  25.8M  0.141  1.060  5.247  0.215  0.830  0.944  0.977 
2  30.7M  0.144  1.111  5.299  0.217  0.826  0.943  0.976 
Number of DLBlock  MS  Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
1  25.8M  0.141  1.060  5.247  0.215  0.830  0.944  0.977 
2  34.7M  0.144  1.171  5.323  0.218  0.831  0.944  0.975 
scales  Errors  Errors  

AbsRel  SqRel  RMS  RMSlog  
0  0.143  1.103  5.347  0.217  0.828  0.944  0.976 
0,3 (MMDSP)  0.142  1.118  5.330  0.216  0.829  0.944  0.976 
0,2,3  0.145  1.184  5.448  0.220  0.822  0.942  0.976 
0,1,2,3  0.144  1.115  5.347  0.217  0.823  0.943  0.976 
Errors  Errors  

Methods  AbsRel  SqRel  RMS  RMSlog  
B  0.218  4.062  6.788  0.286  0.772  0.915  0.959 
B+MRp  0.194  2.756  6.213  0.262  0.786  0.925  0.966 
B+MRp+SSIM  0.148  1.238  5.496  0.221  0.818  0.941  0.974 
B+MRp+SSIM+AM  0.144  1.120  5.308  0.217  0.826  0.944  0.976 
B+MRp+SSIM+AM+MMDSP  0.142  1.118  5.330  0.216  0.829  0.944  0.976 
B+MRp+SSIM+AM+MMDSP+3DGS  0.141  1.060  5.247  0.215  0.830  0.944  0.977 
To evaluate the generalization capability of the proposed model, we directly apply our model trained on the KITTI dataset to the Make3D dataset without any refinements and training. Table II shows that the quantitative results output by the proposed model are competitive, regardless of pretraining status.
Figure 9 illustrates some qualitative results generated by the proposed model on the Make3D dataset, which indicate that the proposed model has powerful generalization capability and can predict a reasonable depth map with adequate details for unseen data. The scenarios and perspectives in Make3D differ substantially from those in the KITTI data used for training.
In Fig. 10, we present some qualitative results predicted by the proposed model on the Cityscapes dataset, thus demonstrating the model’s practicality. The scenes, color distributions, and perspectives in Cityscapes differ from those in the KITTI data used for training.
Summarily, we can conclude that the proposed model achieves competitive performance on the KITTI, Make3D, and Cityscapes datasets, presenting excellent qualitative results, especially in some challenging regions.
VC Parameter analysis
In this part, we report on exhaustive experiments on the different parameters of the proposed model, conducted to determine the optimal configurations.
Primarily, we conduct a series of experiments to search for the best set of parameters for the Linformer block embedded in our networks. As shown in Tables III, IV, V, and VI, we carefully perform the experiments for each parameter in the Linformer [Linformer] block and choose the parameters that result in a better performance as the final configurations. Specifically, we set the parameter , , , and to , , , and , respectively.
We further experiment with various global network parameters. As shown in Table VII, we evaluate the different activation functions and find that GELU [gelu] outperforms other activation functions by a clear margin. Table VIII summarizes the performance of different hidden dimensions defined in Eq. 7. Considering both performance and model efficiency, is set to .
Finally, the results for experiments on networks with various Linformer depths and DLBlock amounts are shown in Table IX and X. No performance improvements could be achieved by deepening the network. Therefore, the final networks adopt a single DLBlock in which the Linformer depth is set as for each encoder block. All results reported in Section VB are derived from these configurations.
VD Ablation studies
In this section, we discuss experiments conducted to validate the proposed MMDSP strategy and loss items. First, Table XI shows that the proposed MMDSP is more effective than the other multiscale prediction strategies. In fact, the proposed Linformerbased networks are capable of concurrently capturing the global and local features, which contribute to overcoming the gradient locality issue. Therefore, the performance of our network with a single prediction scale is still on par with that of the network equipped with the proposed MMDSP.
Table XII demonstrates the effectiveness of each item of the proposed loss function, which indicates that the proposed MMDSP strategy and especially the 3DGS loss indeed improve the performance.
VE Model complexity
Table XIII summarizes the model’s time and space complexities. Clearly, the proposed model outperforms the other stateoftheart methods by large margins in both time and space complexities.
For reference, our depth and pose networks can achieve a speed of more than and frames per second at a resolution of on a single GTX Ti card (averaging over iterations), respectively, which can satisfy the requirements of most applications.
Methods  TC (GFLOPs)  SC (M) 

Jia et al. [mypaper2021]  3.485  57.6 
Monodepth2 [godard2019digging]  3.480  59.4 
Our  1.311  25.8 
Vi Limitations
In this section, some failure cases of the proposed model are discussed. As shown in Fig. 11, we find that it is challenging to 1) capture extremely far objects, which requires more accurate pose estimation; 2) accurately capture moving objects because such objects violate the static scene assumption; and 3) predict slender objects because such objects are easily confused as being part of the background. We will work on these issues in future studies.
Vii Conclusion
This paper focuses on how to simultaneously extract global and local information from images and predict a geometrically smooth depth map. For extraction, an innovative network, DLNet, is proposed along with special depth and pose decoders. For prediction, a 3DGS is presented. Moreover, we explore the multiscale prediction strategy used for overcoming the gradient locality issue and propose a maximum margin dualscale prediction (MMDSP) strategy for efficient and effective predictions. Detailed experiments demonstrate that the proposed model achieves competitive performance against stateoftheart methods and reduces time and space complexities by more than and , respectively. We hope that our work contributes to the relevant academic and industrial communities.