Log In Sign Up

Monocular Depth Distribution Alignment with Low Computation

by   Fei Sheng, et al.

The performance of monocular depth estimation generally depends on the amount of parameters and computational cost. It leads to a large accuracy contrast between light-weight networks and heavy-weight networks, which limits their application in the real world. In this paper, we model the majority of accuracy contrast between them as the difference of depth distribution, which we call "Distribution drift". To this end, a distribution alignment network (DANet) is proposed. We firstly design a pyramid scene transformer (PST) module to capture inter-region interaction in multiple scales. By perceiving the difference of depth features between every two regions, DANet tends to predict a reasonable scene structure, which fits the shape of distribution to ground truth. Then, we propose a local-global optimization (LGO) scheme to realize the supervision of global range of scene depth. Thanks to the alignment of depth distribution shape and scene depth range, DANet sharply alleviates the distribution drift, and achieves a comparable performance with prior heavy-weight methods, but uses only 1 on two datasets, namely the widely used NYUDv2 dataset and the more challenging iBims-1 dataset, demonstrate the effectiveness of our method. The source code is available at


page 1

page 2

page 3

page 6


Structure-Aware Residual Pyramid Network for Monocular Depth Estimation

Monocular depth estimation is an essential task for scene understanding....

360^∘ Depth Estimation from Multiple Fisheye Images with Origami Crown Representation of Icosahedron

In this study, we present a method for all-around depth estimation from ...

DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation

This paper aims to address the problem of supervised monocular depth est...

Boundary-induced and scene-aggregated network for monocular depth prediction

Monocular depth prediction is an important task in scene understanding. ...

P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior

Monocular depth estimation is vital for scene understanding and downstre...

Learning Depth from Focus in the Wild

For better photography, most recent commercial cameras including smartph...

Learning Weighting Map for Bit-Depth Expansion within a Rational Range

Bit-depth expansion (BDE) is one of the emerging technologies to display...

I Introduction

Monocular depth estimation (MDE) aims to infer the 3D space for a given 2D image, which has been widely applied in many computer vision and robotics tasks, e.g., visual SLAM

[21, 33, 34], monocular 3D object detection [36, 2], obstacle avoidance [26, 38, 41], and augmented reality [25]. They raise high demand for both the accuracy and speed of MED.

With the development of deep learning, many impressive works

[18, 11, 16, 28] have emerged, most of which focus on improving accuracy of predicted depth. However, when we attempt to reduce the parameter and computation of these models their accuracy drops sharply. The degradation is mainly caused by the inadequate feature representation of pixel-wise continuous depth value. This phenomenon also appears in some recent algorithms [27, 37] that realize real-time MDE. To our knowledge, it is difficult for the prior methods to run in low latency while achieving a similar performance as the networks focusing on accuracy.

Fig. 1: Illustration of distribution drift phenomenon. The depth distribution is represented by the histogram of depth values, green for correct depth and red for predicted depth. The error map describes the pixel-wise error of depth, with red indicating too far and blue indicating too close.

The motivation of this paper is the observed major degradation of the light-weight MDE models compared to heavy-weight MDE models. We found that for the light-weight MDE networks, there are usually whole pieces of pixel in the prediction that are smaller or larger than the correct depth monolithically, which is the main indicator of accuracy degradation. As shown in the second error map of [15] in Fig. 1, almost all pixels on the wall are predicted farther, which can be observed more intuitively in the depth distribution. Depth distribution shows the proportion of pixels with different depth values. The light-weight MDE models tend to get a completely different depth distribution from the ground truth, which is reflected in two differences, i.e., the shape of depth distribution and the full depth range. We call this issue ‘Distribution Drift’. As shown in Fig. 1, [15] using a light-weight backbone obtains a different shape of depth distribution and depth range from ground truth.

Fig. 2: Cases of accuracy degradation of the prior state-of-the-art methods ([15, 42, 5] from left to right). For each method, the first row shows the prediction, error map and depth distribution of the models using the heavy-weight backbone, and the second row for the light-weight backbone.

In this paper, we propose a distribution alignment network to alleviate the distribution drift, making our method to achieve the performance comparable to the state-of-the-art methods, while with low latency. Firstly, to address the shape deviation of depth distribution, we propose a pyramid scene transformer (PST). Since the light-weight models are limited in network depth, they only extract depth cues in short range. However, minimal depth changes in a short range can hardly be perceived, which causes the wrong predicted depth of the whole slice. In the proposed PST, we capture the long-range interaction between every two regions in multiple scales, which constrains the depth relationship between different regions. Thus PST is beneficial to realize a reliable scene structure. Then, to align the depth range, a local-global optimization (LGO) scheme is proposed to optimize the local depth value and the global depth range simultaneously. By using maximum and minimum depth as supervision, the value range of the scene depth is estimated to be aligned with the ground truth. Experiments prove that we indeed align the distribution of the scene depth, which helps our method to achieve comparable performance with state-of-the-art methods on NYUDv2 and iBims-1 datasets.

The main contributions of this work lie in:

  • The distribution drift is studied to reveal the major degradation of light-weight models, which inspires us to propose a distribution alignment network (DANet). The DANet exceeds all prior light-weight works, and achieves a comparable accuracy with heavy-weight models but uses only 1% FLOPs of them.

  • A pyramid scene transformer (PST) module is proposed to gain long-range interaction between multi-scale regions, helping DANet to alleviate the shape deviation of predicted depth distribution.

  • A local-global optimization (LGO) scheme is proposed to jointly supervise the network with local depth value and global depth statistics.

Ii Related work

Ii-a Monocular Depth Estimation

Several early monocular depth estimators utilize the handcrafted features to estimate depth [30, 22] but suffer from insufficient expression ability. Recently, many CNN-based methods achieve great performance gain. Eigen et al. [9, 10] propose a coarse-to-fine CNN to estimate depth. Laina et al. [18] propose the up-projection for MDE to achieve higher accuracy. Xue et al. [39] improve the boundary accuracy of the predicted depth by a boundary fusion module. Yin et al. [42] enforce geometric constraints of virtual normal for depth prediction. Different from these works, Fu et al. [11] define MDE as a classification task. They divide the depth range into a set of bins with a predetermined width. Bhat et al. [1] compute bins adaptively for each image.

However, these prior methods focus heavily on achieving high accuracy with the cost of complexity and runtime, because they construct a large number of convolutions to obtain sufficient feature representation. Fig. 2 shows three models to verify the issue we find. Once the network capacity drops, they suffer a sharp degradation in accuracy which limits their applications. In this paper, we solve this problem by aligning the depth distribution, thus our method achieves a better trade-off between accuracy and computation.

Fig. 3: Network architecture of DANet which consists of an encoder-decoder network, pyramid scene transformer. The pyramid scene transformer is between the encoder and the decoder, which predicts the center of depth bin to combine with the output of decoder.

Ii-B Real-time Monocular Depth Estimation Methods

To reduce latency in inference, several methods are proposed for MDE in recent years. Wofk et al. [37] design an extremely light-weight network. It adopts the MobileNet [13] and depth-wise separable convolution [7] to build the whole network, followed by network pruning to further reduce computation. Nekrasov et al. [27] boost depth estimation by learning semantic segmentation and distilling structured knowledge from large model to light-weight model. However, due to the limited capacity of the model, distribution drift still appears in these light-weight networks.

Ii-C Context Learning

Context plays an important role in computer vision tasks [43, 12, 45, 46, 44]. Zhao et al. [43] propose the pyramid pooling to aggregate global context information. Lu et al. [24] propose the multi-rate context learner to capture image context by dilated convolution. Vaswani et al. [35] design the transformer to obtain global context by self-attention, which is used in multiple vision tasks [8, 3, 6]. This paper proposes a pyramid scene transformer to capture context interaction between multi-scale regions.

Iii Problem Formulation

In contrast to other methods [15, 5], following [1], the depth range ( in NYUDv2 dataset) of the whole scene is divided into bins, and the goal of depth estimation is formulated as follow: for an input image

, two tensors are jointly predicted, namely, the center values of depth bins

, and the bin-probability maps

indicating the probability of each pixel falling into the corresponding depth bin. In the final predicted depth map, denoted as , each pixel can be formulated as the linear combination of bin probabilities and the bin centers.


where denotes the -th pixel in the prediction .

Iv Methodology

The first subsection outlines the whole architecture of DANet. The second subsection illustrates the pyramid scene transformer (PST), and the following subsection presents the local-global optimization (LGO) scheme for aligning the depth range of the scene to the correct range.

Iv-a Network Structure

Fig. 3 illustrates the architecture of DANet that consists of an encoder, a pyramid scene transformer, and a decoder. Given an image , a light-weight backbone EfficientNet B0 [32] is used to extract features. Assuming that the -th level feature map of the backbone is denoted as , where is the channel number. The feature compression block (FCB), composed of a convolution and a convolution, is used to reduce the channel number of feature () to 16. The four FCBs provide multi-level scene detail information with low time cost. At the end of the encoder, the PST is employed to capture the interaction between multi-scale regions from , meanwhile predict the center values of the depth bin, i.e., , in the scene (see Section IV-B). In the decoder, four up-scaling stages are employed to gradually enlarge the resolution from to . Each stage upsamples the last stage output, and sums it with the same-size feature given by FCB. Then, a residual structure containing three convolutions is used to fuse these features. The channel numbers of features in the decoder are set to 16 to meet requirements of low latency and light weight. At the end of the decoder, we use a convolution to learn -dimensional bin-probability map from the finest resolution feature. Referring to Eq. 1, the final prediction is obtained by linear combination of and .

Fig. 4: The detailed structure of pyramid scene transformer.

Iv-B Pyramid Scene Transformer

Context interaction models the inter-region relationship of depth features, which helps to correctly estimate depth difference between regions. In the global view of the scene, it plays a significant role in suppressing the shape deviation of the depth distribution. To this end, we design the PST to capture context interaction, which consists of three independent parallel paths, as shown in Fig. 4. These paths divide the scene into various-size patches, respectively, to cover various-size scene components. And the relationship between every two patches is captured by a transformer structure [8].

Specifically, an adaptive embedding convolution (AEC) is firstly designed to gain the multi-scale context embeddings adaptively. Given the input feature resolution and the expected output resolution , AEC is defined as:

  • Stride in X direction:

  • Stride in Y direction:

  • Kernel size:

By using AEC, these paths re-scale into tensors with three sizes: , where is the path number. Each pixel in represents the -dimensional context embedding of a patch in the scene. Secondly, in a path, all embeddings are fed into a transformer encoder after adding a 1-D learned positional encoding [8]. The transformer encoder is utilized to perceive the interaction between every two embeddings, and output a sequence of embeddings with the same size as the input embeddings. Note that the first path is different from the two others. It appends an additional -dimensional embedding together with context embeddings, and outputs a special embedding which has the same size as . Thirdly, in each path, the output embeddings are reshaped to build a tensor which has the same size as . Then, all output tensors are upsampled to , so that they can be concatenated. The concatenated feature is then compressed to 16 channels through two convolution and a convolution, and fed into the decoder. Meanwhile, in the first path, the output special embedding

is fed into a multi-layer perceptron to obtain a

-dimensional vector. Subsequently, in the same way as [1]

, the vector

is normalized to obtain the depth-range widths vector : . And the center of bin is obtained as follows: , where are the minimum and maximum depth values, and is the -th value in .

Since the transformer extracts the context interaction of each two patches in a scene, each output embedding encodes the depth interaction from one patch to all other patches. And different paths correspond to the depth correlation of patch in various scales. Moreover, unlike [1], PST is between the encoder and decoder to minimize the amount of computation.

Iv-C Local-Global Optimization for Depth Range Learning

To align the global depth range, we propose a local-global optimization (LGO) scheme, which trains DANet by two stages. In the local stage, we perform two local errors referred from [1] as supervision. In the global stage, we propose min-max loss and range-based pixel weight to learn the global depth range and optimize the whole depth.

Iv-C1 Loss of local stage

The local stage aims to optimize the pixel-wise depth. To this end, a scaled version of the Scale-Invariant (SSI) loss [20] is used to minimize the pixel-wise error between the predicted depth and correct depth:


where , and is the pixel number of an image. are the predicted and correct depth respectively. is a weight parameter of pixel . In the local stage, is set to . Furthermore, following [1], the bi-directional Chamfer Loss [1] is employed as a regularizer to optimize the bin centers to be close to the ground truth.


where is the set of all depth values in the ground truth.

Iv-C2 Loss of global stage

The global stage aims to learn the depth range. In this stage, we supervise the first and last bin center in by a new designed min-max loss:


where is -th value of . and are the operation of taking minimum and maximum value, respectively. The min-max loss affects all pixels during back-propagation by supervising the bins, so that it squeezes all predicted depth values into range .

Fig. 5: The distribution of a depth map and value of with different .

However, since the amount of pixels with the largest and smallest depth is small in scenes, the network might be insensitive to these pixels. Thus, we additionally assign a depth-related weight to each pixel. In the global stage, the parameter in Eq. 2 is taken as the depth-related weight of a pixel, which is proportional to the difference from the pixel’s depth to the median depth value in ground truth.


where is pixel index, and is a coefficient. denotes the operation of taking medium value. As shown in Fig. 5, if the correct depth is close to or , the is close to , which means that the network pays more attention to this pixel . If the correct depth is close to , the tends to be . In this way, DANet pays more attention to pixels with small and large depth, and predicts a more reasonable depth range.

Iv-C3 Training scheme

Combined with the min-max loss and , the total loss is formulated as:


where are hyper-parameters. In the first stage, we set , , , and . The first stage optimizes the pixel-wise depth preliminarily. In the second stage, we set , , , , and . The second stage further optimizes the depth range based on the learned weight of the first stage.

Groups Methods Backbone Resolution FLOPs Params REL RMS log10
Eigen et al.[10] VGG16 31G 240M 0.215 0.772 0.095 0.611 0.887 0.971
Eigen et al. [9] VGG16 23G - 0.158 0.565 - 0.769 0.950 0.988
Laina et al. [18] ResNet50 17G 63M 0.127 0.573 0.055 0.811 0.953 0.988
Fu et al. [11] ResNet101 102G 85M 0.118 0.498 0.052 0.828 0.965 0.992
Lee et al. [19] DenseNet161 96G 268M 0.126 0.470 0.054 0.837 0.971 0.994
Hu et al. [15] ResNet50 107G 67M 0.130 0.505 0.057 0.831 0.965 0.991
Chen et al. [5] SENet154 150G 258M 0.111 0.420 0.048 0.878 0.976 0.993
Yin et al. [42] ResNet101 184G 90M 0.105 0.406 0.046 0.881 0.976 0.993
Lee et al. [20] ResNet101 132G 66M 0.113 0.407 0.049 0.871 0.977 0.995
Bhat et al. [1] EfficientNet b5 186G 77M 0.103 0.364 0.044 0.902 0.983 0.997
Wofk et al. [37] MobileNet 0.75G 3.9M 0.162 0.591 - 0.778 0.942 0.987
Nekrasov et al. [27] MobileNet v2 6.49G 2.99M 0.149 0.565 - 0.790 0.955 0.990
Yin et al. [42] MobileNet v2 15.6G 2.7M 0.135 - 0.060 0.813 0.958 0.991
Hu et al. [14] MobileNet v2 - 1.7M 0.138 0.499 0.059 0.818 0.960 0.990
Hu et al. [15] EfficientNet b0 14G 5.3M 0.142 0.505 0.059 0.814 0.961 0.989
Chen et al. [5] EfficientNet b0 8.22G 12M 0.135 0.514 - 0.828 0.963 0.990
Yin et al. [42] EfficientNet b0 18G 4.6M 0.145 0.567 0.067 0.771 0.947 0.988
Ours EfficientNet b0 1.5G 8.2M 0.135 0.488 0.057 0.831 0.966 0.991
TABLE I: Comparisons on NYUDv2 dataset. Group ① contains non-lightweight methods. Group ② contains light-weight methods. Group ③ contains the re-implemented models using a same backbone with our method.

V Experiments

In this section, we evaluate the proposed method on several datasets, and compare to the prior methods. Moreover, we give more discussions for the network design.

Methods REL RMS log10
Eigen et al. [10] 0.32 1.55 0.17 0.36 0.65 0.84
Eigen et al. [9] 0.25 1.26 0.13 0.47 0.78 0.93
Laina et al. [18] 0.23 1.20 0.12 0.50 0.78 0.91
Hu et al. [15] 0.24 1.20 0.12 0.48 0.81 0.92
Chen et al. [5] 0.25 1.07 0.10 0.56 0.86 0.94
Fu et al. [11] 0.23 1.13 0.12 0.55 0.81 0.92
Lee et al. [19] 0.23 1.09 0.11 0.53 0.83 0.95
Yin et al. [42] 0.24 1.06 0.11 0.54 0.84 0.94
Bhat et al. [1] 0.21 0.91 0.10 0.55 0.86 0.95
Wofk et al. [37] 0.38 1.76 0.21 0.30 0.56 0.74
Nekrasov et al. [27] 0.52 1.57 0.16 0.33 0.66 0.87
Ours 0.26 1.11 0.11 0.55 0.86 0.94
TABLE II: Comparisons on iBims-1 dataset. The 1-st group is non-light weight methods. The 2-nd group is light-weight methods.

V-a Dataset and Implementation Details

Datasets: NYUDv2 [31] and iBims-1 dataset [17] are used to conduct experiments. NYUDv2 [31] is an indoor dataset that collects 464 scenes with 120K pairs of RGB and depth maps. Following [15, 5], we train DANet on 50k images sampled from raw training data and adopt the same data augmentation strategy as [5]. The test set includes 654 images with filled-in depth values. iBims-1 dataset [17] contains 100 pairs of high-quality depth map and high-resolution image. Since the dataset lacks training set, we evaluate the generalization on it by using the model trained on NYUDv2 dataset.

Implementation Details: DANet is constructed on thePytorch framework using a single NVIDIA 3090 GPU. Our backbone, namely, EfficentNet b0, is pre-trained on ILSVRC [29]. Other parameters are randomly initialized. The Adam optimizer is adopted with parameters . The weight decay is

. We train our model for 20 epochs with batch size of 24, 10 epochs for the local stage and 10 epochs for the global stage. The initial learning rate is set to 0.0002 and reduced by 10

for every 5 epochs.

Metrics: Following [1], we evaluate our method based on following metrics: mean absolute relative error (REL), root mean squared error (RMS), mean log Error (log10), and the accurate under threshold (). Referring to [1], in order to make a fair comparison, we re-evaluated some methods [10, 9, 18, 11, 15, 5, 42], in which the performance will be slightly different.

V-B Comparison with the prior methods

Quantitative Evaluation: Table I shows the comparison between our method and the prior methods on NYUDv2 dataset. The backbones of three non-real time networks [15, 42, 5] are replaced for comparison (Group ③). DANet achieves a comparable RMS and accuracy of several non-real time networks [19, 11, 15], but only expending FLOPs of them. It also outperforms all light-weight networks [37, 27, 14] by a large margin. Furthermore, compared to the state-of-the-art methods with EfficientNet b0, DANet gains the best performance on all metrics, which expresses the effectiveness of distribution alignment in light-weight network. Although DANet uses more parameters than light-weight models, it is much slighter than heavy-weight models, enough to run well on the embedded platforms.

Table II shows the cross-dataset evaluation on iBims-1 dataset by using the model trained on NYUDv2 dataset without fine-tuning. Note that we do not re-normalize the depth range of the results to iBims-1. Although iBims-1 dataset has a totally different data distribution from NYUDv2 dataset, DANet achieves the fifth best RMS and tied for -rd best accuracy of with state-of-the-art methods [1, 11]. Furthermore, DANet exceeds the prior real-time works [27, 37] by a large margin on all metrics. The reason is that DANet gains an outstanding performance on scenes with a similar depth range to NYUD v2, which proves the generalization of our method with distribution alignment.

Fig. 6: Visualizations on the NYUDv2 (first two rows) and iBims-1 datasets (last two rows). columns are input and ground truth, and columns for [19, 15, 5, 1, 37, 27] and DANet. The depth distribution is under the depth maps with green for correct depth and red for prediction.
Models RMS FLOPs Params
Baseline 0.510 0.810 1.0G 3.7M
+ PST 0.498 0.820 1.5G 8.2M
+ PST + min-max loss 0.498 0.825 1.5G 8.2M
+ PST + depth-related weight 0.496 0.822 1.5G 8.2M
+ LGO 0.496 0.822 1.0G 3.7M
+ PST + LGO 0.488 0.831 1.5G 8.2M
TABLE III: Quantitative results of our proposed module.
Models RMS
Ours with PPM [43] 0.497 0.823
Ours with ASPP [4] 0.494 0.825
Ours with mini ViT [1] 0.496 0.822
Ours with PST 0.488 0.831
TABLE IV: Quantitative results of context learning module.
Fig. 7: Qualitative results of each contributions.

Qualitative Evaluation: Fig. 6 shows the qualitative results on NYUDv2 dataset (first two rows) and iBims-1 dataset (last two rows). In the first scene, several methods predict a wrong depth of the wall behind the sofa, thus suffering from the wrong depth range. DANet gets a depth distribution almost coinciding with ground truth. In the second scene, the farthest region is occluded by the cabinet. The light-weight models obtain the wrong farthest region, causing a large distribution drift. Our method correctly estimates the depth together with other state-of-the-art methods. The third and fourth rows show two scenes that have never been seen. Many methods suffer from the deviation of depth range, especially the light-weight models [27, 37]. Our method still estimates the depth distribution almost perfectly, and predicts a reasonable depth image. These visualizations further prove the effectiveness of proposed paradigm.

V-C Detailed Discussions

Ablation studies: We verify our PST and LGO in Table III. They are added one by one to test the effectiveness of each proposal. Note that the baseline is an encoder-decoder network without PST and LGO. Compared with baseline, PST and LGO achieve and gain in . Moreover, Baseline+PST+LGO achieves the best performance in all metrics of evaluation. We further validate the effectiveness of min-max loss and depth-related weight in LGO. Compared to Baseline+PST, the performance of the model is improved after using and respectively.

Fig. 7 illustrates the visualized results of these variants. The model using LGO squeezes the depth range into a narrower space, but fails to optimize the distribution shape. The model using PST obtains a similar distribution with ground truth, but suffers from the wrong depth range. The model using all of them aligns the depth distribution well.

Fig. 8: The performance of with various coefficient .

Effectiveness of multi-scale interaction. To evaluate the multi-scale interaction, PST is replaced by other contextual learning modules, i.e., Pyramid Pooling Module (PPM) [43], ASPP [4], and mini ViT [1], respectively. As shown in Table IV, DANet with PST outperforms others over all metrics, because the interaction of multi-scale regions directly models the relationship between every two regions.

Coefficient of depth-related weight. To explore the best coefficient of depth-related weight , is set to . Fig. 8 demonstrates the comparisons. It can be seen that rises at the beginning and continues to decrease as increases, which reveals that excessive attention to the far and near areas leads to performance saturates. Therefore, the coefficient is set to in this paper.

Vi Conclusions

In this work, our DANet is designed to solve the distribution drift problem in light-weight MDE network. To obtain an aligned depth distribution shape, the PST is introduced, which captures the interaction between multi-scale regions. In addition, a local-global optimization is proposed to guide the network to obtain a reliable depth range. Experimental results on NYUDv2 and iBims-1 datasets prove that DANet achieves comparable performance with state-of-the-art methods with only 1% FLOPs of them. In the future, we will further achieve real-time running time on the embedding platform, so that it can be used to improve depth-dependent tasks [47, 23] and mobile robot applications [40].


  • [1] S. F. Bhat, I. Alhashim, and P. Wonka (2021) AdaBins: depth estimation using adaptive bins. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §II-A, §III, §IV-B, §IV-B, §IV-C1, §IV-C, TABLE I, Fig. 6, §V-A, §V-B, §V-C, TABLE II, TABLE IV.
  • [2] Y. Cai, B. Li, Z. Jiao, L. H, X. Zeng, and X. Wang (2020) Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation. In

    AAAI Conference on Artificial Intelligence (AAAI)

    Cited by: §I.
  • [3] D. Chen, H. Hsieh, and T. Liu (2021) Adaptive image transformer for one-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-C.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (4), pp. 834–848. Cited by: §V-C, TABLE IV.
  • [5] X. Chen, X. Chen, and Z. Zha (2019) Structure-aware residual pyramid network for monocular depth estimation. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: Fig. 2, §III, TABLE I, Fig. 6, §V-A, §V-A, §V-B, TABLE II.
  • [6] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu (2021) Transformer tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-C.
  • [7] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-B.
  • [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. ArXiv abs/2010.11929. Cited by: §II-C, §IV-B, §IV-B.
  • [9] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II-A, TABLE I, §V-A, TABLE II.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NIPS), Cited by: §II-A, TABLE I, §V-A, TABLE II.
  • [11] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A, TABLE I, §V-A, §V-B, §V-B, TABLE II.
  • [12] Y. Gao, X. Li, J. Zhang, Y. Zhou, D. Jin, J. Wang, S. Zhu, and X. Bai (2021) Video text tracking with a spatio-temporal complementary model. IEEE Transactions on Image Processing (TIP) 30 (), pp. 9321–9331. Cited by: §II-C.
  • [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    MobileNets: efficient convolutional neural networks for mobile vision applications

    ArXiv abs/1704.04861. Cited by: §II-B.
  • [14] J. Hu, C. Fan, H. Jiang, X. Guo, X. Lu, and T. L. Lam (2021) Boosting light-weight depth estimation via knowledge distillation. ArXiv abs/2105.06143v1. Cited by: TABLE I, §V-B.
  • [15] J. Hu, M. Ozay, Y. Zhang, and T. Okatani (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: Fig. 2, §I, §III, TABLE I, Fig. 6, §V-A, §V-A, §V-B, TABLE II.
  • [16] J. Jiao, Y. Cao, Y. Song, and R. Lau (2018) Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In The European Conference on Computer Vision (ECCV), Cited by: §I.
  • [17] T. Koch, L. Liebel, F. Fraundorfer, and M. Körner (2019) Evaluation of cnn-based single-image depth estimation methods. In European Conference on Computer Vision Workshop (ECCV Workshop), Cited by: §V-A.
  • [18] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In International Conference on 3D Vision (3DV), Cited by: §I, §II-A, TABLE I, §V-A, TABLE II.
  • [19] J. Lee and C. Kim (2019) Monocular depth estimation using relative depth maps. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE I, Fig. 6, §V-B, TABLE II.
  • [20] J. H. Lee, M. Han, D. W. Ko, and I. H. Suh (2020) From big to small: multi-scale local planar guidance for monocular depth estimation. ArXiv abs/1907.10326. Cited by: §IV-C1, TABLE I.
  • [21] Y. Li, Y. Ushiku, and T. Harada (2019) Pose graph optimization for unsupervised monocular visual odometry. In International Conference on Robotics and Automation (ICRA), Cited by: §I.
  • [22] M. Liu, M. Salzmann, and X. He (2014) Discrete-continuous depth estimation from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-A.
  • [23] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai (2020) TANet: robust 3d object detection from point clouds with triple attention. In AAAI Conference on Artificial Intelligence(AAAI), Cited by: §VI.
  • [24] R. Lu, F. Xue, M. Zhou, A. Ming, and Y. Zhou (2019) Occlusion-shared and feature-separated network for occlusion relationship reasoning. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II-C.
  • [25] X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf (2020) Consistent video depth estimation. ACM Transactions on Graphics (TOG) 39 (4). Cited by: §I.
  • [26] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia (2018) J-mod: joint monocular obstacle detection and depth estimation. IEEE Robotics and Automation Letters (RA-L) 3 (3), pp. 1490–1497. Cited by: §I.
  • [27] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, and I. Reid (2019) Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §II-B, TABLE I, Fig. 6, §V-B, §V-B, §V-B, TABLE II.
  • [28] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun (2020) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §I.
  • [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and Li,Fei-Fei (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §V-A.
  • [30] A. Saxena, M. Sun, and A. Y. Ng (2009) Make3D: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31 (5), pp. 824–840. Cited by: §II-A.
  • [31] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In The European Conference on Computer Vision (ECCV), Cited by: §V-A.
  • [32] M. Tan and Q. V. Le (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks


    International Conference on Machine Learning (ICML)

    Cited by: §IV-A.
  • [33] K. Tateno, F. Tombari, I. Laina, and N. Navab (2017) CNN-SLAM: real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [34] L. Tiwari, P. Ji, Q. H. Tran, B. Zhuang, and M. Chandraker (2020) Pseudo rgb-d for self-improving monocular slam and depth prediction. In European Conference on Computer Vision (ECCV), Cited by: §I.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Neural Information Processing Systems (NIPS), Cited by: §II-C.
  • [36] X. Weng and K. Kitani (2019) Monocular 3d object detection with pseudo-lidar point cloud. In IEEE International Conference on Computer Vision Workshop (ICCVW), Cited by: §I.
  • [37] D. Wofk, F. Ma, T. Yang, S. Karaman, and V. Sze (2019) FastDepth: fast monocular depth estimation on embedded systems. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §II-B, TABLE I, Fig. 6, §V-B, §V-B, §V-B, TABLE II.
  • [38] L. Xie, S. Wang, A. Markham, and N. Trigoni (2017)

    Towards monocular vision based obstacle avoidance through deep reinforcement learning

    In Robotics: Science and System workshop (RSS Workshop), Cited by: §I.
  • [39] F. Xue, J. Cao, Y. Zhou, F. Sheng, Y. Wang, and A. Ming (2021) Boundary-induced and scene-aggregated network for monocular depth prediction. Pattern Recognition (PR) 115, pp. 107901. Cited by: §II-A.
  • [40] F. Xue, A. Ming, M. Zhou, and Y. Zhou (2019) A novel multi-layer framework for tiny obstacle discovery. In International Conference on Robotics and Automation (ICRA), Cited by: §VI.
  • [41] F. Xue, A. Ming, and Y. Zhou (2020) Tiny obstacle discovery by occlusion-aware multilayer regression. IEEE Transactions on Image Processing (TIP) 29 (), pp. 9373–9386. Cited by: §I.
  • [42] W. Yin, Y. Liu, C. Shen, and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. In IEEE International Conference on Computer Vision (ICCV), Cited by: Fig. 2, §II-A, TABLE I, §V-A, §V-B, TABLE II.
  • [43] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-C, §V-C, TABLE IV.
  • [44] M. Zhou, J. Ma, A. Ming, and Y. Zhou (2018) Objectness-aware tracking via double-layer model. In IEEE International Conference on Image Processing (ICIP), Cited by: §II-C.
  • [45] Y. Zhou, X. Bai, W. Liu, and L. Latecki (2012) Fusion with diffusion for robust visual tracking. In Advances in Neural Information Processing Systems (NIPS), Cited by: §II-C.
  • [46] Y. Zhou, X. Bai, W. Liu, and L. J. Latecki. (2016) Similarity fusion for visual tracking. International Journal of Computer Vision (IJCV) 118 (3), pp. 337–363. Cited by: §II-C.
  • [47] Y. Zhou., Y. Yang., Y. Meng., X. Bai, W. Liu, and L. J. Latecki. (2014) Online multiple person detection and tracking from mobile robot in cluttered indoor environments with depth camera. International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI) 28 (1), pp. 1455001.1–1455001.28. Cited by: §VI.