Lightweight Monocular Depth Estimation through Guided Decoding

by   Michael Rudolph, et al.
Universität Duisburg-Essen

We present a lightweight encoder-decoder architecture for monocular depth estimation, specifically designed for embedded platforms. Our main contribution is the Guided Upsampling Block (GUB) for building the decoder of our model. Motivated by the concept of guided image filtering, GUB relies on the image to guide the decoder on upsampling the feature representation and the depth map reconstruction, achieving high resolution results with fine-grained details. Based on multiple GUBs, our model outperforms the related methods on the NYU Depth V2 dataset in terms of accuracy while delivering up to 35.1 fps on the NVIDIA Jetson Nano and up to 144.5 fps on the NVIDIA Xavier NX. Similarly, on the KITTI dataset, inference is possible with up to 23.7 fps on the Jetson Nano and 102.9 fps on the Xavier NX. Our code and models are made publicly available.



page 6


High Quality Monocular Depth Estimation via Transfer Learning

Accurate depth estimation from images is a fundamental task in many appl...

HiMODE: A Hybrid Monocular Omnidirectional Depth Estimation Model

Monocular omnidirectional depth estimation is receiving considerable res...

LocalBins: Improving Depth Estimation by Learning Local Distributions

We propose a novel architecture for depth estimation from a single image...

FastDepth: Fast Monocular Depth Estimation on Embedded Systems

Depth sensing is a critical function for robotic tasks such as localizat...

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Neural networks have shown great abilities in estimating depth from a si...

S3Net: A Single Stream Structure for Depth Guided Image Relighting

Depth guided any-to-any image relighting aims to generate a relit image ...

NeW CRFs: Neural Window Fully-connected CRFs for Monocular Depth Estimation

Estimating the accurate depth from a single image is challenging since i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Depth estimation is a major perception component of robotics systems, which can be also combined with downstream vision tasks [17, 31, 41]. For instance, self-localization [11], visual odometry [25] or object detection [26]

often rely on depth information in the context of automated driving or robot navigation. While RGB-D and stereo cameras, as well as laser and radar sensors provide accurate depth estimates, they can be costly and often only deliver sparse depth information. In contrast, monocular depth estimation is an inexpensive and easily deployable approach that has been recently well-developed thanks to deep learning advances.

Fig. 1: Our proposed architecture outperforms related monocular depth estimation methods wrt. accuracy on KITTI [14] and NYU Depth V2 [37], while delivering competitive inference speed on the NVIDIA Jetson Nano and the NVIDIA Jetson Xavier NX devices when comparing to FastDepth [42] and TuMDE [39], which were re-trained and evaluated using our described procedure. In addition, our extra light model (Ours-S) reaches solid performance at a high frame rate.

A plethora of monocular depth estimation approaches has been proposed based on convolutional neural networks

[27, 2, 28] and different types of supervision [15, 16, 13]. The majority of the existing approaches, though, targets accuracy over real-time performance [4, 38, 35]. Consequently, these approaches cannot deliver real-time execution on devices with constrained resources.

Efficient approaches [42, 39] therefore often employ hardware-specific compilation [7] and model compression [43, 20, 3] to achieve higher throughput. Additionally, only low resolution inputs and outputs are considered, resulting in the loss of fine-grained details and blurred edges in the predicted depth maps.

In this paper, we present a lightweight encoder-decoder architecture for monocular depth estimation. Our motivation is delivering real-time performance on embedded systems without the necessity to compress the model or rely on specific hardware compilation. To reconstruct high resolution depth maps, we propose the Guided Upsampling Block (GUB) for designing the decoder while relying on a standard encoder from the literature. Inspired by guided image filtering [18, 30], the proposed GUB makes use of the input image at different resolutions to guide the decoder on upsampling the feature representation as well as for the final depth map reconstruction. By stacking multiple GUBs in sequence, we build a cost-efficient decoder, allowing to infer detailed depth maps on constrained hardware. Moreover, our evaluation shows that we reach a good balance between accuracy and speed when comparing with the related work on KITTI [14] and NYU Depth V2 [37] datasets.

To sum up, we propose the Guided Upsampling Block (GUB) for building the decoder of a convolutional neural network for monocular depth estimation. Our approach outperforms the related methods that target embedded systems in terms of accuracy for both KITTI and NYU Depth V2 benchmarks while achieving inference speed suitable for real-time applications, as illustrated in Fig. 


Ii Related Work

The problem of monocular depth estimation with deep neural networks has been well studied over the past few years. At first, Eigen et al. [10] proposed to use a coarse-scale network, relying on fully connected layers to achieve a global receptive field, in order to give a coarse depth prediction, which is then refined locally by a fine-scale network. In contrast, by using a pre-trained ResNet [19] model as feature extractor, Laina et al. [27] proposed a fully convolutional encoder-decoder architecture for depth estimation. Based on this approach, recent methods heavily rely on convolutional neural networks by using pre-trained, general-purpose backbones as encoder and focusing on decoder design [2, 12, 28, 38].
To increase the receptive field, Atrous Spatial Pyramid Pooling (ASPP) [6] modules from semantic segmentation are leveraged in recent architectures [12, 28, 38] and skip-connections from the encoder to the decoder are used to improve optimization [2, 28, 38]. Song et al. [38] achieve state-of-the-art performance based on images from the Laplacian pyramid as guidance in the decoder. More recently, Ranftl et al. [35] show that self-attention based architectures like vision transformer [9] are capable of outperforming convolutional neural networks when provided with enough training data. Apart from architectural novelties, advances have been made to allow training with fewer data and computationally efficient augmentations [2], as well as using additional supervision through virtual normals [44].
As the above-discussed approaches are generally too complex to deploy on embedded hardware, Wofk et al. [42] proposed the usage of a pre-trained MobileNetV2 encoder in combination with a lightweight decoder leveraging depth-wise separable convolutions. Tu et al. [39] rely on a similar architecture to allow deployment to the NVIDIA Jetson Nano, but further simplify the decoder. Poggi et al. [34] proposed a hand-designed custom architecture to allow inference on the CPU of the Raspberry Pi3, while Aleotti et al. [1] investigate the deployment of the model on mobile phones. To increase throughput, input resizing is performed through center-cropping in [34], [42] and [39], which however results in evaluating on self-defined crops, and therefore inconsistencies when comparing performance. To further increase inference speed, models are compiled to TVM [7] in [42] and [39]. Additionally, pruning with the NetAdapt [43] framework is used by Wofk et al. [42] to reduce the model size while Tu et al. [39]

reduce model size with the help of a reinforcement learning algorithm

[20]. While these post-processing steps greatly increase the throughput of lightweight models, they make deployment complex and time-consuming. We aim at closing this gap by proposing a method that achieves fast and precise inference while being easy to deploy by only using TensorRT [32] for inference. To allow a fair comparison when performing inference on lower resolution, we bilinearly upscale the prediction to allow evaluation on the original evaluation crop of the respective dataset.

It is worthwile to investigate efficient semantic segmentation methods, as these approaches also produce pixel-wise predictions, but often rely on bilinear interpolation to replace decoders

[45, 21]. Simply applying these approaches to monocular depth estimation results in blurred depth maps of insufficient quality; this stresses the importance of the decoder for monocular depth estimation. Our decoder design is inspired by guided image filtering networks [30] combined with the design principles of standard decoders. By using image guidance in the decoder, we are able to reconstruct fine-grained details in the prediction. Additionally, this allows to reduce the size of feature maps in the decoder to reduce computational complexity. By stacking multiple blocks sequentially, we can also reduce the size of convolutional kernels while maintaining a similar receptive field, further reducing complexity.

Iii Method

Assume the availability of the training set where is an RGB image and is the corresponding real-valued ground-truth depth map with the same resolution as the image. Our objective is to utilize for learning the image to depth mapping with the deep neural network , parametrized by . Furthermore, the neural network is decomposed into the encoder and the decoder parts, defined as:


where the encoder maps the input image to the latent space and the decoder reconstructs the depth map from it. In this work, our contribution is the decoder part of the network, while we rely on a standard architecture for the encoder. We propose the Guided Upsampling Block for the decoder to enable high-quality depth map reconstruction at a lower computational cost.

Fig. 2: GuideDepth architecture, consisting of DDRNet-23-slim encoder and a novel decoder, consisting of three Guided Upsamling Blocks (). denotes bilinear interpolation.

Iii-a Guided Upsampling Block

Our Guided Upsampling Block (GUB) is motivated by the idea of guided image filtering [18] and the more recent ideas of learning these filters with convolutional neural networks [30]. In guided image filtering, the usage of a guidance image helps to enhance the degraded target image where degradation can occur, for example, due to the low spatial resolution of the target image. In our context, we propose to use the input image of the model at different resolutions for guiding the feature upsampling in the decoder part (see Fig. 2). A GUB takes as input the image and the decoder’s feature representation to deliver the upsampled feature representation. We define it as


where refers to feature maps and refers to guidance images. Their spatial resolution is indexed by , being of the input resolution. Moreover, we design the decoder with several GUBs to progressively upsample the feature representation of each decoder layer and finally the reconstructed depth map, as shown in Fig. 2.

The GUB operations are illustrated in Fig. 3. A GUB upsamples a feature map by scale factor 2 while reconstructing fine-grained details from the guidance image. The fundamental block operation is the stacked convolution

that consists of a convolution with kernel size 3, a batch normalization layer and a ReLU activation, followed by the same operations with a kernel size of 1 as illustrated in Fig. 

3 (b). We rely on the stacked convolution operation to collect three type of features. At first, it is used to extract features from the guidance image , defined as:


where denotes the stacked convolution. These features guide the upsampling step. Second, the feature representation of the encoder is upsampled by a factor of 2 through bilinear interpolation . Next, another stacked convolution is used to refine the upsampled representation. These operations are represented as:


Finally, the feature tensors

and are concatenated to create a joint representation, as shown in Fig. 3. To attend to the most important feature channels, we apply the Squeeze-and-Excitation module [23] , followed by the last stacked convolution operation . We describe these operations as:


where we consider as the correction term for the upsampled feature representation . For that reason, we add with as we illustrate in Fig. 2 (a). At last, we apply a convolution to reduce the number of feature maps.

(a) Guided Upsampling Block
(b) Stacked Convolution
Fig. 3: The Guided Upsampling Block () in (a) uses the input image as guidance to upsample the feature representation by a factor of 2, resulting in . The fundamental block operation is the stacked convolution , shown in (b). Up denotes bilinear interpolation by factor 2; SE represents the Squeeze-and-Excitation module [23].

Iii-B Complete Encoder-Decoder

As defined in Eq. 1, our model consists of the encoder and the decoder . As encoder, we choose the DDRNet-23-slim [21]

, pre-trained on the ImageNet database 

[8]. We pick up this semantic segmentation architecture since it is designed to achieve fast inference. To use DDRNet as encoder, we change the final layer to predict feature maps and then stack three GUBs as shown in Fig. 2. We down-sample the input image for guidance in the first two GUBs, while the last one receives the input image at it’s original resolution. Furthermore, the last GUB directly reconstructs the depth map instead of another feature representation. We refer to this architecture as GuideDepth.

Iii-C Network Training

We follow the training procedure proposed by Alhashim and Wonka [2]: Let be a depth prediction of the model derived from an input RGB image and a groundtruth depth map. Then, the objective function is given by:


where the structural dissimilarity loss enforces the model to predict depth maps that are perceived similarly to the groundtruth by using the inverse of the Sturctural Similarity (SSIM) [40]. The gradient loss ) helps the model to learn edges by comparing the pixel-wise partial derivatives and of the ground truth in and direction with the partial derivative and of the prediction. increases the overall accuracy of the model through pixel-wise supervision. Hereby denotes the total number of pixels in a depth map. Moreover, the weighting term balances out the magnitude of the term as proposed by Alhashim and Wonka [2]

. To optimize for the objective function, we rely on backpropagation and gradient descent.

Method Resolution
RMSE rel log10 fps
Nano NX
TuMDE [39] 192 640 5.7 2.19 5.801 0.150 0.068 0.760 0.930 0.980 10.5 75.7
FastDepth [42] 192 640 3.9 1.82 5.839 0.168 0.070 0.752 0.927 0.977 15.0 101.4
GuideDepth (Ours) 192 640 5.8 4.19 5.194 0.142 0.061 0.799 0.941 0.982 14.9 69.3
GuideDepth-S (Ours) 192 640 5.7 2.41 5.480 0.142 0.063 0.784 0.936 0.981 23.7 102.9
GuideDepth (Ours) 384 1280 5.8 16.75 4.956 0.133 0.056 0.819 0.952 0.986 n/a 20.5
GuideDepth-S (Ours) 384 1280 5.7 9.64 4.934 0.133 0.056 0.820 0.952 0.986 n/a 33.0
TABLE I: Model performance on KITTI [14], following the evaluation procedure described in Section IV. Our models GuideDepth and GuideDepth-S outperform related architectures wrt. accuracy, while having reasonably high throughput for real-time applications. Note, that for resolution 384 1280, the performance is only reported for the Xavier NX since the Nano can not handle such resolution. (: retrained and evaluated with our procedure.)

Iv Evaluation

We evaluate our approach on two standard benchmarks for monocular depth estimation. Moreover, we perform experiments on different embedded platforms and report results on different ablation studies. In addition to the GuideDepth architecture from Fig. 2, we also evaluate a smaller model where the number of feature maps in the decoder is half of the original version. We refer to it as GuideDepth-S.

Implementation & Training

Our method is implemented in PyTorch

[33]. We use the Adam [24] optimizer with , a learning rate of

, and batch size of 8 images. We train for 20 epochs and reduce the learning rate by a factor of 10 after 15 epochs. Additionally, we use data augmentation during training following the same protocol as Alhashim and Wonka 

[2], where random horizontal flips () and random colour channel swaps () are applied. Finally, inverse depth norm is used to bring depth values to a predefined range, where is the maximum depth value in the dataset.

Hardware Platforms

Our proposed architecture targets inference on embedded devices. Hence, we evaluate the model performance on two different NVIDIA Jetson Single Board Computers (SBCs): (1) Jetson Nano and (2) Jetson Xavier NX 555 Both boards have similar dimensions but differ significantly in performance. The Jetson Nano employs a quad-core Arm Cortex-A57 processor, a 128-core NVIDIA Maxwell GPU and 4 GB of RAM. In contrast, the Jetson Xavier NX uses a 6-core NVIDIA Carmel Arm processor, a 384-core NVIDIA Volta GPU and 8 GB of RAM. The results are reported at 10 W power mode for the Jetson Nano and 15 W for the Jetson Xavier NX.


We evaluate on the NYU Depth V2 [37] and the KITTI [14] datasets. NYU Depth V2 [37] covers indoor scenes recorded with a Microsoft Kinect camera at a resolution of pixels with densely annotated ground truth depth values. For training, the reduced dataset proposed by Alhashim and Wonka [2] is used, consisting of 50.688 images. The KITTI dataset [14]

contains 23.158 images from outdoor street scenes with sparse ground-truth depth data, captured by a LIDAR sensor. To provide dense ground-truth depth maps, we rely on the colorization method proposed by Levin

et. al. [29].

Evaluation Metrics

We make use of standard metrics from the literature [10]. Let be the pixel of ground-truth depth map and be the pixel in the prediction , and be the total number of pixels. Then we rely on the following metrics: Root mean square error , relative absolute error , scale-invariant error and the threshold accuracy : . All results are reported on models converted to tensorRT [32] with quantization to float16. Inference times are averaged over 200 samples.

Evaluation Protocol

SOTA real-time methods such as FastDepth [42] and Tu et al. [39] evaluate their network performance on center cropped input images to lower the input resolutions of their models. However, this harms comparability as the areas of the prediction differ depending on the crop, effectively only allowing comparison against methods using the exact same crops. We aim at using an evaluation protocol that produces comparable results for models with differing resolutions, reinforcing the need of balancing inference speed and accuracy when reducing resolution.
For that purpose, we use the test splits defined by Eigen et al. [10] for NYU Depth V2 [37] and KITTI [14]

. For evaluation (and training), the input image is resized to the desired resolution of the model through bilinear interpolation. This allows to produce a low resolution prediction covering the full area of the ground-truth depth map. For the computation of evaluation metrics, the prediction is upsampled to the resolution of the ground-truth depth map. To account for asymmetries, we report the average evaluation result of the test set and the vertically flipped test set as proposed by

[2]. The test crops by Eigen [10] are then used for calculating the evaluation metrics on both predictions at the resolution of the dataset. For NYU Depth V2, the cropped depth map is considered to exclude noisy ground-truth values at the boarders of the images. For KITTI, the crop excludes area with a low presence of LIDAR points. As the images in KITTI differ slightly in resolution, the crop is recalculated for each image to evaluate on depth maps of size where , , and .

Iv-a NYU Depth V2 Results

Considering the results in Tab. LABEL:table:NYU_RealTime, GuideDepth and GuideDepth-S outperform the related work targeting embedded systems in terms of RMSE and accuracy for all experiments. We retrained FastDepth [42] and the model proposed by Tu et al. [39] – to which we will refer to as TuMDE – with the same training and evaluation procedure as described in Sec. IV (denoted with ) to provide direct comparison between their architectures and our model. GuideDepth clearly outperforms the architecture of FastDepth with respect to the accuracy, while achieving comparable throughput.
Additionally, the first three columns in Table I showcase the inference speeds reported by Wofk et al. [42] and Tu et al. [39], indicating a approximate speed-up of factor 2-3 by their compression techniques. However, these models were evaluated on the previously mentioned center-crops, meaning that accuracy and RMSE values are not directly comparable to the further results. In Fig. 4, we present a visual comparison of the results.

(a) RGB
(b) Ground Truth
(c) FastDepth [42]
(d) TuMDE [39]
(e) Ours
Fig. 4: The qualitative results on NYU Depth V2 [37] show that our predicted depth map is of significantly greater detail. Related methods denoted with are retrained and evaluated using the procedure described in Section IV.
(a) RGB
(b) Ground Truth
(c) FastDepth [42]
(d) TuMDE [39]
(e) Ours
Fig. 5: The qualitative results on KITTI [14] show that our predicted depth map is of significantly greater detail, especially when inspecting the pedestrians. For related methods, we show results for retrained models, denoted by .

Iv-B KITTI Results

For the evaluation on KITTI, we retrained TuMDE [39] and FastDepth [42] with the same training and evaluation procedure as described in Section IV (denoted with ). Note that we did not apply the post-processing steps such as model pruning and only converted models to tensorRT for inference. This results in a decreased throughput compared to the results reported by Wofk et al. [42] and Tu et al. [39]. Furthermore, using our evaluation protocol described in Section IV makes the benchmark more challenging for models operating on low resolution, resulting in higher errors.
In this experiment, our proposed method GuideDepth and GuideDepth-S outperform the related methods in terms of RMSE and accuracy for all experiments. Additionally, when solely comparing the architectures without considering post-processing, GuideDepth-S outperforms FastDepth on the inference speed. Considering the results on the resolution of 384 1280, GuideDepth-S surprisingly achieves similar accuracy to the standard model. Note that we only report results on the Jetson Xavier NX for this resolution, since the computational requirements exceed the capabilities of the Jetson Nano. In Fig. 5, we compare the visual results of related architectures with our model, showing the improvements gained by our approach. Especially the model’s capabilities to capture pedestrians in the depth prediction stands out when comparing to the results of related architectures.

Iv-C Ablation Studies

We perform various ablation studies on the NYU Depth V2 [37] dataset based on the previously described evaluation and training procedures, using the image resolution of pixels. The results are presented in Table II and Table III.

Iv-C1 Guidance

This experiment investigates the importance of the guidance image and the corresponding processing branch. In particular, we compare the usage of the guidance image in the Guided Upsampling Block as proposed in the method section against directly concatenating the guidance image with the feature maps without the preceding feature extraction branch (Direct guidance in Table

II). Inspired by Song et al. [38], who relied on images from the Laplacian pyramid in their decoder architecture, Laplacian images are also used as guidance in our second experiment. To increase computational efficiency, the Laplacian images are generated by , where describes the target scale of the interpolated image for . Additionally, a model without a guidance image is included in the comparison.
Investigating the results, the usage of Laplacian images does not introduce any benefits in terms of prediction quality. This could be because the GUB learns to extract diverse, low-level features from the rich image while the Laplacian image already contains reduced information in the form of edges.
Concatenating the guidance image directly with the feature maps results in faster inference and, at first glance, only shows a minor difference for the accuracy and error when compared to the significantly slower GUB. However, we notice that the visual quality of depth predictions when using GUB is noticeable better, justifying our choice for the usage of the GUB over direct image guidance.

Guidance Inference RMSE
Type Branch [ms]
Image GUB 44 0.501 0.823 0.961 0.990
Image Direct 37 0.508 0.821 0.961 0.990
Laplacian GUB 44 0.508 0.819 0.961 0.990
Laplacian Direct 38 0.509 0.818 0.962 0.990
None None 34 0.510 0.821 0.961 0.990
TABLE II: Ablation study on different types of guidance images (RGB image and Laplacian image) as well as modifications to the GUB architecture. The last row shows results when using no guidance image.

Iv-C2 Encoder

Since the decoder was designed to be combined with Hong et al.’s DDRNet [21] encoder, we investigate the impact of employing different encoders. General purpose backbones like MobileNetV2 [36] and HarDNet [5] usually extract features at lower resolution with a higher number of channels. Therefore, we upsample these feature maps based on two stages of the decoder from FastDepth [42], as the information from guidance images of our decoder at this resolution is not useful. Then, our decoder is attached as described in the method section.
Furthermore, we can compare with FastDepth [42], as it is based on MobileNetV2 [36]) for the encoder. By comparing the results in Table III, we note that our decoder improves the RMSE and accuracy while marginally impacting throughput. HarDNet-39-DS increases the prediction accuracy even further at similar throughput. The usage of DDRNet-23-slim, finally, allows to reach the desired accuracy while additionally reducing the inference time, being the most powerful encoder for the model. Interestingly, DDRNet-23-slim allows faster inference than the other encoders although having more parameters and MACs.

FastDepth [42] 3.9 1.20 45 0.576 0.777
MobileNetV2 [36] 4.7 2.27 53 0.547 0.790
HarDNet39-DS [5] 2.9 2.09 52 0.516 0.811
DDRNet-23-slim [21] 5.8 2.63 44 0.501 0.823
TABLE III: Comparison of different encoders combined with our decoder. Using MobileNetV2 [36] allows direct comparison between our decoder and the retrained FastDepth architecture [42].

V Conclusions

We presented an encoder-decoder architecture for monocular depth estimation, targeted to robotics systems with constrained resources. Our approach focused on accelerating the decoding part while relying on a standard encoder. We proposed the Guided Upsampling Block (GUB) for guiding the decoder on upsampling the feature representation and the final depth map, allowing us to achieve more detailed, high-resolution depth predictions. By stacking multiple GUBs together, we designed a cost-efficient decoder that has a good balance between accuracy and speed when compared with related architectures on KITTI and NYU Depth V2 datasets. For future work, we aim to examine our models capabilities when training directly on the embedded platform, similar to domain adaption on resource-constrained hardware [22].


The authors acknowledge support by the state of Baden-Württemberg through bwHPC.


  • [1] F. Aleotti, G. Zaccaroni, L. Bartolomei, M. Poggi, F. Tosi, and S. Mattoccia (2021) Real-time single image depth perception in the wild with handheld devices. Sensors 21 (1), pp. 15. Cited by: §II.
  • [2] I. Alhashim and P. Wonka (2018)

    High quality monocular depth estimation via transfer learning

    arXiv preprint arXiv:1812.11941. Cited by: §I, §II, §III-C, §IV, §IV, §IV.
  • [3] V. Belagiannis, A. Farshad, and F. Galasso (2018) Adversarial network compression. In

    Proceedings of the European Conference on Computer Vision (ECCV) Workshops

    pp. 431–449. Cited by: §I.
  • [4] S. F. Bhat, I. Alhashim, and P. Wonka (2021) Adabins: depth estimation using adaptive bins. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 4009–4018. Cited by: §I.
  • [5] P. Chao, C. Kao, Y. Ruan, C. Huang, and Y. Lin (2019) Hardnet: a low memory traffic network. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3552–3561. Cited by: §IV-C2, TABLE III.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §II.
  • [7] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. (2018) tvm: An automated end-to-end optimizing compiler for deep learning. pp. 578–594. Cited by: §I, §II.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §III-B.
  • [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. Cited by: §II.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. Vol. 27. Cited by: §II, §IV, §IV.
  • [11] N. Engel, S. Hoermann, M. Horn, V. Belagiannis, and K. Dietmayer (2019) Deeplocalization: landmark-based self-localization with deep neural networks. In Intelligent Transportation Systems Conference (ITSC), pp. 926–933. Cited by: §I.
  • [12] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2002–2011. Cited by: §II.
  • [13] R. Garg, V. K. Bg, G. Carneiro, and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European conference on computer vision, pp. 740–756. Cited by: §I.
  • [14] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: Fig. 1, §I, TABLE I, Fig. 5, §IV, §IV.
  • [15] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279. Cited by: §I.
  • [16] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838. Cited by: §I.
  • [17] R. Güldenring, E. Boukas, O. Ravn, and L. Nalpantidis (2021) Few-leaf learning: weed segmentation in grasslands. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §I.
  • [18] K. He, J. Sun, and X. Tang (2010) Guided image filtering. In European conference on computer vision, pp. 1–14. Cited by: §I, §III-A.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §II.
  • [20] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pp. 784–800. Cited by: §I, §II.
  • [21] Y. Hong, H. Pan, W. Sun, and Y. Jia (2021) Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv preprint arXiv:2101.06085. Cited by: §II, §III-B, §IV-C2, TABLE III.
  • [22] J. Hornauer, L. Nalpantidis, and V. Belagiannis (2021) Visual domain adaptation for monocular depth estimation on resource-constrained hardware. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 954–962. Cited by: §V.
  • [23] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Fig. 3, §III-A.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Cited by: §IV.
  • [25] I. Kostavelis, E. Boukas, L. Nalpantidis, and A. Gasteratos (2016) Stereo based visual odometry for autonomous robot navigation. International Journal of Advanced Robotic Systems 13 (1). Cited by: §I.
  • [26] B. Kovács, A. D. Henriksen, J. D. Stets, and L. Nalpantidis (2021) Object detection on TPU accelerated embedded devices. In International Conference of Computer Vision Systems (ICVS), Cited by: §I.
  • [27] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. Cited by: §I, §II.
  • [28] J. H. Lee, M. Han, D. W. Ko, and I. H. Suh (2019) From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326. Cited by: §I, §II.
  • [29] A. Levin, D. Lischinski, and Y. Weiss (2004) Colorization using optimization. pp. 689–694. Cited by: §IV.
  • [30] Y. Li, J. Huang, N. Ahuja, and M. Yang (2016) Deep joint image filtering. In European Conference on Computer Vision, pp. 154–169. Cited by: §I, §II, §III-A.
  • [31] C. Nissler, N. Mouriki, C. Castellini, V. Belagiannis, and N. Navab (2015) OMG: introducing optical myography as a new human machine interface for hand amputees. In International Conference on Rehabilitation Robotics (ICORR), pp. 937–942. Cited by: §I.
  • [32] NVIDIA (2021) TensorRT. NVIDIA. External Links: Link Cited by: §II, §IV.
  • [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Vol. 32. Cited by: §IV.
  • [34] M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia (2018) Towards real-time unsupervised monocular depth estimation on cpu. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 5848–5854. Cited by: §II.
  • [35] R. Ranftl, A. Bochkovskiy, and V. Koltun (2021) Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188. Cited by: §I, §II.
  • [36] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §IV-C2, TABLE III.
  • [37] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp. 746–760. Cited by: Fig. 1, §I, Fig. 4, §IV, §IV, §IV-C.
  • [38] M. Song, S. Lim, and W. Kim (2021) Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE transactions on circuits and systems for video technology 31 (11), pp. 4381–4393. Cited by: §I, §II, §IV-C1.
  • [39] X. Tu, C. Xu, S. Liu, R. Li, G. Xie, J. Huang, and L. T. Yang (2020) Efficient monocular depth estimation for edge devices in internet of things. IEEE Transactions on Industrial Informatics 17 (4), pp. 2821–2832. Cited by: Fig. 1, §I, §II, TABLE I, (d)d, (d)d, §IV, §IV-A, §IV-B.
  • [40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-C.
  • [41] J. Wiederer, A. Bouazizi, U. Kressel, and V. Belagiannis (2020) Traffic control gesture recognition for autonomous vehicles. In International Conference on Intelligent Robots and Systems (IROS), pp. 10676–10683. Cited by: §I.
  • [42] D. Wofk, F. Ma, T. Yang, S. Karaman, and V. Sze (2019) Fastdepth: fast monocular depth estimation on embedded systems. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6101–6108. Cited by: Fig. 1, §I, §II, TABLE I, (c)c, (c)c, §IV, §IV-A, §IV-B, §IV-C2, TABLE III.
  • [43] T. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam (2018) Netadapt: platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 285–300. Cited by: §I, §II.
  • [44] W. Yin, Y. Liu, C. Shen, and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5684–5693. Cited by: §II.
  • [45] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 325–341. Cited by: §II.