waveletmonodepth
[CVPR 2021] Monocular depth estimation using wavelets for efficiency
view repo
We present a novel method for predicting accurate depths from monocular images with high efficiency. This optimal efficiency is achieved by exploiting wavelet decomposition, which is integrated in a fully differentiable encoderdecoder architecture. We demonstrate that we can reconstruct highfidelity depth maps by predicting sparse wavelet coefficients. In contrast with previous works, we show that wavelet coefficients can be learned without direct supervision on coefficients. Instead we supervise only the final depth image that is reconstructed through the inverse wavelet transform. We additionally show that wavelet coefficients can be learned in fully selfsupervised scenarios, without access to groundtruth depth. Finally, we apply our method to different stateoftheart monocular depth estimation models, in each case giving similar or better results compared to the original model, while requiring less than half the multiplyadds in the decoder network. Code at https://github.com/nianticlabs/waveletmonodepth
READ FULL TEXT VIEW PDF[CVPR 2021] Monocular depth estimation using wavelets for efficiency

Singleimage depth estimation methods are useful in many realtime applications, for example robotics, autonomous driving and augmented reality. These areas are typically resourceconstrained, so efficiency at prediction time is important.
Neural networks which estimate depth from a single image overwhelmingly use UNet architectures, with skip connections between encoder and decoder layers [Unet]. Most work on singleimage depth prediction has focused on improved depth accuracy, without focusing on efficiency. Those that have cared about efficiency have typically borrowed tricks from the “efficient network” world [howard2017mobilenets, sandler2018mobilenetv2] to make faster depth estimation, with the network using standard convolutions all the way through [WofkFastDepth19, Poggipydnet18]. All these approaches still use standard neural network components: convolutions, additions, summations and multiplications.
Inspired by sparse representations that can be achieved with wavelet decomposition, we propose an alternative network representation for more efficient depth estimation, using wavelet decomposition. We call this system WaveletMonodepth. We make the observation that depth images of the manmade world are typically made up of many piecewise flat regions, with a few ‘jumps’ in depth between the flat regions. This structure lends itself well to wavelets. A lowfrequency component can represent the overall scene structure, while the ‘jumps’ can be well captured in highfrequency components. Crucially, the highfrequency components are sparse, which means computation can be focused only in certain areas. This has the effect of saving runtime computation, while still enabling highquality depths to be estimated.
To the best of our knowledge, we are the first to train a singleimage depth estimation network that reconstructs depth by predicting wavelet coefficients. Furthermore, we show that our models can be trained with selfsupervised loss on the final depth signal, in contrast to other methods that directly supervise predicted wavelet coefficients.
We evaluate on NYU and KITTI datasets, where we train supervised and selfsupervised, respectively. We show that our approach allows us to effectively trade off depth accuracy against runtime computation.
We first give an overview of monocular depth estimation, before looking at works which have made depth estimation more efficient. We then discuss related works which have used wavelets for computer vision tasks, before finally looking at other forms of efficient neural networks.
Beyond early shapefromshading methods, most works that estimate depth from a single image have been learningbased. Early works used a Markov random field [SaxenaMake3D], but more recent works have used deep neural networks. Supervised approaches use imagetoimage networks to regress depth maps [Eigen14, Eigen2015PredictingDS, kumar2018depthnet, fu2018deep]; however these require ground truth depth data at training time. Selfsupervision reduces the requirement of supervised data by using stereo frames [garg2016unsupervised, godard2017unsupervised] or nearby video frames [zhou2017unsupervised] as supervision, exploiting 3D geometry with image reconstruction losses to learn a depth estimator. Focus in this area is typically around improving the depth accuracy scores, by modelling moving objects at training time [bian2019unsupervised, chen2019self, yin2018geonet, gordon2019depth, ranjan2018adversarial] or by modelling occlusion [gordon2019depth, godard2019digging]. While these improvements achieve higher scores with equivalently trained architectures, some works aim for improved depth accuracy at the expense of efficiency. For example, by using higher resolution images [luo2019every], larger networks [guizilini20203d] or classification instead of regression at the output layer [fu2018deep].
A relatively small number of works focus on efficiency specifically for depth. Poggi et al. [Poggipydnet18] introduce PyDNet, which uses an image pyramid to enable a high receptive field with a small number of parameters. Wofk et al. [WofkFastDepth19] introduce FastDepth, which uses depthwise separable layers and network pruning to achieve efficient depth estimation. An alternative angle on efficient depth estimation is to focus on the training procedure. Several works use knowledge distillation to enable a small depth estimation network to learn some of the knowledge from a larger network [CreamIROS2018, StructKDPAMI20].
In contrast to these works, our contribution is to change the internal representation of depth within the network itself. We note that our contributions could be used in conjunction with the above efficient architectures or distillation schemes.
Wavelet decomposition is an extensively used technique in signal processing, image processing and computer vision. The discrete wavelet transform (DWT) allows a representation of a discrete signal which is more redundant and hence compressible. A notable example is compression of images with JPEG2000 format [UnserJPEG2000, JPEG2000Book]. Furthermore, wavelet decomposition is also a frequency transform, and can be used for denoising [DonohoSoftThreshold, DonohoWaveletShrinkage, Kang2018FrameletDenoising]
. Wavelet transforms have also recently been combined with Deep Learning to restore images affected by Moiré color artifacts, which occur when RGB sensors are unable to resolve highfrequency details
[Luo2020CVPRW, liu2020waveletbased]. Li [Li2020CVPR]show that by substituting pooling operations in neural networks with discrete wavelet transforms it is possible to filter out highfrequency components of the input image during prediction and thus improve noiserobustness in image classification tasks. Superresolution methods
[Guo2017DWSR, huang2017wavelet, Deng2019ICCV] learn to estimate the highfrequency wavelet coefficients of an input lowresolution image to generate highfrequency image through inverse wavelet transform.Closer to our work, Yang [Yang2020CVPR] use wavelets in a stereo matching network but require supervision of wavelet coefficients while we do not. Similarly, Luo [Luo2020CVPR] replaced the downsampling and upsampling operations of UNetlike architectures with DWTs and inverse DWT respectively, and replaced standard skipconnection with highfrequency coefficient skipconnections. However, they do not directly predict wavelet coefficients of depth and as such are unable to exploit the sparse representation of wavelets for efficiency. In contrast with both these works, we focus on efficient depth prediction from a single image.
Convolutional Neural Networks (CNNs) [Lecun1995convolutional] have revolutionized the field of computer vision as CNN based methods tend to outperform every other competing methods on regression or classification tasks, if they are provided enough training data. However, the best performing neural networks contain a large number of parameters and require a large number of floating point operations (FLOPs) at runtime, making deployment to lightweight platforms problematic. Many architectures have been developed to improve the accuracy/speed tradeoff in deep nets. For example, depthwise separable convolutions [howard2017mobilenets], inverted residual layers [sandler2018mobilenetv2], and pointwise group convolutions [zhang2018shufflenet]. An alternative approach though is to train a network before cutting down some of its unnecessary computations.
One line of research is network pruning [Liu2017ICCV, He2017ICCV, yu2019slimmable], which consists of removing some of the redundant filters in a trained neural network. While this helps reducing the network memory footprint as well as the number of FLOPs necessary for inference, sparsity is typically enforced through regularization terms [Wei16NIPS, he2017channel] to compress the network without losing performance. Using such regularisation, however, often requires careful tuning to achieve the desired result [ye2018rethinking]. In contrast, our waveletbased method intrinsically provides sparsity in outputs and intermediate activations, and the wavelet predictions coincide with edges in the depth map, knowledge of which has direct applications in augmented reality [SharpNet2019, holynski2018fast].
While most works focus on classification, channel pruning has also been successfully applied to depth estimation in the aforementioned FastDepth [WofkFastDepth19], which uses NetAdapt [yang2018netadapt] to perform channel pruning.
Another recent work considers spatially sparse inference in imagetoimage translation tasks. PointRend
[Kirillov2020CVPR] treats semantic segmentation as a rendering process, where a highresolution estimate is obtained from a lowresolution one through a cascade of upsampling and sparse refinement operations. The location of these sparse renderingoperations is chosen based on an uncertainty measure of the classification method. However, while they demonstrate the efficiency and applicability of their method to classification tasks, their method cannot directly be applied to regression tasks because of the requirement to evaluate an uncertainty heuristic for all pixel locations. In contrast, our method can directly be applied to regression tasks, as
rendering locations are directly predicted by our model as nonzerovalued highfrequency wavelet coefficients.In this section, we first introduce the basics of 2D wavelet transforms. We chose Haar wavelets [haar1910theorie] due to their simplicity and provided efficiency. Next, we describe how to use the cascade nature of wavelet representations to build our efficient depth estimation architecture, which we call WaveletMonodepth. Finally, we discuss the computational benefits of sparse representations.
The Haar wavelet basis is the simplest basis of functions for wavelet decomposition. A discrete wavelet transform (DWT) with Haar wavelets decomposes a 2D image into four coefficient maps: a lowfrequency (L) component and three highfrequency (H) components , , at half the resolution of the input image. For the remainder of the paper, we refer to the coefficient maps as the output of the DWT. The DWT is an invertible operation, where IDWT converts four coefficient maps into a 2D signal at twice the resolution of the coefficient maps.
The multiscale and multifrequency wavelet representation is build by recursively applying DWT to the lowfrequency coefficient map , starting from the input image—see Figure 2(a). Similarly, the multiscale representation can be recursively inverted to reconstruct a full resolution image (Figure 2(b)). This synthesis operation is the building block of our depth reconstruction method.
Our method, which we call WaveletMonoDepth, is summarized in Figure 3. It builds on a recursive use of IDWT operation applied to predicted coefficient maps. Thus, we reconstruct a depth map at the input scale by first predicting a coarse estimate at the bottleneck scale of a UNetlike architecture [Unet], and iteratively upscale and refine this estimate by predicting highfrequency coefficient maps.
In our network architecture, the coarse depth estimate is estimated at 1/16 of the input scale. This depth map is then progressively upscaled and refined using Algorithm 1. A forward pass of our model generates a collection of 5 depth maps for scales [1/16, 1/8, 1/4, 1/2, 1]. We choose to supervise only the four last scales as in [godard2019digging]. It is worth noting that the coefficient maps are predicted at scales [1/16, 1/8, 1/4, 1/2], thus removing the need for fullresolution computation.
For piecewise flat depth maps, highfrequency coefficient maps have a small number of nonzero values; these are located around depth edges. Hence, for fullresolution depth reconstruction, only some pixel locations need to predict nonzero coefficient map values at each scale. At any scale, we assume that these pixel locations with nonzero values can be determined from highfrequency coefficient maps estimated at the previous scale defined by a mask described in GetSparseMask of Algorithm 1.
The sparsity level achieved by using mask is
(1) 
which allows us to remove redundant computation in the decoder layer. Indeed, for a typical
convolution (with a bias term) on a feature tensor of size
that has input channels and output channels, the number of multiplyadd operations is(2) 
With the sparsity level , it would be
(3) 
Note that our sparsification strategy aims to reduce FLOPs by decreasing the number of pixel locations at which we need to compute an output. This approach is orthogonal and complements other approaches such as channel pruning, which instead reduces and , or separable convolutions. We refer to supplementary material for further details on these.
Considering a quite conservative threshold used on highfrequency coefficient maps, the sparse decoder computation is about 3 lower in FLOPs compared to standard convolutions at all pixel locations for an image of size .
Our selfsupervised losses are as described in [godard2019digging], which we briefly describe here for completeness. See supplemental material for further details. Given a stereo pair of images (), we train our network to predict a depth map , pixelaligned with the left image. We also assume access to the camera intrinsics , and the relative camera transformation between the images in the stereo pair . We use the network’s current estimate of depth to synthesise an image , computed as
(4) 
where are the 2D pixel coordinates obtained by projecting the depths into image , and is the sampling operator. We follow standard practice in training with a photometric reconstruction error , so our loss becomes . Following [godard2019digging, chen2019self] etc., we set to a weighted sum of SSIM and losses.
We also include the depth smoothness loss from [godard2019digging].
For our experiments which train on monocular and stereo sequences (‘MS’), we combine reprojection errors from the three different source images: one frame forward in time, one frame back in time, and the corresponding stereo pair. In this case, we create synthesized images from the monocular sequence using relative poses estimated from a pose network, as described in [godard2019digging]. In this setting, we use a perpixel minimum reprojection loss, again following [godard2019digging].
Our validation experiments explore the task of training a CNN to predict depth from a single color image, using wavelets as an intermediate representation. Depending on the experiment, we compare against known leading baselines that supplement, and pre and postprocess the stereo pairs used for supervision, and the output depth maps.
We conduct experiments on the KITTI and NYUv2 depth datasets. KITTI [Geiger2012CVPR] consists of 22,600 calibrated stereo video pairs captured by a car driving around a city in Germany. Models are evaluated using the Eigen split [Eigen2015PredictingDS] using corresponding LiDAR point clouds; see [godard2017unsupervised] for details. NYUv2 [silberman2012indoor] consists of RGBD frames captured with a Kinect sensor. There are 120K raw frames collected by scanning various indoor scenes. As in DenseDepth [Densedepth], we use a 50K samples subset of the full dataset where depth is inpainted using Levin et al. inpainting method [Levin2004colorization]. The NYUv2 evaluation is run on the 654 test frames introduced by Eigen [Eigen14].
On the KITTI dataset we compute depth estimation scores based on the standard metrics introduced by Eigen et al. [Eigen14]: Abs Rel, Sq Rel, RMSE, RMSE, , , and . We use the same metrics for NYUv2, but we follow standard practice ( [fu2018deep]) in reporting instead of RMSE. To evaluate the sharpness of depth maps on NYUv2, we use the metrics introduced by Koch et al. [Koch2018EvaluationOC, koch2020comparison] and the NYUOC++ dataset manually annotated by Ramamonjisoa et al. [SharpNet2019, ramamonjisoa2020displacement].
To demonstrate the efficiency of our method, we choose two models for experiments on the NYUv2 and KITTI datasets.
For KITTI, we choose the weaklysupervised Depth Hints [watson2019depthhints] method, which adds Semi Global Matching [hirschmuller2005accurate, hirschmuller2007stereo] supervision to the selfsupervised Monodepth2 [godard2019digging], without requiring Lidar depth supervision. At each scale of the Monodepth2 decoder there is a layer which outputs a onechannel disparity. We replace this layer at each scale with a 3channel output layer to predict {, , }. While our baseline consumes decoder feature maps at scales [1/16, 1/8, 1/4, 1/2, 1], we only need to keep the four scales [1/16, 1/8, 1/4, 1/2], as the IDWT outputs disparity at resolution. Both our model and baseline are trained with an Adam optimizer using a learning rate of
, with batch size 12 for 20 epochs. Unless otherwise specified, our experiments are done with Resnet50based model trained with depth hints loss at
resolution.For NYUv2, we implement a UNetlike baseline similar to DenseDepth [Densedepth]
, using PyTorch, and detail its architecture in supplementary material. Similar to our KITTI experiments, we discard the last layer of the decoder as it is not needed, and add one extra layer at each scale to predict the wavelet coefficients. Both our model and baseline are trained using an Adam optimizer with standard parameters, for 20 epochs with batch size 8 and with learning rate
. It is worth noting that DenseDepth predicts outputs at half the input resolution, but evaluates at full resolution after bilinearly upsampling.In this section, we study the relation between accuracy, sparsity, and efficiency of WaveletMonoDepth. For each set of experiments, we compare our method to an equivalently trained model without wavelets. We first study how wavelets contribute to highfrequency details, then show that they are sparse. Finally, we discuss how we trade off accuracy against efficiency by varying the threshold used in Algorithm 1 to filter out closetozero coefficients.
KITTI [Geiger2012CVPR]  NYUv2 [Nyuv2]  
RGB input  
LL only  
LL {LH, HL, HH }  
LL + all HF Sparse () 

As mentioned in Section 3.2, the wavelet representation of depth maps allows us to output depth at different resolutions, depending on how many levels of coefficients have been computed. Tables 1 and 2 demonstrate evaluation scores for depth maps produced at different levels of wavelet decomposition on the KITTI and NYUv2 datasets respectively. As can be seen, most of the signal is captured in lowfrequency estimates of the depth map at the lowest resolution. This confirms previous works observations [Eigen2015PredictingDS, Chen2019SARPN] that a coarse estimate of depth is sufficient to capture the global geometry of the scene. Using more wavelet levels adds more highfrequency details to the depth map, yielding sharper results. Figure 7 shows the sharpening effect of wavelets qualitatively on KITTI and NYUv2 images.
Activated HF  Abs  Sq  R  Rlog  

LL only  0.104  0.668  4.415  0.179  0.878  0.962  0.985 
[3]  0.097  0.659  4.321  0.177  0.887  0.964  0.984 
[3, 2]  0.096  0.679  4.333  0.179  0.890  0.963  0.983 
[3, 2, 1]  0.096  0.702  4.366  0.180  0.891  0.963  0.983 
[3, 2, 1, 0]  0.097  0.714  4.386  0.181  0.891  0.963  0.983 
Activated HF  Depth Accuracy  Occ. Boundaries  

Abs  RMSE  log  
LL only  0.1281  0.5549  0.0548  0.8419  0.9674  0.9915  8.3672  9.8552 
[3]  0.1264  0.5517  0.0543  0.8446  0.9680  0.9917  3.3945  8.7933 
[3, 2]  0.1259  0.5512  0.0542  0.8451  0.9682  0.9917  2.1259  7.6702 
[3, 2, 1]  0.1258  0.5515  0.0542  0.8451  0.9681  0.9917  1.8070  7.1073 
Next, we show that highfrequency coefficients are sparse. As an example, Figure 1(b) shows one lowfrequency and three highfrequency coefficient maps for a given depth map. We observe that the highfrequency maps have nonzero values near depth edges. More wavelet predictions can be found in supplementary. As depth edges are sparse, highfrequency coefficients at only a few pixel locations are necessary to produce highaccuracy depth maps.
After training our network with standard convolutions, these are replaced with sparse ones as in Figure 3 and Algorithm 1. Varying the threshold value allows us to vary the sparsity level in Equation (3), and consequently to trade off accuracy against complexity. Because wavelets are sparse, we can compute them only at a very small number of pixel locations and suffer a minimal loss in depth accuracy. Figures 4 and 5 show relative score changes with varying sparsity threshold on KITTI and NYU datasets respectively. Note that a fixed value of
produces different sparsity levels depending on the content of an image, so we also plot standard deviation of sparsity levels for each
value. Figure 4 indicates that computing the wavelet coefficients at only 10 percent of pixel locations results in a relative loss in scores of less than for KITTI images. Similarly, Figure 5 shows that we can compute wavelet coefficients at only 5 percent of pixel locations while suffering a loss in scores of less than for NYU images.Finally, we demonstrate how sparsity of highfrequency coefficient maps can be exploited for efficiency gains in the decoder. Figure 6 shows Abs Rel and scores for varying used during prediction. As can be seen, the score change is minimal when using half multiplyadd operations in the decoder and the performance is comparable to SOTA methods using only a third of multiplyadd operations. Note that biggest efficiency gains are obtained at higher resolution, as sparsity increases with resolution.
Cit.  Method  PP  Data  H W  Abs Rel  Sq Rel  RMSE  RMSE log  

[godard2019digging]  Monodepth2 Resnet18  ✓  S  192 640  0.108  0.842  4.891  0.207  0.866  0.949  0.976 
WaveletMonodepth Resnet18  ✓  S  192 640  0.110  0.876  4.916  0.206  0.864  0.950  0.976  
Monodepth2 Resnet50  ✓  S  192 640  0.108  0.802  4.577  0.185  0.886  0.963  0.983  
WaveletMonodepth Resnet50  ✓  S  192 640  0.106  0.824  4.824  0.205  0.870  0.949  0.975  
[watson2019depthhints]  Depth Hints  ✓  S  192 640  0.106  0.780  4.695  0.193  0.875  0.958  0.980 
WaveletMonodepth Resnet18  ✓  S  192 640  0.106  0.813  4.693  0.193  0.876  0.957  0.980  
Depth Hints Resnet50  ✓  S  192 640  0.102  0.762  4.602  0.189  0.880  0.960  0.981  
WaveletMonodepth Resnet50  ✓  S  192 640  0.105  0.813  4.625  0.191  0.879  0.959  0.981  
[godard2019digging]  Monodepth2 Resnet18  ✓  MS  192 640  0.104  0.786  4.687  0.194  0.876  0.958  0.980 
WaveletMonodepth Resnet18  ✓  MS  192 640  0.109  0.814  4.808  0.198  0.868  0.955  0.980  
[watson2019depthhints]  Depth Hints  ✓  MS +  192 640  0.105  0.769  4.627  0.189  0.875  0.959  0.982 
WaveletMonodepth Resnet18  ✓  MS +  192 640  0.110  0.840  4.741  0.195  0.868  0.956  0.981  
[godard2019digging]  Monodepth2 Resnet18  ✓  S  320 1024  0.105  0.822  4.692  0.199  0.876  0.954  0.977 
WaveletMonodepth Resnet18  ✓  S  320 1024  0.105  0.797  4.732  0.203  0.869  0.952  0.977  
[watson2019depthhints]  Depth Hints  ✓  S  320 1024  0.099  0.723  4.445  0.187  0.886  0.961  0.982 
WaveletMonodepth Resnet18  ✓  S  320 1024  0.102  0.739  4.452  0.188  0.883  0.960  0.981  
Depth Hints Resnet50  ✓  S  320 1024  0.096  0.710  4.393  0.185  0.890  0.962  0.981  
WaveletMonodepth Resnet50  ✓  S  320 1024  0.097  0.718  4.387  0.184  0.891  0.962  0.982  
WaveletMonodepth Resnet50 ()  ✓  S  320 1024  0.100  0.726  4.444  0.186  0.888  0.962  0.982 
Method  H W  Abs Rel  RMSE  

DenseNet baseline  480 640  0.1277  0.5479  0.0539  0.8430  0.9681  0.9917  1.7170  7.0638 
WaveletMonodepth (last scale sup.)  480 640  0.1280  0.5589  0.0546  0.8436  0.9658  0.9908  1.7678  7.1433 
WaveletMonodepth  480 640  0.1258  0.5515  0.0542  0.8451  0.9681  0.9917  1.8070  7.1073 
WaveletMonodepth ()  480 640  0.1259  0.5517  0.0543  0.8450  0.9681  0.9917  1.8790  7.0746 
We summarize our results on the KITTI dataset in Table 3. Here we show that our method, which simply replaces depth or disparity predictions with wavelet predictions, can be applied to a wide range of single image depth estimation models and losses. In each section of the table, the offtheshelf model numbers are reported, together with numbers from a model trained with our wavelet formulation. For example, we demonstrate that wavelets can be used in selfsupervised depth estimation frameworks such as Monodepth2 [godard2019digging], as well as its weaklysupervised extension Depth Hints [watson2019depthhints]. We note that we achieve our best results when using Depth Hints and highresolution input images. This is not surprising, as supervision from SGM should give better scores, but more importantly using high resolution inputs and outputs allows for more sparsification, as edge pixels become sparser as resolution grows. Importantly, we show overall that replacing fully convolutional layers with wavelets gives models with comparable performance to the offtheshelf, nonwavelet baselines. We show qualitative results from KITTI in Figure 7 (left).
Scores on NYUv2 are shown in Table 4. Our method performs on par with our baseline, which demonstrates that it is possible to estimate accurate depth and sparse wavelets without directly supervising the wavelet coefficients, in contrast with [Yang2020CVPR]. In Table 4, we show that supervising depth only at the last scale performs on par with our network supervised at all scales, which shows that a full multiscale wavelet reconstruction network can be trained endtoend. Qualitative results from NYUv2 are shown in Figure 7 (right).
In this work we combine wavelet representation with deep learning for a singleimage depth prediction task. We demonstrate that a neural network can learn to predict wavelet coefficient maps through supervision of the reconstructed depth map with existing losses. Our experiments using KITTI and NYUv2 datasets show that we can achieve scores comparable to SOTA models using similar encoderdecoder neural network architectures to the baseline models, but with wavelet representations.
We also analyze sparsity of wavelet coefficients and show that sparsified wavelet coefficient maps can generate highquality depth maps. Finally, we exploit this sparsity to reduce multiplyadd operations in the decoder network by at least a factor of 2.
We would like to thank Aron Monszpart for helping us set up cloud experiments, and our reviewers for their useful suggestions.
Supplementary Material
The previous work WaveletStereo [Yang2020CVPR] supervises its wavelet based stereo matching method with ground truth wavelet coefficients at the different levels of the decomposition. However, wavelets can only reliably be supervised when ground truth depth or disparity is provided and when it does not contain missing values or highfrequency noise, as they show on the synthetic SceneFlow [mayer2016Sceneflow] dataset. The sparsity of ground truth data in the KITTI dataset especially around edges makes it impossible to estimate reliably ground truth wavelet coefficients. On NYUv2, the noise in depth maps is also an issue for direct supervision of wavelets, e.g. with creases in the layout or inaccurate depth edges. This noise also prohibits the use of Semi Global Matching ground truth for wavelet coefficient supervision.
As we show in our work, supervising the network on wavelet reconstructions allows us to ignore missing values and be robust to noisy labels. ^{†}^{†}Work done during an internship at Niantic
The architecture we use for our experiments is a modification of the architecture used in [godard2019digging], as described in the main paper. In Table 5, we set out our decoder architecture in detail.
Our selfsupervised losses are as described in [godard2019digging], which we repeat here for completeness. Given a stereo pair of images (), we train our network to predict a depth map , pixelaligned with the left image. We also assume access to the camera intrinsics , and the relative camera transformation between the images in the stereo pair . We use the network’s current estimate of depth to synthesise an image , computed as
(5) 
where are the 2D pixel coordinates obtained by projecting the depths into image , and is the sampling operator. We follow standard practice in training the model under a photometric reconstruction error , so our loss becomes
(6) 
Following [godard2019digging, chen2019self] etc. we use a weighted sum of SSIM and L1 losses
where . We additionally follow [godard2019digging] in using the smoothness loss:
(7) 
where is the meannormalized inverse depth for image .
When we train on monocular and stereo sequences (‘MS’), we again follow [godard2019digging] — see our main paper for an overview, and [godard2019digging] for full details.
When we train with depth hints, we use the proxy loss from [watson2019depthhints], which we recap here. For stereo training pairs, we compute a proxy depth map using semiglobal matching [hirschmuller2005accurate], an offtheshelf stereo matching algorithm. We use this to create a second synthesized image
(8) 
We decide whether or not to apply a supervised loss using as ground truth on a perpixel basis. We only add this supervised loss for pixels where is lower . The supervised loss term we use is , following [watson2019depthhints]. For experiments where Depth Hints are used for training, we disable the smoothness loss term.
Depth Decoder  
layer  k  s  chns  res  input  activation 
upconv5  3  1  256  32  econv5  ELU [clevert2015elus] 
Level 3 coefficients predictions  
iconv4  3  1  256  16  upconv5, econv4  ELU 
disp4  3  1  1  16  iconv4  Sigmoid 
wave4  3  1  3  16  iconv4  Sigmoid 
upconv4  3  1  128  16  iconv4  ELU 
    1  8  disp4, wave4    
Level 2 coefficients predictions  
iconv3  3  1  128  8  upconv4, econv3  ELU 
wave3  3  1  3  8  iconv3  Sigmoid 
upconv3  3  1  64  8  iconv3  ELU 
    1  8  , wave3    
Level 1 coefficients predictions  
iconv2  3  1  64  4  upconv3, econv2  ELU 
wave2  3  1  3  4  iconv2  Sigmoid 
upconv2  3  1  32  4  iconv2  ELU 
    1  8  , wave2    
Level 0 coefficients predictions  
iconv1  3  1  32  2  upconv2, econv1  ELU 
wave1  3  1  3  2  iconv1  Sigmoid 
      1  , wave1   
the stride,
chns the number of output channels for each layer, res is the downscaling factor for each layer relative to the input image, and input corresponds to the input of each layer where is a nearestneighbor upsampling of the layer. disp4 is used produce the lowresolution estimate , while waveJ is used to decode at level J. disp4 and waveJ are convolution blocks detailed in Table 6.disp4 Layer  
layer  k  s  chns  res  input  activation 
disp4(1)  1  1  chns(iconv5) / 4  16  iconv5  LeakyReLU(0.1) [xu2015empirical] 
disp4(2)  3  1  1  16  disp41  Sigmoid 
Wavelet Decoding Layer  waveJ  
layer  k  s  chns  res  input  activation 
waveJ(1+)  1  1  chns(iconv[J+1])  iconv[J+1]  LeakyReLU(0.1)  
waveJ(2+)  3  1  3  waveJ(1+)  Sigmoid  
waveJ(1)  1  1  chns(iconv[J+1])  iconv[J+1]  LeakyReLU(0.1)  
waveJ(2)  3  1  3  waveJ(1)  Sigmoid  
substract  1  1  3  waveJ(2+),  Linear  
waveJ(2) 
We additionally tried training using edgeaware sparsity constraints that penalize nonzero coefficients at nonedge regions, by replacing depth gradients with wavelets coefficients in Monodepth’s [Monodepth17] disparity smoothness loss, which unfortunately made training unstable. We also tried to supervise wavelet coefficients using distillation [hinton2015distilling, aleotti2021real] from a teacher depth network, which resulted in lower performances.
We adapted our architecture from the PyTorch implementation of DenseDepth [Densedepth]. Our implementation uses a DenseNet161 encoder instead of a DenseNet169, and a standard decoder with upconvolutions. We first design a baseline that does not use wavelets, using the architecture detailed in Table 7. Our wavelet adaptation of that baseline is then detailed in Table 8. For experiments reported in the main paper, we follow the DenseDepth strategy and predict outputs at half the input resolution. Hence, the last level of the depth decoder in Table 8 is discarded. For experiments using a lightweight decoder discussed later in Section 9.4, which predicts depth maps given a input image, we keep all four levels of wavelet decomposition.
Depth Decoder  

layer  k  s  chns  res  input  activation 
upconv5  3  1  1104  32  econv5  Linear 
iconv4  3  1  552  16  upconv5, econv4  LeakyReLU(0.2) 
iconv3  3  1  276  8  iconv4, econv3  LeakyReLU(0.2) 
iconv2  3  1  138  4  iconv3, econv2  LeakyReLU(0.2) 
iconv1  3  1  69  2  iconv2, econv1  LeakyReLU(0.2) 
outconv0  1  1  1  2  iconv1  Linear 
Depth Decoder  
layer  k  s  chns  res  input  activation 
upconv5  3  1  1104  32  econv5  Linear 
Level 3 coefficients predictions  
iconv4  3  1  552  16  upconv5, econv4  LeakyReLU(0.2) 
disp4  1  1  1  16  upconv5  Linear 
wave4  3  1  3  16  upconv5  Linear 
    1  8  disp4, wave4    
Level 2 coefficients predictions  
iconv3  3  1  276  8  iconv4, econv3  LeakyReLU(0.2) 
wave3  3  1  3  8  iconv3  Linear 
    1  4  , wave3    
Level 1 coefficients predictions  
iconv2  3  1  138  4  iconv2, econv2  LeakyReLU(0.2) 
wave2  3  1  3  4  iconv2  Linear 
    1  2  , wave2   
For our NYU results in the main paper, we supervise depth using an L1 loss and SSIM:
(9) 
where and are respectively predicted and ground truth depth and . Similar to [SharpNet2019, ramamonjisoa2020displacement], we clamp depth between 0.4 and 10 meters.
We report results on the improved KITTI ground truth [uhrig2017sparse] in Table 9. As we saw in the main paper, our method is competitive on scores with nonwavelets baselines, but as we have shown our wavelet decomposition enables more efficient predictions.
In this section, we show qualitative results of our method.
In Figures 101112 and Figures 131415 we first show our sparse prediction process with corresponding sparse wavelets and masks, on the NYUv2 and KITTI datasets respectively. While we only need to compute wavelet coefficients in less than 10% of pixel locations in the decoding process, we show that our wavelets efficiently retain relevant information. Furthermore, we show that wavelets efficiently detect depth edges and their orientation. Therefore, future work could make efficient use of our wavelet based depth estimation method for occlusion boundary detection.
In Figure 8, we show comparative results between our baseline Depth Hints [watson2019depthhints] and our wavelet based method.


Cit.  Method  PP  Data  H W  Abs Rel  Sq Rel  RMSE  RMSE log  
[godard2019digging]  Monodepth2 Resnet18  ✓  S  192 640  0.079  0.512  3.721  0.131  0.924  0.982  0.994 
WaveletMonodepth Resnet18  ✓  S  192 640  0.084  0.523  3.807  0.137  0.914  0.980  0.994  
WaveletMonodepth Resnet50  ✓  S  192 640  0.081  0.477  3.658  0.133  0.920  0.981  0.994  
[watson2019depthhints]  Depth Hints  ✓  S  192 640  0.085  0.487  3.670  0.131  0.917  0.983  0.996 
WaveletMonodepth Resnet18  ✓  S  192 640  0.083  0.476  3.635  0.129  0.920  0.983  0.995  
Depth Hints Resnet50  ✓  S  192 640  0.081  0.432  3.510  0.124  0.924  0.985  0.996  
WaveletMonodepth Resnet50  ✓  S  192 640  0.081  0.449  3.509  0.125  0.923  0.986  0.996  
[godard2019digging]  Monodepth2 Resnet18  ✓  MS  192 640  0.084  0.494  3.739  0.132  0.918  0.983  0.995 
WaveletMonodepth Resnet18  ✓  MS  192 640  0.085  0.497  3.804  0.134  0.912  0.982  0.995  
[watson2019depthhints]  Depth Hints  ✓  MS +  192 640  0.087  0.526  3.776  0.133  0.915  0.982  0.995 
WaveletMonodepth Resnet18  ✓  MS +  192 640  0.086  0.497  3.699  0.131  0.914  0.983  0.996  
[godard2019digging]  Monodepth2 Resnet18  ✓  S  320 1024  0.082  0.497  3.637  0.132  0.924  0.982  0.994 
WaveletMonodepth Resnet18  ✓  S  320 1024  0.080  0.443  3.544  0.130  0.919  0.983  0.995  
WaveletMonodepth Resnet50  ✓  S  320 1024  0.076  0.413  3.434  0.126  0.926  0.984  0.995  
[watson2019depthhints]  Depth Hints  ✓  S  320 1024  0.077  0.404  3.345  0.119  0.930  0.988  0.997 
WaveletMonodepth Resnet18  ✓  S  320 1024  0.078  0.397  3.316  0.121  0.928  0.987  0.997  
Depth Hints Resnet50  ✓  S  320 1024  0.074  0.363  3.198  0.114  0.936  0.989  0.997  
WaveletMonodepth Resnet50  ✓  S  320 1024  0.074  0.357  3.170  0.114  0.936  0.989  0.997  
Our paper mainly explores computation reduction in the decoder of a UNetlike architecture. However, this direction is orthogonal and complementary with all other complexity reduction lines of research.
Our approach is for example complementary with the FastDepth approach, which consists in reducing the overall complexity of a depth estimation network by compressing it in many dimensions such as (1) the encoder, (2) the decoder (3) the input resolution. They argue that the deep network introduced by Laina et al. [Laina2016DeeperDP] suffers from high complexity, while it could largely be reduced. Here we present a set of experiments we conducted to explore these different aspects of complexity reduction.
Cit.  Method  PP  Data  H W  Abs Rel  Sq Rel  RMSE  RMSE log  
[watson2019depthhints]  Depth Hints  ✓  S  192 640  0.106  0.780  4.695  0.193  0.875  0.958  0.980 
WaveletMonodepth MobileNetv2  ✓  S  192 640  0.109  0.851  4.754  0.194  0.870  0.957  0.980  
WaveletMonodepth Resnet18  ✓  S  192 640  0.107  0.829  4.693  0.193  0.873  0.957  0.980  
WaveletMonodepth Resnet50  ✓  S  192 640  0.105  0.813  4.625  0.191  0.879  0.959  0.981  
[watson2019depthhints]  Depth Hints  ✓  S  320 1024  0.099  0.723  4.445  0.187  0.886  0.961  0.982 
WaveletMonodepth MobileNetv2  ✓  S  320 1024  0.104  0.772  4.545  0.188  0.880  0.960  0.982  
WaveletMonodepth Resnet18  ✓  S  320 1024  0.102  0.739  4.452  0.188  0.883  0.960  0.981  
Depth Hints Resnet50  ✓  S  320 1024  0.096  0.710  4.393  0.185  0.890  0.962  0.981  
WaveletMonodepth Resnet50  ✓  S  320 1024  0.097  0.718  4.387  0.184  0.891  0.962  0.982  
Method  Encoder  Depthwise  H W  Abs Rel  RMSE  

Dense baseline  DenseNet161    480 640  0.1277  0.5479  0.0539  0.8430  0.9681  0.9917  1.7170  7.0638 
Ours  DenseNet161    480 640  0.1258  0.5515  0.0542  0.8451  0.9681  0.9917  1.8070  7.1073 
Ours  DenseNet161  ✓  480 640  0.1275  0.5771  0.0557  0.8364  0.9635  0.9897  2.0133  7.1903 
Dense baseline  MobileNetv2    480 640  0.1772  0.6638  0.0731  0.7419  0.9341  0.9835  1.8911  7.7960 
Ours  MobileNetv2    480 640  0.1727  0.6776  0.0732  0.7380  0.9362  0.9844  1.9732  7.9004 
Ours  MobileNetv2  ✓  480 640  0.1734  0.6700  0.0731  0.7391  0.9347  0.9844  2.3036  8.0538 
First, we replace the costly ResNet [He2016ResNet] or DenseNet [huang2017densenet, huang2019densenet] backbone encoders with the efficient MobileNetv2 [sandler2018mobilenetv2]. Indeed, in contrast with FastDepth, in the main paper, we report results using large encoder models (Resnet18/50 or Densenet161). Although this helps achieving better scores, we show in Table 10 and Table 11 that we can reach close to stateoftheart results even with a small encoder such as MobileNetv2.
Secondly, FastDepth also shows that separable convolutions in their ”NNConv” decoder provides the best scoreefficiency tradeoff. Since this approach is orthogonal to our sparsification method, it therefore complements our method and can be used to improve efficiency. Interestingly, we show in Table 11 that replacing sparse convolutions with sparsedepthwise separable convolutions works on par with standard convolutions. This can be explained by the fact that IDWT is also a separable operation, and therefore efficiently combines with depthwise separable convolutions.
Method  Encoder  Depthwise  H W  Abs Rel  RMSE  

Dense baseline  DenseNet161    224 224  0.1278  0.5715  0.0557  0.8368  0.9620  0.9901 
Ours  DenseNet161    224 224  0.1279  0.5651  0.0549  0.8399  0.9652  0.9899 
Ours  DenseNet161  ✓  224 224  0.1304  0.5775  0.0564  0.8329  0.9613  0.9892 
Dense baseline  MobileNetv2    224 224  0.1505  0.6221  0.0632  0.7984  0.9526  0.9878 
Ours  MobileNetv2    224 224  0.1530  0.6409  0.0655  0.7844  0.9500  0.9864 
Ours  MobileNetv2  ✓  224 224  0.1491  0.6463  0.0646  0.7880  0.9506  0.9871 
A popular approach to complexity and memory footprint reduction is channel pruning, which aims at removing some of the unnecessary channel in convolutional layers. Note that our wavelet enabled sparse convolutions are complementary with channel pruning, as can be seen in Figure 9. While channel pruning can, in practice, greatly reduce both complexity and memory footprint, it requires heavy hyperparameter search that we therefore choose to leave for future work.
Finally, one important factor that makes FastDepth efficient is that it is trained with inputs, against our input. While our method is best designed for higherresolution regime where sparsity of wavelets is stronger, we still show that our method achieves decent results even at lowresolution, and report our scores in Table 12.
Comments
There are no comments yet.