Log In Sign Up

Dyna-DM: Dynamic Object-aware Self-supervised Monocular Depth Maps

by   Kieran Saunders, et al.
Aston University

Self-supervised monocular depth estimation has been a subject of intense study in recent years, because of its applications in robotics and autonomous driving. Much of the recent work focuses on improving depth estimation by increasing architecture complexity. This paper shows that state-of-the-art performance can also be achieved by improving the learning process rather than increasing model complexity. More specifically, we propose (i) only using invariant pose loss for the first few epochs during training, (ii) disregarding small potentially dynamic objects when training, and (iii) employing an appearance-based approach to separately estimate object pose for truly dynamic objects. We demonstrate that these simplifications reduce GPU memory usage by 29


page 1

page 3

page 8


3D Object Aided Self-Supervised Monocular Depth Estimation

Monocular depth estimation has been actively studied in fields such as r...

Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Self-supervised monocular depth estimation presents a powerful method to...

Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications

In recent years, self-supervised methods for monocular depth estimation ...

Revisiting Self-Supervised Monocular Depth Estimation

Self-supervised learning of depth map prediction and motion estimation f...

Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation

Dense depth estimation is essential to scene-understanding for autonomou...

Depth Estimation with Simplified Transformer

Transformer and its variants have shown state-of-the-art results in many...

Detecting Invisible People

Monocular object detection and tracking have improved drastically in rec...

1 Introduction

The problem of depth estimation from visual data has recently received increased interest because of its emerging application in autonomous vehicles. Using cameras as a replacement to LiDAR and other distance sensors leads to cost efficiencies and can improve environment perception.

There have been several attempts at visual depth estimation, for example, using stereo images [godard2017unsupervised, chang2018pyramid, li2019stereo] and optical flow [ilg2017flownet, wang2018occlusion]. Other methods focus on supervised monocular depth estimation, which requires large amounts of ground truth depth data. Examples of supervised monocular architectures can be found in [bhat2021adabins, song2021monocular, ranftl2021vision]. Ground truth depth maps such as those available in the KITTI dataset [geiger2013vision] are usually acquired using LiDAR sensors. Supervised methods can be very precise but require expensive sensors and post-processing to obtain clear ground truth. This paper focuses on monocular depth estimation, as it can provide the most affordable physical set-up and requires less training data.

Self-supervised methods exploit information acquired from motion. Examples of architectures using this method are given in [godard2019digging, andraghetti2019enhancing, guizilini20203d, shu2020feature]. While supervised monocular depth estimation currently outperforms self-supervised methods, their performance is converging towards that of supervised ones. Additionally, research has shown that self-supervised methods are better at generalizing across a variety of environments [TENDLE2021100124] (e.g., indoor/outdoor, urban/rural scenes).

Figure 1: Dynamic objects have been consistently ignored in self-supervised monocular depth estimations. Our proposed method, Dyna-DM, isolates truly dynamic objects in a scene using an appearance-based approach.

Many works assume the entire 3D world is a rigid scene, thus ignoring objects that move independently. While convenient for computational reasons, in practice this assumption is frequently violated, leading to inexact pose and depth estimations. To account for each independently moving object, we would need to estimate the motion of the camera and each object separately. Unfortunately, slow-moving objects and deforming dynamic objects such as pedestrians make this a very challenging problem.

Attempts to account for potentially dynamic objects by detecting them and estimating their motion independently can be found in [casser2019depth] and [lee2021learning]. However, not all potentially dynamic objects in a scene will be moving at any given time (e.g., cars parked on the side of the road). Estimating each potentially dynamic objects’ motion is not only computationally wasteful, but more importantly, can cause significant errors in depth estimation (see Section 4).

Self-supervised models are generally trained on the KITTI [geiger2013vision] and CityScape [cordts2016cityscapes] datasets, which are known to have a large amount of potentially dynamic objects. Previous literature reports that truly dynamic objects are relatively infrequent (see Figure 1) and that static objects represent around 86% of the pixels in KITTI test images [voicila2022instance]. In addition to being computationally wasteful, estimating static objects’ motion can reduce accuracy. To address this and other issues, the paper provides the following contributions:

  • A computationally efficient appearance-based approach to avoid estimating the motion of static objects.

  • Experiments supporting the idea of removing small objects, as they tend to be far from the camera and their motion estimation does not bring significant benefits since they are “almost rigid”.

  • Finally, based on experimental results, the paper proposes limiting the use of invariant pose loss.

2 Related Works

Self-Supervised Depth Estimation

Self-supervised methods use image reconstruction between consecutive frames as a supervisory signal. Stereo methods use synchronized stereo image pairs to predict the disparity between pixels [godard2017unsupervised]

. For monocular video, the framework for simultaneous self-supervised learning of depth and ego-motion via maximising photometric consistency was introduced by Zhou

et al. [zhou2017unsupervised]. It uses ego-motion, depth estimation and an inverse warping function to find reconstructions for consecutive frames. This early model laid the foundation of self-supervised monocular depth estimation. Monodepth2 by Godard et al. [godard2019digging] presented further improvements by masking stationary pixels that would cause infinite-depth estimations. They cast the problem as the minimisation of image reprojection error while addressing occluded areas between two consecutive images. In addition, they employed multi-scale reconstruction, allowing for consistent depth estimations between the layers of the depth decoder. Monodepth2 is currently the most widely used baseline for depth estimation.

Object Motion Estimation

The previous methods viewed dynamic objects as a nuisance, masking non-static objects to find ego-motion. Later methods such as [casser2019depth, lee2021learning] showed that there is a performance improvement when using a separate object-motion estimation for each potentially dynamic object. Casser et al. [casser2019depth] estimated object motion for potentially dynamic objects given predefined segmentation maps, cancelling the camera motion to isolate object motion estimations. Furthermore, Insta-DM by Lee et al. [lee2021learning] improved previous models by using forward warping for dynamic objects and inverse warping for static objects to avoid stretching and warping dynamic objects. Additionally, this method uses a depth network (DepthNet) and pose network (PoseNet) based on a ResNet18 encoder-decoder structure following [ranjan2019competitive]. Both methods treat all potentially dynamic objects in a scene as dynamic, therefore calculating object motion estimations for each individual object.

There have been many appearance-based approaches using optical flow estimations [lv2018learning, ranjan2019competitive]. Recent methods have combined geometric motion segmentation and appearance-based approaches [yang2021learning]. These methods tackle the complex problem of detecting object motion, although this is a step forward for self-supervision, motion estimations can be jittery and inaccurate.

Recently, Safadoust et al. [safadoust2021self] demonstrated detecting dynamic objects in a scene using an auto-encoder. While this seems promising for reducing computation and improving depth estimation, it tends to struggle with texture-less regions such as the sky and plain walls. More complex methods have approached the detection of dynamic objects using transformers [voicila2022instance] resulting in state-of-the-art performance in detecting dynamic objects but requiring heavy computation.

Figure 2: The input image and the background mask are used to calculate the initial ego-motion. Before removing static objects we remove small objects (objects less than 0.75% of the image’s pixel count) leading to an updated background mask. This mask is further processed using the theta filter, removing objects with a Dice value greater than . This results in an updated background mask with truly dynamic objects only.

3 Method

We propose Dyna-DM, an end-to-end self-supervised framework that learns depth and motion using sequential monocular video data. It introduces the detection of truly dynamic objects using the Sørensen–Dice coefficient (Dice). We later discuss the use of invariant pose and the necessity of object motion estimations for objects at a large distance from the camera.

Figure 3: On the left, a dynamic object (in red) is warped using initial ego-motion, resulting in a warped source mask (green). Calculating the IoU and Dice coefficients between the target mask (blue) and warped source mask (green) a relatively small Dice value is observed. On the right, larger values for IoU and Dice were obtained following the same process.

3.1 Self-supervised monocular depth estimation

To estimate depth in a self-supervised manner we consider consecutive RGB frames in a video sequence. We then estimate depth for both of these images using the inverse of the Disparity Network’s (DispNet) output. To estimate the motion in the scene, two trainable pose networks (PoseNet) with differing weights are used, one optimized for ego-motion and another to handle object-motion. The initial background images are computed by removing all potentially dynamic objects’ masks , corresponding to respectively. These binary masks are obtained using an off-the-shelf instance segmentation model Mask R-CNN [he2017mask]. The next step is to estimate the ego-motion using these two background images. Following [lee2021learning], let us define inverse and forward warping, as follows:




where is the camera matrix. Furthermore, we forward warp the image and mask using ego-motion, that we previously calculated, resulting in a forward warped image and a forward warped mask . We feed the pixel-wise product of the forward warped mask and image and the pixel-wise product of the target mask and target image into the object motion network. Resulting in an object motion estimation for each potentially dynamic object , where represents each potentially dynamic object. Using these object motion estimates we can inverse warp each object to give . The inverse warped background image is represented as . The final warped image is a combination of the background warped image and all of the warped potentially dynamic objects as follows:


3.2 Truly Dynamic Objects

To detect truly dynamic objects we first define an initial ego-motion network that uses pre-trained weights from Insta-DM’s ego-motion network. The binary masks from Insta-DM will also be used as our initial binary mask that contains all potentially dynamic objects.


Then, similarly to section 3.1, we find the background image using pixel-wise multiplication and determine the initial ego-motion estimation between frames and .


We then forward warp all of these masked potentially dynamic objects using this initial ego-motion.


Assuming perfect precision for pose and depth implies that the object mask will have been warped to the object mask in the target image if the object is represented by ego-motion i.e., if the object is static. In other words, if an object is static, there will be a significant overlap between its warped mask (under initial ego-motion) and its mask on the target image. Conversely, if the object is truly dynamic, the warped and target masks will not match. This type of overlap or discrepancy can be captured using the Sørensen–Dice coefficient (Dice) or the Jaccard Index, also known as Intersection over Union (IoU):


Warping of dynamic and static objects using ego-motion is depicted in Figure 3. Stationary objects have greater Dice values than dynamic objects. Potentially dynamic objects that have Dice values lower than the selected value of theta will be classed as truly dynamic objects. Testing both IoU and Dice, we found that the Dice coefficient led to more accurate dynamic object detection, therefore the proposed solution is to use the Dice coefficient. We found optimal theta values around the range [0.8, 1].

Note however that the reasoning presented above is based on the assumption that depth and pose estimations are accurate. This is unrealistic in most circumstances so we can expect larger Dice values with dynamic objects. To mitigate this challenge we can use greater frame separation between the source and target frames. This is simply done by calculating the reconstruction between frames and rather than frames and . This extra distance between the frames gives dynamic objects more time to diverge from where ego-motion will warp them to be, which is beneficial for slow-moving objects. With modern camera frame rates, extra distance does not cause jitters, but leads to more consistency in depth and more exact pose. As these objects can be moving very slowly, calculating the reconstruction between larger frames can be beneficial in determining imperceptible object motion.

To remove static objects, the method decreasingly sorts all potentially dynamic objects based on their Dice values and selects the first 20 objects. These objects are the most likely to be static and tend to be larger objects, for theta less than 1, these objects with large Dice values will be removed first. The discrepancy between these metrics at theta equal to 1 can be explained by the metrics choosing different objects as the first 20 objects.

In summary, we filter using Dice scores to remove all static objects in the initial binary mask, providing an updated binary mask that only contains truly dynamic objects. We refer to this filtering process as theta filtering. This can be used to calculate an updated ego-motion estimation and object motion estimations for each truly dynamic object.

3.3 Small Objects & Invariant Pose

We improve this method further in two aspects: by removing small objects and revisiting the usefulness of using invariant pose. Firstly, small objects are objects that take a small pixel count in an image. We define these objects as less than 1% of an image’s pixel count. Motion estimation for small objects tends to be either inaccurate or insignificant as these objects are frequently at a very large distance from the camera. The removal of small objects and static objects is depicted in Figure 2.

Secondly, when calculating poses in 3D space, some methods exploit the relationship between the forward and backward pose as a loss function known as invariant pose, as similarly shown in

[lee2021learning, gordon2019depth].


We performed an ablation study to test the usefulness of calculating forward and backward pose versus considering a single direction, which would lead to a reduction in computing, significant enough to outweigh any accuracy improvements.

3.4 Final Loss

Photometric consistency loss will be our main loss function in self-supervision. Following Insta-DM [lee2021learning], we apply consistency to the photometric loss function using a weighted valid mask to handle invalid instance regions, view exiting and occlusion.


The geometric consistency loss is a combination of the depth inconsistency map and the valid instance mask as shown in [lee2021learning];


To smooth textures and noise while holding sharp edges, we applying an edge aware smoothness loss proposed by [ranjan2019competitive].


Finally, as there will be trivial cases for infinite depth for moving objects that have the same motion as the camera as discussed in [godard2019digging], we use object height constraint loss as proposed by [casser2019depth] given by;


Where is a learnable height prior, is the learnable pixel height of the object and is the mean estimate depth. The final objective function is a weighted combination of the previously mentioned loss functions:


We followed the path of reconstructing from , although the reconstruction of from would follow the same process.

4 Results

Invariant Pose
AbsRel SqRel RMSE RMSE log GPUm GPUp

0.124 0.809 4.897 0.201 0.847 0.954 0.981 10.8GB 121W
w/o I.P. 0.127 0.844 4.880 0.200 0.842 0.953 0.982 7.4GB 107W

Table 1: To compare the usefulness of the pose invariant loss function, we compare the test scores with and without the invariant loss. Initially, we train on CityScape and then continue training with the KITTI dataset.

This section describes the experiments carried out. They show that (1) invariant pose is not necessary as an extra supervisory signal after the first few epochs, (2) theta filtering removes most static objects assumed to be dynamic, leading to accuracy improvements, and (3) removing small objects leads to better reconstructions. All of these modifications to previous approaches are shown to significantly reduce the cost of training.

4.1 Experimental Setup

Testing: We will be using the KITTI benchmark following the Eigen et al. split [eigen2014depth]. The memory usage and power consumption were recorded as the maximum memory used during training GPUm and maximum power used while training GPUp recorded using Weights & Biases [wandb]. For CityScape, as there is no ground truth data, we will record the loss in equation 15

to inform us of improvements when doing hyperparameter tuning.


Implementation Details:Pytorch [paszke2017automatic] is used for all models, training the networks on a single NVIDIA RTX3060 GPU with the ADAM optimiser, setting and . Input images have resolution 832256 and are augmented using random scaling, cropping and horizontal flipping. The batch size is set to 1 and we will be training the network with an epoch size of 1000 with 100 epochs. Initially setting the learning rate to , and using exponential learning rate decay with . Finally, the weights are set to with as defined in [lee2021learning].

Network Details: Following from [ranjan2019competitive] we use DispNet, which is an encoder-decoder. This auto-encoder uses single-scale training as [bian2019unsupervised] suggests faster convergence and better results. Ego-PoseNet, Obj-PoseNet and Initial Ego-PoseNet are all based on multiple PoseNet auto-encoders, but they do not share weights.

4.2 Invariant pose

As we are training from scratch using He weight initialization [DBLPjournals/corr/HeZR015], we initially reap benefits from training using the forward-backward pose consistency. After a few epochs, this benefit diminishes and can lead to small reconstruction errors as the pose estimations will differ in the two directions. Table 1

shows the test scores with and without the invariant loss. It can be appreciated that after training on the CityScape and KITTI datasets, continuing to use the invariant pose to enforce consistency between backward and forward pose leads to minimal change in the validation metrics. Furthermore, this consistency check comes at a significant cost in terms of memory and power usage. The invariant pose requires the backward and forward pose estimation for the background and each potentially dynamic object, whereas without invariant pose we only require one direction, for example, calculating just the forward pose. Also, if these networks are initialised using pre-trained weights then accuracy improvements from invariant pose become even less significant.

4.3 Tuning the Theta parameter

This section focuses on selecting a value of the filtering parameter, theta, used to determine if an object is dynamic. Theta filtering removes all objects that are determined as static, leaving only object motion estimations for dynamic objects. Using the IoU measure (Jaccard index) as our measure, we iterative select theta values as described in Table 2. This table demonstrates that a value of 0.9 leads to the greatest reduction in loss while also reducing memory and power usage.

Jaccard Index
Loss GPUm GPUp

1.220 9.08GB 100W
0.95 1.217 8.90GB 94.7W
0.9 1.216 8.54GB 87.4W
0.85 1.233 8.29GB 84.5W
0.8 1.264 8.04GB 88.9W

Table 2: Using the Jaccard index to remove static objects with varying theta values.

Furthermore, we explore an increased intra-frame distance to handle slow-moving objects. In Table 3, again the optimal value of theta is shown to be 0.9. Now the loss is shown to be less than in Table 2, suggesting that extra distance leads to improvements in detecting dynamic objects.

Jaccard Index + Extra Distance
Loss GPUm GPUp

1.215 8.79GB 99.6W
0.95 1.218 8.92GB 97.5W
0.9 1.207 9.06GB 97.4W
0.85 1.220 8.56GB 96.9W
0.8 1.251 8.19GB 89.8W

Table 3: Using the Jaccard index and extra distance to determine if an object is dynamic based on three frames rather than two frames.

Replacing the Jaccard Index with the Sørensen–Dice coefficient, we get the optimal theta at 0.9 being as seen in Table 4. This measure gives even greater reduction in loss, memory and power usage. Tables 2, 3, 4, show loss increases when using values under

. Arguably, doing so increases the probability of misclassifying dynamic objects as static.

Dice coefficient + Extra Distance
Loss GPUm GPUp

1.177 8.53GB 97.758W
0.95 1.174 8.87GB 94.977W
0.9 1.172 8.24GB 92.802W
0.85 1.180 8.42GB 82.233W
0.8 1.189 8.05GB 85.354W

Table 4: Using extra distance and Dice coefficient to detect dynamic objects.

Using these methods, we must determine how many potentially dynamic objects we will process for each scene in advance. Previously, Insta-DM [lee2021learning] focused on a maximum of 3 potentially dynamic objects, whereas here we will use a maximum of 20.

Dice coefficient + Extra Distance (KITTI)
AbsRel SqRel RMSE RMSE log GPUm GPUp

0.118 0.775 4.975 0.200 0.860 0.954 0.980 7.54GB 99.4W
0.95 0.120 0.799 4.799 0.196 0.862 0.956 0.981 7.44GB 98.8W
0.9 0.117 0.786 4.709 0.192 0.868 0.959 0.982 7.37GB 98.2W
0.85 0.116 0.748 4.870 0.196 0.861 0.956 0.982 6.94GB 96.1W
0.8 0.118 0.770 4.816 0.195 0.863 0.957 0.982 6.74GB 97.4W

Table 5: Training the KITTI dataset to analyse the optimal value of theta on the Eigen test set.

We base our theta experimentation on the CityScape dataset as it is known to have more potentially dynamic objects than the KITTI dataset. This allows us to determine the optimal value of theta in a setting where this method is potentially more valuable. Now testing with the KITTI dataset we obtain the Table 5. A theta value of 0.9 is again optimal in this dataset, also showing a reduction in power and memory usage as theta decreases. To explore more quantitative evidence, we took the first 500 potentially dynamic objects from the validation set and labelled them as either dynamic or non-dynamic. Then exporting the IoU and Dice coefficient values for each associated object. For these values, we iteratively modify theta and determine which value was optimal for detecting if the object was truly dynamic. Using a theta value of 0.9 with the Jaccard index the method was only accurate 65% of the time. Whereas, when using the Dice coefficient, with a theta value of 0.9, the method was accurate 74% of the time. Although seems to lead to the greatest loss reduction, the selection of this value is depended on the user as smaller values may lead to more misclassifications but lead to greater GPU memory and power reductions.

4.4 Small Objects

% Removing Small Objects
Loss GPUm GPUp

1.172 8.12GB 94.345W
0.25 1.172 7.03GB 88.437W
0.5 1.172 7.66GB 85.442W
0.75 1.169 6.64GB 83.973W
1 1.173 6.68GB 88.189W

Table 6: The table shows the removal of small objects in each image of size less than the first column’s percentage value.

Table 6 demonstrates the removal of objects smaller than a specific percentage of the images pixel count. We are using a theta value of 0.9 and the Sørensen–Dice coefficient with extra intra-frame distance. We observe that as the percent value increases we will be removing more objects and therefore reducing memory and power usage. But we see an increase in loss after 0.75% as objects greater than this could represent small objects than are close to the camera, like children. This would be greatly detrimental if we ignored these objects, therefore, as suggested by the increasing loss, we will keep this value low to only account for objects at a far distance.

4.5 Comparison Table

In summary, our method, which we refer to as Dyna-DM alleviates the need for forward/backward pose consistency and removes all objects that occupy less than 0.75% of the image’s pixel count. Finally and most importantly, it determines which objects are dynamic using the Sørensen–Dice coefficient with a theta value of 0.9, therefore only calculating object motion estimations for these truly dynamic objects. We will be comparing our method with Monodepth2 [godard2019digging], Struct2Depth [casser2019depth] and Insta-DM [lee2021learning]. Dyna-DM has been initialised by weights provided by Insta-DM [lee2021learning] and therefore, the invariant pose is not used . We train consecutively through CityScape and then the KITTI dataset for all methods and test with the Eigen test split. Insta-DM and Struct2Depth will be using a maximum number of dynamic objects of 13 and Dyna-DM will use 20. The results are reported in Table 7.

Methods Comparison Table
AbsRel SqRel RMSE RMSE log GPUm GPUp

0.132 1.044 5.142 0.210 0.845 0.948 0.977 9GB 112W
Struct2Depth 0.141 1.026 5.290 0.215 0.816 0.945 0.979 10GB 116W
Insta-DM 0.121 0.797 4.779 0.193 0.858 0.957 0.983 9.48GB 107.9W
Dyna-DM (ours) 0.115 0.785 4.698 0.192 0.871 0.959 0.982 6.67GB 94.0W

Table 7: Dyna-DM uses no invariant pose loss and a theta value with the Sørensen–Dice coefficient to detect dynamic objects, removing all objects 0.75% of the images pixel count. Training for the whole CityScape and KITTI datasets, we compare methods Monodepth2 [godard2019digging], Insta-DM [lee2021learning] and Struct2Depth [casser2019depth].

Here we see significant improvements in all metrics except for a3. These metric improvements are accompanied by improvements in GPU usage. We most notably see a 29.6% reduction in memory usage when training comparing Dyna-DM to Insta-DM, this allows for us to improve the accuracy of our pose and depth estimations while requiring less computation.

Methods CityScape Only
AbsRel SqRel RMSE RMSE log GPUm GPUp

0.178 1.312 6.016 0.257 0.728 0.916 0.966 10.58GB 110.8W
Dyna-DM (ours) 0.163 1.259 5.939 0.244 0.768 0.926 0.970 6.63GB 86.99W

Table 8: Using the same setup, we train for the whole CityScape dataset. Where we compare Dyna-DM to Insta-DM [lee2021learning].

As we know that the CityScape dataset has more potentially dynamic objects than KITTI we can train Dyna-DM and Insta-DM on CityScape and test with the Eigen test split. Table 8 demonstrates even greater improvements in all metrics when comparing Dyna-DM and Insta-DM. Our model can remove most static objects from these object motion estimations while isolating truly dynamic objects, thereby leading to significant improvements in pose estimation. This further improves the reconstructions and leads to improved depth estimations. Estimated depth-maps from samples of the Eigen test split are shown in Figure 4. The figure shows clear qualitative improvements in depth estimations when compared to Insta-DM. Looking closely at the potentially dynamic objects in the images, we can see greater depth estimations, with sharper edges and clearer outlines.

Figure 4: Depth maps comparing Dyna-DM to Insta-DM showing qualitative improvements for Dyna-DM.

5 Conclusion

Dyna-DM reduces memory and power usage during training monocular depth-estimation while providing better accuracy in test time. Although not all dynamic objects are always classified accordingly (e.g., slow-moving cars) Dyna-DM detects significant movements which are the ones which would cause the greatest reconstruction errors. We believe that the best step forward is to make this approach completely self-supervised by detecting all dynamic objects, including debris, using an auto-encoder for safety and efficiency. This means we will have to be able to handle textureless regions and lighting issues which will be shown in future work. Other further improvements include using depth maps to determine objects’ distances, and removing object motion estimations for objects at larger distances.