1 Introduction
The problem of depth estimation from visual data has recently received increased interest because of its emerging application in autonomous vehicles. Using cameras as a replacement to LiDAR and other distance sensors leads to cost efficiencies and can improve environment perception.
There have been several attempts at visual depth estimation, for example, using stereo images [godard2017unsupervised, chang2018pyramid, li2019stereo] and optical flow [ilg2017flownet, wang2018occlusion]. Other methods focus on supervised monocular depth estimation, which requires large amounts of ground truth depth data. Examples of supervised monocular architectures can be found in [bhat2021adabins, song2021monocular, ranftl2021vision]. Ground truth depth maps such as those available in the KITTI dataset [geiger2013vision] are usually acquired using LiDAR sensors. Supervised methods can be very precise but require expensive sensors and postprocessing to obtain clear ground truth. This paper focuses on monocular depth estimation, as it can provide the most affordable physical setup and requires less training data.
Selfsupervised methods exploit information acquired from motion. Examples of architectures using this method are given in [godard2019digging, andraghetti2019enhancing, guizilini20203d, shu2020feature]. While supervised monocular depth estimation currently outperforms selfsupervised methods, their performance is converging towards that of supervised ones. Additionally, research has shown that selfsupervised methods are better at generalizing across a variety of environments [TENDLE2021100124] (e.g., indoor/outdoor, urban/rural scenes).
Many works assume the entire 3D world is a rigid scene, thus ignoring objects that move independently. While convenient for computational reasons, in practice this assumption is frequently violated, leading to inexact pose and depth estimations. To account for each independently moving object, we would need to estimate the motion of the camera and each object separately. Unfortunately, slowmoving objects and deforming dynamic objects such as pedestrians make this a very challenging problem.
Attempts to account for potentially dynamic objects by detecting them and estimating their motion independently can be found in [casser2019depth] and [lee2021learning]. However, not all potentially dynamic objects in a scene will be moving at any given time (e.g., cars parked on the side of the road). Estimating each potentially dynamic objects’ motion is not only computationally wasteful, but more importantly, can cause significant errors in depth estimation (see Section 4).
Selfsupervised models are generally trained on the KITTI [geiger2013vision] and CityScape [cordts2016cityscapes] datasets, which are known to have a large amount of potentially dynamic objects. Previous literature reports that truly dynamic objects are relatively infrequent (see Figure 1) and that static objects represent around 86% of the pixels in KITTI test images [voicila2022instance]. In addition to being computationally wasteful, estimating static objects’ motion can reduce accuracy. To address this and other issues, the paper provides the following contributions:

A computationally efficient appearancebased approach to avoid estimating the motion of static objects.

Experiments supporting the idea of removing small objects, as they tend to be far from the camera and their motion estimation does not bring significant benefits since they are “almost rigid”.

Finally, based on experimental results, the paper proposes limiting the use of invariant pose loss.
2 Related Works
SelfSupervised Depth Estimation
Selfsupervised methods use image reconstruction between consecutive frames as a supervisory signal. Stereo methods use synchronized stereo image pairs to predict the disparity between pixels [godard2017unsupervised]
. For monocular video, the framework for simultaneous selfsupervised learning of depth and egomotion via maximising photometric consistency was introduced by Zhou
et al. [zhou2017unsupervised]. It uses egomotion, depth estimation and an inverse warping function to find reconstructions for consecutive frames. This early model laid the foundation of selfsupervised monocular depth estimation. Monodepth2 by Godard et al. [godard2019digging] presented further improvements by masking stationary pixels that would cause infinitedepth estimations. They cast the problem as the minimisation of image reprojection error while addressing occluded areas between two consecutive images. In addition, they employed multiscale reconstruction, allowing for consistent depth estimations between the layers of the depth decoder. Monodepth2 is currently the most widely used baseline for depth estimation.Object Motion Estimation
The previous methods viewed dynamic objects as a nuisance, masking nonstatic objects to find egomotion. Later methods such as [casser2019depth, lee2021learning] showed that there is a performance improvement when using a separate objectmotion estimation for each potentially dynamic object. Casser et al. [casser2019depth] estimated object motion for potentially dynamic objects given predefined segmentation maps, cancelling the camera motion to isolate object motion estimations. Furthermore, InstaDM by Lee et al. [lee2021learning] improved previous models by using forward warping for dynamic objects and inverse warping for static objects to avoid stretching and warping dynamic objects. Additionally, this method uses a depth network (DepthNet) and pose network (PoseNet) based on a ResNet18 encoderdecoder structure following [ranjan2019competitive]. Both methods treat all potentially dynamic objects in a scene as dynamic, therefore calculating object motion estimations for each individual object.
There have been many appearancebased approaches using optical flow estimations [lv2018learning, ranjan2019competitive]. Recent methods have combined geometric motion segmentation and appearancebased approaches [yang2021learning]. These methods tackle the complex problem of detecting object motion, although this is a step forward for selfsupervision, motion estimations can be jittery and inaccurate.
Recently, Safadoust et al. [safadoust2021self] demonstrated detecting dynamic objects in a scene using an autoencoder. While this seems promising for reducing computation and improving depth estimation, it tends to struggle with textureless regions such as the sky and plain walls. More complex methods have approached the detection of dynamic objects using transformers [voicila2022instance] resulting in stateoftheart performance in detecting dynamic objects but requiring heavy computation.
3 Method
We propose DynaDM, an endtoend selfsupervised framework that learns depth and motion using sequential monocular video data. It introduces the detection of truly dynamic objects using the Sørensen–Dice coefficient (Dice). We later discuss the use of invariant pose and the necessity of object motion estimations for objects at a large distance from the camera.
3.1 Selfsupervised monocular depth estimation
To estimate depth in a selfsupervised manner we consider consecutive RGB frames in a video sequence. We then estimate depth for both of these images using the inverse of the Disparity Network’s (DispNet) output. To estimate the motion in the scene, two trainable pose networks (PoseNet) with differing weights are used, one optimized for egomotion and another to handle objectmotion. The initial background images are computed by removing all potentially dynamic objects’ masks , corresponding to respectively. These binary masks are obtained using an offtheshelf instance segmentation model Mask RCNN [he2017mask]. The next step is to estimate the egomotion using these two background images. Following [lee2021learning], let us define inverse and forward warping, as follows:
(1) 
and
(2) 
where is the camera matrix. Furthermore, we forward warp the image and mask using egomotion, that we previously calculated, resulting in a forward warped image and a forward warped mask . We feed the pixelwise product of the forward warped mask and image and the pixelwise product of the target mask and target image into the object motion network. Resulting in an object motion estimation for each potentially dynamic object , where represents each potentially dynamic object. Using these object motion estimates we can inverse warp each object to give . The inverse warped background image is represented as . The final warped image is a combination of the background warped image and all of the warped potentially dynamic objects as follows:
(3) 
3.2 Truly Dynamic Objects
To detect truly dynamic objects we first define an initial egomotion network that uses pretrained weights from InstaDM’s egomotion network. The binary masks from InstaDM will also be used as our initial binary mask that contains all potentially dynamic objects.
(4) 
Then, similarly to section 3.1, we find the background image using pixelwise multiplication and determine the initial egomotion estimation between frames and .
(5) 
We then forward warp all of these masked potentially dynamic objects using this initial egomotion.
(6) 
Assuming perfect precision for pose and depth implies that the object mask will have been warped to the object mask in the target image if the object is represented by egomotion i.e., if the object is static. In other words, if an object is static, there will be a significant overlap between its warped mask (under initial egomotion) and its mask on the target image. Conversely, if the object is truly dynamic, the warped and target masks will not match. This type of overlap or discrepancy can be captured using the Sørensen–Dice coefficient (Dice) or the Jaccard Index, also known as Intersection over Union (IoU):
(7) 
(8) 
Warping of dynamic and static objects using egomotion is depicted in Figure 3. Stationary objects have greater Dice values than dynamic objects. Potentially dynamic objects that have Dice values lower than the selected value of theta will be classed as truly dynamic objects. Testing both IoU and Dice, we found that the Dice coefficient led to more accurate dynamic object detection, therefore the proposed solution is to use the Dice coefficient. We found optimal theta values around the range [0.8, 1].
Note however that the reasoning presented above is based on the assumption that depth and pose estimations are accurate. This is unrealistic in most circumstances so we can expect larger Dice values with dynamic objects. To mitigate this challenge we can use greater frame separation between the source and target frames. This is simply done by calculating the reconstruction between frames and rather than frames and . This extra distance between the frames gives dynamic objects more time to diverge from where egomotion will warp them to be, which is beneficial for slowmoving objects. With modern camera frame rates, extra distance does not cause jitters, but leads to more consistency in depth and more exact pose. As these objects can be moving very slowly, calculating the reconstruction between larger frames can be beneficial in determining imperceptible object motion.
To remove static objects, the method decreasingly sorts all potentially dynamic objects based on their Dice values and selects the first 20 objects. These objects are the most likely to be static and tend to be larger objects, for theta less than 1, these objects with large Dice values will be removed first. The discrepancy between these metrics at theta equal to 1 can be explained by the metrics choosing different objects as the first 20 objects.
In summary, we filter using Dice scores to remove all static objects in the initial binary mask, providing an updated binary mask that only contains truly dynamic objects. We refer to this filtering process as theta filtering. This can be used to calculate an updated egomotion estimation and object motion estimations for each truly dynamic object.
3.3 Small Objects & Invariant Pose
We improve this method further in two aspects: by removing small objects and revisiting the usefulness of using invariant pose. Firstly, small objects are objects that take a small pixel count in an image. We define these objects as less than 1% of an image’s pixel count. Motion estimation for small objects tends to be either inaccurate or insignificant as these objects are frequently at a very large distance from the camera. The removal of small objects and static objects is depicted in Figure 2.
Secondly, when calculating poses in 3D space, some methods exploit the relationship between the forward and backward pose as a loss function known as invariant pose, as similarly shown in
[lee2021learning, gordon2019depth].(9) 
We performed an ablation study to test the usefulness of calculating forward and backward pose versus considering a single direction, which would lead to a reduction in computing, significant enough to outweigh any accuracy improvements.
3.4 Final Loss
Photometric consistency loss will be our main loss function in selfsupervision. Following InstaDM [lee2021learning], we apply consistency to the photometric loss function using a weighted valid mask to handle invalid instance regions, view exiting and occlusion.
(10) 
The geometric consistency loss is a combination of the depth inconsistency map and the valid instance mask as shown in [lee2021learning];
(11) 
To smooth textures and noise while holding sharp edges, we applying an edge aware smoothness loss proposed by [ranjan2019competitive].
(12) 
Finally, as there will be trivial cases for infinite depth for moving objects that have the same motion as the camera as discussed in [godard2019digging], we use object height constraint loss as proposed by [casser2019depth] given by;
(13) 
Where is a learnable height prior, is the learnable pixel height of the object and is the mean estimate depth. The final objective function is a weighted combination of the previously mentioned loss functions:
(14) 
We followed the path of reconstructing from , although the reconstruction of from would follow the same process.
4 Results
Invariant Pose  
AbsRel  SqRel  RMSE  RMSE log  GPUm  GPUp  
I.P. 
0.124  0.809  4.897  0.201  0.847  0.954  0.981  10.8GB  121W 
w/o I.P.  0.127  0.844  4.880  0.200  0.842  0.953  0.982  7.4GB  107W 

This section describes the experiments carried out. They show that (1) invariant pose is not necessary as an extra supervisory signal after the first few epochs, (2) theta filtering removes most static objects assumed to be dynamic, leading to accuracy improvements, and (3) removing small objects leads to better reconstructions. All of these modifications to previous approaches are shown to significantly reduce the cost of training.
4.1 Experimental Setup
Testing: We will be using the KITTI benchmark following the Eigen et al. split [eigen2014depth]. The memory usage and power consumption were recorded as the maximum memory used during training GPUm and maximum power used while training GPUp recorded using Weights & Biases [wandb]. For CityScape, as there is no ground truth data, we will record the loss in equation 15
to inform us of improvements when doing hyperparameter tuning.
(15) 
Implementation Details:Pytorch [paszke2017automatic] is used for all models, training the networks on a single NVIDIA RTX3060 GPU with the ADAM optimiser, setting and . Input images have resolution 832256 and are augmented using random scaling, cropping and horizontal flipping. The batch size is set to 1 and we will be training the network with an epoch size of 1000 with 100 epochs. Initially setting the learning rate to , and using exponential learning rate decay with . Finally, the weights are set to with as defined in [lee2021learning].
Network Details: Following from [ranjan2019competitive] we use DispNet, which is an encoderdecoder. This autoencoder uses singlescale training as [bian2019unsupervised] suggests faster convergence and better results. EgoPoseNet, ObjPoseNet and Initial EgoPoseNet are all based on multiple PoseNet autoencoders, but they do not share weights.
4.2 Invariant pose
As we are training from scratch using He weight initialization [DBLPjournals/corr/HeZR015], we initially reap benefits from training using the forwardbackward pose consistency. After a few epochs, this benefit diminishes and can lead to small reconstruction errors as the pose estimations will differ in the two directions. Table 1
shows the test scores with and without the invariant loss. It can be appreciated that after training on the CityScape and KITTI datasets, continuing to use the invariant pose to enforce consistency between backward and forward pose leads to minimal change in the validation metrics. Furthermore, this consistency check comes at a significant cost in terms of memory and power usage. The invariant pose requires the backward and forward pose estimation for the background and each potentially dynamic object, whereas without invariant pose we only require one direction, for example, calculating just the forward pose. Also, if these networks are initialised using pretrained weights then accuracy improvements from invariant pose become even less significant.
4.3 Tuning the Theta parameter
This section focuses on selecting a value of the filtering parameter, theta, used to determine if an object is dynamic. Theta filtering removes all objects that are determined as static, leaving only object motion estimations for dynamic objects. Using the IoU measure (Jaccard index) as our measure, we iterative select theta values as described in Table 2. This table demonstrates that a value of 0.9 leads to the greatest reduction in loss while also reducing memory and power usage.
Jaccard Index  
Loss  GPUm  GPUp  
1 
1.220  9.08GB  100W 
0.95  1.217  8.90GB  94.7W 
0.9  1.216  8.54GB  87.4W 
0.85  1.233  8.29GB  84.5W 
0.8  1.264  8.04GB  88.9W 

Furthermore, we explore an increased intraframe distance to handle slowmoving objects. In Table 3, again the optimal value of theta is shown to be 0.9. Now the loss is shown to be less than in Table 2, suggesting that extra distance leads to improvements in detecting dynamic objects.
Jaccard Index + Extra Distance  
Loss  GPUm  GPUp  
1 
1.215  8.79GB  99.6W 
0.95  1.218  8.92GB  97.5W 
0.9  1.207  9.06GB  97.4W 
0.85  1.220  8.56GB  96.9W 
0.8  1.251  8.19GB  89.8W 

Replacing the Jaccard Index with the Sørensen–Dice coefficient, we get the optimal theta at 0.9 being as seen in Table 4. This measure gives even greater reduction in loss, memory and power usage. Tables 2, 3, 4, show loss increases when using values under
. Arguably, doing so increases the probability of misclassifying dynamic objects as static.
Dice coefficient + Extra Distance  
Loss  GPUm  GPUp  
1 
1.177  8.53GB  97.758W 
0.95  1.174  8.87GB  94.977W 
0.9  1.172  8.24GB  92.802W 
0.85  1.180  8.42GB  82.233W 
0.8  1.189  8.05GB  85.354W 

Using these methods, we must determine how many potentially dynamic objects we will process for each scene in advance. Previously, InstaDM [lee2021learning] focused on a maximum of 3 potentially dynamic objects, whereas here we will use a maximum of 20.
Dice coefficient + Extra Distance (KITTI)  
AbsRel  SqRel  RMSE  RMSE log  GPUm  GPUp  
1 
0.118  0.775  4.975  0.200  0.860  0.954  0.980  7.54GB  99.4W 
0.95  0.120  0.799  4.799  0.196  0.862  0.956  0.981  7.44GB  98.8W 
0.9  0.117  0.786  4.709  0.192  0.868  0.959  0.982  7.37GB  98.2W 
0.85  0.116  0.748  4.870  0.196  0.861  0.956  0.982  6.94GB  96.1W 
0.8  0.118  0.770  4.816  0.195  0.863  0.957  0.982  6.74GB  97.4W 

We base our theta experimentation on the CityScape dataset as it is known to have more potentially dynamic objects than the KITTI dataset. This allows us to determine the optimal value of theta in a setting where this method is potentially more valuable. Now testing with the KITTI dataset we obtain the Table 5. A theta value of 0.9 is again optimal in this dataset, also showing a reduction in power and memory usage as theta decreases. To explore more quantitative evidence, we took the first 500 potentially dynamic objects from the validation set and labelled them as either dynamic or nondynamic. Then exporting the IoU and Dice coefficient values for each associated object. For these values, we iteratively modify theta and determine which value was optimal for detecting if the object was truly dynamic. Using a theta value of 0.9 with the Jaccard index the method was only accurate 65% of the time. Whereas, when using the Dice coefficient, with a theta value of 0.9, the method was accurate 74% of the time. Although seems to lead to the greatest loss reduction, the selection of this value is depended on the user as smaller values may lead to more misclassifications but lead to greater GPU memory and power reductions.
4.4 Small Objects
%  Removing Small Objects  
Loss  GPUm  GPUp  
0 
1.172  8.12GB  94.345W 
0.25  1.172  7.03GB  88.437W 
0.5  1.172  7.66GB  85.442W 
0.75  1.169  6.64GB  83.973W 
1  1.173  6.68GB  88.189W 

Table 6 demonstrates the removal of objects smaller than a specific percentage of the images pixel count. We are using a theta value of 0.9 and the Sørensen–Dice coefficient with extra intraframe distance. We observe that as the percent value increases we will be removing more objects and therefore reducing memory and power usage. But we see an increase in loss after 0.75% as objects greater than this could represent small objects than are close to the camera, like children. This would be greatly detrimental if we ignored these objects, therefore, as suggested by the increasing loss, we will keep this value low to only account for objects at a far distance.
4.5 Comparison Table
In summary, our method, which we refer to as DynaDM alleviates the need for forward/backward pose consistency and removes all objects that occupy less than 0.75% of the image’s pixel count. Finally and most importantly, it determines which objects are dynamic using the Sørensen–Dice coefficient with a theta value of 0.9, therefore only calculating object motion estimations for these truly dynamic objects. We will be comparing our method with Monodepth2 [godard2019digging], Struct2Depth [casser2019depth] and InstaDM [lee2021learning]. DynaDM has been initialised by weights provided by InstaDM [lee2021learning] and therefore, the invariant pose is not used . We train consecutively through CityScape and then the KITTI dataset for all methods and test with the Eigen test split. InstaDM and Struct2Depth will be using a maximum number of dynamic objects of 13 and DynaDM will use 20. The results are reported in Table 7.
Methods  Comparison Table  
AbsRel  SqRel  RMSE  RMSE log  GPUm  GPUp  
Monodepth2 
0.132  1.044  5.142  0.210  0.845  0.948  0.977  9GB  112W 
Struct2Depth  0.141  1.026  5.290  0.215  0.816  0.945  0.979  10GB  116W 
InstaDM  0.121  0.797  4.779  0.193  0.858  0.957  0.983  9.48GB  107.9W 
DynaDM (ours)  0.115  0.785  4.698  0.192  0.871  0.959  0.982  6.67GB  94.0W 

Here we see significant improvements in all metrics except for a3. These metric improvements are accompanied by improvements in GPU usage. We most notably see a 29.6% reduction in memory usage when training comparing DynaDM to InstaDM, this allows for us to improve the accuracy of our pose and depth estimations while requiring less computation.
Methods  CityScape Only  

AbsRel  SqRel  RMSE  RMSE log  GPUm  GPUp  
InstaDM 
0.178  1.312  6.016  0.257  0.728  0.916  0.966  10.58GB  110.8W 
DynaDM (ours)  0.163  1.259  5.939  0.244  0.768  0.926  0.970  6.63GB  86.99W 

As we know that the CityScape dataset has more potentially dynamic objects than KITTI we can train DynaDM and InstaDM on CityScape and test with the Eigen test split. Table 8 demonstrates even greater improvements in all metrics when comparing DynaDM and InstaDM. Our model can remove most static objects from these object motion estimations while isolating truly dynamic objects, thereby leading to significant improvements in pose estimation. This further improves the reconstructions and leads to improved depth estimations. Estimated depthmaps from samples of the Eigen test split are shown in Figure 4. The figure shows clear qualitative improvements in depth estimations when compared to InstaDM. Looking closely at the potentially dynamic objects in the images, we can see greater depth estimations, with sharper edges and clearer outlines.
5 Conclusion
DynaDM reduces memory and power usage during training monocular depthestimation while providing better accuracy in test time. Although not all dynamic objects are always classified accordingly (e.g., slowmoving cars) DynaDM detects significant movements which are the ones which would cause the greatest reconstruction errors. We believe that the best step forward is to make this approach completely selfsupervised by detecting all dynamic objects, including debris, using an autoencoder for safety and efficiency. This means we will have to be able to handle textureless regions and lighting issues which will be shown in future work. Other further improvements include using depth maps to determine objects’ distances, and removing object motion estimations for objects at larger distances.